Author Topic: "The AGC wan't powerful enough"  (Read 36873 times)

Offline ka9q

  • Neptune
  • ****
  • Posts: 3014
Re: "The AGC wan't powerful enough"
« Reply #30 on: December 03, 2019, 09:44:03 AM »
The point was that code like
Code: [Select]
while ( true )
   ;
while you're doing nothing but waiting for an interrupt was often acceptable in embedded systems.
That's bad news even in embedded systems. It runs up the power and heat. If you're testing a memory location it also hits memory (and any cache) very hard, depriving any other cores of those access cycles. But even in general purpose multitasking operating systems you sometimes want to busy wait when you know the wait will be so short that it would take much more time to call the scheduler to release the CPU, only to come right back to where you left off. For this reason, Intel (and other CPU architectures, I suppose) added the PAUSE instruction for use inside spinloops. It's like a NOOP (no operation) that takes some small but unspecified time to complete, greatly reducing CPU power consumption.

On an Intel system, if you simply wanted to wait for an interrupt (without actually testing anything) you'd use the HLT (halt) instruction. It continues after an interrupt has been serviced.
« Last Edit: December 03, 2019, 09:46:47 AM by ka9q »

Offline JayUtah

  • Neptune
  • ****
  • Posts: 3814
    • Clavius
Re: "The AGC wan't powerful enough"
« Reply #31 on: December 03, 2019, 12:09:33 PM »
They really mean "I'm running out of real time". The specific ways that's detected don't matter. Adding more memory wouldn't have fixed the problem because running out of memory was only a symptom of the real problem.

Right.  Adding more erasable memory means more core sets available, which only fixes the immediately reported symptom.  Since the underlying cause remains unaddressed, the problem is likely to behave like the proverbial bag of bread dough:  the underlying problem will just manifest itself elsewhere, in a different way.  In a way, that's what happened.  1201 and 1202 both mean a resource shortfall.  In one case the OS was reporting that when the interrupt happened to tell the AGC to run its real-time tasks, they were still running from the previous interrupt.  In the other case, the OS was telling a new task that it couldn't have a core set because they were full -- and they were full because the real-time tasks were taking too long to complete.  All of that was the result of too much time being spent reading unexpected data from the radar.

Quote
Had I been in Aldrin's position I like to think that I would have immediately realized what was going on, though not why or whether I could continue.

Yeah, it's not clear whether Aldrin apprehended what was going on.  But in general the troubleshooting checklist for flying still has Aviate as the prime directive.  If the craft is flyable notwithstanding the warnings, you have to consider whether the best plan is to press on to the landing and then try to troubleshoot once you're on the ground.

One thing Aldrin had that not all pilots do is the flight controllers and their back rooms.  As soon as the program alarm is reported, you can hear one of the flight controllers say, "Same thing we had."  He's referring to the ill-fated simulation where the controllers had first been presented with the program alarms and had incorrectly called for an abort.  Chagrined, the story goes that they delved into the program alarms and had playbooks for all of them.  So when pressed for a recommendation, they were able to give a go.  Then you hear them trying to troubleshoot:  "It seems to happen whenever we have a 16 68 up," and later, "Noun 68 may well be the problem here."  Verb 16 means to display the specified noun data on the DSKY and update it at intervals.  In order to do a semi-continuous update, you have to schedule a task to be woken up at those intervals.  That lengthens the list of things that have to be done on a periodic basis.

The controllers were methodical, but initially wrong.  Neither verb 16 nor noun 68 was the problem.  They were what revealed the problem.  But for the floating radar clocks (the root cause), nothing would have been wrong with Aldrin wanting to monitor noun 68.  I too like to think I'd be adept at gathering all the information and making sound decisions.  But historically even well-trained, highly-skilled people are often fairly bad at it.  The history of watching engineers and operators respond to emergencies suggests they do largely what Bales and his colleagues did:  they hypothesize a de minimis cause and then unconsciously filter new data according to whether it fits that hypothesis.

Quote
Any real-time computer system MUST have some idle time left over, or it won't keep up.

Indeed, even those with a fixed duty cycle.  The AGC didn't have a fixed duty cycle:  the operator could add real-time tasks to it willy-nilly, and the different modes of operation changed the cycle.  For example, in unaccelerated flight, the digital autopilot operated much more leisurely.
"Facts are stubborn things." --John Adams

Offline rocketman

  • Mercury
  • *
  • Posts: 19
Re: "The AGC wan't powerful enough"
« Reply #32 on: December 03, 2019, 12:37:33 PM »
Does anyone here know about, or is anyone involved with, this?

https://www.ibiblio.org/apollo/

Offline JayUtah

  • Neptune
  • ****
  • Posts: 3814
    • Clavius
Re: "The AGC wan't powerful enough"
« Reply #33 on: December 03, 2019, 12:59:14 PM »
That's bad news even in embedded systems. It runs up the power and heat.

Right; it would only work in the dumbest of control systems.  You could do it, but it would still be inadvisable.  But in a time-sharing system, it's anathema except for the reason you gave, where you can know that scheduler overhead is worse in that case.  Linux, I believe, has a spinlock for just that purpose.

Quote
On an Intel system, if you simply wanted to wait for an interrupt (without actually testing anything) you'd use the HLT (halt) instruction. It continues after an interrupt has been serviced.

That's good to know.  I didn't have any specific architecture in mind when I was writing.  But you bring up an important point:  these days "embedded" is quite likely to mean "battery-powered," such as in a consumer handset.  Things like power usage and heat are extremely important.  However, "embedded" can also mean "industrial controller."  In that case it's hooked up to an almost limitless power supply, and its thermal environment may be extreme.  What you may want from that is for it to be dumb and rugged.
"Facts are stubborn things." --John Adams

Offline JayUtah

  • Neptune
  • ****
  • Posts: 3814
    • Clavius
Re: "The AGC wan't powerful enough"
« Reply #34 on: December 03, 2019, 01:00:45 PM »
Does anyone here know about, or is anyone involved with, this?

https://www.ibiblio.org/apollo/

Yes, back in the day I was somewhat involved with it.  I didn't write any of the code, but I corresponded with the guy who was writing it and I tested the end result.  If you want to write code to run on the AGC, I highly endorse it.
"Facts are stubborn things." --John Adams

Offline bknight

  • Neptune
  • ****
  • Posts: 3132
Re: "The AGC wan't powerful enough"
« Reply #35 on: December 03, 2019, 01:08:09 PM »
They really mean "I'm running out of real time". The specific ways that's detected don't matter. Adding more memory wouldn't have fixed the problem because running out of memory was only a symptom of the real problem.

Right.  Adding more erasable memory means more core sets available, which only fixes the immediately reported symptom.  Since the underlying cause remains unaddressed, the problem is likely to behave like the proverbial bag of bread dough:  the underlying problem will just manifest itself elsewhere, in a different way.  In a way, that's what happened.  1201 and 1202 both mean a resource shortfall.  In one case the OS was reporting that when the interrupt happened to tell the AGC to run its real-time tasks, they were still running from the previous interrupt.  In the other case, the OS was telling a new task that it couldn't have a core set because they were full -- and they were full because the real-time tasks were taking too long to complete.  All of that was the result of too much time being spent reading unexpected data from the radar.

Quote
Had I been in Aldrin's position I like to think that I would have immediately realized what was going on, though not why or whether I could continue.

Yeah, it's not clear whether Aldrin apprehended what was going on.  But in general the troubleshooting checklist for flying still has Aviate as the prime directive.  If the craft is flyable notwithstanding the warnings, you have to consider whether the best plan is to press on to the landing and then try to troubleshoot once you're on the ground.

One thing Aldrin had that not all pilots do is the flight controllers and their back rooms.  As soon as the program alarm is reported, you can hear one of the flight controllers say, "Same thing we had."  He's referring to the ill-fated simulation where the controllers had first been presented with the program alarms and had incorrectly called for an abort.  Chagrined, the story goes that they delved into the program alarms and had playbooks for all of them.  So when pressed for a recommendation, they were able to give a go.  Then you hear them trying to troubleshoot:  "It seems to happen whenever we have a 16 68 up," and later, "Noun 68 may well be the problem here."  Verb 16 means to display the specified noun data on the DSKY and update it at intervals.  In order to do a semi-continuous update, you have to schedule a task to be woken up at those intervals.  That lengthens the list of things that have to be done on a periodic basis.

The controllers were methodical, but initially wrong.  Neither verb 16 nor noun 68 was the problem.  They were what revealed the problem.  But for the floating radar clocks (the root cause), nothing would have been wrong with Aldrin wanting to monitor noun 68.  I too like to think I'd be adept at gathering all the information and making sound decisions.  But historically even well-trained, highly-skilled people are often fairly bad at it.  The history of watching engineers and operators respond to emergencies suggests they do largely what Bales and his colleagues did:  they hypothesize a de minimis cause and then unconsciously filter new data according to whether it fits that hypothesis.

Quote
Any real-time computer system MUST have some idle time left over, or it won't keep up.

Indeed, even those with a fixed duty cycle.  The AGC didn't have a fixed duty cycle:  the operator could add real-time tasks to it willy-nilly, and the different modes of operation changed the cycle.  For example, in unaccelerated flight, the digital autopilot operated much more leisurely.

Add reading on in the ALSJ page noting the landing of A11, there is a passage that Pete had a lot of trouble nulling out the horizontal movements during A12 landing.  The guys at MIT rewrote the program for the AGC to null out the lateral movements automatically, IIRC.
Truth needs no defense.  Nobody can take those footsteps I made on the surface of the moon away from me.
Eugene Cernan

Offline ka9q

  • Neptune
  • ****
  • Posts: 3014
Re: "The AGC wan't powerful enough"
« Reply #36 on: December 03, 2019, 04:33:55 PM »
Yeah, it's not clear whether Aldrin apprehended what was going on.  But in general the troubleshooting checklist for flying still has Aviate as the prime directive.  If the craft is flyable notwithstanding the warnings, you have to consider whether the best plan is to press on to the landing and then try to troubleshoot once you're on the ground.
Well, that was the big question, wasn't it? Would the craft remain flyable notwithstanding the warnings? Especially during the dead-man zone within the last 100 m or so when an abort wouldn't succeed?
Quote
One thing Aldrin had that not all pilots do is the flight controllers and their back rooms.  As soon as the program alarm is reported, you can hear one of the flight controllers say, "Same thing we had."  He's referring to the ill-fated simulation where the controllers had first been presented with the program alarms and had incorrectly called for an abort.  Chagrined, the story goes that they delved into the program alarms and had playbooks for all of them.  So when pressed for a recommendation, they were able to give a go.

I can't think of a better example of the aphorism "Luck favors the prepared". Gene Kranz discusses all this at length in Failure is not an option. What Kranz had his controllers do after that simulation is what programmers call achieving high code coverage: making sure that you test every possible decision path to make sure it does the right thing. One interesting way to achieve this is called fuzzing. You just throw random garbage at the program. If it crashes (or runs out of time in a real-time system), you've got a problem to fix. You don't expect the program to do anything useful, of course; you only want to know if it will crash, which it should never do regardless of its input. In a sense, the Apollo simulation supervisors were fuzzing the controller/astronaut system.

Fuzzing isn't the only thing you do, of course; it is simply an adjunct to methodical analysis (code "walk throughs"). But it has an uncanny ability to reveal problems that only become obvious in hindsight. Accident investigations often do the same thing, but they're a little more costly.

You know, I could probably teach a course based entirely on NTSB reports and what they reveal about human, engineering and system failures.

Quote
The controllers were methodical, but initially wrong.  Neither verb 16 nor noun 68 was the problem.
I can't fault them for that. They didn't know why the AGC was running out of cycles; that took an engineering investigation. (I think MIT figured it out while they were on the moon, and the ascent checklist was modified to turn off the rendezvous radar.) But they probably did already know during the landing that 16 68 was pretty compute intensive, and turning it off would relieve the load. It was a good call.

Quote
The history of watching engineers and operators respond to emergencies suggests they do largely what Bales and his colleagues did:  they hypothesize a de minimis cause and then unconsciously filter new data according to whether it fits that hypothesis.
Yeah, and this is why I teach my students about fault trees and how they help you avoid jumping to conclusions. But I think you're still a little hard on the Apollo 11 flight controllers. They succeeded, didn't they?

Quote
Quote
Any real-time computer system MUST have some idle time left over, or it won't keep up.

Indeed, even those with a fixed duty cycle.  The AGC didn't have a fixed duty cycle:  the operator could add real-time tasks to it willy-nilly, and the different modes of operation changed the cycle.  For example, in unaccelerated flight, the digital autopilot operated much more leisurely.
Yeah. So you have to test for the worst case with every possible task running to make sure you have enough spare cycles. If you don't, you have to carefully plan which tasks are allowed to run.
« Last Edit: December 03, 2019, 04:40:50 PM by ka9q »

Offline ka9q

  • Neptune
  • ****
  • Posts: 3014
Re: "The AGC wan't powerful enough"
« Reply #37 on: December 03, 2019, 05:01:12 PM »
But in a time-sharing system, it's anathema except for the reason you gave, where you can know that scheduler overhead is worse in that case.  Linux, I believe, has a spinlock for just that purpose.
Yup, that's what I had in mind. The spin loop is used on things like waiting for access to a shared variable. Only one thread of execution at a time can be allowed to modify a shared variable, but these accesses are so quick (a few instructions) that a spinwait is faster and more efficient. You still use the PAUSE instruction, though. Also the usual case is that nobody else is using the variable (it is unlocked) so you don't have to wait at all.

By the way, you can use the halt instruction as a "slow pause" instruction. Even if you're not waiting for an interrupt yourself, one will always come along (the system clock timer if nothing else). You wouldn't do this in Linux except when the system is completely idle because the wait will almost certainly be long enough to make it worthwhile to invoke the scheduler.

Quote
But you bring up an important point:  these days "embedded" is quite likely to mean "battery-powered," such as in a consumer handset.  Things like power usage and heat are extremely important.  However, "embedded" can also mean "industrial controller."  In that case it's hooked up to an almost limitless power supply, and its thermal environment may be extreme.  What you may want from that is for it to be dumb and rugged.
I worked for Qualcomm, so to me "embedded" naturally implies the former case (small, battery powered, extremely energy-starved).

« Last Edit: December 03, 2019, 05:08:40 PM by ka9q »

Offline ka9q

  • Neptune
  • ****
  • Posts: 3014
Re: "The AGC wan't powerful enough"
« Reply #38 on: December 03, 2019, 05:04:39 PM »
Add reading on in the ALSJ page noting the landing of A11, there is a passage that Pete had a lot of trouble nulling out the horizontal movements during A12 landing.  The guys at MIT rewrote the program for the AGC to null out the lateral movements automatically, IIRC.
And yet all the Apollo commanders prided themselves on "manually" landing the LM. There was no such thing as a fully manual landing mode...

Offline JayUtah

  • Neptune
  • ****
  • Posts: 3814
    • Clavius
Re: "The AGC wan't powerful enough"
« Reply #39 on: December 03, 2019, 06:02:55 PM »
Well, that was the big question, wasn't it? Would the craft remain flyable notwithstanding the warnings?

Yep.  You can hear the urgency in Armstrong's voice when he asks for them to rule on the 1202.

Quote
One interesting way to achieve this is called fuzzing. You just throw random garbage at the program.

We fuzz our software extensively.  The joke goes that the testing department for a bar tests for patrons asking for 1 beer, -1 beer, 9999999 beers, 0 beers, and "dog" beers.  Then the whole thing blows up when someone asks to use the restroom.

Quote
In a sense, the Apollo simulation supervisors were fuzzing the controller/astronaut system.

Dunno...  If I were going to fuzz the Net-1/MOCR setup, I'd have a mariachi band suddenly appear right at DOI.

Quote
You know, I could probably teach a course based entirely on NTSB reports and what they reveal about human, engineering and system failures.

I've taken such courses, based largely on those kinds of sources.  There are also a couple of good books written by sociologists who study how critical decision-makers work in technical environments.

Quote
But they probably did already know during the landing that 16 68 was pretty compute intensive...

Except that I don't think it is.  Updating noun 68 is intensive, and it happens anyway as part of the landing tasks.  I don't think displaying it is, even if it's as often as once per second.  What I gather from the analysis is that it was just long enough.  If you're running at 99% capacity and you add an extra 2%, the nonlinear response is what gets you.  You don't get a 1202 at 99% but you get one at 101%.

Quote
...and turning it off would relieve the load. It was a good call.

Yes.  If the recommendation based on initial analysis is that the crew has to reduce the load on the Executive, then any real-time tasks that can be eliminated with should be.  But the snippets we hear on the FD loop have them speculating what's so special about noun 68.  This would have been a wrong direction to go, but as you say, they had no urgency beyond stabilizing the current thing.

A similar situation happened during the fatal Columbia re-entry.  As temperature sensors and other sensors started going offline, the flight controllers were looking for systemic commonalities.  It wasn't until much later in the troubleshooting process that they realized all those sensors were going offline because they were being destroyed -- the commonality was that they were in the rapidly heating part of the orbiter.

Quote
Yeah, and this is why I teach my students about fault trees and how they help you avoid jumping to conclusions. But I think you're still a little hard on the Apollo 11 flight controllers. They succeeded, didn't they?

They did, and in the final analysis that's all that matters.  I merely bring it up as an example of de minimis thinking.  The saving grace is that de minimis remedies aren't doomed to immediate failure.  Also, in their further defense, there is a theory of operating complex systems that says you apply only the minimal effective remedy.  You don't fix more than what the data say are broken.

Quote
Yeah. So you have to test for the worst case with every possible task running to make sure you have enough spare cycles. If you don't, you have to carefully plan which tasks are allowed to run.

And I gather MIT generally took the latter approach.  The specs for the computer had to be locked down at a certain point, but afterwards people started realizing what a useful gadget the computer was and gave it more and more tasks to do.
"Facts are stubborn things." --John Adams

Offline Obviousman

  • Jupiter
  • ***
  • Posts: 743
Re: "The AGC wan't powerful enough"
« Reply #40 on: December 03, 2019, 08:14:49 PM »
And yet all the Apollo commanders prided themselves on "manually" landing the LM. There was no such thing as a fully manual landing mode...

I thought P67 was a full "manual" landing?

Offline smartcooky

  • Uranus
  • ****
  • Posts: 1966
Re: "The AGC wan't powerful enough"
« Reply #41 on: December 04, 2019, 05:51:35 AM »
Dunno...  If I were going to fuzz the Net-1/MOCR setup, I'd have a mariachi band suddenly appear right at DOI.

I just can't pass up this opportunity...



If you're not a scientist but you think you've destroyed the foundation of a vast scientific edifice with 10 minutes of Googling, you might want to consider the possibility that you're wrong.

Offline bknight

  • Neptune
  • ****
  • Posts: 3132
Re: "The AGC wan't powerful enough"
« Reply #42 on: December 04, 2019, 09:38:12 AM »
Add reading on in the ALSJ page noting the landing of A11, there is a passage that Pete had a lot of trouble nulling out the horizontal movements during A12 landing.  The guys at MIT rewrote the program for the AGC to null out the lateral movements automatically, IIRC.
And yet all the Apollo commanders prided themselves on "manually" landing the LM. There was no such thing as a fully manual landing mode...

Of course they did all but one were jet jockeys and the majority flying off a carrier.
Truth needs no defense.  Nobody can take those footsteps I made on the surface of the moon away from me.
Eugene Cernan

Offline JayUtah

  • Neptune
  • ****
  • Posts: 3814
    • Clavius
Re: "The AGC wan't powerful enough"
« Reply #43 on: December 04, 2019, 11:12:44 AM »
I just can't pass up this opportunity...

Good on you.  I figured Confuse-A-Cat might have been too obscure a reference.

The idea behind fuzzing is that you're not necessarily trying to exercise some specific means.  Take lock-picking, for example.  The overt approach is to twist the lock barrel to apply shear to the lock pins, then use a tiny pick to raise each pin to the appropriate position.  The constant shear holds it there while you work on the other pins.  But there's also a sawtooth tool you can just randomly slide in and out while you told the barrel in torsion, and it sort of randomly raises and lowers the pins.  It's often much faster than the explicit method, but it requires less skill.  This would be equivalent to fuzzing the lock.

Now consider the AGC.  The DSKY operates by sending a specific byte to the computer when each key is pressed.  Modern keyboards work pretty much the same way.  The uplink channel is simply a virtual DSKY.  The modem receives digital values that are fed to the AGC as keystrokes, allowing anything that can be done on the DKSY to be done remotely.  An explicit plan to take over an Apollo spacecraft might involve feeding it keystrokes that tell the computer to do something deleterious, like orient the ship for gimbal lock or apply RCS hardover commands.  There are probably innumerable ways to sabotage an Apollo mission by hijacking the AGC uplink.  Fuzzing, on the other hand, would be simply feeding random words over this channel, to see if any of them accidentally put the computer into an unusable state.  It's exactly equivalent to button-mashing the DSKY.

It would be impossible to program the AGC to recognize and reject all the possible sequences of improper input.  But it can apply strict controls on what it accepts as proper input.  And anything outside the canonical sequences like "verb-key, digit, digit, noun-key, digit, digit, enter-key" essentially puts the AGC into a mode where it doesn't accept any more input until the key-release button is pressed.  This makes the computer harder to operate because there's no equivalent to the backspace key.  But it's safer in that fuzzy input will very quickly get caught, and all subsequent input rejected, without impairing the background operation of the software.  I'm sure ka9q can talk at length about similar discriminators in his technologies.

Even better, this channel is disabled unless the crew explicitly switches the Uplink switch to Accept.  This too has tradeoffs.  What if the crew are incapacitated and can't enable ground control of the computer?
"Facts are stubborn things." --John Adams

Offline ka9q

  • Neptune
  • ****
  • Posts: 3014
Re: "The AGC wan't powerful enough"
« Reply #44 on: December 04, 2019, 11:17:25 AM »
And yet all the Apollo commanders prided themselves on "manually" landing the LM. There was no such thing as a fully manual landing mode...

I thought P67 was a full "manual" landing?
Even in P67 the astronauts' inputs are still being passed through the computer. The LM could not be landed without a functioning computer.