James Routley

The Apollo Guidance Computer (AGC) is one of the most scrutinised codebases in history. Thousands of developers have read it. Academics have published papers on its reliability. Emulators run it instruction by instruction. We found a bug in it that had been missed for fifty-seven years: a resource lock in the gyro control code that leaks on an error path, silently disabling the guidance platform's ability to realign.

We used Claude and Allium, our open-source behavioural specification language, to distil 130,000 lines of AGC assembly into 12,500 lines of specs. The specs were derived from the code itself, and the process signposted us directly to the defect.

Charting the code

The source code has been publicly available since 2003, when Ron Burkey and a team of volunteers began painstakingly transcribing it by hand from printed listings at the MIT Instrumentation Laboratory. In 2016, former NASA intern Chris Garry’s GitHub repository went viral, topping the trending page. Thousands of developers scrolled through the assembly language of a machine with 2K of erasable RAM and a 1MHz clock.

The AGC’s programs were stored in 74KB of core rope: copper wire threaded by hand through tiny magnetic cores in a factory (a wire passing through a core was a 1; a wire bypassing it was a 0). The women who wove it were known internally as the “Little Old Ladies”, and the memory itself was called LOL memory. The program was physically woven into the hardware. Ken Shirriff has analysed it down to individual gates, and the Virtual AGC project runs the software in emulation, having confirmed the recovered source byte-for-byte against the original core rope dumps.

As far as we can determine, no formal verification, model checking or static analysis has been published against the flight code. The scrutiny has been deep, but it has been a particular kind of scrutiny: reading the code, emulating the code, verifying the transcription.

We took a different approach. We used Allium to distil a behavioural specification from the Inertial Measurement Unit (IMU) subsystem, the gyroscope-based platform that tells the spacecraft which way it is pointing. The specification models the lifecycle of every shared resource: when it is acquired, when it must be released, and on which paths.

It surfaced a flaw that reading and emulation had missed.

Losing the reference

The AGC manages the IMU through a shared resource lock called LGYRO. When the computer needs to torque the gyroscopes (to correct platform drift or perform a star alignment), it acquires LGYRO at the start and releases it when all three axes have been torqued. The lock prevents two routines from fighting over the gyro hardware at the same time.

The lock is acquired on the way in and released on the way out. But there is a third possibility, and it doesn’t release the lock.

‘Caging’ is an emergency measure: a physical clamp that locks the IMU’s gimbals in place to protect the gyroscopes from damage. The crew could trigger it with a guarded switch in the cockpit.

When the torque completes normally, the routine exits via STRTGYR2 and the LGYRO lock is cleared. When the IMU is caged while a torque is in progress, the code exits via a routine called BADEND, which does not clear the lock. Two instructions are missing:

        CAF  ZERO
        TS   LGYRO

Four bytes.

Once LGYRO is stuck, every subsequent attempt to torque the gyros finds the lock held, sleeps waiting for a wake signal that will never come, and hangs. Fine alignment, drift compensation, manual gyro torque: all blocked.

On 21 July 1969, while Neil Armstrong and Buzz Aldrin walked on the lunar surface, Michael Collins orbited alone in the Command Module Columbia. Every two hours he disappeared behind the Moon, out of radio contact with Earth. “I am alone now, truly alone, and absolutely isolated from any known life. I am it,” he wrote in Carrying the Fire. “If a count were taken, the score would be three billion plus two over on the other side of the moon, and one plus God knows what on this side.”

During each pass he ran Program 52, a star-sighting alignment that kept the guidance platform pointed in the right direction. If the platform drifted, the engine burn to bring him home would point the wrong way.

Radio silence

Here’s how the bug might have manifested.

Collins has just finished his star sightings at the optics station in the lower equipment bay and keyed in the final commands. The computer is torquing the gyroscopes to apply the correction across all three axes.

He moves back toward the main panel in a cramped cockpit, past a cage switch protected by a flip-up cover. An elbow catches the cover and nudges the switch. The code handles this gracefully: a routine called CAGETEST detects the cage, abandons the torque and exits. The P52 fails, and he understands why: the cage interrupted the correction. He uncages the IMU and heads back to the optics station to realign.

He starts a new P52. The program hangs.

No alarm, no program light. The DSKY (display and keyboard, his only interface to the computer) accepts the input and does nothing. He tries V41, the manual gyro torque verb. Same result. Everything else on the computer works. Only gyro operations are dead.

The first failure looked normal: a cage event during alignment, with a known recovery. The second gives no clue what is wrong. The trained response to an accidental cage is to uncage and realign. Collins had been trained to restart the computer, but nothing about this failure would suggest he needed to. Commands were accepted, everything else worked. It would look like faulty hardware, not a stuck lock.

“My secret terror for the last six months has been leaving them on the Moon and returning to Earth alone”, Collins later wrote of the rendezvous. A dead gyro system behind the Moon, with Armstrong and Aldrin on the surface waiting for a rendezvous burn that depends on a platform he can no longer align, is exactly that scenario.

A hard reset would have cleared it. But the 1202 alarms during the lunar descent had been stressful enough with Mission Control on the line and Steve Bales making a snap abort-or-continue call.

Behind the Moon, alone, with a computer that was accepting commands and doing nothing, Collins would have had to make that call by himself.

The known landmarks

Margaret Hamilton (as “rope mother” for LUMINARY) approved the final flight programs before they were woven into core rope memory. Her team at the MIT Instrumentation Laboratory pioneered concepts we now take for granted: priority scheduling, asynchronous multitasking, restart protection and software-based error recovery. Even the term ‘software engineering’ is hers.

Their priority scheduling saved the Apollo 11 landing when the 1202 alarms fired during descent, shedding low-priority tasks under load exactly as designed. Most modern systems don’t handle overload that gracefully.

The most serious bugs that did surface were specification errors, not coding mistakes. Don Eyles, who wrote the lunar landing guidance code, documented several. For example, the ICD for the rendezvous radar specified that two 800 Hz power supplies would be frequency-locked but said nothing about phase synchronisation. The resulting phase drift made the antenna appear to dither, generating roughly 6,400 spurious interrupts per second per angle and consuming roughly 13% of the computer’s capacity during Apollo 11’s descent. This was the underlying cause of the 1202 alarms.

This defect has the same shape. BADEND is a general-purpose termination routine shared by all IMU mode-switching operations. It clears MODECADR (the stall register), wakes sleeping jobs, and exits. But LGYRO is a gyro-specific lock, acquired only by the pulse-torquing code and released only by the normal completion path in STRTGYR2. When the error path routes through BADEND, it handles the general resources correctly, but not the gyro-specific lock.

The AGC was written so defensively that latent faults like this would be silently corrected by the restart logic, which clears LGYRO as a side effect of full erasable-memory initialisation. Any test that happened to trigger a restart after the bug would see the system recover seamlessly.

The defensive coding hid the problem, but it didn’t eliminate it. A cage event without a subsequent restart would still leave the gyros locked. Collins would have no way to realign the guidance platform and no diagnostic clue pointing to the fix.

Star sighting

We found this defect by distilling a behavioural specification of the IMU subsystem using Allium, an AI-native behavioural specification language. The specification models each shared resource as an entity with a lifecycle: acquired, held, released.

The IMU entity declares a gyros_busy field modelling LGYRO. Two rules govern it:

rule GyroTorque {
    -- Sends gyro torquing pulse commands. Reserves the gyros,
    -- enables power supply, and dispatches pulses per axis.
    when: GyroTorque(command: GyroTorqueCommand)

    requires:
        imu.mode != caged
        imu.gyros_busy = false

    ensures:
        imu.gyros_busy = true
        GyroTorqueStarted()
}

rule GyroTorqueBusy {
    -- Gyros already reserved by another torquing operation.
    -- Caller sleeps until LGYRO is cleared.
    when: GyroTorque(command: GyroTorqueCommand)

    requires: imu.gyros_busy = true

    ensures:
        JobSleep(job: calling_job())
}

GyroTorque requires gyros_busy = false and ensures gyros_busy = true: the lock is acquired. Somewhere, on every path that follows, the lock must be released. The spec doesn’t show where in the code the release happens, but it makes the obligation explicit: if gyros_busy goes to true, something must set it back to false.

With that obligation written down, Claude traced every path that runs after gyros_busy is set to true. The normal completion path (STRTGYR2) clears it. The cage-interrupted path (BADEND) does not. MODECADR, the other shared resource, is correctly cleared in BADEND: LGYRO is missing.

The specification forces this question on every path through the IMU mode-switching code. A reviewer examining BADEND would see correct, complete cleanup for every resource BADEND was designed to handle.

The specification approaches from the other direction: starting from LGYRO and asking whether any paths fail to clear it.

Tests verify the code as written; a behavioural specification asks what the code is for.

A specification distilled by Allium models resource lifecycles across all paths, including the ones nobody thought to test. You can view the Allium specifications and reproduction of the bug on GitHub.

Course correction

Hamilton’s team released resources by loading the constant zero into the accumulator (CAF ZERO) and storing it into the lock register (TS LGYRO). Every release placed manually, by a programmer who remembered every path that could reach that point.

Modern languages have tried to make lock leaks structurally impossible: Go has defer, Java has try-with-resources, Python has with, Rust’s ownership system makes lock leaks a compile-time error.

Nevertheless, lock leaks persist. MITRE classifies the pattern as CWE-772: “Missing Release of Resource after Effective Lifetime”, and rates its likelihood of exploitation as high. Not all resources are managed by a language runtime. Database connections, distributed locks, file handles in shell scripts, infrastructure that must be torn down in the right order: these are still often the programmer’s responsibility. Anywhere the programmer is responsible for writing the cleanup, the same bug is waiting.

Every Apollo crew came home safely. But the IMU mode-switching routines were carried forward across missions in both the Command Module software (COMANCHE) and the Lunar Module software (LUMINARY). The fault was never noticed and never fixed.

A fifty-seven-year-old bug hid in flight-proven assembly. What’s hiding in yours? Let’s talk.

Thanks to Farzad “Fuzz” Pezeshkpour for independently reproducing the bug, and to Danny Smith and Prashant Gandhi for reviewing early drafts of this article.