Margaret Hamilton: The Woman Who Taught Me That Code Can Save Lives (Or End Them)
You know that moment when you’re staring at production logs at 3am, watching a cascade failure spread through your infrastructure, and you think “fuck, I should have planned for this”?
I have a photo on my wall (okay, it’s my desktop wallpaper, but same energy) that stops me from ever having that thought again. It’s from 1969. Margaret Hamilton—this 33-year-old MIT engineer—standing beside stacks of computer printouts that are literally taller than she is.
Margaret Hamilton with the Apollo Guidance Computer source code she and her team produced at MIT, 1969
Most people look at that photo and think “wow, that’s a lot of code.”
I look at it and see every single thing that could kill three astronauts, documented, tested, and handled.
That’s not just impressive engineering. That’s a completely different philosophy about what it means to write software. And once you understand it, you can’t go back to writing code the old way.
How I Found This Photo (And Why It Lives Rent-Free in My Head)
I first stumbled across Hamilton’s work around 2005 or 2006, deep in the trenches of building ISP infrastructure. The kind of systems where “just restart it” isn’t an option because you’ve got 50,000 people whose entire internet connection depends on your code not shitting the bed.
Someone posted that photo on a forum (remember forums?) with the caption “This is what software engineering looked like before Stack Overflow.”
I did what any curious engineer would do: I went down the rabbit hole.
Turns out, Hamilton didn’t just write the Apollo Guidance Computer software. She invented the discipline of software engineering as we know it. She pioneered defensive programming. She created the concept of priority-based task scheduling. She proved that software could be reliable, testable, and verified—all in an era when most people treated programming like black magic.
And she did it all while NASA kept telling her “that could never happen in production.”
Narrator: It absolutely happened in production.
The P01 Bug: Why You Should Listen to Four-Year-Olds
Here’s my favorite Margaret Hamilton story, and I swear to god I think about it every single time someone tells me “users would never do that.”
During Apollo program simulations, Hamilton would sometimes bring her four-year-old daughter Lauren to work. (Because apparently in 1968, MIT’s idea of “bring your kid to work day” was “sure, let your four-year-old play with the multi-million-dollar lunar simulator.”)
One day, Lauren is mashing buttons on the simulator and accidentally triggers P01—the pre-launch program—in the middle of a simulated flight.
The simulator crashes hard.
Hamilton looks at the crash and has this beautifully simple thought: “If my four-year-old can trigger this by accident, so can a highly trained astronaut.”
So she files a change request with NASA: add code to prevent astronauts from accidentally selecting P01 during flight.
NASA’s response?
“Denied. Astronauts are highly trained professionals who would never make such an error.”
(I want you to sit with that for a second. NASA—the people who put humans on the moon—just deployed the classic “our users would never do that” argument.)
Hamilton documented the bug anyway. She wrote the recovery procedure. She filed it in the manual.
And then she waited.
Apollo 8. December 1968. First manned mission to orbit the moon.
Three days into the mission, astronaut Jim Lovell is doing a routine procedure when he accidentally selects P01.
The computer wipes out all the navigation data.
All of it.
They’re orbiting the moon with no way to calculate their trajectory home.
Record scratch. Freeze frame. “You’re probably wondering how I ended up in this situation.”
It took nine hours for Hamilton’s ground team to upload new navigation data. Nine hours of three astronauts wondering if they’d ever see Earth again. Nine hours that only worked because Hamilton had documented the recovery procedure—even though NASA wouldn’t let her prevent the bug.
They made it home. And NASA immediately approved her fix for future missions.
The lesson: When someone finds a bug—even if they’re four years old, even if it seems impossible, even if “users would never do that”—assume it will happen in production. On Apollo 8, if possible.
I think about Lauren Hamilton every time I’m tempted to skip input validation.
The Day Apollo 11 Almost Failed (And Why It Didn’t)
Okay, so now we get to the big one. The moment that validated everything Hamilton believed about defensive programming.
July 20, 1969. Neil Armstrong and Buzz Aldrin are three minutes from landing on the moon.
And then the computer starts screaming.
1202 alarm. 1201 alarm. 1202 again.
The Apollo Guidance Computer is completely overloaded. The rendezvous radar and the landing system are both trying to run, and the computer physically cannot handle both at once.
Now, here’s the thing you need to understand: in 1969, if software crashed, you rebooted and tried again. That was the state of the art. Hell, that’s still what most consumer software does in 2025.
But you can’t reboot the lunar module.
You get one shot.
Most software written in 1969 would have just crashed. Game over. Mission abort. Go home—if you can figure out how.
But Margaret Hamilton had designed the Apollo Guidance Computer with something that barely existed yet: priority-based task management and graceful degradation.
The system:
- Detected it was being asked to do too much
- Identified which tasks were critical (landing guidance) and which weren’t (rendezvous radar)
- Dropped the lower-priority tasks
- Kept the critical landing functions running
- Gave the astronauts a clear choice: land or abort
Armstrong and Aldrin made the call: land.
They became the first humans to walk on the moon.
Hamilton later wrote: “If the computer hadn’t recognized this problem and taken recovery action, I doubt if Apollo 11 would have been the successful Moon landing it was.”
The software saved the mission because it was designed to fail gracefully instead of catastrophically.
And that—that right there—changed how I think about every system I build.
What Hamilton Taught Me About Building Infrastructure That Doesn’t Suck
You gotta understand: when I first learned about Hamilton’s work, I was already deep into building the kind of infrastructure where downtime means angry customers, lost revenue, and me getting paged at 3am on a Saturday.
(There’s a special kind of hell reserved for infrastructure engineers who have to explain to the CTO why the entire payment processing system went down during Black Friday.)
Hamilton’s Apollo experience gave me a framework that I still use today. Not as theory. As survival skills.
The “Everything Will Fail” Design Principle
When I’m building an MCP server or setting up infrastructure, I play this game Hamilton taught me:
Assume everything will fail. Not “might fail.” Will fail.
- That API you’re calling? It’s going to timeout mid-request.
- That database connection? It’s going to drop during a critical transaction.
- That network link? It’s going to flake out right when you need it most.
- That perfectly formatted JSON? Someone’s going to send you XML. Or HTML. Or the complete works of Shakespeare.
- That rate limit you thought you’d never hit? You’re about to hit it. During a demo. To a customer.
When I built mcp-vultr—335+ tools for Vultr infrastructure automation—Hamilton’s voice was in my head the entire time:
“What happens when the Vultr API goes down during a deployment?”
So I built:
- Retry logic with exponential backoff
- Circuit breakers that fail fast when services are down
- Detailed error logging for debugging at 3am
- Graceful degradation when non-critical services fail
- Fallback modes for when the API is rate-limiting
Because if Hamilton could design software that handled computer overload three minutes before landing on the moon, I can damn well handle an API timeout.
The Documentation Obsession (Or: Future You Will Thank Present You)
See those stacks of printouts in Hamilton’s photo? That’s not just code. It’s:
- Complete verification procedures for every module
- Interface specifications for every component
- Every possible failure mode, documented
- Recovery procedures for each failure
- Test results proving it all works
Hamilton said: “There was no second chance. We knew that. We took our work seriously.”
I think about this every time I’m tempted to skip writing a README. Every time I think “I’ll document this later.” Every time I write a quick hack with a comment like ”// TODO: fix this properly.”
Sure, most of my code isn’t landing humans on the moon.
But when a system goes down at 3am, documentation is the difference between a 5-minute fix and a 5-hour nightmare of git blame and “what the fuck was I thinking?”
(Past Ryan: surprisingly often not thinking clearly.)
I have a rule now: if you can’t explain it to the person who’ll debug it at 3am, you don’t understand it well enough to deploy it.
That person is usually Future Me. Future Me is always tired, angry, and has forgotten everything Present Me thought was “obvious.”
Test the Failure Modes (Not Just the Happy Path)
Here’s something that blew my mind when I learned about Hamilton’s testing process:
She tested failures just as thoroughly as success.
Most developers test that their code works when everything’s fine. Hamilton tested what happened when everything was simultaneously on fire.
- Overload conditions? Tested.
- Bad inputs? Tested.
- Hardware failures? Tested.
- Multiple failures at once? Tested.
This is why your error handling code should have just as many tests as your business logic.
Maybe more.
Because in production, the happy path is boring and predictable. The error paths are where systems die, data gets corrupted, and engineers get paged.
When I’m writing tests now, I ask: “What would Hamilton test?”
Then I test that. And then I test what happens when that fails too.
The “Software Engineering” Revolution That Changed Everything
Quick history lesson that most engineers don’t know:
Margaret Hamilton coined the term “software engineering.”
Not as a cool name. As a revolutionary statement.
In the 1960s, most people treated software as “an art” or “magic”—something creative that you couldn’t really plan, measure, or verify systematically. You wrote code, ran it, and hoped for the best.
Hamilton looked at this approach and thought: “That’s insane. If you’re building a bridge, you don’t ‘fail fast and iterate.’ You prove it works BEFORE people drive on it.”
Her argument was simple:
If your software controls things that matter—infrastructure, money, medical devices, autonomous systems, AI decisions—apply the same rigor as engineering.
This is why the “move fast and break things” mentality drives me absolutely crazy.
That’s fine for a social media app. It’s catastrophic for infrastructure.
Hamilton showed us a different path: Move deliberately. Prevent breakage. Verify before deploying.
And she proved it works. Her software landed humans on the moon and brought them home safely. Every. Single. Time. Six missions. Twelve astronauts. Zero software-related failures.
The Hamilton Standard (Or: How I Actually Review Code Now)
When I’m reviewing code—mine or anyone else’s—I run it through Hamilton’s filter:
1. What could go wrong?
- Have we identified all failure modes?
- Have we tested them?
- What happens under load?
- What breaks first?
2. How does it fail?
- Does it crash catastrophically?
- Does it fail gracefully?
- Does it lose data?
- Does it corrupt state?
- Can we detect the failure?
3. Can we recover?
- Is this a one-way door?
- Can we roll back?
- What’s the recovery procedure?
- Have we tested recovery?
4. Is it documented?
- Could someone else debug this?
- At 3am?
- Without access to me?
- While half asleep?
5. Would Hamilton approve?
- Have we taken this seriously enough?
- Are we treating this like it matters?
- Would we deploy this if lives depended on it?
That last question might seem dramatic. But coding as if lives depend on it makes you a better engineer—even when lives don’t actually depend on it.
Because someday, for some system you build, they might.
Real-World Application: Building MCP Servers and AI Infrastructure
When I’m building MCP servers and AI infrastructure, Hamilton’s principles aren’t theoretical philosophy. They’re practical survival skills for keeping systems running.
Error Prevention at the Boundary:
# Don't do this (trust everything)
result = process_user_data(request.data)
# Do this (Hamilton would approve)
validated_data = validate_schema(request.data)
if not validated_data.is_valid:
log_validation_failure(validated_data.errors)
return graceful_error_response()
with circuit_breaker("external_api"):
result = process_user_data(validated_data)
Defensive Programming for AI Systems:
- Never trust AI output without verification
- Have human-in-the-loop for critical decisions
- Design rollback mechanisms for every action
- Log reasoning chains for debugging
- Test failure modes explicitly
- Rate limit to prevent runaway costs
- Implement kill switches for when things go wrong
The parallel to Apollo is direct: Hamilton couldn’t patch the lunar module after launch. I can’t always roll back an AI decision after it’s been executed and caused real-world effects.
You design it to work the first time. Or you design it to fail gracefully.
Those are your two options. Pick one and commit.
Why This Matters More Than Ever (Spoiler: AI Changes Everything)
The tech industry is in love with “move fast and break things.” Ship it and iterate. Fail fast. Learn from production.
That works fine for consumer apps where the worst-case scenario is someone sees a broken UI.
It fails catastrophically for:
- Infrastructure (like what I build)
- Financial systems (like payments)
- Medical devices (like insulin pumps)
- Autonomous vehicles (like self-driving cars)
- AI systems making real-world decisions (like everything we’re building now)
And here’s the thing that keeps me up at night:
We’re deploying AI into all of these domains. AI that we don’t fully understand. AI that can make mistakes in subtle, hard-to-detect ways. AI that can fail in modes we haven’t even imagined yet.
We need Hamilton’s approach more than ever.
Design systems that degrade gracefully when things go wrong.
Because with AI, things will go wrong in ways we haven’t planned for.
The Legacy
In 2016, President Obama awarded Margaret Hamilton the Presidential Medal of Freedom.
Margaret Hamilton receiving the Presidential Medal of Freedom from President Obama, November 22, 2016 (Wikimedia Commons)
But her real legacy isn’t the medal. It isn’t even landing on the moon.
It’s showing us how to build software that matters.
Software you can trust. Software that works when everything else is failing. Software that might—just might—save lives.
She pioneered:
- Defensive programming patterns
- Error detection and recovery systems
- Priority-based task management
- Asynchronous real-time communication
- Graceful degradation under load
- The entire discipline of software engineering
These aren’t historical curiosities. These are foundational techniques for modern reliable systems.
For over five decades, Hamilton’s methods have shaped how we build software that people depend on.
Building Like Hamilton (A Practical Checklist)
Next time you’re building something that matters, channel Hamilton:
Before writing code:
- Understand all possible failure modes
- Design error handling strategy first (not as an afterthought)
- Plan testing approach for both success and failure
- Document design decisions while you still remember why
While writing code:
- Validate all inputs at system boundaries
- Handle errors explicitly (no silent failures)
- Add observability everywhere (logs, metrics, traces)
- Write clear, actionable error messages for Future You
After writing code:
- Test success paths work correctly
- Test failure paths just as thoroughly
- Test edge cases (especially the “impossible” ones)
- Test recovery procedures actually work
- Document behavior (especially weird edge cases)
Before deploying:
- All tests pass (no “we’ll fix it later”)
- Code reviewed by another human being
- Documentation updated to match reality
- Rollback plan exists and is tested
- Monitoring configured to detect failures
- Runbook written for 3am debugging
Hamilton built systems where failure meant people died. We build systems where failure is expected but should never be catastrophic.
Apply her rigor. Your future self—debugging production at 3am, half-asleep, three coffees in—will thank you.
Past Me has thanked Past Me multiple times for following these rules. It’s a weird feeling, but I’ll take it.
The Photo That Changed Everything
I keep coming back to that 1969 photo.
Not because it shows a lot of code. Anyone can write a lot of code. (Hell, LLMs can generate millions of lines in seconds.)
I come back to it because it shows what it looks like to take software seriously.
Every page in those stacks represented decisions:
- What if this fails? (We handle it gracefully)
- How do we recover? (Here’s the procedure)
- How do we prevent this? (Here’s the safeguard)
- How do we know it works? (Here’s the proof)
That’s the standard.
That’s what it means to be a software engineer, not just someone who writes code.
Build like lives depend on it.
Sometimes they do.
“There was no second chance. We knew that. We took our work seriously, many of us beginning this journey while still in our 20s. Coming up with solutions and new ideas was an adventure. Dedication and commitment were a given.” — Margaret Hamilton, 2009
(She was 33 when Apollo 11 landed. Think about that next time someone tells you you’re too young to work on something important.)