Learning from Failure: Lessons from the Challenger Disaster and Engineering Ethics

We all make mistakes. The only difference is how we deal with them; hiding them is always the worst solution. Let's check one of the most famous human failures that lead to the worst possible outcome: deaths.

The Challenger Disaster

The year was 1986. On the 28th of January, the Space Shuttle Challenger exploded after 73 seconds, killing all the crew.

But why it happened? What was the cause?

As with any disaster like this, the local authorities set up a committee to understand what happened so they could learn from the mistakes and avoid repeating the same errors again.

The O-ring problem

The Challenger used solid rocket boosters (SRBs). The SRBs were the first solid-propellant rockets used for primary propulsion on a vehicle for human spaceflight.

Two solid rocket boosters perched on top from the Vehicle Assembly Building's

The SRBs had seven sections, six of which were joined in pairs. These created 4 resulting segments that were sealed with asbestos-silica insulation. Then, each joint was sealed with two rubber O-rings.

As with any engineering project, they had tests to ensure everything worked as expected. In 1977, they ran a test that showed that under pressurized water, the metal parts bent away from each other, opening a gap. This phenomenon, known as "joint rotation", caused a momentary drop in air pressure, which made it possible that combustion gasses could erode O-rings. If this were the case, then a flame path could be developed, causing the joint to burst, and if this happens, then the booster and the Shuttle.

With this knowledge, thanks to the experiments, the engineers wrote to the manager of the SRBs project, saying that the O-rings design was not acceptable for a flight. Now, guess what happened: the manager didn't share this information with the SRBs engineers, and the field joints were accepted for flight in 1980.

While this knowledge was based on tests, the first confirmed case was found on the second space shuttle mission, STS-2. You might be wondering, well, they reported it back so they could address the problem. Well, the reality is quite different: instead of following the regular rules to report it, the Marshall Center decided to not do it to senior management at NASA and share it directly with the contractor that built the boosters (Thiokol).

But how? They already knew 3 years before that this could be a problem, but they ignored it. Instead of briefing everyone, they wanted to keep it low and handle it with the contractor.

After all these reports, the O-rings were finally classified as "Criticality 1", which means that in the case of a failure, it could cause the destruction of the orbiter. Did they decide with this information to stop flights? No one at Marshall Center suggested grounding all the flights until they had a fix.

On the next flights, they got more evidence that O-rings were unsafe. The STS-41-D flight post-analysis found that O-rings suffered erosion, but the risk was low as they had two rings.

Flights continue to show more and more evidence that O-rings could cause the Shuttle explosion if they didn't find a solution. The STS-51-B flight showed for the very first time that O-rings were utterly eroded, removing the sealing and causing erosion on the second ring. At this point, it was clear from an engineering point of view that flights should be stopped at all costs. However, flights continued.

Adding cold weather to the mix

All the previous evidence was captured in "good weather conditions". None of the flights or tests were run under really low temperatures, so the O-rings were never certified to operate at low temps. The forecasts for the launch day predicted temperatures below -1 degrees Celsius, the minimum temperature permitted for launch.

The contractor engineers were worried that others would not share their concerns regarding the low-temperature effects on the boosters. Engineer Bob Ebeling, in October 1985, wrote a memo titled "Help!" You can imagine his desperation because his goal was that someone would read it and take action regarding flying about low temperatures.

After the weather forecast for the launch day, NASA contacted the contractor about this issue. A contractor manager asked Ebeling about the possibility of a launch at 18º F (-8º C). He answered: "We're only qualified to 40º (4ºC) ... what business does anyone even have thinking about 18°, we're in no-man's land.'" At this moment, the team agreed that a launch risked disaster, so they called NASA, recommending postponing the launch day until temperatures were within the approval ranges. NASA manager Jud Lovingood said they cannot recommend this without providing a safe temperature. The contractor organized a teleconference two hours later to justify the no-launch recommendation.

At the teleconference, several engineers reiterated their concerns about the effects of the low temperatures on the O-rings and insisted on postponing the launch. The conversation was centered on not having enough data to determine whether the O-rings would properly work under 54º F (12º C). This was important because, if you remember, SRB O-rings were designated as a "Criticality 1" component, meaning there is no backup if both rings fail. The failure could destroy the orbiter and kill its crew.

While NASA was against postponing the launch, the contractors tried to convince them. During the conference, NASA said things like: "I am appalled by your recommendation." or "My God, Thiokol, when do you want me to launch—next April?"

NASA believed the first O-Ring could fail, but the second one would work. However, this was not tested at all, and in any case, this could not be argued because it was marked as a "Criticality 1" component. Astronaut Sally Ride stated before the Rogers Commission that it is forbidden to rely on a backup for a "Criticality 1" component.

With all this mess, the clock was running, and it looked like a second conference was scheduled with NASA and the contractor, but without the engineers. It is still being determined why the contractor's managers disregarded its engineer's warning and finally recommended launching as initially scheduled. Ebeling told his wife that night that Challenger would blow up.

The launch day arrived, and the temperatures were below freezing: 28º to 28.9º F (-2.2 to -1.7ºC). Due to all the ice covering the Shuttle that day, NASA postponed the launch by an hour so the Ice Team could perform another inspection. This last inspection showed that ice appeared to be melting, so Challenger got the green light for launch at 11:38 am EST.

At T+73.191 the mission control said: "We have a report from the Flight Dynamics Officer that the vehicle has exploded. The flight director confirms that. We are looking at checking with the recovery forces to see what can be done at this point."

What can we learn from this?

It's been shown in this article that several engineers warned several times about the O-ring's issues. However, they "ignored them" even with precise data that showed that the orbiter could explode.

Hence, was the message not clear enough? Didn't they provide enough good information? They did, but there was pressure from other spheres about the launch day. The critical part is meeting without any engineers between NASA and the contractor. Why do you want to remove them if it is not to silence them and manipulate the rest to achieve your goal of launching the day you wanted it to do it? The most depressing part is that engineers asked to postpone the launch, not to cancel the flight (even though this was risky as hell as well).

Luckily, all engineers involved with the O-ring tests and problems documented everything, which helped clarify the problem and how NASA pressured Thiokol to get the approval. They left signed documents recommending postponing the launch, as well as not signing documents that approved the launch.

These engineers were the best professionals I can think of: they risked their jobs by opposing top bosses within the command line.

The takeaway from this specific case, and other engineers, is to always document everything. Documentation will save you and your project if something is pushed like this without your approval. If things go awry, you will always have proof that you at least tried to inform your superiors about the possible consequences. Also, build tests. Test everything. Even when testing everything, there will be tons of uncertainty and errors. Tests help us reduce, manage, and handle that uncertainty properly.

And yes, I'm a computer engineer; I don't build rockets, but pressures to launch products without proper tests always happen. Finding the right balance is something you have learned over the years, but launching a project because someone is pushing you is different from the right way of doing things. And if you re-read everything, that scenario could lead to the worst possible outcome: deaths. Yes, this is an exaggeration because most of us build web or phone apps, but now that software is ubiquitous, engineers should take all this stuff more seriously. AI, web3, etc. possess great power, and we should be vigilant and take care of what we build, how we test it, and also, really importantly, how we report problems back to our superiors and convince them to avoid the subsequent big failure.

Daniel Lombraña

Commented 1 year ago

Just to start the year, here you have a long blog post about how we can learn from the mistakes when developing new projects and products: https://paragraph.xyz/@teleyinex.eth/learning-from-failure The aim of the article is to show why is important to document everything, test everything and being a professional.

Ed O'Shaughnessy

Commented 1 year ago

This line has always struck me as a fundamental observation about any high risk endeavour. "NASA appeared to be requiring a contractor to prove that it was not safe to launch, rather than proving it was safe." https://history.nasa.gov/rogersrep/v1ch5.htm