Failure is ALWAYS an Option
Failure is not an option — Gene Kranz, NASA Chief Flight Director — Gemini and Apollo missions
Failure is always an option — me. Also, Adam Savage, special effects designer
As far as facts go, Gene Kranz never uttered the phrase “Failure is not an option.” The Apollo 13 screenwriters fabricated that phrase to summarize his tenure as NASA Chief Flight Director over the Gemini and Apollo space missions.
Special effects designer Adam Savage of Mythbusters fame coined the phrase “Failure is ALWAYS an option” as a tagline for his show.
Gene Kranz had told the Apollo 13 screenwriters that when anything went wrong on any of the NASA space missions, “we just calmly laid out all the options, and failure was not one of them”.
However much the NASA engineers planned, things would and did go wrong.
Because failure is always an option. In engineering, in nature, in life. Thirty-three percent of the (original) laws of thermodynamics are dedicated to this notion — Any system gravitates to a state of disorder — in short, all things break down.
Failure in the delivery cycle
However, it is easy to overlook this or not prepare for these eventualities within the software development lifecycle: get deadline, code, deliver, get bug reports and complain about why the user misused our product like that.
As a self-taught developer / engineer, I can look back and see my follies in tackling a problem, typing code for a few hours, a day or two, testing my code and feeling like Doc Brown in Back to The Future — It Works!
But how many times did I think that I was done because my code worked? How many times have I heard a delivery manager communicate this same sentiment? We got our code to execute in the predetermined way within the allotted time, so let’s deploy and delight in meeting the goal.
Not every developer or delivery manager is this myopic in their approach to achieving code completion. From my experience and observations, however, those engineers who see beyond the happy path — in comparison with those focused on rapidity — are often deemed as the stragglers within the build cycle. These engineers may not have the shared vocabulary to educate their peers as to why working code does not equate to code complete.
And if they are able to express that the ticket or request is incomplete in addressing known use cases and unknown edge cases, they’re seen as putting perfection in front of progress.
Right-Now solution vs the Right solution
Over the last few years, I have had several opportunities to help bridge this expectation gap between engineers who are targeting every use and edge case and delivery managers focused on meeting the prescribed deadline.
Where some delivery managers see these engineers as a hinderance to code completion, I quickly realized that these engineers indeed have a problem — an image problem. The delivery managers may understand these engineers have great skills (or they may even misjudge the engineer’s acumen based on their previous adherence to timelines). But in each of these instances, delivery had not seen the engineer’s reasoning that the code is not complete even if it executes as expected.
Routine one-on-ones afforded me the opportunity to discuss (both with engineers and delivery management) the spectrum between the teams’ expectations and the risk factor of delivering the “right now” solution — usually the happy path packaged up as a mock service, and then hardening the service as a second delivery. I was able to adapt the Pareto Principal (80/20 rule) to help normalize expectations.
The largest single use case in testing and using the code is the happy path. However, that single use case is not the majority. The culmination of the known use cases and unknown/undiscovered edge cases comprise the majority of many code lifecycles.
Normalizing expectations
I have frequently fostered these conversations, sometimes repeatedly between the same engineer and delivery manager. However, after normalizing each other’s expectations, I’ve watched the delivery team not only understand the engineers’ hesitation in delivering happy-path code, but actively advocate that this perceived delay has the potential to protect the team from undesired consequences.
These two parts of the team now lay out a more equitable timeline to ensure both rapid delivery and hardening against unexpected use cases. And together my engineering team and the delivery team are able to normalize their understandings of each other’s anxieties.
As I put it, weighing our balance of headaches: rapidity vs stability. Helping delivery management understand that the happy path is but a small piece of the entire use-case landscape. Likewise, this has helped the engineer manage his or her anxiety in delivering code under the expectation that hardening against 100% of the exception cases is near impossible.
In these instances, it has been my prerogative as a manager to take the responsibility from the engineer to ensure the code (which he or she is not 100% comfortable covers yet undiscovered scenarios) in order to ensure downstream development continues. And engineers who were once seen as an encumbrance to the sprint cycle are now seen as trusted leaders to ensuring the stability of the product.
Fantasy Management
I have not had the desire to explore fantasy sports, “manage” my own team of football or basketball stars. But that shouldn’t stop me from armchair managing two engineering groups that my team routinely relies upon and that exemplify either side of this spectrum.
The overly prudent team is upstream of us and is a non-stop, 24/7 content-delivery provider for the entire organization; the happy path rapid-delivery team is an external group that builds front-end applications.
This upstream team is understandably reticent to make a great change in how they provide service. They literally have no window of opportunity to roll out a new (and drastically different) paradigm without jeopardizing one of a half-dozen clients who operate on divergent delivery cycles and cannot be down, either for upstream maintenance nor for failures and disruptions in their content distribution. Any changes this team makes needs to be carefully arranged for each team’s unique needs.
The downstream team is an outsourced development team. They have tighter deadlines as they need to interface with a variety of external distribution channels. They also have the additional disadvantage of operating outside the confines of our offices — they miss out on hallway conversations, nuances that could help address assumptions and misunderstandings.
In any given situation, we only have three options to affect change:
- Say something
* Talk with the people in charge of affecting change. This could be via influential/peer one-on-ones - Take it upon yourself to do something
* This could be going against the rules or even leaving your given situation - Accept the situation as is
* Understand that you may be unwilling or unable to do either of the above and can only surrender to your given situation
I have standing peer one-on-ones with the tech leadership on both teams. But saying something does not always lead to change. At least not in a timely manner.
Be the change you want to see
Since my team is afforded the benefit of smaller scale — we only support one brand — we can test services with time enough to modify functionality or scale.
As such, I have begun an experiment with our upstream team: R&D the architectural changes that I see could benefit the upstream team’s rollout. Help prove out and then guide implementing the tools we find work well in a stateless world of serverless computing: methods of better triaging, scaling, handling throughput management.
We have only just begun. I hope that our smaller scale can translate to a more efficient rollout for this larger team with much bigger consequences at stake.
As for the outsourced development team, I have begun addressing the idea of failure from their perspective. In the past, they have required and then relied on the fallacy of the guaranteed contract. As we often aggregate and deliver a myriad of content sources as part of our payload, we adhere to a predetermined structure in times of content success or failure.
However, in our routine office hour meetings, I reiterate the mantra of failure is always an option. What if our service is working, but the CDN is down? What if the devices they maintain lose connectivity? Our options are two-fold: break the consumers’ experience or plan for failures.
We have discussed circuit-breaking design patterns, decoupling responses, and sharing or redundant monitoring systems. A next step would be to help facilitate specific unhappy path QA scenarios.
As Eddie Cantor once said, it takes 20 years to make an overnight success. The small, iterative changes I have in mind will take much less than that. But the concept is the same: small, iterative changes. And lots of patience.