~ 9 mins. read time | Last modified May 7, 2019
Don’t Kill Yourself or Anyone Else
Not that it’s in any training manuals, but I think it’s safe to say a good rule of thumb for anyone flying an aircraft is:
- Don’t kill yourself
- Don’t kill anyone else.
However, these rules are sometimes broken. Sometimes it’s outside of the pilot’s control, and sometimes they are broken in spite of the pilot’s best efforts not to break them, which is often when one of the following conditions are met:
- The pilot isn’t operating at 100%: they’re tired, stressed, or ill.
- The situation they’re in is non-standard, and they react to it incorrectly.
These are known as “human performance” issues. As a Private Pilot*, I had to sit a theory exam on human performance issues during training, which tested my knowledge around the risk factors that we introduce to aircraft as fallible human beings, along with the tried and tested procedures and patterns to mitigate them.
Sure, falling foul of human performance issues can be a more permanent mistake in aviation than in tech, but the mitigations carry over well when both designing and maintaining production software.
In this post, I’ll be sharing a few of my favourite aviation processes and systems that I learned through both my Private Pilot Licence training and general anorak-wearing interest of aviation incident investigations, along with how I apply them to software engineering and infrastructure management.
* I’m not a commercial/airline pilot. There’s a whole world of knowledge above my licence that I am not privy to. I’d love to hear thoughts from commercial and other GA pilots in the comments of this post around the aviation procedures and conventions you employ outside of the cockpit!
Checklists are simple documents that describe how to carry out potentially complex operations step by step, such that each step is an individually small piece of work. They are designed to require minimal thinking to ensure that you are still able to carry out the tasks they describe even if your mental load is high at the time of reading. Tasks can range from the docile: checking an aircraft over before a flight, to the extreme: securing an engine fire in flight.
In software, checklists also come with the added benefits of:
- Allowing someone else to follow the checklist if you are not around.
- Having a repeatable, written process for the task you wish to carry out, which is the precursor to being able to automate it away in code.
As an Infrastructure Engineer/SRE, let’s assume a scenario where you need to carry out a repeatable but non-standard operation such as an SQL database replica restoration. You have a multi-step, fiddly process that if you get wrong would take some time to recover from, during which availability could suffer. With this in mind, you decide to carry out the process out-of-hours to minimise the risk of affecting users.
You are now carrying out a slightly stressful, non-standard operation when you’re tired. Remember those two human performance danger conditions?
Having a checklist to follow in these situations offloads the thinking and planning to a less stressful time, leaving your mental capacity free to deal with the situation as it occurs.
A few tips for writing checklists:
- Use a markdown-supported tool/service such as Github Gists or Wikis. They’ll be formatted nicely with no effort, and the current state of each checkbox will be stored.
- Try to keep each item as simple as possible with one action, such as running a single command or checking a single log file.
- Create a master checklist for your processes then make a copy of it every time you need to run through it. Make sure you tick each item as you go through to get a free and automatic audit log of your actions for later analysis, should you need it.
For example, lets take a look at a checklist that could be run during an abnormal situation of your automated deployment tooling being down when you need to revert an application change:
Manually Flip a Blue/Green application's colour: - [ ] SSH to web server - [ ] Check current live version of the site: `grep "colour" /etc/nginx/sites-enabled/app-live.conf` - [ ] Update Nginx config parameter "$live_colour" to point at NEW colour: `vim /etc/nginx/sites-enabled/app-live.conf` - [ ] Reload Nginx: `nginx -t && systemctl reload nginx` - [ ] Run test suite: `/var/www/app/tests.sh` - [ ] TESTS SUCCESSFUL - [ ] Post @here in #deployments on Slack, alert team site was deployed. - [ ] TESTS FAILED: - _repeat checklist to flip back to original colour_
Cross-Checks and known responsibilities
Checklists are a great first step, but their existence doesn’t enforce their adoption, and nothing is stopping you from skipping individual items even if you do use them.
This lack of enforcement is a real problem in single-pilot aviation, with many fatal and non-fatal accidents attributed to improper checklist discipline. It’s mitigated via “USE YOUR CHECKLIST!” being yelled at you during training, but ultimately, as single operators, it is entirely our responsibility to be strict with checklist discipline when the situation is normal to ensure that it’s our instinct to use them if a situation becomes abnormal.
“Doors to manual and cross-check, please.”
But monitoring ourselves is HARD. It’s a hell of a lot harder than monitoring someone else, which is one of the reasons that in commercial aviation, there are often two pilots on the flight deck. One will act as the Pilot Flying, i.e. they are in charge of the aircraft’s flight controls, and the other will be the Pilot Monitoring, whose job it is to carry out any ancillary tasks requested by the PF, and to “cross-check” their actions.
This cross-checking doesn’t stop at the cockpit doors, either. You may have also heard the phrase, “doors to manual and cross-check” over the P.A after landing on a commercial flight. The pilots are requesting the cabin crew disarm the evacuation slides on the exits, and to then check each other’s work.
As a software engineer in charge of infrastructure, we are often on a team of other software engineers. In non-standard or stressful situations, you can minimise your own mistakes by running through your checklists with someone else, but remember:
- Decide who will act as the “Pilot Flying” and “Pilot Monitoring” before you begin. You could use the same titles, but to avoid getting weird looks from across the office, I suggest using something like Lead and Monitor.
- Any actions taken should be carried out by the Lead, and checked by the Monitor. The Monitor should only ever carry out actions if explicitly asked by the Lead.
- The Monitor should make a copy of the checklist, date it, and check the items off as the Lead steps through the list.
A great example of both checklists and cross-checking in action can be found in this Youtube video, where a real-life failure of an Airbus A340 happened to occur while the crew were being filmed.
You can see the crew working through a checklist item-by-item, which they use to diagnose a problem with one of the generators, and carry out the corrective action of shutting it down. Commercials jets are full of redundant systems so this was a mere inconvenience, but they maintained the checklist and cross-check discipline to ensure the PF shut down the correct generator. Accidentally shutting down the wrong generator could have easily added confusion and escalated the severity of the situation.
Carrying out corrective action against the wrong target has been the cause of disasters in both aviation (British Midland Flight 92) and in tech (Gitlab Database Outage), so it’s well worth the extra few seconds to cross-check your actions.
Dealing with Incidents
In most of the world, aviation accidents or incidents are investigated by a regulatory body. In the UK, it’s the AAIB who endeavour to establish facts, identify a cause, and issue recommendations around mitigating future risk from an incident. However, most General Aviation aircraft don’t have the high-tech black boxes airliners have, so if the pilot isn’t around to provide the facts of an incident, the reports can sometimes be akin to, “They crashed”. Even so, these reports are still a fantastic resource for pilots and non-pilots alike to learn a little more about aviation.
As software engineers, we have access to better logging and monitoring than most General Aviation aircraft, so we really should be able to create useful incident reports that are much more detailed than, “It crashed”. Internal incident reports enable your technical team to better understand issues, and your non-technical team to understand both the complexity of your systems and the effort that goes into keeping them online.
The draft for these reports should be written by the Monitor as the Lead investigates and carries out actions. The Lead should be vocal in the actions they are carrying out (pasting in Slack channels is a good compromise for having to break your concentration by speaking) for the Monitor to jot down in a time-stamped document. The Monitor should also be pasting relevant log lines and Slack chat into this document.
Once the issue has been resolved, take the incident report draft into a room (RL or IM) with anyone interested in the outage (including your non-technical colleagues) and run through the incident in chronological order, fleshing out the document into a full report as you go.
The end goal should be a document that includes:
- A summary of the issue
- The impact of the issue
- The approach[es] used to investigate and resolve
- Timings of the investigation
- Lessons learned
- Details of any information that can and can’t be released publicly for security/compliance reasons
Learn from your mistakes
Just like the AAIB, lots of tech companies release their incident reports publicly. This usually results in public discussion that (even through embarrassment) ultimately leads to better infrastructure.
Releasing your incident reports publicly is a decision that you’ll have to make as a company. This may be hindered by compliance and security considerations, but it’s almost always worth releasing them internally.
If you’re interested in learning more about aviation incidents and safety, the AAIB is a great place to start in the UK, or the FAA in the US. The Air Safety Institute is a great Youtube channel, and there are plenty of episodes of Mayday/Air Crash Investigation available on Youtube too. (Be warned though, all doomed flights take off from one of the busiest airports in the world .)
And remember the rules of thumb: Don’t
kill -9 $$ And Don’t
kill -9 $(ps axo pid | shuf -n 1).