Don’t worry, there will be no “kill or be killed” or “death before dishonor” slogans here.
I do not work in a place where the production environment is responsible for people lives. No one would die if I mess up, or even if I really really mess up. A lot of people, however, will have to look for new jobs. With this in mind, let’s see what lessons from the military would benefit you in maintaining your production environment.

 

Not working is bad. Blows up in your face is worse

No one wants to hear “click” when you expected “bang”.

No one wants to take a look at your online store in the morning and see an error message. Your potential users will wander away, and you now lost 100% of your potential earnings. That’s a bad scenario, and it is tested for. How fast does the site show? Does the home page look good on an obscure browser that only runs on mobile devices from 20 years ago? Very important indeed.

“Kaboom” when you expected “bang” is worse.

Your site looks great, but the API for the credit card company was not tested to the same extent (you have to coordinate with a third party, set up new accounts…. let’s use a stub and call it a day ). Maybe you tested on your local servers, but your production servers have limited bandwidth or different firewall rules. Now imagine your lovely site works fine and throws an exception right after the user clicked “complete payment”. Not only you are not going to keep any of that money, you will have to spend the rest of your cash on legal issues and doubling your support team.

Know where your real risks are, test them and invest your brainpower to challenge and crack them.

Know your MTBF, publish it and prepare accordingly

Would you equip yourself with something which is state of the art, but have a chance of failing miserably once a week? Maybe. I do it with my mobile device (I could still use a 90’s Nokia that can operate during a nuclear apocalypse, but I choose to restart my smart phone once in a while). Would you do the same with your car breaks? Heck, no. What if someone decided that for you, and you were not tech savvy enough to ask?

When something goes wrong, the first thing you will be asked is “how can it go wrong?” (which is more than fair), followed by “why were we not notified of this risk?”. You can answer that nobody asked, which is probably true, and also that the requirement demanded a complex feature on a low cost virtual server, which is bound to go wrong sometime. This is also true, but it will not help you.

Mean Time Between Failures is a very military friendly metric. It has numbers, acronyms, and is very useful when blame has to be shifted. We don’t really need an accurate measurement to the nth decimal place on our production environment, as long as we can learn from experience. If something is bound to fail once every X days, weeks, months, the following three tasks should be handled:

  1. Share the information with the stakeholders. Your sales team might be babbling about 99.99999% up time, while your production environment, even if it does nothing but be online, resides on a virtual machine with a 99.5% SLA. Well informed sales department will be able to craft contracts and SLA that will protect you.
  2. Build KPI, monitoring and alerting mechanisms that will allow you to minimize your reaction time.
    Separate you analytic and long term business KPI from the immediate response ones. Only the immediate KPI should wake you up at night.
  3. Have an emergency response procedure set up. Have professional technical support on call, or designate a person for that in shifts. Don’t send your team to battle without the training and tools they require!

 

 Develop and train procedures for high risk tasks (mainly deployment) until everybody can do them

Reminder: Professionals Practice Until They Can’t Get It Wrong

Deploying new versions to the production environment is usually a complex ordeal. Even the best of us are bound to forget a step some day. Unless we have the mother of all checklists.

I’m totally stealing this analogy, but it works too well to replace. Your operating procedures should be structured, readable and rehearsed like a space shuttle launch. Each step should be so simple to do, that everybody that might ever be required to do it, will know how. Add validations, tests, whatever – but always keep each individual step dead simple. Sticking to the NASA analogy, you should also let you team acquire the experience and confidence they will need when the you know what hits the you know what. When the procedure breaks, you need people that can go full Apollo 13 mode without losing their cool.

 

Maintain you environment and it will return the favor. Ignore it, and it will bite you in the back side

I always thought that this is one of the things that should be too obvious to mention. But guess what, when I ask people “What are you actively doing on a regular basis to maintain maximal performance and lower risk? ” they usually point at someone else (same deal when questioning that someone else), or just say something along the line of “We will deal with the problems as they present themselves”. Maintenance is not sexy. There is no glory in it. But it will keep you high above the muck you will find yourself in if you avoid it.

Expect the unexpected (prepare for unknown unknowns)

OK…. I just couldn’t help myself, and I had to let one cliche it. But as I mentioned here, the real world is out to get you. Real world is made of a collection of data, users and environmental complexities that will try to crack each line of your code, each server. The real world also includes some very bad people that will try to do the same. Sooner or later you will encounter some completely unexpected failure that will harm your production, or even shut it down. You might think that what you do now will decide the time it will take to overcome the failure, but that is not entirely true. It’s how you hired, trained and motivated your team that will make or break this ordeal. Do not plan your training only on scenarios. Scenarios about things that can go wrong are limited by our knowledge and imagination, or in one word – limited. Thing that you could not imagine can and will go wrong. However, the ways these issue can affect you are limited. You might find yourself without communication, and it doesn’t matter which of the gazillion complexities of cause and effect led to that. You can find yourself without your most valuable employee, and it does not matter if she won the lottery, or kidnapped by aliens (in any case she is not going to the office today). The scenario does not really matter – the end result that disrupts your life does. You can prepare to the end result with or without role playing the entire process that led to it.