- Incident Response
Every Minute Counts: Coordinating Heroku’s Incident Response
The hardest thing about ops and incident response isn’t designing robust systems, debugging production, or quickly repairing technical issues. The toughest challenge is organizing to respond, communicating internally, and most importantly, communicating externally. These difficult challenges require massive preparation and are critical to developing customer trust and building a successful business.
- The Problem
- Software Breaks
- It's Stressful
- Poor Incident Handling
- Heroku's Incident Response in 2012
- Communication Tools
- Practicing Incident Response
- Post-Mortem Ownership
- Reasons of Blame
- Incident Command System
- Key Concepts
- Reworking ICS at Heroku
- Incident Commander
- Other ICS Lessons
- Training Your Team
- Clear Communications
- Terms, Tools & Process
Things I'm not going to talk about tonight:
I'm not going to talk about how to build robust systems, how to debug your production issues, how to fix them quickly, how to monitor your systems or any of that. First of all those skills are absolutely necessary, but they're not what you need to build customer trust.
I am going to talk about how we coordinateHeroku's incident responseand how you can apply that towards your startup.In particular we're going to look at how you can communicate with your companyand your customers to coordinate that response, how to organize your company's response to incidents,as well as most importantlyhow to build customer trust.Ultimately you want to be able to do thateven through your company's most difficult times.
A little bit of a problem statement here â€” software breaks.This happens to everyoneand it doesn't matter if your software is well-built or not.
Ultimately you're going to run into bugs,human error, power outages, security problemsand you're going to have an issue at some point.It's probably not possibleto prevent this from ever happening. But it absolutely is possibleto determine how well your company reacts to it, to react as well as possible when shit hits the fan.
Production incidents are stressful.That's a second problem here.There's a ton of stuff going on during an incident.You've got people complaining on Twitter,you've got support tickets coming in,your phones are ringing,Hacker News articles are popping up. All the while you're busy tryingto actually figure out what the problem is and fix it.To make matters worse this often happensat four in the morning.
We're probably all familiar with the effectsof poor incident handling.First of all there's the primary effectlike the immediate loss of revenuefrom not being able to run your businessand process new customers,as well as the SLA credits that comefrom having to reimburse people when you're down.Then there's the secondary effect, the long term impact on customer trustand customers leaving you,and your inability to get deals in the futurebecause of your outages.The good news though is thatthis is almost entirely avoidable.
Everyone still with me?
Before I give you the context on these changes we made,I want to tell the story aboutwhat Heroku's incident response looked like in 2012.Heroku around that time had rapidly grownto about 75 people.A year earlier than that we were at about 25. Half of that were engineers. We started to collectively notice during that timethat our incident response was degrading,things were becoming a little more chaoticand the quality of our responses was really varying a lot.We decided it was time to focus on improving that.
Our primary communication channelduring incidents was Campfire.This worked well for us during incidentsand during our regular day to day work as well,as far as coordinating what's being worked on.During larger incidentswhen you have more people involvedwe could move things to Skypejust to get a little more bandwidth on the communicationand not have to waste so much time typing things.Both of these systems worked pretty well for us. But there was a common question that kept coming upevery time someone would come into help out on the incidentor had to figure out what was going onand maybe help customers.
That question was, "Can somebody fill me in?"
This single question illustratesso many of the problems that we were experiencing. So many of the symptoms as well.
If you're an engineer that's coming in,if you're a support repthat has to answer people on Twitter,if you're a salespersonthat's dealing with a high profile customer,if you're in marketingand you have to respond to people that way, for any one of these rolesyou need to know what's going on with the incidentand you don't necessarily even knowwho's working on it.
We found that we were wasting a lot of timeon the communicationbetween these different groups of peopleas people kept having to ask what was happening,what the latest status was.Those channels, as I was talking about Campfire,it's really tough to go through thosewhen there is a real-time conversation going onand trying to figure outwhat's been going on for the last hour or two.
We found that we spend so much mental energy on thisdoing it in an ad-hoc waythat it was really impacting our response time.
Next, we had a lot of issues with context switchingpreventing us from getting a flow.You could imagine engineers concentrating onsolving the problem and figuring out what's broken,but all the while we had to actuallyget updates out to our customersto tell them what was broken.
It's really tempting to have the personwho best understands what's going on,the person who is working on it,trying to explain that to people.It's pretty natural, I think. But it turns out that writing these status updatesand getting that information outin a customer-friendly wayactually takes a lot of mental effortand there's a huge cognitive shift when you're doing thatversus doing deep problem solving.It turns out to be really slowand disruptive to the engineers. That was also really impacting our recovery time.
Next, we had a number of reasonsthat customers were being kept in the darkas far as what the state of the incident was.You could imagine we didn't have any good goals;any goals defined for what thepublic status update should look like,how frequently they should be happening,what their content should be.
We ended up with status updatesthat were vague, inconsistent, uninformative,sporadic and sometimes downright inaccurate.Our customers basically were left to concludethat either we didn't know what was happeningor we weren't going to tell them what was happening.
We realized the root problem herewas actually that we didn't have a wayto practice these things outside of an actual incident.We didn't have any frameworkon which to practice our communication procedures.We did do wargaming occasionally. That's always focused on technical problem solvingor simulating an outage,and that sort of thing is really helpfulbut it's a different set of skills.You can't really work on them simultaneously.We wanted a way to focusjust on the communication proceduresand to practice those.
The final problem here.We had an incident around this timeframe,on a Saturday I believe it was. About six days after thiswe didn't actually have any progress madeon a public post-mortem.These are things that you usually want to get outwithin a couple of days after an incident to restore trust,but nobody had actually been assignedto own the post-mortem here. So there had been no progress made on it.No one was actually responsiblefor making sure that was written.There are all kinds of reasons to blame for that.
There are typical problems that all startups face: your company is growing, your product is growing, people are coming in and out of the company. And they're challenges that all startups face. Ultimately we realized that our incident response waskind of chaotic and disorganized;and it started to affect our businessand it was something that we needed to fix.As we started digging into this,we quickly realized that incident response in generalis considered a bit of a solved problem.That's the good news.
The Incident Command Systemis what other organizations use,other emergency responders use,in order to deal with the same kinds of problems, coordinating large groups of peoplethat are dealing with stressful incidents and emergencies.IT operations folks aren't the first ones to deal with those problems.You can imagine firefighters that are working ondealing with wildfires, people who are responding to traffic accidents,natural disasters.
All of these problemsthat these emergency responders havereally mirrored the problemsthat Heroku was facing at this time.
The Incident Command System â€” they designed this back in the 60'sto help with the fighting of California wildfires.It was originally based on the Navy's management procedures.Those are the procedures that help the Navymanage a 3,000 person aircraft carrierthat's full of 18-year-oldswho aren't very well-trained during the heat of battle.You can imagine those are pretty intense proceduresand they have to be effectivein order to make things work well in that situation.
Since then, it's actually evolved into afederal standard for emergency response.There are a lot of scenarios where you might begetting federal funding and a mandatethat you follow this systemin order to get that funding.I'm not going to spend a ton of timeon the details of the ICS right now.I'm just going to give you the key concepts of it.
The first and foremost is thatyou have an organizational structurethat can scale along with the incident.Second is unity of command.That means that each individualreports to one and only one supervisor.Next up is the limited span of controlso that means each supervisor has no more thanthree to seven people reporting directly to them.If you get to a point where you have more than thatyou've got to split it out,add another layer of hierarchyor add another management group.That's just to make surethat everyone can actually keep trackof what their direct reports are doingand coordinate things that way.
Clear communications is a huge thing to focus on.Having common terminology,making sure that your companyor your group is on the same pageas far as the terminologythey're using to describe what's happening,so there's no confusion therewhen somebody uses a keyword.
Management by objective.That's clear, specific and prioritized objectivesas far as what your companyor what your group should be focused ondoing to resolve the incidentand in what order they should be doing those things.
As I said, I'm not going to spend a ton of time on this.There are a lot of good resources out therefor the Incident Command Systemand in particular how to apply ittowards IT operations.I'm going to show these slides.I recommend that you check them outif you're interested about that. But instead, I'm going to focus onhow we applied this at Heroku, how we use the ICS to improve our incident response.
First and foremost, I mentioned the organizational structure.When we studied the ICS we decidedthat we should have three primary organizational groups.The first of those being the incident command group.I'm going to go into detail on each one of thesein a little bit here, but just to briefly go over them: incident command, operations and communications.
If you were to apply this to your company,you don't necessarily have to doexactly what we didas far as what your organization is structured like.The main important part isthat you have this defined up frontand everyone knows what the responsibilities arefor each one of these groups.
Incident Command. This is typically defined by the incident commander role.That's a single person who is in chargewith the final decision making authority.They have the say on what's happeningand what decisions we make.This actually tends to be the first personor by definition is the first personthat responds during an incident.The first person who figures outthat something might be broken, they have to automatically assumethe incident commander role.Along with that they have to also assume all responsibilitiesthat haven't yet been assigned to other peopleuntil they get handed offor until the incident is resolved.
Heroku has a rotation of peoplethat are specifically trained for this role.That's helpful.A few of their responsibilities: they're tracking the progress of the incident,coordinating the response between different groups.They have to act as kind of a communication hubbetween the different groups in the organization.They have to make the call and state changes. They have to decide how the incident is evolvingwhether it becomes resolved or not.They have to issue situation reports.This is a really important onethat we're going to dig into next here.As I said, they take over all responsibilitiesthat aren't already assigned to another groupor that people haven't been handed off.
A sitrep.What is a sitrep?A sitrep is a really concise descriptionof what's happeningand it's an internal thingthat using blunt language.You don't have to worry aboutwhether customers are going to see it or not.You can see here that we're talking aboutexactly what's broken, what the symptoms are,how widespread the impact isand any other important informationas far as what we're doing to resolve it.
We have links to a ticket here from Amazonthat's a support case that we havewith our infrastructure provider.We also have a link to a Trello cardthat's our follow-up, showing what we're doing,that we can keep track of any temporary changesand roll those back once the incident is actually resolved.Finally we have the peoplethat are actually working on the incident listed here.We have the incident commanderand then we have the peoplethat are in charge of communicationsas well as the other engineersthat are working on the problem.
This is very, very dense informationand it's a very great summary of what's happening. Most importantlyit can be blunt and to the point.The purpose of it is to keep the company at largeinformed about what's happening with the incident.
Sales, support, marketing, all these folksneed to know what's happening with the incident. The way this works is it pushes it outthrough our company-wide mailing list.It blasts everybody. Everyone has access to it. Everyone knows what the current state is. It also goes to our HipChat channelsso it's coordinated in the same placewhere we're actually doing our incident response.
As I said, it should contain what's broken,scope of the impact, who is working on it,what we're doing to fix it,and it should be sent regularly.That could be every half hour, every houror whenever there's something majorthat changes about the incidentwhen there's new informationor you figure out something that customers could doin order to recover from it faster.That gets sent to the entire company.
The event loop is something that weuse to describewhat the incident commander is doing during an incident.It's more or less a series of questions.They don't have to be these exact ones,but it's a series of thingsthat they're doing as the incident progresses.Basically, they're just following this loopasking these questions,making sure that everything is being handled, that nobody else needs additional support.They go through this list and at the end of itthey just keep repeatinguntil the incident is resolved and closed out.That's the first group, the first organization.
The second one we have is operations.These are the people responsiblefor actually fixing the problemat a software company that tends to be engineers.That tends to be only a couple of engineersfor smaller incidents, maybe one or two or three people. And then once it's a larger incident, you're going to involve more peopleand you'll have more groups and more teams involved.
To give you an example at Heroku, if our platform is having issues,if we have a major outagethen we'll have a team of people working onjust the platform side of things.We'll have another group that's in chargeof restoring our database operationsand those will be separate teamsthat are both coordinated underneath the incident commander.
As I said, their responsibility isdiagnosing the problem, fixing itand then reporting progress back up the chainto the incident commanderso that he or she can communicatewith the rest of the company about what's happening.
Importantly they don't have to spenda single cycle of their timeworrying about how they communicate this processor about it being communicated in a waythat can be shown to customers.
They can just be straight to the pointand not have to waste any cycles on that.That's the second group.
Our third one here is communications.These are the people that are responsiblefor keeping customers informedabout the state of the incident and what's happening.At Heroku this is typically managedby our customer support personnel.There are a few reasons for that.They don't have to context with problem solving.They're already dealing with customer issues all the time,and even during an incident they're probably answering customer tickets as well.They don't have to have that contextwhich they're already used to speaking customers' language.
We used to waste all kinds of time,as I said, with engineers mincing words,trying to get things massaged in a waythat both describes the issuebut also doesn't look bad to customers.If you don't have to do thatyou can focus entirely on problem solving.Off loading this is really important.
It's not about whether engineers are capable of that.All of Heroku's engineerscould easily write a status update.It's mostly about the context switch.
The status updates,these are public messagesthat we post periodicallyto keep our customers informed about what's broken,what they can do to fix itor what they can do to work around it,and what we're doing to fix the problem.We post these both to our public status websiteas well as our Heroku status dedicatedTwitter account that we have for this.I don't know if it's necessaryto use a dedicated account.We just find it helpfulto keep things out of our main Twitter streambecause it can get messy during an incidentwhere you're pushing a lot of information out.
Communications, importantly, works very closelywith the incident commanderto make sure that they have all the information they needto describe the impact of the incident to customers.
Some guidelines for the content of these status updates:
They should be honest,they should be transparent and up frontabout what's happening.Don't hide stuff from your customers. And they should explain the progressthat you're making on actually resolving it.
You should probably establish some goalswithin your company as far as what you wantyour customer communications to look likeand what you're trying to accomplish with them.
Things they shouldn't do. They shouldn't provide explicit ETAaround when things are going to be resolved.We've found so many timesthat we think things are going to be resolved in an hourand it turns out that we actually thought wrongabout what the problem was. So, maybe one hour or two hours lateryou're still saying things are going to be fixed in an hour. And that never looks good.
Just don't over promisebecause you're never going to know for surehow long it's going to take.
Don't presume to know the root cause. If you haven't done a proper post-mortem yet,you're not going to know what the actual causes wereso don't assume that you understand thatat the time of the incident.
Don't shift blame.Ultimately you and you alone are responsiblefor your infrastructure choicesand for your own availability.That does not fall on your providers.It doesn't matter if you're using Herokuor Amazon Web Services directly.That was your choice for your product,but your customers are paying you for that service.Ultimately it's your faultfor making those decisions if they don't work out.Don't shift the blame there.
Don't do this. This may have been something that you've seenif you've looked at Amazon status site before.This is their "Everything is okayexcept for this little thingthat may actually be a huge issuethat's taking your site down."
Don't beat around the bush.If things are broken just own up to itand be honest with your customers about it.
Some proactive handling around your top customers.Sometimes the incident commanderfinds out that there's a situationwhere you can recover your top customers more quickly.You can imagine a situationwhere you've got manual work that's requiredto recover a bunch of customer's databases. You can't do them all at once.There might be constraints on that. Ultimately you have to do one at a time. In a situation like thatyou can prioritize your top customersand give them the best experience there,get them up the soonest, the ones who are most important to your companyor the ones that are paying you the most.
Also, account managers or sales teamscan proactively reach out to those top customerseven if they haven't complained yet. Even if they don't know that they're being affected,you can tell them if they're affectedand that establishes a communication channel with themand makes them really confidentthat you're on top of things.
Handling support tickets during incidents.You can imagine a widespread incident,we're going to have lots of tickets coming in,lots of customers are affected,and it's not really possibleto respond effectively to all of themduring the heat of an incident.What we do, we have a macrothat our support team will set up when the incident beginsthat will redirect them to the status site.That's the place where we're already puttingall of the relevant information that they need to know,everything that we are able to tell customersabout the problem and what they can do to fix it.
What they also do when they assign tickets to this macro,they can label them in a waythat they can easily find those tickets after the incidentwhich is wonderfulbecause it gives them a chanceafter the incident is overto apologize to customers first of all.Secondly, to make sure that the customer's problemwas actually being caused by the incident.That happens all the timewhen you think the problem is due to whatever is broken,but it might actually be something else.It gives you a chance to then follow upand see that their problem was actually resolvedby the incident being closed.Then you can re-engage with them.
That's it for the organizational units.We've got incident command, operations and communications.As I said, this isn't really set in stone.It's not that you have to usethis exact structure for your company.
You should use what works for your organization.Don't be afraid to change this alsoas your company grows and evolvesand your org changes.It's okay to change this when it makes sense for you.
Some other ideas that we implementedbased on the Incident Command System, training and simulations is a really important one.Incidents are really stressful,if I haven't made that clear yet.People under stress tend to refer towards their habits.I played volleyball in collegeand this was something that I heard all the timefrom coaches at a higher level which is that under stress, when the game is on the line,players are going to do what they're comfortable doing,what their habits are.
The only way to really change thatis via realistic training.It's not just about practice,it's about practicing underas much of a realistic setting as you can.The stressful environment is really what makes youable to change your behaviorwhen you practice under that environment.Studying is not enough. You'll see this in any fieldwhere people are expected to perform under stress.There's a heavy emphasison realistic training environment.If you want to respondas quickly and effectively as possibleit really has to be second nature.You have to be able to know these steps by heart.
It's okay to have a checklistso that you don't forget things,but if you don't knowwhat all those steps on the checklist areyou're not going to be able to perform themquickly during an incident.You're going to have to look upwhat you're actually supposed to be doing there.Airline pilots when they have to land a planethey have a long checklist of things that they go through,but they've trained thousandsand thousands of hours on thatso they know what everything is already.It's just to make sure they don't forget a step.
With our training and simulationswe want to mimic our production environmentas much as possible.We'll have an entire clone of our productioninfrastructure setupas well as simulation copies of our status siteand of our Twitter accounts and our Zendesk.All of that to give youbasically the complete experience around thisso you can practice using the same toolsthat you would during an actual incident.
As far as the intervals on which you should train,you probably want to have some kind of a jumpstartwhere your team is,when people are coming on boardor getting into on call procedures.You want to jumpstart their trainingand make sure that they're brought up to speedon what they should be doing.
It's also important to train regularly.Maybe every three months or something, every quarterjust to keep those skills fresh.
You want to track the thingsthat you care about during training, things that are important to your company.For us we placed a heavy emphasis onhow long it took us to get thatfirst status update out to customers even if you're not sure what the problem is. We really try to be focused ongetting that investigating post up thereso that we've acknowledgedthat we're at least looking into something which is a great thingwhen customers are having a problem. They already know that we're working on it.
Clear communications, that was another big focus of the ICS.There are a few ways in which we've tried to achieve that goal, having explicit state changes and handoffs,being very clear with messagingabout when things are evolvingor when responsibilities are being handed over.You can see in these first couple â€” imagine that I am the incident commanderand I want to hand over that responsibilityto Ricardo.I'm going to have a clear messagethat everyone can recognize,that alerts everyone that says,"Hey, there's a new incident commanderand it's this person."Likewise for communicationsif I want to hand that off to somebody else.
We also have messages here that we usewhen we are confirming that the incidenthas grown in scopeor we know that there's a real problemor when it's resolvedto let everyone know that we're wrapping downour handling of the incident.This is actually built into our toolingfor our status sites so it ends up being automatic for us. That makes it really hard to forget it.Basically, impossible to forget it.
The dedicated communications channel,this was something that we had before,but we really reinforced itwhen we took another look at this.We have this HipChat room called "Platform Incidents"that everyone knows that you're supposed to go towhen you think something's broken. That's where the investigation takes place. That's where things are coordinated.We can still escalate things to Skypeor Google Hangouts when things areneeding a little bit more communication bandwidth,but it's important that everyone has aknown starting point they're going to go towhen the incident gets goingand that it's defined in advance.
There's a lot in the ICS alsoabout defining your terminology and your processmaking sure that those goals are set up up front.
Among those, we found it really helpfulto define some product health metrics.We picked two or three specific metrics that we thinkexplain whether our platform is working effectivelyand whether our customers are having issues or not.We saw a lot of problemswith inconsistent handling of incidentswhere an engineer might be getting woken up at 3:00 a.m.and they're going to be really hesitantabout waking someone else upwhen they're not sure what the problem is.Especially if they think they can handle it.
This is made to prevent the hero symptom.We don't want people to be heroes.We want to have consistently good responding. Having these metrics definedwas actually very valuable for us on that. Describing the severity of the incidentand the state of it becomes much easierwhen you've got a couple of metrics that tell youthe most important partsabout how your product is functioning.It makes it much easier to justifycalling in help at 4:00 in the morningwhen you've got the numbers to back it up.
It's actually harder than it sounds though, unfortunately.Our metrics, we use continuous platform integration testing,HTTP availability numbers,and we try and gauge the number of appsor customers that are impacted.We also have thresholds for thosethat we use to determinehow severe the incident might be.As I said, in practiceit's been difficult for us.I think that's somewhat relatedto the more complex your product isthe harder it is to really limit it downto just a couple of them.It's still really valuable to try and focus on thatand try and find those most important thingsthat tell about how your product is working.
Another problem is thatHeroku hasn't historically had a really greatcontinuous integration story internally.That's something that I think is really importantto invest in with any product,so I'd recommend that to any of you.
The next piece here is having toolingaround our incident response.This is a screenshot from the admin sideof our status site.This is what the communications person would be usingto draft a public status update.It's where they can control what the public colors aredescribing the severity of the incidentand also adjust the scope of itto tell who's impacted or what services are impacted.
There's a similar interfacethat the incident commander usesalso when they're writing a situation report, a sitrep.This is integrated with our status siteso when you write something hereit's automatically up on the status site.It gets pushed out to everybody by email,it gets put on our HipChat roomand it makes everything easy to have thiskind of built into our tooling.
Another tool that we have, a HipChat bot that we use. One of the big purposes of itis to be able to page in people.Whether it's another specific person that you knowwho you want to call inor if you want to call in whoever is on call for a particular rotation.We have bots that make that really easy for usso we don't have to leave the placewhere we're already doing our incident response.We don't have to go muck aroundwith PagerDuty's UI to figure out who's on calland to figure out who we want to escalate it to.Having that automated hereand having a really convenient way to access ithas been huge for us.
I will say about tools â€”it's only helpful to have themand have great toolingif people know how to use them.If you haven't been trained on them up front you're not going to be able to figure outhow to respond to the incident,how to use your tools during the heat of an incident.You have to be trained.
Another tool that we have here is theincident state machine.This is our internal nomenclature that we useto describe the state of the incident.If you could imagine, everything is functioning normallyin steady state, everything's healthy.That's our state zero.That's our "everything's normal" state.As soon as anybody suspectsthat something might be brokenyou're going to escalate that into investigatingwhen you have a strong suspicion that something is wrong.There are actually triggers for that.
If you want to escalate things to the state one, you have to open that public status incidentto say that you're investigating the incidentso that people know.Finally, once you've confirmed the scope and the impact,the symptoms of the incident,then you can confirm that there is actually a problemand you can update that way. And then when there's a major incident,we have a separate state for that. That is our big red button that calls in everyonewhen we know that there's something really major going on.
I want to say a word hereabout follow-ups and post-mortems.As I said, the big problem that we had herewas not knowing whether somebody owned this,not having someone who at the end of each incidentwas assigned to own this processand make sure that they were gettingall the input and all the informationthey needed to compile the post-mortem as quickly as possible.
Make sure that somebody owns that after each incident.
As far as how to write a good one,I like to look at the Mark Imbriaco formula.It's just three simple steps.You want to apologize for what happened. Make sure that your customers really feel likeyou feel bad about it.Secondly, you want to demonstratethat you understand what happened, that you understand the sequence of eventsthat led to it.Finally, you want to demonstrate or explain what you're going to do to fix it.You want to tell themthat this isn't going to happen again and here's why.You can read basically any post-mortemthat Mark Imbriaco has written on these.They're all great.
I'd also recommend you check out the onesthat CloudFlare writes.They're really in depth and detailed technical analyses,but they also basically follow this formula. They do a great jobespecially at getting them out quickly.I've seen them have an incident on a Sundayand they'll have these post-mortems out the next dayin extreme detail.That's really impressive and as a customerit makes me want to trust them more.
Those are the changes we made. As far as how well this has actually worked for us, it turns out that customers actually really noticeand appreciate this kind of thingwhen you're proactively communicating with them.There were a lot of recent examples herefrom the Heartbleed SSL debaclea couple of months ago.People were very adamant about praising usfor how good of a job we had donekeeping them informed about the state of things.
I think it's important to note here,we didn't really do anything fundamentally differentas far as what we said than what AWS did,but the way that we communicatedit was much, much different. We didn't even really necessarilymove faster than them.In a lot of cases we were constrained on themwith certain pieces of our infrastructure.
In spite of that, people really praisedthe way that we responded to itand they were mockinghow Amazon responded to it.The way that you're doing this proactivelymakes a huge difference.
We've also seen a nice side effectwhere our support volume is much lower during incidentsas a result of the way that wepush this information out.People are less tempted tohave to open support ticketswhen they know we're already working on what's broken.I think during normal incidentswe had on the order of 50 to 100 ticketsthat might be openedand now that actually goes down to like five or 10that might be opened during a regular incident.Huge impact there.
We're definitely far from perfect.After our initial response or initial sprintto improve this stuff a couple of years ago,we've really slacked on the regular training aspect of it. As our company has grown,as people have left or new people have come in,our response, our ability to coordinatethis cleanly internally has kind of degraded. We're really getting back into thatnow with the more regular trainingand making sure that happensso people keep their skills fresh.
I want to recap all the things we talked aboutas far as how to apply them to your company. Standardizing your process is really important here.You want to define that org structure,get it defined up front.The sooner that you can havemore than one person involved with it,the sooner you can really reap the benefitsof having people able to focus on particular tasksand not have all that context switching.
Standardize your tooling and your process,making sure that's not ad-hocbecause that leads to inconsistency.
Really working hard and trying to definethose product health metricsand thresholds for them. Also, establishing those goals up frontfor what you want to accomplishas far as customer communications during the incident.Limit that to no more than two or three if you can.
Clear internal communications, there are a few things around that making sure those handoffs are really explicit, really embrace the sitrep.That's done an amazing job for usas far as keeping people informed.There's no longer a need for peopleto keep jumping on HipChatand asking details about what's happening.They already have that information pushed out to them.
Owning the post-mortem process, making sure that someone's in charge of that.It would probably be useful here also to haveanybody who's going to be involved with these incidents,whether they're writing status postsor writing sitreps, to do some practice ahead of time,maybe write out some templates.It doesn't matter if you're going to use those templatesduring the incident,the important thing is that you'vepracticed on what the content isthat should go into themso you're able to do that under stress.
Finally, training yourselfso you can make sure that you are able to practiceor apply these skills regularly when it matters.
The best procedures mean nothingif people haven't been trained on how to follow themand if they're not going to follow them when it matters.
If you want to recoveras quickly and effectively as possibleyou really have to be trained on them.
Q: How does a training session work at Heroku?
A: We try to have someone who's involved. Basically, the coordinator of the training sessions, they'll plan out a game plan for it ahead of time and figure out what they're going to breakand keep track of loggingwhat people are doing during the incidents.They're going to take notes about itthroughout the whole timeand they're going to write down things aboutwhen each person came in to actually deal with it.We usually get people into a single roomand they'll get called into the roomas they get paged or as they get called inlike they normally wouldfor many of their alerting triggers.
They'll come into the roomand we'll really keep everyone in close watchso we can take notes on how they're responding.We have the logs in HipChat as far aswhen certain events were happening.That's the important part of it I think. We have them pretendthat they're not in the same roomeven though they are,sort of speaking their thought process out loudas they're thinking about what they should be doing nextso we can take notes about how to refine that processand make sure that everyone is well-trained on that.I think that's about it.
As I said, they're really focusedon the communication aspect of things.The actual wargaming,the training of fixing technical problems,that's kind of a separate pieceand I think you really have to train them separately.
During these incident response training sessionswe don't really worry that muchabout whether the person is able to figure outeffectively what's brokenor figure out how to fix it.That's sort of a separate thingthat we can then work on afterwards with themif that's having issues.It's more about just making surethe communication process is really refined.
Q: What's the distinction between production and development on the Heroku status page?
A: You use Heroku a lotand you don't understand the distinctionon our status site betweenwhat's considered production and development.I could say that we don't really either.That's kind of a distinction that . . .I guess I could say that on camera.No, we've had debate about that internally.Customers have been confused about that for a while.
It's hard because we're trying to not convey things in a way that matches how our internal systems work.I mean, it would be easy for us to describe thingsin terms of whether the platform,whether your existing apps are up and runningversus whether you can push new codeand whether the API is functional.That's kind of how it ends up breaking down. But we really want to focuson what customers care aboutrather than how the internal systems work which is why we've kind of struggled with thosedefinitions a bit so far.Don't know if I have recommendationsfor how you might do that for another service.
Q: When do you change the incident commander or escalate the incident to the next level?
A: When do you decide whether the incident commandershould change or when the incidentshould be escalated to the next level?
As far as when it should be escalated to the next level,we usually try and define that based on the severity,based on our metrics.
If we detect that two percent of apps are affectedor one percent of apps are affected,I'm just making numbers up here,that's considered a minor incident. Then if something like 20 percent is affectedthat's a really big problem, that's huge.If it's a small handful of customersit might just be a regular incident.
As far as when to change the incident commanders,I think our rule of thumb thereis that we don't like to have peoplein the rotation for more than four hours at a time.If you start an incident and it might be a long onethat's going through the night or something,I think it's important to change thatperson out regularly every four hoursbecause there's just a lotthat they have to keep track of during the incident. At the end of four hoursyou really start losing the ability to be sharp on that stuffand coordinate effectively.
For long incidents we'll have peoplerotating every four hours,changing throughout the nightand the other people will go off and take a rest.They may have to get called back 12 hours later,but at least they can take a mental breakand sleep and recharge a little bit.
Q: How do you deal with people who complain online, even when you're being as transparent as possible?
A: His question was even if you're doing a great job at this stuffthere's still a chance that somebody will go on Hacker Newsand post about it or complain about the factthat your service is broken.Yeah.How do you deal with that as an engineerwhen you think you're doing things rightand then people are still complaining about it?
I think from Heroku's perspectiveit's pretty normal now.Anytime we have any kind of an incident, I don't know if it's peoplewho are subscribed to our text messages,that just immediately then go post about it on Hacker News. But it basically shows up every single timeno matter how severe the incident is.I think we've just gotten used to thatand they tend to die off the front page now pretty quicklybecause people are getting tiredof those posts showing up all the time,every time something is broken.
I think you just have toreally take a look at how you respondedand try and think objectively.Listen to the customer's feedbackif they're complaining about stuff.Ultimately it's up to you to decidewhether you're doing things as well as you canor as well as you need tofor your company to succeed.
Q: Do you have a refined process for the post-mortem and is the incident commander involved in it?
A: Do we have a refined processaround the post-mortemand is the incident commander involved in that process?The sitrep that I showed earlier,there was a link to a Trello card.We use Trello for all kinds of things internally. When there's an incidentthe IC tends to open a Trello cardthat they'll use to track things both to track what's going on during the incidentso we can maybe undo temporary changes afterwards, also that we can trackthat there has been someone assignedto that post-mortem responsibility;or that there has been a follow-up meeting scheduledin which we're going to do that the next day or somethingdepending on when things wrap up.
We do have more well-defined processesfor having meetings around thatand making sure that all the peoplewhose systems were involved are thereto give their information aboutwhat they figured out happened.I'm not sure if there are specific aspects of itthat you're curious about,but yeah, we do have a more formal process around that.We have an SRE team nowthat's really defining that stuffin a much more clear way.The incident commander, they're usually involved in the post-mortemif only to be there as the person who'sexplaining the way things happenedand explaining the sequence of events.
Q: How many people would be involved in a normal incident?
A: How many people would be involvedin a normal type of incident?There are times when it's only one.I've said those responsibilities should be split outand they really should when it's anythingother than a really trivial incident.
Sometimes we have a single database server that goes downand a data engineer might go on thereand be involved with that one customerand they're able to handle everything.For the most part, we really tryand emphasize having somebody take overas incident commander as quickly as possible. Worst case,maybe the IC is responsiblefor writing those status updatesif it's a minor thing or not that much is changingthen they can avoid having to call ina communications person.
Definitely trying to split out the IC rolefrom the engineering role is importantand we try and get at leasttwo people involved there.
For a regular incident, I'm trying to estimate here,but I don't think it's typical for a regular incidentto have more than two or three engineers involved in it.There are certain people who are already on calland it's their job to kind of fix these things. Typically you can add more people to the firebut it's not going to help you recover fasterfrom most situations.
Q: How often do you update the status site and how detailed is each status update?
A: I guess one of them was how often do welook to update the status site, how often do we look to get those updates outand then how much detail do we go intoabout what the state isand how things have changed? What the impact is and what the symptoms might beor the specific problem?As far as that last one,we try to avoid talking about our internal systemsas much as possible on those.
We really want to focuson what the customer affects are,what symptoms they're noticingand what impact there is to their apps.
We'll go into as much detail as we canabout which features or which functionalitymight be affected at a given time,or how widespread the impact is.We'll get into that stuff.
As far as the timeline,we like to get status updates outbasically every time we have new informationto convey to peopleand sometimes that can be every five or 10 minutesthat we'll figure out more and moreabout what's happeningor we'll be making progress resolving large numbers of the customers that are affected.There are times like during the Heartbleed incidentwhere there's really not much changingover the course of an hour,and so we'll make sure that we update every hourjust to say, "We're still working on this,we're still focused on it.We just don't have anything new to saybecause we're just proceeding as we were before."
I think at least every hour for those longer incidentsis still really helpful,but there are always going to be placeswhere you don't have anything new to say.
I guess that's a tradeoff you have to strike a balance onwhether you want to just keep saying the same thingover and over againversus trying to get information outonly when it changes.
Q: Will people be upset if you solve the problem but don't update or communicate properly?
A: Yes.It's certainly possible that people will get upsetif you're not keeping them up to date on things.That's kind of why we've resolved to thatone hour time frame at the absolute latest.We don't want to go any longer than that. Ultimately, if you can talk about thingsthat are changing more rapidly than that,you should definitely do so.The more people that are involved with this,if you've got a separate personfor each one of those orgsthat's in charge of those responsibilities,it becomes easier for themto just keep churning out that informationas they're getting new pieces infrom the groups that are working on itand new reports about what's happening.
Q: How do you pick the first incident commander and how do you train people to do that task?
A: Yes. Whoever finds the problem they are the incident commander.When we go through the training processeveryone is familiar withwhat the different responsibilities arefor the different orgs.
They're supposed to be comfortable with the fact thatif they discover that there's a problemor they think there might be a problemthen they have to basically immediately assume that IC role.
They're also trained thatthey should bring in someoneas soon as they think they're overwhelmedor they're uncomfortable with all the responsibilitythey might have.If they're really confused about what might be happeningand they're going to have to really dig deep on it,they'll probably be more likelyto just call in an IC immediatelyusing our HipChat botsand not worry about the factthat they're going to have to wake someone up; more be focused on trying to make surethat they're responding effectivelyand that someone is ableto get information out to customerseven while they're working on diagnosing the problem.
Q: What's your approach to events that are resolved before you're first able to notify your customers?
A: Our approach to how we deal with eventsthat end by the time we get around to staying they start.These days we've gotten a lot more adamantabout posting retroactive incidents.We might have an API database failure that we noticedthat takes our API offline for one or two minutesand sometimes it's faster than we can reactand put something on our status site.During which time like our automated databaserecovery things might be taking overand the problem might be fixed.
What we try and do in those cases isI'm not sure exactly what the threshold is,but if we determine that there was any real impact from itwe try and go backand post the retroactive incident on our site.
You may have noticed our status site getting more noisy.It's not really because we've had more problemsit's just because we've been focused much more ontrying to get every bit of information aboutproblems that we've had up thereto be more transparentand open with our customers about it.
Q: How do you determine who's responsible for fixing a particular problem at Heroku?
A: So you're asking how we determine who's responsiblefor actually fixing the issues.I think that depends on our internal responsibilities.Different teams are responsiblefor different parts of our infrastructureand different subsystems.Basically, if it's your system that's having an issuethere's going to be someone who's on callfor that particular teamwho's responsible for coming in and fixing that.
This is how it works in a larger organization anyway.If you're a smaller company,I'll say in the past we had a rotation systemwhere only one person was on call at a given time. There's an issue with that,but it's obviously more feasiblefor a smaller company to do thatso that everyone doesn't have to carry aroundtheir laptops all the time.
As far as how we figure out where the problem isor who needs to be alerted for that,we hope that anytime our internal systems have problemsthat those teams have done a good enough jobof monitoring their systemsand keeping an eye out for problemsthat they get alerted, that they get a PagerDuty notificationthat makes them come onlineas the problem is starting.
That doesn't always happen.Of course, there are certain issuesthat are kind of nebulous, hard to define. In those caseswhoever the first responder is, or the IC, theywill have to make the judgment call on where they think the problem might beor call someone in who can help them figure that out.