January 27, 2021
Ep. #9, Falco with Dan “Pop” Papandrea of Sysdig
In episode 9 of The Kubelist Podcast, Marc Campbell speaks with Dan “Pop” Papandrea of Sysdig. They discuss Pop’s experience as a CNCF...
Things I'm not going to talk about tonight:
I'm not going to talk about how to build robust systems, how to debug your production issues, how to fix them quickly, how to monitor your systems or any of that. First of all those skills are absolutely necessary, but they're not what you need to build customer trust.
I am going to talk about how we coordinate Heroku's incident response and how you can apply that towards your startup. In particular we're going to look at how you can communicate with your company and your customers to coordinate that response, how to organize your company's response to incidents, as well as most importantly how to build customer trust. Ultimately you want to be able to do that even through your company's most difficult times.
A little bit of a problem statement here â€” software breaks. This happens to everyone and it doesn't matter if your software is well-built or not.
Ultimately you're going to run into bugs, human error, power outages, security problems and you're going to have an issue at some point. It's probably not possible to prevent this from ever happening. But it absolutely is possible to determine how well your company reacts to it, to react as well as possible when shit hits the fan.
Production incidents are stressful. That's a second problem here. There's a ton of stuff going on during an incident. You've got people complaining on Twitter, you've got support tickets coming in, your phones are ringing, Hacker News articles are popping up. All the while you're busy trying to actually figure out what the problem is and fix it. To make matters worse this often happens at four in the morning.
We're probably all familiar with the effects of poor incident handling. First of all there's the primary effect like the immediate loss of revenue from not being able to run your business and process new customers, as well as the SLA credits that come from having to reimburse people when you're down. Then there's the secondary effect, the long term impact on customer trust and customers leaving you, and your inability to get deals in the future because of your outages. The good news though is that this is almost entirely avoidable.
Everyone still with me?
Before I give you the context on these changes we made, I want to tell the story about what Heroku's incident response looked like in 2012. Heroku around that time had rapidly grown to about 75 people. A year earlier than that we were at about 25. Half of that were engineers. We started to collectively notice during that time that our incident response was degrading, things were becoming a little more chaotic and the quality of our responses was really varying a lot. We decided it was time to focus on improving that.
Our primary communication channel during incidents was Campfire. This worked well for us during incidents and during our regular day to day work as well, as far as coordinating what's being worked on. During larger incidents when you have more people involved we could move things to Skype just to get a little more bandwidth on the communication and not have to waste so much time typing things. Both of these systems worked pretty well for us. But there was a common question that kept coming up every time someone would come in to help out on the incident or had to figure out what was going on and maybe help customers.
That question was, "Can somebody fill me in?"
This single question illustrates so many of the problems that we were experiencing. So many of the symptoms as well.
If you're an engineer that's coming in, if you're a support rep that has to answer people on Twitter, if you're a salesperson that's dealing with a high profile customer, if you're in marketing and you have to respond to people that way, for any one of these roles you need to know what's going on with the incident and you don't necessarily even know who's working on it.
We found that we were wasting a lot of time on the communication between these different groups of people as people kept having to ask what was happening, what the latest status was. Those channels, as I was talking about Campfire, it's really tough to go through those when there is a real-time conversation going on and trying to figure out what's been going on for the last hour or two.
We found that we spend so much mental energy on this doing it in an ad-hoc way that it was really impacting our response time.
Next, we had a lot of issues with context switching preventing us from getting a flow. You could imagine engineers concentrating on solving the problem and figuring out what's broken, but all the while we had to actually get updates out to our customers to tell them what was broken.
It's really tempting to have the person who best understands what's going on, the person who is working on it, trying to explain that to people. It's pretty natural, I think. But it turns out that writing these status updates and getting that information out in a customer-friendly way actually takes a lot of mental effort and there's a huge cognitive shift when you're doing that versus doing deep problem solving. It turns out to be really slow and disruptive to the engineers. That was also really impacting our recovery time.
Next, we had a number of reasons that customers were being kept in the dark as far as what the state of the incident was. You could imagine we didn't have any good goals; any goals defined for what the public status update should look like, how frequently they should be happening, what their content should be.
We ended up with status updates that were vague, inconsistent, uninformative, sporadic and sometimes downright inaccurate. Our customers basically were left to conclude that either we didn't know what was happening or we weren't going to tell them what was happening.
We realized the root problem here was actually that we didn't have a way to practice these things outside of an actual incident. We didn't have any framework on which to practice our communication procedures. We did do wargaming occasionally. That's always focused on technical problem solving or simulating an outage, and that sort of thing is really helpful but it's a different set of skills. You can't really work on them simultaneously. We wanted a way to focus just on the communication procedures and to practice those.
The final problem here. We had an incident around this timeframe, on a Saturday I believe it was. About six days after this we didn't actually have any progress made on a public post-mortem. These are things that you usually want to get out within a couple of days after an incident to restore trust, but nobody had actually been assigned to own the post-mortem here. So there had been no progress made on it. No one was actually responsible for making sure that was written. There are all kinds of reasons to blame for that.
There are typical problems that all startups face: your company is growing, your product is growing, people are coming in and out of the company. And they're challenges that all startups face. Ultimately we realized that our incident response was kind of chaotic and disorganized; and it started to affect our business and it was something that we needed to fix. As we started digging into this, we quickly realized that incident response in general is considered a bit of a solved problem. That's the good news.
The Incident Command System is what other organizations use, other emergency responders use, in order to deal with the same kinds of problems, coordinating large groups of people that are dealing with stressful incidents and emergencies. IT operations folks aren't the first ones to deal with those problems. You can imagine firefighters that are working on dealing with wildfires, people who are responding to traffic accidents, natural disasters.
All of these problems that these emergency responders have really mirrored the problems that Heroku was facing at this time.
The Incident Command System â€” they designed this back in the 60's to help with the fighting of California wildfires. It was originally based on the Navy's management procedures. Those are the procedures that help the Navy manage a 3,000 person aircraft carrier that's full of 18-year-olds who aren't very well-trained during the heat of battle. You can imagine those are pretty intense procedures and they have to be effective in order to make things work well in that situation.
Since then, it's actually evolved into a federal standard for emergency response. There are a lot of scenarios where you might be getting federal funding and a mandate that you follow this system in order to get that funding. I'm not going to spend a ton of time on the details of the ICS right now. I'm just going to give you the key concepts of it.
The first and foremost is that you have an organizational structure that can scale along with the incident. Second is unity of command. That means that each individual reports to one and only one supervisor. Next up is the limited span of control so that means each supervisor has no more than three to seven people reporting directly to them. If you get to a point where you have more than that you've got to split it out, add another layer of hierarchy or add another management group. That's just to make sure that everyone can actually keep track of what their direct reports are doing and coordinate things that way.
Clear communications is a huge thing to focus on. Having common terminology, making sure that your company or your group is on the same page as far as the terminology they're using to describe what's happening, so there's no confusion there when somebody uses a keyword.
Management by objective. That's clear, specific and prioritized objectives as far as what your company or what your group should be focused on doing to resolve the incident and in what order they should be doing those things.
As I said, I'm not going to spend a ton of time on this. There are a lot of good resources out there for the Incident Command System and in particular how to apply it towards IT operations. I'm going to show these slides. I recommend that you check them out if you're interested about that. But instead, I'm going to focus on how we applied this at Heroku, how we use the ICS to improve our incident response.
First and foremost, I mentioned the organizational structure. When we studied the ICS we decided that we should have three primary organizational groups. The first of those being the incident command group. I'm going to go into detail on each one of these in a little bit here, but just to briefly go over them: incident command, operations and communications.
If you were to apply this to your company, you don't necessarily have to do exactly what we did as far as what your organization is structured like. The main important part is that you have this defined up front and everyone knows what the responsibilities are for each one of these groups.
Incident Command. This is typically defined by the incident commander role. That's a single person who is in charge with the final decision making authority. They have the say on what's happening and what decisions we make. This actually tends to be the first person or by definition is the first person that responds during an incident. The first person who figures out that something might be broken, they have to automatically assume the incident commander role. Along with that they have to also assume all responsibilities that haven't yet been assigned to other people until they get handed off or until the incident is resolved.
Heroku has a rotation of people that are specifically trained for this role. That's helpful. A few of their responsibilities: they're tracking the progress of the incident, coordinating the response between different groups. They have to act as kind of a communication hub between the different groups in the organization. They have to make the call and state changes. They have to decide how the incident is evolving whether it becomes resolved or not. They have to issue situation reports. This is a really important one that we're going to dig into next here. As I said, they take over all responsibilities that aren't already assigned to another group or that people haven't been handed off.
A sitrep. What is a sitrep? A sitrep is a really concise description of what's happening and it's an internal thing that using blunt language. You don't have to worry about whether customers are going to see it or not. You can see here that we're talking about exactly what's broken, what the symptoms are, how widespread the impact is and any other important information as far as what we're doing to resolve it.
We have links to a ticket here from Amazon that's a support case that we have with our infrastructure provider. We also have a link to a Trello card that's our follow-up, showing what we're doing, that we can keep track of any temporary changes and roll those back once the incident is actually resolved. Finally we have the people that are actually working on the incident listed here. We have the incident commander and then we have the people that are in charge of communications as well as the other engineers that are working on the problem.
This is very, very dense information and it's a very great summary of what's happening. Most importantly it can be blunt and to the point. The purpose of it is to keep the company at large informed about what's happening with the incident.
Sales, support, marketing, all these folks need to know what's happening with the incident. The way this works is it pushes it out through our company-wide mailing list. It blasts everybody. Everyone has access to it. Everyone knows what the current state is. It also goes to our HipChat channels so it's coordinated in the same place where we're actually doing our incident response.
As I said, it should contain what's broken, scope of the impact, who is working on it, what we're doing to fix it, and it should be sent regularly. That could be every half hour, every hour or whenever there's something major that changes about the incident when there's new information or you figure out something that customers could do in order to recover from it faster. That gets sent to the entire company.
The event loop is something that we use to describe what the incident commander is doing during an incident. It's more or less a series of questions. They don't have to be these exact ones, but it's a series of things that they're doing as the incident progresses. Basically, they're just following this loop asking these questions, making sure that everything is being handled, that nobody else needs additional support. They go through this list and at the end of it they just keep repeating until the incident is resolved and closed out. That's the first group, the first organization.
The second one we have is operations. These are the people responsible for actually fixing the problem at a software company that tends to be engineers. That tends to be only a couple of engineers for smaller incidents, maybe one or two or three people. And then once it's a larger incident, you're going to involve more people and you'll have more groups and more teams involved.
To give you an example at Heroku, if our platform is having issues, if we have a major outage then we'll have a team of people working on just the platform side of things. We'll have another group that's in charge of restoring our database operations and those will be separate teams that are both coordinated underneath the incident commander.
As I said, their responsibility is diagnosing the problem, fixing it and then reporting progress back up the chain to the incident commander so that he or she can communicate with the rest of the company about what's happening.
Importantly they don't have to spend a single cycle of their time worrying about how they communicate this process or about it being communicated in a way that can be shown to customers.
They can just be straight to the point and not have to waste any cycles on that. That's the second group.
Our third one here is communications. These are the people that are responsible for keeping customers informed about the state of the incident and what's happening. At Heroku this is typically managed by our customer support personnel. There are a few reasons for that. They don't have to context with problem solving. They're already dealing with customer issues all the time, and even during an incident they're probably answering customer tickets as well. They don't have to have that context which they're already used to speaking customers' language.
We used to waste all kinds of time, as I said, with engineers mincing words, trying to get things massaged in a way that both describes the issue but also doesn't look bad to customers. If you don't have to do that you can focus entirely on problem solving. Off loading this is really important.
It's not about whether engineers are capable of that. All of Heroku's engineers could easily write a status update. It's mostly about the context switch.
The status updates, these are public messages that we post periodically to keep our customers informed about what's broken, what they can do to fix it or what they can do to work around it, and what we're doing to fix the problem. We post these both to our public status website as well as our Heroku status dedicated Twitter account that we have for this. I don't know if it's necessary to use a dedicated account. We just find it helpful to keep things out of our main Twitter stream because it can get messy during an incident where you're pushing a lot of information out.
Communications, importantly, works very closely with the incident commander to make sure that they have all the information they need to describe the impact of the incident to customers.
Some guidelines for the content of these status updates:
They should be honest, they should be transparent and up front about what's happening. Don't hide stuff from your customers. And they should explain the progress that you're making on actually resolving it.
You should probably establish some goals within your company as far as what you want your customer communications to look like and what you're trying to accomplish with them.
Things they shouldn't do. They shouldn't provide explicit ETA around when things are going to be resolved. We've found so many times that we think things are going to be resolved in an hour and it turns out that we actually thought wrong about what the problem was. So, maybe one hour or two hours later you're still saying things are going to be fixed in an hour. And that never looks good.
Just don't over promise because you're never going to know for sure how long it's going to take.
Don't presume to know the root cause. If you haven't done a proper post-mortem yet, you're not going to know what the actual causes were so don't assume that you understand that at the time of the incident.
Don't shift blame. Ultimately you and you alone are responsible for your infrastructure choices and for your own availability. That does not fall on your providers. It doesn't matter if you're using Heroku or Amazon Web Services directly. That was your choice for your product, but your customers are paying you for that service. Ultimately it's your fault for making those decisions if they don't work out. Don't shift the blame there.
Don't do this. This may have been something that you've seen if you've looked at Amazon status site before. This is their "Everything is okay except for this little thing that may actually be a huge issue that's taking your site down."
Don't beat around the bush. If things are broken just own up to it and be honest with your customers about it.
Some proactive handling around your top customers. Sometimes the incident commander finds out that there's a situation where you can recover your top customers more quickly. You can imagine a situation where you've got manual work that's required to recover a bunch of customer's databases. You can't do them all at once. There might be constraints on that. Ultimately you have to do one at a time. In a situation like that you can prioritize your top customers and give them the best experience there, get them up the soonest, the ones who are most important to your company or the ones that are paying you the most.
Also, account managers or sales teams can proactively reach out to those top customers even if they haven't complained yet. Even if they don't know that they're being affected, you can tell them if they're affected and that establishes a communication channel with them and makes them really confident that you're on top of things.
Handling support tickets during incidents. You can imagine a widespread incident, we're going to have lots of tickets coming in, lots of customers are affected, and it's not really possible to respond effectively to all of them during the heat of an incident. What we do, we have a macro that our support team will set up when the incident begins that will redirect them to the status site. That's the place where we're already putting all of the relevant information that they need to know, everything that we are able to tell customers about the problem and what they can do to fix it.
What they also do when they assign tickets to this macro, they can label them in a way that they can easily find those tickets after the incident which is wonderful because it gives them a chance after the incident is over to apologize to customers first of all. Secondly, to make sure that the customer's problem was actually being caused by the incident. That happens all the time when you think the problem is due to whatever is broken, but it might actually be something else. It gives you a chance to then follow up and see that their problem was actually resolved by the incident being closed. Then you can re-engage with them.
That's it for the organizational units. We've got incident command, operations and communications. As I said, this isn't really set in stone. It's not that you have to use this exact structure for your company.
You should use what works for your organization. Don't be afraid to change this also as your company grows and evolves and your org changes. It's okay to change this when it makes sense for you.
Some other ideas that we implemented based on the Incident Command System, training and simulations is a really important one. Incidents are really stressful, if I haven't made that clear yet. People under stress tend to refer towards their habits. I played volleyball in college and this was something that I heard all the time from coaches at a higher level which is that under stress, when the game is on the line, players are going to do what they're comfortable doing, what their habits are.
The only way to really change that is via realistic training. It's not just about practice, it's about practicing under as much of a realistic setting as you can. The stressful environment is really what makes you able to change your behavior when you practice under that environment. Studying is not enough. You'll see this in any field where people are expected to perform under stress. There's a heavy emphasis on realistic training environment. If you want to respond as quickly and effectively as possible it really has to be second nature. You have to be able to know these steps by heart.
It's okay to have a checklist so that you don't forget things, but if you don't know what all those steps on the checklist are you're not going to be able to perform them quickly during an incident. You're going to have to look up what you're actually supposed to be doing there. Airline pilots when they have to land a plane they have a long checklist of things that they go through, but they've trained thousands and thousands of hours on that so they know what everything is already. It's just to make sure they don't forget a step.
With our training and simulations we want to mimic our production environment as much as possible. We'll have an entire clone of our production infrastructure setup as well as simulation copies of our status site and of our Twitter accounts and our Zendesk. All of that to give you basically the complete experience around this so you can practice using the same tools that you would during an actual incident.
As far as the intervals on which you should train, you probably want to have some kind of a jumpstart where your team is, when people are coming on board or getting into on call procedures. You want to jumpstart their training and make sure that they're brought up to speed on what they should be doing.
It's also important to train regularly. Maybe every three months or something, every quarter just to keep those skills fresh.
You want to track the things that you care about during training, things that are important to your company. For us we placed a heavy emphasis on how long it took us to get that first status update out to customers even if you're not sure what the problem is. We really try to be focused on getting that investigating post up there so that we've acknowledged that we're at least looking into something which is a great thing when customers are having a problem. They already know that we're working on it.
Clear communications, that was another big focus of the ICS. There are a few ways in which we've tried to achieve that goal, having explicit state changes and handoffs, being very clear with messaging about when things are evolving or when responsibilities are being handed over. You can see in these first couple â€” imagine that I am the incident commander and I want to hand over that responsibility to Ricardo. I'm going to have a clear message that everyone can recognize, that alerts everyone that says, "Hey, there's a new incident commander and it's this person." Likewise for communications if I want to hand that off to somebody else.
We also have messages here that we use when we are confirming that the incident has grown in scope or we know that there's a real problem or when it's resolved to let everyone know that we're wrapping down our handling of the incident. This is actually built into our tooling for our status sites so it ends up being automatic for us. That makes it really hard to forget it. Basically, impossible to forget it.
The dedicated communications channel, this was something that we had before, but we really reinforced it when we took another look at this. We have this HipChat room called "Platform Incidents" that everyone knows that you're supposed to go to when you think something's broken. That's where the investigation takes place. That's where things are coordinated. We can still escalate things to Skype or Google Hangouts when things are needing a little bit more communication bandwidth, but it's important that everyone has a known starting point they're going to go to when the incident gets going and that it's defined in advance.
There's a lot in the ICS also about defining your terminology and your process making sure that those goals are set up up front.
Among those, we found it really helpful to define some product health metrics. We picked two or three specific metrics that we think explain whether our platform is working effectively and whether our customers are having issues or not. We saw a lot of problems with inconsistent handling of incidents where an engineer might be getting woken up at 3:00 a.m. and they're going to be really hesitant about waking someone else up when they're not sure what the problem is. Especially if they think they can handle it.
This is made to prevent the hero symptom. We don't want people to be heroes. We want to have consistently good responding. Having these metrics defined was actually very valuable for us on that. Describing the severity of the incident and the state of it becomes much easier when you've got a couple of metrics that tell you the most important parts about how your product is functioning. It makes it much easier to justify calling in help at 4:00 in the morning when you've got the numbers to back it up.
It's actually harder than it sounds though, unfortunately. Our metrics, we use continuous platform integration testing, HTTP availability numbers, and we try and gauge the number of apps or customers that are impacted. We also have thresholds for those that we use to determine how severe the incident might be. As I said, in practice it's been difficult for us. I think that's somewhat related to the more complex your product is the harder it is to really limit it down to just a couple of them. It's still really valuable to try and focus on that and try and find those most important things that tell about how your product is working.
Another problem is that Heroku hasn't historically had a really great continuous integration story internally. That's something that I think is really important to invest in with any product, so I'd recommend that to any of you.
The next piece here is having tooling around our incident response. This is a screenshot from the admin side of our status site. This is what the communications person would be using to draft a public status update. It's where they can control what the public colors are describing the severity of the incident and also adjust the scope of it to tell who's impacted or what services are impacted.
There's a similar interface that the incident commander uses also when they're writing a situation report, a sitrep. This is integrated with our status site so when you write something here it's automatically up on the status site. It gets pushed out to everybody by email, it gets put on our HipChat room and it makes everything easy to have this kind of built into our tooling.
Another tool that we have, a HipChat bot that we use. One of the big purposes of it is to be able to page in people. Whether it's another specific person that you know who you want to call in or if you want to call in whoever is on call for a particular rotation. We have bots that make that really easy for us so we don't have to leave the place where we're already doing our incident response. We don't have to go muck around with PagerDuty's UI to figure out who's on call and to figure out who we want to escalate it to. Having that automated here and having a really convenient way to access it has been huge for us.
I will say about tools â€” it's only helpful to have them and have great tooling if people know how to use them. If you haven't been trained on them up front you're not going to be able to figure out how to respond to the incident, how to use your tools during the heat of an incident. You have to be trained.
Another tool that we have here is the incident state machine. This is our internal nomenclature that we use to describe the state of the incident. If you could imagine, everything is functioning normally in steady state, everything's healthy. That's our state zero. That's our "everything's normal" state. As soon as anybody suspects that something might be broken you're going to escalate that into investigating when you have a strong suspicion that something is wrong. There are actually triggers for that.
If you want to escalate things to the state one, you have to open that public status incident to say that you're investigating the incident so that people know. Finally, once you've confirmed the scope and the impact, the symptoms of the incident, then you can confirm that there is actually a problem and you can update that way. And then when there's a major incident, we have a separate state for that. That is our big red button that calls in everyone when we know that there's something really major going on.
I want to say a word here about follow-ups and post-mortems. As I said, the big problem that we had here was not knowing whether somebody owned this, not having someone who at the end of each incident was assigned to own this process and make sure that they were getting all the input and all the information they needed to compile the post-mortem as quickly as possible.
Make sure that somebody owns that after each incident.
As far as how to write a good one, I like to look at the Mark Imbriaco formula. It's just three simple steps. You want to apologize for what happened. Make sure that your customers really feel like you feel bad about it. Secondly, you want to demonstrate that you understand what happened, that you understand the sequence of events that led to it. Finally, you want to demonstrate or explain what you're going to do to fix it. You want to tell them that this isn't going to happen again and here's why. You can read basically any post-mortem that Mark Imbriaco has written on these. They're all great.
I'd also recommend you check out the ones that CloudFlare writes. They're really in depth and detailed technical analyses, but they also basically follow this formula. They do a great job especially at getting them out quickly. I've seen them have an incident on a Sunday and they'll have these post-mortems out the next day in extreme detail. That's really impressive and as a customer it makes me want to trust them more.
Those are the changes we made. As far as how well this has actually worked for us, it turns out that customers actually really notice and appreciate this kind of thing when you're proactively communicating with them. There were a lot of recent examples here from the Heartbleed SSL debacle a couple of months ago. People were very adamant about praising us for how good of a job we had done keeping them informed about the state of things.
I think it's important to note here, we didn't really do anything fundamentally different as far as what we said than what AWS did, but the way that we communicated it was much, much different. We didn't even really necessarily move faster than them. In a lot of cases we were constrained on them with certain pieces of our infrastructure.
In spite of that, people really praised the way that we responded to it and they were mocking how Amazon responded to it. The way that you're doing this proactively makes a huge difference.
We've also seen a nice side effect where our support volume is much lower during incidents as a result of the way that we push this information out. People are less tempted to have to open support tickets when they know we're already working on what's broken. I think during normal incidents we had on the order of 50 to 100 tickets that might be opened and now that actually goes down to like five or 10 that might be opened during a regular incident. Huge impact there.
We're definitely far from perfect. After our initial response or initial sprint to improve this stuff a couple of years ago, we've really slacked on the regular training aspect of it. As our company has grown, as people have left or new people have come in, our response, our ability to coordinate this cleanly internally has kind of degraded. We're really getting back into that now with the more regular training and making sure that happens so people keep their skills fresh.
I want to recap all the things we talked about as far as how to apply them to your company. Standardizing your process is really important here. You want to define that org structure, get it defined up front. The sooner that you can have more than one person involved with it, the sooner you can really reap the benefits of having people able to focus on particular tasks and not have all that context switching.
Standardize your tooling and your process, making sure that's not ad-hoc because that leads to inconsistency.
Really working hard and trying to define those product health metrics and thresholds for them. Also, establishing those goals up front for what you want to accomplish as far as customer communications during the incident. Limit that to no more than two or three if you can.
Clear internal communications, there are a few things around that making sure those handoffs are really explicit, really embrace the sitrep. That's done an amazing job for us as far as keeping people informed. There's no longer a need for people to keep jumping on HipChat and asking details about what's happening. They already have that information pushed out to them.
Owning the post-mortem process, making sure that someone's in charge of that. It would probably be useful here also to have anybody who's going to be involved with these incidents, whether they're writing status posts or writing sitreps, to do some practice ahead of time, maybe write out some templates. It doesn't matter if you're going to use those templates during the incident, the important thing is that you've practiced on what the content is that should go into them so you're able to do that under stress.
Finally, training yourself so you can make sure that you are able to practice or apply these skills regularly when it matters.
The best procedures mean nothing if people haven't been trained on how to follow them and if they're not going to follow them when it matters.
If you want to recover as quickly and effectively as possible you really have to be trained on them.
That's it. Thanks.
Q: How does a training session work at Heroku?
A: We try to have someone who's involved. Basically, the coordinator of the training sessions, they'll plan out a game plan for it ahead of time and figure out what they're going to break and keep track of logging what people are doing during the incidents. They're going to take notes about it throughout the whole time and they're going to write down things about when each person came in to actually deal with it. We usually get people into a single room and they'll get called into the room as they get paged or as they get called in like they normally would for many of their alerting triggers.
They'll come into the room and we'll really keep everyone in close watch so we can take notes on how they're responding. We have the logs in HipChat as far as when certain events were happening. That's the important part of it I think. We have them pretend that they're not in the same room even though they are, sort of speaking their thought process out loud as they're thinking about what they should be doing next so we can take notes about how to refine that process and make sure that everyone is well-trained on that. I think that's about it.
As I said, they're really focused on the communication aspect of things. The actual wargaming, the training of fixing technical problems, that's kind of a separate piece and I think you really have to train them separately.
During these incident response training sessions we don't really worry that much about whether the person is able to figure out effectively what's broken or figure out how to fix it. That's sort of a separate thing that we can then work on afterwards with them if that's having issues. It's more about just making sure the communication process is really refined.
Q: What's the distinction between production and development on the Heroku status page?
A: You use Heroku a lot and you don't understand the distinction on our status site between what's considered production and development. I could say that we don't really either. That's kind of a distinction that . . . I guess I could say that on camera. No, we've had debate about that internally. Customers have been confused about that for a while.
It's hard because we're trying to not convey things in a way that matches how our internal systems work. I mean, it would be easy for us to describe things in terms of whether the platform, whether your existing apps are up and running versus whether you can push new code and whether the API is functional. That's kind of how it ends up breaking down. But we really want to focus on what customers care about rather than how the internal systems work which is why we've kind of struggled with those definitions a bit so far. Don't know if I have recommendations for how you might do that for another service.
Q: When do you change the incident commander or escalate the incident to the next level?
A: When do you decide whether the incident commander should change or when the incident should be escalated to the next level?
As far as when it should be escalated to the next level, we usually try and define that based on the severity, based on our metrics.
If we detect that two percent of apps are affected or one percent of apps are affected, I'm just making numbers up here, that's considered a minor incident. Then if something like 20 percent is affected that's a really big problem, that's huge. If it's a small handful of customers it might just be a regular incident.
As far as when to change the incident commanders, I think our rule of thumb there is that we don't like to have people in the rotation for more than four hours at a time. If you start an incident and it might be a long one that's going through the night or something, I think it's important to change that person out regularly every four hours because there's just a lot that they have to keep track of during the incident. At the end of four hours you really start losing the ability to be sharp on that stuff and coordinate effectively.
For long incidents we'll have people rotating every four hours, changing throughout the night and the other people will go off and take a rest. They may have to get called back 12 hours later, but at least they can take a mental break and sleep and recharge a little bit.
Q: How do you deal with people who complain online, even when you're being as transparent as possible?
A: His question was even if you're doing a great job at this stuff there's still a chance that somebody will go on Hacker News and post about it or complain about the fact that your service is broken. Yeah. How do you deal with that as an engineer when you think you're doing things right and then people are still complaining about it?
I think from Heroku's perspective it's pretty normal now. Anytime we have any kind of an incident, I don't know if it's people who are subscribed to our text messages, that just immediately then go post about it on Hacker News. But it basically shows up every single time no matter how severe the incident is. I think we've just gotten used to that and they tend to die off the front page now pretty quickly because people are getting tired of those posts showing up all the time, every time something is broken.
I think you just have to really take a look at how you responded and try and think objectively. Listen to the customer's feedback if they're complaining about stuff. Ultimately it's up to you to decide whether you're doing things as well as you can or as well as you need to for your company to succeed.
Q: Do you have a refined process for the post-mortem and is the incident commander involved in it?
A: Do we have a refined process around the post-mortem and is the incident commander involved in that process? The sitrep that I showed earlier, there was a link to a Trello card. We use Trello for all kinds of things internally. When there's an incident the IC tends to open a Trello card that they'll use to track things both to track what's going on during the incident so we can maybe undo temporary changes afterwards, also that we can track that there has been someone assigned to that post-mortem responsibility; or that there has been a follow-up meeting scheduled in which we're going to do that the next day or something depending on when things wrap up.
We do have more well-defined processes for having meetings around that and making sure that all the people whose systems were involved are there to give their information about what they figured out happened. I'm not sure if there are specific aspects of it that you're curious about, but yeah, we do have a more formal process around that. We have an SRE team now that's really defining that stuff in a much more clear way. The incident commander, they're usually involved in the post-mortem if only to be there as the person who's explaining the way things happened and explaining the sequence of events.
Q: How many people would be involved in a normal incident?
A: How many people would be involved in a normal type of incident? There are times when it's only one. I've said those responsibilities should be split out and they really should when it's anything other than a really trivial incident.
Sometimes we have a single database server that goes down and a data engineer might go on there and be involved with that one customer and they're able to handle everything. For the most part, we really try and emphasize having somebody take over as incident commander as quickly as possible. Worst case, maybe the IC is responsible for writing those status updates if it's a minor thing or not that much is changing then they can avoid having to call in a communications person.
Definitely trying to split out the IC role from the engineering role is important and we try and get at least two people involved there.
For a regular incident, I'm trying to estimate here, but I don't think it's typical for a regular incident to have more than two or three engineers involved in it. There are certain people who are already on call and it's their job to kind of fix these things. Typically you can add more people to the fire but it's not going to help you recover faster from most situations.
Q: How often do you update the status site and how detailed is each status update?
A: I guess one of them was how often do we look to update the status site, how often do we look to get those updates out and then how much detail do we go into about what the state is and how things have changed? What the impact is and what the symptoms might be or the specific problem? As far as that last one, we try to avoid talking about our internal systems as much as possible on those.
We really want to focus on what the customer affects are, what symptoms they're noticing and what impact there is to their apps.
We'll go into as much detail as we can about which features or which functionality might be affected at a given time, or how widespread the impact is. We'll get into that stuff.
As far as the timeline, we like to get status updates out basically every time we have new information to convey to people and sometimes that can be every five or 10 minutes that we'll figure out more and more about what's happening or we'll be making progress resolving large numbers of the customers that are affected. There are times like during the Heartbleed incident where there's really not much changing over the course of an hour, and so we'll make sure that we update every hour just to say, "We're still working on this, we're still focused on it. We just don't have anything new to say because we're just proceeding as we were before."
I think at least every hour for those longer incidents is still really helpful, but there are always going to be places where you don't have anything new to say.
I guess that's a tradeoff you have to strike a balance on whether you want to just keep saying the same thing over and over again versus trying to get information out only when it changes.
Q: Will people be upset if you solve the problem but don't update or communicate properly?
A: Yes. It's certainly possible that people will get upset if you're not keeping them up to date on things. That's kind of why we've resolved to that one hour time frame at the absolute latest. We don't want to go any longer than that. Ultimately, if you can talk about things that are changing more rapidly than that, you should definitely do so. The more people that are involved with this, if you've got a separate person for each one of those orgs that's in charge of those responsibilities, it becomes easier for them to just keep churning out that information as they're getting new pieces in from the groups that are working on it and new reports about what's happening.
Q: How do you pick the first incident commander and how do you train people to do that task?
A: Yes. Whoever finds the problem they are the incident commander. When we go through the training process everyone is familiar with what the different responsibilities are for the different orgs.
They're supposed to be comfortable with the fact that if they discover that there's a problem or they think there might be a problem then they have to basically immediately assume that IC role.
They're also trained that
they should bring in someone
as soon as they think they're overwhelmed
or they're uncomfortable with all the responsibility
they might have.
If they're really confused about what might be happening
and they're going to have to really dig deep on it,
they'll probably be more likely
to just call in an IC immediately
using our HipChat bots and not worry about the fact
that they're going to have to wake someone up; more be focused on trying to make sure
that they're responding effectively
and that someone is able
to get information out to customers
even while they're working on diagnosing the problem.
Q: What's your approach to events that are resolved before you're first able to notify your customers?
A: Our approach to how we deal with events that end by the time we get around to staying they start. These days we've gotten a lot more adamant about posting retroactive incidents. We might have an API database failure that we noticed that takes our API offline for one or two minutes and sometimes it's faster than we can react and put something on our status site. During which time like our automated database recovery things might be taking over and the problem might be fixed.
What we try and do in those cases is I'm not sure exactly what the threshold is, but if we determine that there was any real impact from it we try and go back and post the retroactive incident on our site.
You may have noticed our status site getting more noisy. It's not really because we've had more problems it's just because we've been focused much more on trying to get every bit of information about problems that we've had up there to be more transparent and open with our customers about it.
Q: How do you determine who's responsible for fixing a particular problem at Heroku?
A: So you're asking how we determine who's responsible for actually fixing the issues. I think that depends on our internal responsibilities. Different teams are responsible for different parts of our infrastructure and different subsystems. Basically, if it's your system that's having an issue there's going to be someone who's on call for that particular team who's responsible for coming in and fixing that.
This is how it works in a larger organization anyway. If you're a smaller company, I'll say in the past we had a rotation system where only one person was on call at a given time. There's an issue with that, but it's obviously more feasible for a smaller company to do that so that everyone doesn't have to carry around their laptops all the time.
As far as how we figure out where the problem is or who needs to be alerted for that, we hope that anytime our internal systems have problems that those teams have done a good enough job of monitoring their systems and keeping an eye out for problems that they get alerted, that they get a PagerDuty notification that makes them come online as the problem is starting.
That doesn't always happen. Of course, there are certain issues that are kind of nebulous, hard to define. In those cases whoever the first responder is, or the IC, they will have to make the judgment call on where they think the problem might be or call someone in who can help them figure that out.