MAR 7, 2022

39 MIN

Ep. #49, Incident Commanders with Fred Hebert of Honeycomb

GuestsFred Hebert

light mode

about the episode

In episode 49 of o11ycast, Charity Majors and Jessica Kerr speak with Fred Hebert of Honeycomb about incident commanders. Listen in for insights on the role of SREs, the educational value in incidents, improving feedback cycles, and choosing the right OKRs.

about the guests

Fred Hebert is a Site Reliability Engineer (SRE) at Honeycomb, with over a decade of previous software engineering experience. He’s a published technical author who loves distributed systems, systems engineering, and has a strong interest in resilience engineering and human factors.

show notes

about the episode

about the guests

show notes

transcript

Charity Majors: So what is Incident Commander, Fred? What does that mean?

Fred Hebert: God, what doesn't it mean?

This is going to be anything really that's going to touch the sections between how you organize your response and muster up all the resources to fix an incident that might be ongoing.

So it's going to have all the definitions that everybody in the world can think of in terms of what do we want in terms of authority and accountability for an incident.

So in some cases it's facilitation, in some cases it's who do you blame for things and everything.

Charity: What do we mean by Honeycomb? What does Incident Commander mean at Honeycomb?

Fred: Yeah. Honeycomb is still small enough that we don't have people trained specifically in the Incident Commander role, so it's a hat that people can wear from time to time when they jump into an incident and people feel that, "This is going to be high paced, it's going to have a lot of events, a lot of things to track. We need someone in charge of doing that coordination right now."

Charity: And by incidents we mean anytime the software isn't working right?

Fred: That's another endless trap door that leads to everything. I will use the definition of incident as something happening that distracts you from your planned work.

Charity: That's a great definition.

Fred: I will avoid the definition where it has user impact because a lot of incidents can be massive in response but have absolutely nothing to do with customers even knowing something is wrong.

Sometimes things become wrong if the response has not managed to turn the ship around in the right time or something like that.

Charity: This sounds like a great time for both of you to introduce yourselves. How about we start with Fred?

Fred: Yeah. So I'm and I'm the Site Reliability Engineer here at Honeycomb.

I'm here because I care a whole lot about incidents, things going wrong, and I'm one of the people who see a show with explosions and is like, "I'd like to know how that happened and everything and dig into that.

Jessica Kerr: I'm Jessica Kerr, Jessitron on Twitter, and I'm a host of the o11ycast now because I'm a Developer Advocate at Honeycomb.

Charity: Our newest host, hooray. I'm so happy to have both of you here, this is really exciting for me.

Honeycomb has a reputation for being a very opsy company, because what we do is we outsource a lot of ops stuff to us, right?

But Fred is actually our one and only actual SRE.

Fred: Yeah. It's an interesting dynamic because this is also the first time I've ever been... Sorry, I've been a software engineer who dealt a lot with operations in the past, but I came here for the first time as an SRE and I worked with a lot of people who were SRE or managing SRE teams and are now software engineers at the same time.

So there's a very interesting dynamic in this one compared to other places I've been at where SRE was the center of expertise for all of that.

It's much more widely distributed within Honeycomb as far as I can tell.

Charity: Nice. Well, let's go back to ICs' start. IC could in a lot of contexts mean Individual Contributor, and here of course in this podcast it means Incident Commander.

So how did the IC role evolve here? In the beginning there were just a few engineers, how did we decide that someone needed to have a specialized role in all of this?

Fred: Right. So that's a bit of a funny one because, as far as I can tell, the Incident Commander role comes from much, much bigger organizations which frequently had these issues and things like that and it became a pattern, especially around natural disasters, forest fires, floods and everything like that.

Jessica: That makes more sense.

Fred: Yeah. So you needed to deal with people who could find the right response if you need multiple fire departments, the trucks, the things to transport the refugees and everything like that.

You might be thinking of things that talk to multiple departments, multiple government authorities, civilians, military, stuff like that, and you needed someone to own the entire thing and send that through.

So in these organizations it can be a very, very strict role with explicit protocols to talk about all sorts of agencies, and it gets to be used in a watered down form into a lot of software companies where it becomes that thing where you own the incident but it can be a lot more about calibration or coordination of the response.

Some companies, the bigger ones, will have things like tech leads, communications managers, people who deal with incident liaison, people who deal with the emergencies.

You might have people who talk to you in the legal department, depending of the type of incident you have.

Then in smaller companies either you have nothing at all or you have the role which is extremely ill defined, but necessary depending on the pace of the incident at hand.

Charity: So something that jumps out at me from, I have this doc that you wrote about the Incident Commanders, at the very top in italics it says, "ICs do not troubleshoot. They help people who troubleshoot do so effectively."

Jessica: So you have people who are down in the weeds troubleshooting, and the ICs job is to not get in the weeds but to be above that and able to liaise above the weeds?

Fred: Yeah. I think that's the effective pattern.

There are places, I believe, which have a more authority driven way of solving issues where the Incident Commander is the person who decides which solutions are good or not.

So before taking corrective actions, they are there to check and make sure it's not going to be a liability, that the right type of decision is being made, and so they have that sort of authority to pick a solution over a different one.

That can lead to a way to defer authority to the commander and they can become a bottleneck in some cases.

Which, in some incidents, is fine if the pace at which you need to make decisions is slow enough to allow that.

I think it's Dr. Lara Maguire has super interesting research about that, if you want to check her name up.

But it's one of the things that's unique about software in that there's a lot of information and a lot of parallel work, and a lot of ways to send that information everywhere at the same time, and that pattern of authority breaks down really, really rapidly in the software world. So coordination ends up being more effective.

Charity: Interesting. So is it important that the IC know the domain intimately or no?

Fred: I think in general it makes sense to know the domain intimately, just because you don't want to be interrupting all the responders all the time to explain to you what it is that they're doing.

You understand the actions they are taking, the moves they're making, you're able to get their relevance, but really your role there is to make sure that what they're doing is aligned with what everybody else is doing.

Jessica: So the more shared context you have, the better you're going to be at communicating quickly and effectively.

That counts both with the engineers troubleshooting things down in the weeds, and with legal.

Fred: Right.

Charity: Yeah, that's the other thing I was thinking about.

It seems like a very important part of this job is being able to translate, right? Being able to switch codes, switch contexts because you've got a lot of stakeholders here and not all of them are engineers.

You've got the marketing folks, you've got the customer success folks, you've got people who are talking to your biggest contracts, you've got legal, and when do these people need to know things?

That's partly a science and it's partly an art, but you have to know, at least you have to be able to switch modes in your head in order to be effective for all of these people.

Fred: Yeah. And even in smaller companies like Honeycomb, it does happen that there's a lot of communications to be had on top of tracking everything so the Incident Commander just picks someone to also be in charge of communications, updating the status pages and stuff like that.

And keeping in mind that we need to get frequent updates and stuff like that so that they can focus on the actual incident resolution steps and coordination there as well.

Charity: So when an incident happens, what is it? All hands on board? Everybody drops what they're doing and jumps in to help?

Fred: I think that's how it is when you're a really, really small company, right?

If you have something like fewer than 10 engineers you just page everyone and whoever knows that they can handle that end up doing that.

As you grow, your engineering departments silo themselves, give that sort of expertise, they own components, they own part of the alerting, and so they're able to page fewer and fewer and fewer people.

Jessica: At some point adding people would slow things down.

Fred: Yeah, absolutely.

It's one of the things that in a growing company the Incident Commander ends up doing, which is you can call for more people to join and lend a hand but there's also the opportunity of saying, "We have enough people, please step back."

Especially if you know or you feel that the incident is going to be a long one, it's going to be like 10, 12 hours, maybe more.

Then you can start thinking about, "You know what? Go rest because if this gets to be longer we'll need people to take over the next shift or something."

Charity: Right, we're going to need you again.

Jessica: There's some degree of doing things, but really you need the pool of knowledge that's going to let you figure out what's going on.

Fred: Right.

Charity: What's the trade off between you've got people who are experts in their area of stuff, but you also don't want to call those same experts over and over again? Right?

How do you balance between letting the people who are on call take a swing, maybe fail a couple times or run it awhile before you escalate? How do you decide when to escalate?

Fred: I think this is one of the strong attributes of an Incident Commander, is knowing who knows what or has a decent idea about, "My own knowledge ends there, but this or that person or these people know about it and I can call upon them to help me in such a situation."

So there's this aspect of management of that in there, but it helps to have also the longer term view of the incidents.

There's a lot of people talking about things like Hero Culture, where the same people answer the incidents all the time and if you validate that and give it a lot of value, you end up with that concentration of the same people dealing with all the incidents, and having all the knowledge to deal about them, burning out.

Jessica: Right. In the Phoenix Project this is Brent.

I have a blog post about this where it's the Purple Developer who knows everything, and looks like a 10X developer, but it's really just their knowledge.

Charity, I love that you mentioned the people on call probably don't have the knowledge, and that's so important because as a developer I feel like I don't want to be on call until I know what to do in all situations.

But the thing is, you don't get there without being on call.

Charity: Even the experts are not usually in a place where they know what's going on.

They're just better at using their tools to figure it out because they figured out similar problems before.

Fred: Right. Everything is a map that's hyper connected and you never have the ability to know what makes sense.

Incidents and things like chaos engineering and these kinds of things let you highlight the portions of the web of connections that are actually significant and useful in these incidents.

So the incidents are one of the best ways to learn about how the system works because they tell you how it breaks down, and the way we have mental models is that we're very, very happy to have an incomplete and inaccurate mental model so long as the decisions we make with it are good.

The incidents are one of the best parts where we figure out, "Oh, my mental model is not right at all."

It ends up being a collective opportunity to repair that stuff that's happening.

So I feel that, yeah, an Incident Commander has to have an understanding about that, who knows what, who's been answering the same thing all the time, and sometimes in the people management you might do, especially if it's not the most important incident in the world, you can let people take a stab at things and try it a few times.

But prepare your experts to take over if it doesn't go super well.

Jessica: Or maybe say, "Hey, person on call who doesn't have a ton of experience. You go ask this person and let them build that relationship and build that knowledge."

And yes, that takes longer than you as the IC asking them, but if the world isn't completely on fire, it's totally worth it for building future response.

Charity: What is the difference in how we respond to incidents that are external facing versus ones that are broken but users aren't able to tell yet?

Fred: There's two sorts of perspective on that. One of them is that the code does not know what is public or private.

It's broken and so the steps required to fix it technically shouldn't have a concern about whether it's public or not.

It should be the same sort of response, regardless what happened.

In practice the bigger difference is going that it creates a stress and importance in the situations that everyone is aware of.

Jessica: So the social side of the system is very aware?

Fred: Yeah. You can't make an abstraction out of that, things are on fire for a reason, and we're going to tell ourselves that it's just software, nobody is going to die?

It happens sometimes that people die directly of what happens, but we don't like that idea in general.

It helps to have a lower stress to say, "That's one of the incidents that's under the budget." Or something like that.

Jessica: Under the error budget?

Fred: Yeah. Under the error budget.

If you're using SLOs and you have an error budget, it's possible that you're being paid because it's burning fast but you're still having some capacity.

There's ways to relax yourself and lower the stakes while still having to answer the incident the same ways.

The other perspective is one where incidents are a normal part of operating complex systems, in software or not.

For me that's one of the interesting perspectives in bad incident command.

It's not an incident that is going to happen once and then never again, and I'm going to make sure that it never happens again, and that everyone is going to live happily ever after.

Jessica: Yeah, tell that to the firefighters.

Fred: Yeah. This incident is going to be followed by another one and another and another one, and so the way you manage this one if you have in mind that you're going to have other incidents, changes things from the panic of saying, "Oh crap, things are on fire, we messed up. We have to fix it and not lose face."

It's part of the cycle of things that are going to keep happening, and so that's where I think that perspective of managing a bit of who responds to what to share the experience around becomes really interesting.

This incident, you take a bit of a loss on it to take a bit more time because you know it's going to help you in the future as well.

Charity: Yeah, I also think about that when it comes to pairing, it takes longer to pair when you're an expert at something, it takes longer if you bring someone along with you.

But I think that often it's worth the time because it's easy to tell someone what you did, but pairing with them and maybe having them drive while you consult with them or something is a much better way of making sure that they are preparing to handle it themselves the next time.

Fred: Yeah, absolutely.

Charity: Lets talk about severities.

Jessica: Dude, what's up with trying to take all these aspects of the consequences of an incident and cram them into one number?

Fred: Yeah, that's one of the things that I like tremendously about severities.

When there's a metric, there's usually a thing you measure because it's easy to measure, then there's the thing you're actually concerned with which is different.

It's the idea that we want 9/9s of uptimes of 5/9s of uptimes or 4/9s of uptime because we assume that response time and being available correspond to user happiness or user satisfaction.

And, if it's not there, then your metric is not necessarily going to be useful.

The severity is playing that same role where the severity tells you how concerned you should be about the incident, the level of response you should have, how many departments, how many people should be involved with this, how serious are we going to be about the action items that might come up after that, how thorough are we going to be in the investigation.

That just gets all boiled down to that level one, two, three, four, five, which might make sense at the natural disaster level where you have to get the resources from five different government agencies or something like that.

Jessica: Oh, okay. So for like a fire or an earthquake, comparing the severity of one earthquake to another makes sense.

Fred: And even then that's different because none of them are going to be the same earthquake as the last one, right?

They're not going to be in the same place, none of the same things are going to break, the response is going to be different and so there's this vision where a lot of severity has to do with the impact and the other one where the severity has to do with the resources that you need to help solve this issue.

Both of them look into that number from one to five or zero to five depending on where you're at, and they conflate all of them and so you so you see the number and you try to have the appropriate reaction with that. In most software environments, which are much smaller than natural disasters, the early incident is spent discussing which severity this should be because these definitions always vary and are not the same.

So personally, for the longest time possible I want to avoid having these severities because we can manage calling people into the incident or out of it without that.

Severity might be a bit interesting later on to help communicate the severity to people external to response to know the level of importance we assign to some things that have been discovered during the incidence.

But at the level where it's all people in a single company and roughly the same three or four timezones, I don't know that the severity is that useful as a tool.

Charity: Is it useful retrospectively? Analyzing like, "How well are we doing? How often are we exceeding our SLOs? Do we need to shift more engineering time away from product development towards reliability work?"

Fred: Yeah, I think so. This is the feedback loop, right?

The people who are on call and operating the system are the people living on the edge of all the decisions that have been made by the organization until that point.

Charity: I love that.

Fred: There's this idea that if you deploy more carefully or less often, you're going to have fewer incidents, that deploys are correlated with incidents.

And that's really a garbage idea, right? What the deployment does is put in production all the assumptions you had during development time and all the practices that you had, and now you fight about it.

The big or the misunderstanding lives in the codebase whether you deploy it or not, and it lives in your people's heads whether they deploy it or not.

So the incidents, if you want them to be productive, have to be a source of learning about how we got into that situation.

It cannot be a thing about, "We need to be more careful about writing tests, we didn't have enough tests."

The question that's more interesting is what made us believe at the time that we had enough tests at this point?

And I recall at a previous job asking that question and people just telling me, "Well, we knew we were shipping garbage code this time around, but the time pressures to ship were just that strict."

So people in management can ask these developers to be more careful all the time, they know that they are cutting corners because that's what the organization rewards.

Charity: Yeah, that's so true. Most people, their production systems are systems they've never understood, right?

And every day they ship more code to these systems that they don't understand, to these systems that they've never understood and it just accumulates like a fucking hairball.

I love the way you put that, "It's the accumulation of all the decisions that you've ever made."

And it's so true because as human beings when we get scared or we get nervous or we want to be careful, what do we do? We slow down, right?

We're like, "Okay, I'm going to slow down, get my belt, hold on tight."

But in software it's exactly the opposite because in software, speed is safety.

The continual swift turnaround of shipping smaller devs but more often, having a very small interval between when you wrote it and when it's live and looking at it, that's what keeps you safe.

Slowing down is going to fuck you up.

Fred: It's the rate at which you get feedback.

The reason we tend to slow down is that there's too much information coming in too fast and we want to slow down to be able to perceive all of it.

In software when you tend to go faster, you tend to get smaller magnitudes of feedback much for frequently so you don't have to dig as much to understand what has been going on between them.

So the speed is about, I think, increasing the granularity of the feedback you get.

Jessica: Granularity, that's really good. Yeah, because it turns out not to be about going slower.

It means going smaller, going smaller helps us stay safer, and that happens to have faster deployments.

Sometimes it feels like a slow feature development, on your laptop anyway.

You can get that feature done faster on your laptop in one big step.

But then I love how Fred points out that the deployment puts the assumptions into the real world, so putting off deployment is hoarding those assumptions and hiding everything that's wrong.

Charity: Yeah, yeah. And they rot, they go sour and then they rot very quickly.

Fred: Yeah, and the longer that happens, it's that you also build on these assumptions and so you just have more and more brittle foundations as you accumulate stuff on top of them.

Charity: And no matter how quickly you're shipping things, I think that if you don't go and look at it, if you aren't instrumenting it throughout your code, if you don't go look at it in production...

Because it's not just the code, it's the intersection of that code on top of that infra, at this point in time with these particular users, and if you don't close the loop by going and examining it, you don't actually know what you shipped because our systems are resilient to a lot of bugs and errors and problems, right?

They're never going to rise to the level of taking you until maybe some day they do.

Jessica: Some huge combination of them.

Charity: Right. Or they've just been festering forever and finally they tip over.

But most problems start out very small and they're there if you go and look for them, but a lot of people don't.

A lot of people don't have the tooling that would even let them do so if they wanted to.

Jessica: Or the time because that feature work, man, it must march forward.

Fred: Generally having this awareness within the organization, and by within the organization I mean within engineering but also the other departments, is one of the things that I think creates that psychological safety for people being on call.

It's going to be really, really hard as an Incident Commander or a person on call just to say, "Don't worry, it's an incident, it's normal," if everybody else through the rest of your organization is breathing down your neck asking like, "What the hell is wrong? Why did this mess up? We had something very important going on, please fix it. I don't care how, just fix it."

Having that broad understanding that these issues happen and are part of it, I think is one of the things that makes being on call a lot simpler and easier.

It's not a personal fault to be handling this, it's a service you give to other people around and it can be anyone's turn.

Having that perception just makes things a lot better and easier for everyone involved, it removes a lot of stress because you don't have the impression that you've been messing up by being there.

Jessica: So shared ownership of code, I imagine, helps with that because then you're troubleshooting as a service to the whole team, instead of a personal failure.

Fred: The concept of a personal failure, I hate to take that term because that's going to sound like... Really it's a social construct that you have.

The failures we see are failures we built.

They're not there to be discovered, they're there for us to interpret and so we can decide that someone fucked up and that's there reason or the bug and that's where you stop.

You can decide that that person was under these pressures and something happened.

The interpretation that you take, whether it is that they were put in an unsafe system, it was their fault, it was management's fault, it's just how things tend to happen, are things that we choose as investigators and as organization to validate as acceptable or not.

The opposite of that, which is the construction of risk, is the same thing.

What do we think is an acceptable risk, what do we consider dangerous or normal is also a thing that we collectively within an organization agree on as a definition.

And so if we tend to construct faults as being personal failures, there's very little to do to bring personally psychological safety back.

You're in a situation where it's all your fault.

Jessica: And even if, as an organization, you're working hard to give that psychological safety and you don't put blame on people, there's still all that social conditioning that we have to somehow counteract.

Fred: Blame is always going to be there because there's this sense of a sleight or an injustice or something.

It's a feeling you can't avoid, the same way you can't avoid feeling bad if it's a public incident more than when it's a private one.

It's always going to be there and there's idea of blame awareness about that.

There's also something I like to call the Shadow Blamelessness, where you decide that we don't name anybody, but the fault still with someone fucking up somewhere.

It's just like you're not getting the retribution, but the blame is still on the person in the system without naming names.

Charity: All you got was anonymity, right?

Fred: Yeah. And so anonymity is like step one to avoiding having issues of people being mad at something like that.

But the opposite of that is do you have a better blameless or blame-aware culture when you are free to name people and you know that nothing bad's going to happen to them?

Jessica: Yeah. Or the other day I screwed up something at Honeycomb and somebody popped into Slack with, "This is screwed up."

And I was like, "That was my bad." And I had no fear about saying that because all the feedback was, "Let's do it differently in the future."

Fred: Yeah. It's one of the things that's tricky about an incident, is how do we make sure this never happens again?

And usually before you're even started with the incident review, people have taken the most critical action items and already put them in practice.

The proper thing is how do we make sure that the next incident is going to be handled better, based on what we have learned in this one?

Charity: What are the good things about having incidents? Why should we value them?

Fred: Because they are always going to happen, and so it's one of the things where you have diverging forces within an organization, multiple priorities and you have to make compromised and trade offs in some of these decisions, and they're not always going to be right and things might be pulling at different directions.

So you have to value them as good opportunities to better introspect the way you work, the way things are happening in your organization, and to adjust and better adapt in the future.

They become that sort of opportunity and it's easy to say that, I think in the software world, because a lot of our incidents are low stakes.

It's harder to say like, "We should be thankful for earthquakes because they let us get better building codes."

That doesn't fly through that easily.

Charity: Yeah. I also think of it in terms of we're doing things, we're making progress.

Jess, maybe you caused this because you did something, it was a change that needed to happen. We don't want people to feel afraid.

Jessica: Yeah. Yeah, and people were very much, "Keep doing stuff."

Charity: Yeah, keep doing stuff. If you aren't breaking things, to some extent then you probably aren't making as much change as you could be or you should, right?

I remember at Linden Lab whenever people would join the backend team, we would crown them their very first time they caused an outage.

We'd give them the Shrek ears. Some people were like, "That sounds mean."

But it's like, no, it was an opportunity for us all to learn about what was going on in the system.

We would praise people like, "Oh yeah, you're really one of us now you've made the backend go down," because people were so afraid of it that we leaned into it by celebrating it.

Like, "You're not really one of us until you've made this thing go down," because otherwise people were paralyzed.

They didn't want to break anything, they didn't want to be the cause of anything bad happening, and so I think reframing it in terms of just anyone who's doing anything real and hard is going to cause things to break, as just an inevitability, can be helpful.

Fred: Yeah. We've had super interesting discussions about picking the right OKRs, the objectives for quarters or something and we have discussions about the incident count, and we ended up shying away from that for the usual reasons of people under declare things that might be happening.

But the other one is just wanting to take a more positive framing around this, which is you should have success factors that depend on the things you manage to do and not necessarily on the things that you wish hadn't happened.

So the question there is you don't necessarily prevent incidents.

How do we pick objectives that have more to do with what makes what we believe is an inadequate response to incidents and then learning from that?

That's a better objective than trying to prevent them from happening at all costs.

Jessica: A lot of small incidents is so much healthier than a few big ones.

Fred: Yeah. There's an idea that there's a good pacing. It's like exercise, if you do it too little often, then you forget, you get rusty, you're not really good at it.

If it's all the time always then you get extremely tired and you need to rest and you burn your people on.

So there is such a thing as a healthy pace for incidents where you have smallish or manageable incidents frequently enough, and if you don't have them enough naturally that's where things like chaos engineering becomes interesting.

You have to keep current with practice.

Jessica: Yeah. You mentioned that people, they really want to know, "How do we make sure this never happened again?"

But if you can ask instead, "How do we make sure stuff like this is less of a big deal? When this kind of thing happens again, how do we make it smaller?"

You can assuage whole categories of fear.

Fred: It's more productive as well. There's this idea with people doing phishing or social engineering for security tests where you start with the idea that you clicked the bad email, shame on you all the time.

And the more productive exercise in that one is you start with the assumption that someone's account has been compromised, now what?

How do you deal with that? It's going to happen sooner or later, you wish it wouldn't, but you want to be prepared when it actually does, and that's a much more constructive way to prepare for the incidents.

Prevention is always making sense, right? The ROI of a fire alarm is hard to evaluate, but it turns out to be really, really useful.

Jessica: Do you mean the alarm itself, the dinging thing?

Fred: Yeah, yeah, yeah. You don't know how much am I going to save money by having a fire alarm?

I don't know, it depends on if you have a fire or something like that. It's uncountable.

Jessica: We'll just tell people to be more careful and then we won't have fires, and then we won't need these fire alarms.

Fred: Exactly right. So these practices have the assumption that you don't want a fire, but you want to be ready if it ever happens.

It's that sort of thing where you don't want it to happen but you have to be prepared anyway.

Charity: The concept of SLOs is great because I think a lot of people feel like the goal is always to reach 100%, right?

The goal is always like, "You have all the time, et cetera, and if you exceed an SLO, well, great."

But we operate a little bit differently, right? What do we do with the budgets when we run over?

Fred: The beauty of the SLOs we have is that they're self selected, all right?

And for some of them, when we ran over the budget, the critical thing is really to have this conversation with your team at first, but at some point to bring it up higher in the organization.

We had that case recently at Honeycomb where after a few weeks, if not months of having SLOs that were really, really hard to meet we had stopped answering them as seriously as everything else.

We had perfectly good explanations for why they were burning faster. The aspirational value of it was not matching the sort of stresses we were having in the system for real, and in practice we were just disregarding them.

We had the on call hand off weekly discussing this with each other and saying like, "This one's going to burn in two days, just reset it, put it into the SLO reset plug where we track the explanations and all the burn rates that we've had to have that sort of history of what we were doing in the past with the SLOs."

And at some point we just said, "You know what? There's a reason why we're not meeting these. Sometimes it has to do with the organization's priorities, there are stressors in different parts of the systems that we're addressing already. This one is going to keep burning."

And so we just got rid of the SLO and relaxed them and sent bigger communication upward the chain in the company, and saying, "You know what? We are no longer meeting this. We are not actually going to be paged for that. We're going to think about how we model what we think is acceptable performance and quality and then come back to you, but we have to change a few things."

And that communication up the chain is one of the really, really interesting parts of that, having that discussion, following up the SLO, saying, "Those are the objectives, we were meeting them. We're not meeting them anymore and it cannot be resolved as a part of being paged while on call. What do we do now?"

Jessica: So instead of repeating pages and resetting, you push this reality up the chain of, "Look, this is how things are. If you want it to be different, let's shift priorities."

Fred: Yeah, we need to talk about it.

And at some point it's one of the realities I think engineers want to do could work, and we like to have more nines than can be necessary and we like to have better op time than what might be the actual objective of the organization you're in.

Getting the extra nine or going back to the nine you used to have might require, I don't know, seven figures of investment in people time but it's very, very possible that this is not actually a priority in your organization.

They want to increase revenue or user acquisition, any other metric you want to choose. That might be a higher priority, and there is something really, really unhealthy for your engineering culture to try to hit targets that don't align with what the organization actually wants because you're just burning yourself out and you're never going to get the actual support you need on that.

So when you see that disconnection for me, it's super important to bring it back up the chain because it might be that they had no idea that you were no longer meeting the standards and then there is going to be an investment that is required for that or it might be that they never cared to hit it to the level you wanted it to have.

At which point, great, relax that, stop paging people for things that don't matter to them. It's that social conceptualization of risk again.

How do we consider this to be risky, good or bad?

There's an agreement to be had and it's different from just the engineer wanting to have the most solid system ever because that's what looks really, really cool online.

Jessica: Right, bragging rights.

Charity: Well, I think we're about out of time. But what would you recommend that people who don't have ICs, people who are in very reactive modes where it's just like you get paged, things are down and you work all night and stuff, where should people start?

Fred: The hardest part is to take a break and look at the things that we're seeing right now.

Is there a feeling of panic right now? Do I feel overwhelmed?

Are there things that stress me out that I don't know how to do? Is there a feeling that things are chaotic and everything like that?

And raising your hand and saying, "I'm noticing this, what are we going to do about that?"

And starting a conversation is usually step one.

We're really, really good at reading the room even though we might be remote engineers who are not reputed to be able to do that.

But there's this idea that if everyone seems to be on a death march, you're going to get on the death march as well, and it's very possible that just raising your hand is a great way to stop that.

And if you do it when you're burning out and nothing seems to be going all right, then that's a very, very good signal if the rest of your organization is perfectly happy to burn engineers nonstop.

Charity: Right, that is a very good point.

Your life, your mental health, your sleep are precious things, and if you're at a place where those things are being taken for granted, that's not good.

Fred: Yeah. People who are healthy either physically or mentally are, I assume, in a better position to respond to stressful situations like incidents.

So the long term health of your team is also the long term health of your response.

Charity: People often I think feel like the health and happiness of engineers and customers are in some way opposed to each other, like engineers should be burning themselves on the pyre or whatever it is to make customers happy.

But in my experience, in anything but the immediate short term, the health and happiness of engineers and customers almost always rises and falls in tandem with each other.

You can't have a long term sustainable.

Jessica: Our job is to make decisions and we can make better decisions when we're okay.

Charity: Absolutely. You don't have happy customers when engineers are miserable or vice versa.

Fred: It's one of the places where we still have the benefit of generally not having high stakes of people dying when things go out because you don't have to make that decision of either I burn myself out or people die.

It's either I burn myself out or people miss their quarter or they're unhappy with the website.

It's a lot easier to make that decision and we should probably cherish that and treat it like we have that option. It's not life or death most of the time.

Charity: Completely agree.

Fred: It's great to have the ethics of thinking there could be dire consequences of doing a bad job, but yeah, we have that ability to prioritize ourselves without necessarily feeling terrible about it from an ethical standpoint.

Subscribe to Heavybit Updates

Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.

Content from the Library

Visit library

Apr 13, 2023

Podcast

O11ycast Ep. #59, Learning From Incidents with Laura Maguire of Jeli

In episode 59 of o11ycast, Jess and Martin speak with Laura Maguire of Jeli and Nick Travaglini of Honeycomb. They unpack...

Aug 17, 2021

Podcast

Unintended Consequences Ep. #8, Resilience & Chill with J. Paul Reed of Netflix

In episode 8 of Unintended Consequences, Heidi Waterhouse and Kim Harrison speak with J. Paul Reed of Netflix. They discuss...

Aug 31, 2020

Podcast

O11ycast Ep. #25, Reliability First with Amy Tobey of Blameless

In episode 25 of O11ycast, Charity and Shelby speak with Amy Tobey of Blameless. They explore the evolution of the SRE role,...