JUN 12, 2023

60 MIN

Ep. #7, The March 2023 Datadog Outage with Laura de Vesine

light mode

about the episode

In episode 7 of Getting There, Nora and Niall speak with Laura de Vesine of Datadog. Laura shares a unique perspective on the March 2023 Datadog outage, how the incident was handled internally, the resulting damage of the outage, and the many lessons learned.

show notes

Inside Datadog’s $5M Outage | The Pragmatic Engineer

about the episode

about the guests

show notes

transcript

Niall Murphy: Welcome, everyone, to the seventh podcast for Getting There. In addition to Nora and myself, I am pleased to announce that we have a special guest this time who is Laura de Vesine of Datadog, and we are examining the Datadog, essentially, global outage of early March in 2023. So I wanted to say, first of all, personal thanks to Laura for agreeing to come on the show and talk about this.

Obviously a thorny issue, the wound is still slightly open, although it's been mostly stitched back together by now, I'm sure. Thanks to Datadog and the company powers that be for agreeing to do this, because obviously this is not something that everyone does, and we very much value that transparency. So thanks to all involved who came to that decision.

Also, I suppose, to continue my usual rant, thanks to the people who do this kind of work for a living or as part of a wider job because it is often thankless and difficult. Anyway, we will get into that more as we go on. Nora, I wonder if it might be possible to explain a little bit about what Datadog is and why people use it and why they care and so on?

Nora Jones: Yeah. Datadog has been around since 2010. It is very interesting to be talking about the Datadog incident on this podcast because the primary users of Datadog sit within our world, and so I think we have a special interest in the incident reviews and reports that come out of a company where their primary user is SRE folks, observability specialists, folks that are likely writing post incident reports for their companies.

So Datadog is an observability service for cloud scale applications, providing monitoring of servers, databases, tools and services through a SaaS based and analytics platform. They've been pervasive throughout the industry, folks that operate in our worlds, everyone knows who Datadog is, everyone has probably used it at some point.

Laura, I would love if you could kick us off with a little bit of a background on the company. One thing to add from Niall's intro is Laura was actually the first incident responder on the scene, so we'll get into the details of the incident in a second, but, Laura, could you give me a little bit of background on Datadog? Folks might not know that are not working at the company.

Laura de Vesine: Yeah, absolutely. I think particularly for this incident a couple of things are really relevant. We run a whole suite of products, so it's all one integrated platform and you see it through a single pane of glass. But of course it is a number of different products that track your more traditional metrics, your logs, we've got traces, we've got a bunch of profiling information, we've got security products. There's a whole suite of products involved in Datadog, and we run those in isolated regions.

We've taken on that same model that the cloud providers have that we run isolated regions and they are fully isolated from each other, so we're running them on cloud provider regions that are isolated. We're running on multiple cloud providers to give those, and they are totally separated at stack. They don't share any infrastructure, they don't share any data storage, and that will become relevant to this incident pretty quickly. I think those are the most important pieces to know about the internals.

Nora: Absolutely. Datadog did not start that way, with this whole suite of product offerings, right? It just started as a single product offering and then expanded from there, is that right?

Laura: Yeah, exactly. As a lot of companies do, it started with one product and then there was a lot of market for it, and then it's built some new things and it's acquired some things as it's grown.

Nora: Nice, awesome. Niall, do you want to give us a little bit of background about the incident before we dive into it? Folks that are listening, Laura is our first guest on this podcast and we hope to actually have many more guests come in and talk about the incidents, and talking with us about them, what they experienced, what they learned, what they felt. But I just want to give a little bit of a background on how we'll be chatting with Laura today.

We're not going to be doing this standard interview where we're bombarding her with questions and not giving her a lot of time to reflect. We're going to be doing a cognitive style interview that's going to encourage her to recreate what was happening for her and for her colleagues during the time of the incident, using some retrieval cues to trigger memory and Laura is not sitting with all the details of the incident sitting in front of her face right now.

I'm sure she's done a million other things since the incident, and so a lot of the questions that we're going to be asking and the ways we're going to be asking them are meant to help her remember what happened and give us some of the details about how things unfolded, how folks coordinated, what was hard, what felt good, things like that. So with that being said, Niall, could you give us a little bit of background on the incident?

Niall: Yeah, I can talk a little bit about this, and probably less convincingly than Laura can, but we'll get into those details in a moment. I suppose the things that i can say are primarily the things that we could say from the outside world like, "What can we tell about Datadog, given we are looking at it from very far away through big telescopes or whatever?" So I suppose there's a couple of things we could say.

First one is that it appears to have been a widely notable outage, like a lot of people knew about this one and it's interesting because it even made its way into, for example, there's a chap who does a newsletter called The Pragmatic Engineer, Gurgly I think is his first name. He talked about the Datadog outage a fair bit because it impacted so many of the people that he tends to talk to in his network right now.

He somehow obtained access to some kind of post mortem or post incident report before it actually was fully publicized, so he did a whole piece on what he thought happened and the bits where he has some evidential things to back up what he says are apparently cool, and the bits where he's guessing are apparently less cool. But it's been interesting that the impact of the incident has been so broadly felt throughout the industry that there's this much attention to it.

I suppose if we're looking for independent metrics, subjective metrics as to how much this impact cost in the broadest sense, I think that it has been a public statement on a call by one of your C suites that it's $5 million or thereabouts. So that is notable for two reasons. First of all, okay, it's a large number. I think I've been involved in production incidents which were larger in terms of loss.

So it's a large number but it's not unprecedented in the industry at all. But the second piece is that we don't often get in the industry widely shared numbers which are related to incident impact, and that is in and of itself notable. So again, kudos to Datadog for talking about that. There's obviously a bunch of indirect impacts as well because not just the unfortunately reality of customer churn and so on and so forth, but also behind the scenes there's a bunch of people who were both on the Datadog side and on the non Datadog side, saying, "This couldn't possibly fail."

Or slightly more accurately, "This will fail someday but probably not tomorrow." And now this work is increasing in importance and there's a lot of scurrying going on behind the scenes in order to manage this risk, and again, it's something that we only see from the outside is when you go to the Datadog website to look for new jobs that have been created since the 8th of March, you see a list of jobs which are around risk management and so on and so forth. More SREs, that's clearly what we need.

Laura: We did have some of those before.

Niall: Yes. "We had some before, we have some more now or shortly, but we definitely had some of them before." So I think without beating around the bush or being disrespectful or whatever, there's clearly some kind of reputational damage, there's clearly some kind of financial damage, et cetera. It's obviously a terrible situation to be thrust into. But that's what we know publicly.

Nora: One thing that I think there was a lot of comments on is that the incident happened on March 8th, but there was not a public incident review fully published until May 16th, and I want to briefly comment on that because when there's a huge incident that happens in a company, it is so hard and so time consuming to properly investigate and know all the details of what happened.

When you're being pressured publicly to produce something in a short period of time, it actually limits the efforts that the engineers can put in to learn themselves, and so Don Allspa has actually written a lot about the multiple audiences of incident reviews and how they can sometimes contradict with each other.

But if you're telling the engineers to write and learn about something in a way that is going to make the customers feel confident and happy, they're not going to be writing and learning about it in all the details that they can use in order to gain more expertise and perform their job better and perform future hires better.

So I just want to comment that I don't think it's necessarily bad when a company takes a couple months to publish a public facing incident review. I actually think it's good because it gives those employees time and space.

Laura: Yeah. You're exactly right about the reasons for the delay in posting something more publicly. I do want to point out and this is where the stuff from The Pragmatic Engineer came from as well. We did have an incident report that we shared with customers, actually even during the incident, while it was still ongoing. Then we wrote up some additional things and shared that with, essentially, anyone we had an ongoing business relationship with. So if you were a customer, any customer, not customers of a particular size or anything like that.

If you were talking to a salesperson and you asked, we would send you essentially what's already up on the website. But what we were doing is really exactly what you've implied, we were taking the time to sit down with engineers and not so much what happened, we had largely that put together, but how could we string this together into a single, coherent timeline? We eventually gave up on that, to be perfectly honest, there's just too much.

But also really taking the time to digest what we learned, it's really easy in an incident with this much impact to expect that it needs to be, "You've got to redesign all your systems to handle this exact problem." And so taking the time to take a step back and say, "What do we really need to do to make our systems better? And what lessons are valuable to people outside of Datadog, and what can we include in that?"

o there is obviously the write up that's up on the website right this second, later today or possibly tomorrow, depending on the editing process, we've got a series of deep dives, actually, that are going to be going up. The first one is going to be on the very specific details of exactly how Kubernetes and Ubuntu broke. There's like 12 IP tables in the article, I don't do networking.

But there's a bunch of very, very specific details of what exactly we understand to have broken for us. I'm actually working on a write up kind of in the same vein as what we're going to be talking about how do we respond as a company. Then we're going to have at least one about our internal product systems and how that specific one broke and how we fixed it because there's just so much that there's no way to cover it in a single blog post.

Nora: Yeah, that makes a lot of sense. I love that. I am very much looking forward to that. I'm going to briefly read the first couple paragraphs of the public incident report that was shared last week, not the one that was necessarily shared with customers. It says starting on March 8th, 2023, at 06:03 UTC, we experienced an outage that affected US-1, EU-1, US-3, US-4 and US-5 Datadog regions across all services. When the incident started, users couldn't access the platform or various Datadog services via the browser or APIs, and monitors were unavailable and not alerting.

Data ingestion for various services also impacted at the beginning of the outage." It mentions that you were first alerted to the issue by your internal monitoring, three minutes after the trigger of the first faulty upgrade that first happened at 06:00 UTC. Declared a high severity incident 18 minutes into the investigation. I have a couple of questions there. I think my first question, Laura, is how did you get involved? It mentions that first alerted to the issue by internal monitoring. Can you tell me a little bit about your story and what happened at those beginning moments of the incident?

Laura: Yeah. Absolutely. So like any responsible engineers, most of our monitoring is built on Datadog. But as responsible engineers, we do not monitor Datadog exclusively using the same infrastructure stack as Datadog, so the piece of monitoring that actually went off for this incident is our out of band, shares no infrastructure monitoring. That team got alerted, got online, it was about one in the morning for them.

That team got alerted, got online, very quickly saw this was a substantial impact, escalated it to a severity that automatically pages our incident response, on call rotation. We have our engineers on call in general, as a You Build It, You Run It kind of system. The team that happens to own our out of band monitoring is also the team that owns our monitoring in general so they own alerting in particular. But when incidents are large, we've got a rotation of more senior engineers who get pulled in to do incident command for those.

So that's the rotation that I'm on. Because our tooling was impacted, it took them around 10 minutes to open an incident and escalate it to me, and then because it was 1:00 in the morning I had literally just gone to bed, it took me a couple of minutes to get online. I remember this very, very vividly. I got online, the first thing in the Slack channel is alerting seems to be down in all sites, followed by, "I think this might be SF-1." My response was, "This is SF-1! Let me get a status page up."

Nora: Wakes up, "Yeah, this is SF-1."

Laura: Yeah, exactly. Because that's the first thing that you can do, but at that moment all that I knew was our alerting is down in all sites.

Nora: On all sites? Is that what you said?

Laura: Yeah, on all our sites. I think we call them sites, externally.

Nora: Okay. And what do sites mean? Can you describe that a little bit more?

Laura: Yeah, so sites are those isolated Datadog regions. That's what that US-1, EU-1, US-3, that whole list is. Each of those runs on a cloud provider region and is a totally isolated stack of Datadog.

Nora: Okay. What was that like to wake up to? So it's 1:00 in the morning?

Laura: Well, I didn't wake up. I'm a late night person so I had literally just gone to bed. But, yeah, it was surprising. I think for me, a lot of my first response to getting a page is, "Oh no, I have to go deal with this. What a pain. I was about to go to sleep." It's more of an annoyance, than even a stress response. It's a, "This is very irritating, I had other things to do." So that was my first response, and then once I was in the incident, I was very focused just on what needed to happen. It was, "This is obviously a pretty big incident, I need to get some kind of notification out to customers, I need to get other people involved in the response."

Nora: Okay. So those were all your responsibilities in that moment, getting the status page up, getting awareness on what everyone was doing, what had happened so far, getting responses out to customers. Then are you also orchestrating the folks that are responding?

Laura: Somewhat. So I become the engineering decision maker, like any typical engineering incident commander. What is our priority? What do we need to do next? All of those kinds of things, but obviously or possibly obviously, I don't know how to fix it. I don't even know if it's broken at this stage, and I'm not really in charge of that, so much as just making sure people have the right priorities set for their engineering needs.

Nora: Okay. And how many people were in the incident channel when you joined? Were you using Slack to coordinate?

Laura: Yeah, so we mostly use Slack and Zoom to coordinate. When I joined it, I think there were three people in the channel.

Nora: Okay. And what were the roles of everyone? So you were on call and then what there the other-

Laura: Yeah, so I was the on call incident commander. At the moment that I joined it, we had somebody there who was just the person who got paged by the automation, and I believe we had... I'm pretty sure we had somebody from our customer support team at that point.

Nora: Okay. So I'm reading in the incident report, "Our incident response team had an emergency operations center of 10 senior leaders, 70 local incident commanders and a pool of 450 to 750 incident responders active throughout the incident." How many employees does Datadog have? What percentage of employees were involved in this incident?

Laura: Yeah. So that responder list is just the engineers list, it doesn't include support.

Nora: Oh, so it's even bigger?

Laura: We employ around 2,000 engineers. I don't think the number is totally public, and I don't know it anyway. But it's around 2,000 engineers.

Niall: Is that kind of nine senior leaders too many for coordination? Or what was the toe stepping?

Laura: So there wasn't toe stepping because what their actually was... No, that's a totally reasonable question. The reason that there's so many people involved is partially just the length of the incident. Because we were on active response, this incident took us a very long time to repair and recover from, we were on active response for almost 48 hours, and most of that nine senior leaders and 70 people, that's really expressing the fact that we handed off throughout.

So most of the time during the incident, we had two or three people on at any given time as engineering leadership, so in that incident command role for what are we trying to resolve at this instant? Keeping on top of getting status updates out to the best of our ability, things along those lines. Then typically one or two actual senior executives at any given time who are primarily engaged in talking to customers.

Nora: I just love that you all put these bits of coordination in your post incident doc. I feel like I don't see that enough in the tech industry. When an incident report gets published and a bunch of folks want to know about it, I am sure you all felt the pressure on Twitter from blog posts, to get something out to explain in all details what happened. That's what I feel like people think they want to know, but there are parts of it like this that there was a lot of coordination efforts involved. It was global for customers, but it was also global for employees, so you must've been working with other incident commanders across a lot of different time zones as well. Is that correct?

Laura: Yeah. Although we are global, we are mostly in either the New York time zone or in Paris time, plus or minus an hour so it wasn't as many time zones as you might think. We don't really have engineers who are in an APAC time zone. We do have a few engineers on the West Coast, but it's less common. But, yes, there were a lot of people to work with.

Nora: Was there only one Slack channel going for this incident? You mentioned you primarily used Slack and Zoom. Did you have a Slack channel and Zoom for their separate threads going on? How did that communication process escalate?

Laura: Yeah, absolutely. We didn't even have a Zoom when I joined. That was one of the first things that I did, was make a Zoom for the incident. We always open a Slack channel automatically for every incident, so there's a Slack channel just for the incident. The Datadog incident app is actually how we coordinate that.

Then we were in the middle of dog fooding a product that I think will be out soon for customers within the incident app, which is around work stream management that allows you to designate a work stream for an incident and opens a Slack channel that is specific to that work stream and coordinates with the broader incident. It sure was useful. We've had a couple of different people try to count the Slack channels that were involved in the incident. I think our official count is 73.

Nora: Wow.

Niall: That's a lot.

Laura: The main incident Slack channel had, I think at peak, around 1,300 people in it.

Nora: And so it looks like about 450 to 750 of those were actively responding. And were the rest-

Laura: Yeah, they were engineers trying to fix it. That's obviously an estimate, right? You can just tell from the range of it. How do you say somebody is an engineer who was actively responding, versus somebody who wandered in to look, versus somebody from support? It's hard to make those specific distinctions, so our best guess is about that many engineers.

Niall: So speaking for myself, reflecting on my personal experience in analogous situations, I'm going, "1,300 people in a Slack channel. Sounds like it's not necessarily a recipe for quick convergence to a set of contributing factors or a common understanding of what's at stake, et cetera." How was your experience with that?

Laura: So obviously this didn't grow immediately. We didn't have 1,300 people online at 1:00 in the morning. The very first thing that happened was I paged in a couple of other teams, I started working with the people who had been initially paged and we tried to get a sense of what is the impact. Is it just our alerting that's down, or are we seeing other symptoms? Because of course the reason that alerting is usually down, is that other things are affected.

And so we got as quickly as we could, a quick assessment of what was broken and tried to figure out what we would get posted for customers right away, and then started basically theory crafting. "Here's the set of priorities of what we need to do. We need to get a better sense of which specific things are impacted, beyond what's immediately obvious. We also need to figure out what just happened?"

Everything just went down, what just happened? That's the 1:30 in the morning experience.

Really, the honest answer is that it was on me to set those priorities, and to coordinate them around that. In general, we train all of our engineers on incident response procedures and good incident management, and so all of our engineers know that there's an incident commander in the room and it's their job to set priorities.

It is not up for debate, right? If you have concerns, if you have things that you need to raise, if you have blockers, please do all of that. But if you disagree with the direction of something but think that there's nothing that can go terribly wrong, you just wish you could do it a different way, not a conversation we get to have right now.

Nora: So, Laura, you mentioned that you first pulled a couple of teams in. Who were the couple teams that you pulled in?

Laura: Yeah, so really there were two initial priorities. One is how down are we really? Just how extensive is the impact, and the other is what broke? So in terms of how down are we really, the question becomes we can see that the website is not loading in at least some sites, we know that our alerting is down, we know the dashboards aren't loading. So that's all pretty obvious, we can assess that as any engineer.

The question that becomes really relevant here is our intake impacted? We don't know, we can't see, and the monitoring that we would use to determine that is currently down. And so I paged in the teams that were able to do that assessment or in theory able to do that assessment. Although, what we found was they couldn't really tell us how intake might or might not be impacted until they had a little bit more of an idea of what the cause was.

Then really it was almost an exercise in suspension of disbelief. I've mentioned a couple of times, we run fully isolated stacks. There is no shared infrastructure. Everything went down all at once. Well, that's not supposed to be possible. And so which teams might this be? We paged in our networking team because maybe there's this networking thing we don't understand.

We paged in our compute resources team, the team that runs our Kubernetes nodes because some of the symptoms that we were seeing seemed maybe Kubernetes nodes related. We actually paged in the team that runs the web UI, not because we thought that the problem was only web UI, but because they'd made the last configuration change and we thought that maybe our configurations had responded globally in some way.

Niall: Almost surprised you didn't page in DNS team.

Laura: We did, we did. That's the networking team.

Niall: Oh right, okay. Fair enough. So you have a struggle, a kind of a cognitive modeling struggle where you're going, "Could it possibly be this bad? Oh yeah, actually it could be." And you're going through various different stages of establishing, "Oh yes, we haven't hit the bottom yet on this." So at some point you actually do bottom out and you go, "Oh yeah, actually this is a very serious global or semi global thing." What happens then?

Laura: So I put in that we officially said it was serious and very global in the official timeline, about 1:30 my time so we knew it was bad pretty quickly. In terms of how bad, it really took us almost two hours, I think, to understand that what had happened was an impact to our actual computing infrastructure, the Kubernetes nodes that we run, and that it was a single point impact.

It wasn't an ongoing problem, it was that something had gone wrong and now it had stopped going wrong, and we just had the mess to clean up. I'm going to say it was around 3:30 in the morning that we came to that realization that that was what had happened, and that now we just needed to repair all of that. So that was where that process happened.

I don't know, I was honestly so busy I never really hit a rock bottom, "Oh no, this is really, really bad," kind of moment. It was all so much. Yeah, it was so much shoveling of let's get people involved, let's make sure we're on the same page, people keep joining the incident response and need to be given a summary of what's going on. I pulled in several other responders from our major incident rotation to start taking on some of that work.

There was a VP on the call pretty quickly, starting around about 1:40 or so, who was also helping to coordinate some of that, helping to figure out what kind of messaging we wanted to get out on the status page. Were we prioritizing all the right things in terms of questioning our impact, all of that. And so once we'd gotten to the point where we understood that it was a point in time impact to our Kubernetes infrastructure, we get to what do we repair first?

Because again, when you run these totally isolated stacks, yes, you shouldn't have global events take them out but it also means you can't repair them globally. They have to be repaired one at a time. Where do we start? We made the call to start with our EU stack because it was starting to be morning there, that seemed like the most sensible place to start. In retrospect, that may or may not have been our best decision.

Nora: Yeah, tell me about how morning played a role in deciding to start there.

Laura: Yeah. So obviously Datadog is useful to customers in two broad ways. One way in that you get your alerting from Datadog if something is broken, and that matters regardless of what time of day it is. But wouldn't you like to look at your dashboards while you work or while you release things or whatever? And so if our customers are actually during business hours, it matters more to them that Datadog is down in a lot of cases, and so we wanted to repair things for the customers for whom it was business hours first.

It turned out, and again this is something that impacted us a bunch during those first few hours, we run these regions on different cloud providers. These sites are different cloud providers, and different cloud providers actually responded to the initial impact differently, and so we believed that our EU site was more impacted early on because of the cloud provider response, than our US site. It was more obviously broken, but faster to repair, and I think it was very hard to keep track of the fact that there were different impacts in different locations early on and so we really focused on that EU site and didn't realize what the problems were in our US site until somewhat later in the incident.

Nora: Was the initial assumption that it has the same impacts?

Laura: Yeah. I think the initial assumption was that everything would be broken in the same way. It all started at the same time, so we would expect that it was all the same problem. Indeed, it was the same underlying root cause, so that we're not beating around this. The underlying root cause of the problem is that in version 22.04 of Ubuntu there was a change made to System D Network D, where when it is restarted it deletes IP tables rules that it doesn't know about.

We use Cilium to manage our Kubernetes network connectivity, and when Cilium installs itself on a Kubernetes node it rewrites the IP tables in order to allow for routing to pods. An automated security update was made by Ubuntu that was in no way was a problem in and of itself, but because the security update was to System D, it restarted System D. That caused the IP tables rules to be deleted that Cilium had put it, and because of the specifics of how Cilium makes those IP table rules, it eliminated network connectivity, all network connectivity, to the host.

Nora: So it deletes IP table rules that it doesn't know about. It was Cilium, you said? How is it communicating back with what the new rules are? Was it a particular timing thing?

Laura: Right. So System D doesn't communicate back the new rules, System D is part of the actual Ubuntu. What happens is when you create a node, it starts with the default Ubuntu rules for IP routing and then when we install Cilium in order to actually integrate it as part of the Kubernetes instance and run pods on it, Cilium deletes some of those rules, writes new ones, in order to manage the routing. When System D then restarts, it doesn't recognize Cilium's rules and so it deletes all of those rules and you wind up with no networking to your host.

Niall: Which is generally acknowledged to be a bad thing. So there's a similar outage I experienced in my past in a previous employer where the database that stores the giant list of permissions for everyone to edit their data is updated by this script and the script wakes up, looks at the delta in the new permissions it's supposed to apply and just goes through the host going, "Okay, drop these, these, et cetera."

Except when it turns out that the delta that you need to apply is larger than the buffer that the database can handle, so you do the drop and, okay, no one has permissions now and then you try and apply the new ones, and you seg fault in the middle of the application and then you try and make your connection back again. Except, you've just dropped your own permissions to add yourself so you can't have... So rolling fleet death is very much the theme here.

Laura: Yeah, absolutely. In this case nothing is even trying to add those back in, but that's fundamentally what happened. We had these default Ubuntu security updates on which certainly none of us in the initial call realized we had. Obviously folks knew about them, we knew that we had them, we'd been running them since 2010, they've never presented a risk before. They're not how we typically do security updates, we do actual long term node management and all of those things, but we still had these ones on.

Niall: That is also an inherently interesting question as well, going back to the theory of the crime piece. You've realized it's a large problem, you've realized it's something to do with Kubernetes. So how did you connect that? How did you end up connecting that with the updates?

Laura: So we were almost 10 hours into the incident when somebody finally found that that was the problem. We were very clear pretty early on, like I said, comparatively early. Any time it's like two hours before you even have a way forward, that feels terrible. But we did understand relatively early on that something took down a bunch of our Kubernetes nodes and now we need to fix it.

We didn't worry about what had taken them down for a while, but one of the folks on our compute team, after some of that initial repair work had been done, went and started digging through logs of impacted nodes and found that an app update had been run at the culprit time and said, "Hey, we run that every night." And then eventually found that that was what was happening.

Nora: Good sleuthing. How many theories were going around before that person brought that up 10 hours in? What were other people looking at?

Laura: Yeah. I think we were honestly so baffled that we didn't have a lot of theories. There were some short lived theories around it being something besides our Kubernetes infrastructure, definitely I thought for a while that perhaps we had seen some kind of cascading failure because we saw a lot of nodes trying to restart and not successfully doing so, or maybe not successfully doing so. It was a little bit hard to tell.

Again, all of our monitoring was impacted by this incident. I theorized that maybe there was some kind of cascading failure and we needed to back off on a bunch of systems to see if we could get a recovery there, but that was really only live as a theory for 20 or 30 minutes.

Then we saw that nothing continued to kill new nodes, so that couldn't have been it. In this case, I think that it was so unexpected and so surprising that we had something that could do this that we didn't have a lot of theories going around about what it was without any evidence.

Niall: That makes a lot of sense, if I'm thinking about things that could cause total fleet death in this way. I'm running through networking, DNS, security, has their been cascading failures? It's a long-

Laura: Yeah, we absolutely did page in our security team, to be clear. Some time around 2:00, so around an hour in somebody pointed out, "We should probably get the security team involved," and then we did.

Niall: Absolutely. And in a funny way, it was security. Except not quite in a way we would be thinking of it in that sense. I mean, it's close to maybe not the bottom of the list, but it's certainly in the mid tier or lower tier that I'm going, "System updates," because of course in a sense one presumes that system updates like this are tested by the OS provider, are rolled out to other people, and if it's going to be a major issue it's caught there before it gets to you.

Laura: Absolutely. Won't the system update itself? It was fine. It was this very specific interaction of the restart of System D with the Cilium rule rewrite that burned us. So nothing about the system update itself was a problem.

Nora: You mentioned that these system updates have been on literally since 2010 and have never led to an issue. Do you know some of the history of that? How they got turned on? How they stayed on?

Laura: Yeah, they're on by default. There's a default Ubuntu configuration that downloads system updates on a randomly chosen 12 hour window for your hosts and then automatically runs them over the course of an hour, configured to 6:00 AM UTC. That's just the default Ubuntu configuration.

Nora: Okay. So you just left the default on and then it was still working as expected until it wasn't, essentially?

Laura: Exactly, exactly. Back in 2010 when it was a very small company doing very small things, we left the default on and it had never caused us a problem as we grew, as we moved onto Kubernetes, as we expanded to multiple locations and multiple sites and multiple cloud providers. It had never caused a problem, so we had just left it as it was.

Niall: Norian Desi of Google now, but many other things previously, says that it's unexpected correlation that gets you, and it's the things that you discover that systems of some relationship or correlation that you weren't necessarily expecting. Sometimes it's as simple as, "Okay, at six o'clock UTC everyone is going to wake up and apply everything that's in that directory. Hope you're ready for the lot." Or sometimes it turns out that this system has a critical latency cliff when it goes over such and such degree of throughput, et cetera.

So I presume that you're doing the equivalent of find/minus type F minus exec grep IP tools * or {} for everything now. What is the process for restoration? Because earlier on you were talking about how, "Well, it's one thing to find out how the crime happened, but what are you doing to remediate it?" And you also spoke about the role of the cloud providers in this, and I think there's some nuance here to how that plays out. Can you talk about that a bit?

Laura: Yeah, so there's actually quite a bit of nuance because we use a slightly different version of Cilium for different cloud providers, which also had some impact, although relatively minor in terms of the above our compute layer response. The cloud providers, there is a substantial difference. So on GCP and Azure, when a host stops responding you can configure health checks that will restart the host automatically for you but you don't have to, and we don't have them configured.

So on GCP and Azure when a host stops responding, they let it sit their dead and that meant that we had hosts that were not responding at all on GCP and Azure which meant that you couldn't load the website and our intake was substantially more impacted for that period of time because we had... This ultimately affected about half our fleet, there was some variability between different locations and cloud providers and all of that, but this affected around half our fleet.

But they just let the dead node sit there, and so we only had to restart all of the dead nodes in a coordinated order because of how we specifically manage our Kubernetes fleet. Then once we restarted them we had to successfully get them to rejoin actual networking and all of those pieces, and then scale up adequately to process our backlog.

Which was substantial, but a well known set of responses from our teams. On AWS, auto scaling groups detect the hosts are unhealthy and automatically terminate and recreate them. This is a problem if you run local discs, which we do in a substantial number of our data stores. We treat them as cash, they can be recreated, but of course if you do this to half of our fleet at once we can't really recreate our caches quite that quickly.

Just in general, we had a number of places where these things are cached or they're consensus based data systems of various kinds, Zookeeper, Kafka, various other forms of those things, Cassandra is the other big one. Those need some level of management to bring back, rather than just coming back automatically when you've just lost their data. So really the story of recovery is, at scale, across many products which are all now competing for resources, fix that.

Niall: So what does that involve? Does it involve scripts to reboot a bunch of stuff? Or turn down a cluster, turn up a cluster? Did you establish a safe throughput, or?

Laura: We went back and forth on safe throughput. In terms of restarting the actual clusters, we established a safe throughput that wasn't quite accurate. There's multiple steps to nodes rejoining clusters and creating pods on those clusters, and those steps include both the actual creation of the node and then also a bunch of calls to things like Fault to authenticate the node, a bunch of calls to cloud provider APIs in order to actually join the cluster and start receiving work and all of those things.

We thought we had a safe throughput, discovered we were wrong and had to back off at one point. We certainly were in touch with cloud providers throughout who were generally very responsive in terms of increasing rate limits for us and things, but that still had to be done, that was still a coordination activity. That's the infrastructure piece. Then there's a bunch of, well, your service has been damaged, what do you do with all of your live data and all of your backup restoration and things along those lines.

Nora: How did you decide the initial safe throughput?

Laura: Yeah, my honest answer is our compute team went as fast as they could. They wrote some scripts and did some restarts on those scripts. Most of the throughput was limited by humans, rather than by machines. Under the circumstances, we really wanted human beings to be watching pretty much anything that was going on, especially since, again, our own monitoring was impacted.

We wanted things to be done with at least semi manual actions, and so the throughput limiter was human beings and we assumed that as fast as humans could go was a fine rate. We did run into some interesting places where throughput was limited by other things.

So like I said, there were cloud provider API limits that we ran up against and then needed to get increased. What else was some of the interesting stuff? We definitely sped up as we went along, the first cluster that we recovered went pretty slowly and then we wrote some scripts to improve that. Some of the tooling that we would normally use to do these kinds of operations at scale was also down because it runs on our infrastructure, so we had some workflow tooling that we would generally use to restore our various metadata databases from backup.

That tooling was impacted. So folks threw together some BASH scripts and went as quickly as they could, but that was also a big factor. In terms of recovering one of our metrics services, we initially had a Cassandra database that needed to be recovered and our starting ETA for recovery was 33 hours, which just has to do with the fact that we make topology changes to Cassandra serially and we had built in a wait of 10 minutes between those for safety and normal operations. But that meant that our normal operations safety margin meant that it was 33 hours to restore this thing, which was not really an acceptable amount of time.

Nora: How were you communicating all these efforts and what was going on with your customer support teams and the folks that were communicating frontlines with the customers? Because that's got to be difficult when you're trying things and they're not maybe quite going as planned, or you're understanding things and your understanding improves and evolves over time. So how was that communication being coordinated and run?

Laura: Yeah, this is a place we really want to improve. We don't think that that was to the standard that we expect of ourselves, that our customers expect of us, so really upfront we're not happy with how that went. The things that we did during the incident were obviously we had the public status page up which was largely managed actually through our direct incident commanders. It's very difficult on a public status page, it's global for all customers, right?

So it doesn't show specifics for any individual customer, and it's really difficult on that public status page to convey the nuance of, "This product is at this state of recovery and this product is at this state of recovery, and we have a better understanding of what's broken. But you're not going to see anything yet for a while. We don't know how long it's going to take, we've never done this before."

We can't really communicate all of that nuance on a public status page, so we had more than 1,000 tickets filed with our customer support. Any customer who's got a direct relationship with a customer success manager or a technical account manager, any of those kinds of folks, they reached out directly to those people, unsurprisingly. That meant that we had hundreds of customer support people asking some of the same questions, and a lot of them questions we couldn't answer or questions we couldn't answer to anybody's satisfaction.

We had a lot of customers saying, "Well, how will this affect me personally?" To which my answer is I have no idea. Or the same way that it does anybody else. We didn't coordinate that messaging very well. We certainly tried, we built an FAQ for our internal customer support folks to read from. As the incident went on, we built up a regular cadence where some of ours execs would actually sit down with a bunch of these customer support leads and talk with them about where we were at in an engineering sense and how far along we were. But we certainly heard that our customers did not feel like they got the detail that they wanted and it wasn't clear that we were really all hands on deck responding to this and that we were making progress to all of our customers.

Niall: It's such a difficult question, right? Because having been in that situation before, one of the things you're asking yourself is, "Do I really communicate with the customers that we have now discarded the 35th thesis that we had about what the hell is going on with this thing? Or do we just repeat the same update, essentially, and move on and try and focus on the real thing that matters?"

Laura: Yeah. Or in fact do I communicate with customers, "Okay, we've restored our compute infrastructure and now we have another two layers of infrastructure to restore before you start to see any impact"? We don't really have language around that, and so how do we communicate that we're making progress in a way that's meaningful to customers and doesn't also sound like a cop out because to some degree, as a customer, if I heard that then that sounds to me like a cop out.

Honestly, one of the things that customers want from you in an outage this broad and this long is they want face time with someone they can yell at. We provided some of that, quite a bit of that. One of the primary things some of the executives involved were doing was getting on calls with customers, kind of just to let them know, "Yes, of course we take this really seriously, this is a big deal to us," because as a customer that's what you want to hear.

Nora: I've certainly had to go on a few apology tours before too, so I totally get that in previous roles that I've been in. It's incredibly important to just over communicate in those situations. But also a friend of mine is an incident commander at a pretty large company and they happen to live in a city where their customer support office as located, so they would go into that office a couple of days a week and work out of there just to have some face time with colleagues.

But what happened was their largest incident happened at the same time they were doing that, and so their customer support folks could overhear everything they were saying to orchestrate the incident without any of it being filtered to them. So they were almost doing exactly what you said, which was over communicating with their customers, details that they probably didn't need to know, details that were incorrect and then correct and then incorrect again.

So I bring all this up not to pass any judgment because I think is incredibly hard and it's incredibly hard when you're also trying to fix and figure out what's going on. I love how reflective you all have been about that too.

Laura: The answer here is we're going to automate a lot more of that kind of status reporting, and we're doing some training with our customer support organization around being organized, more organized in large scale incidents so that we don't have every customer facing person asking the same question or trying to interface with the broader engineering response. But rather having their own channel to go through a little bit better. We built that during this incident pretty aggressively, but making sure that we train that organization on incident response just like we've trained our engineering organization on it.

Niall: Yeah, I think it's interesting to note that added communication is finally a communication, and as a result inherits the attributes of emotional labor, which is not something that is often even acknowledged or thought about in the corporate communications domain. It's very technical, very abstract, et cetera, et cetera, but actually a load of people just want, "ah, I want my thing back,"or the equivalent, right?

Laura: Yeah, believe me, we want to get you your thing back. I promise we're working as hard as we can. But of course, you just need to hear that, you need to hear that it's just as important to us or more as it is to you.

Nora: So one thing that was interesting that was noted was the data hierarchy importance with regards to an outage.

I think it's a good reminder that when everything is an incident, nothing is an incident. It is important to prioritize some of these things too, and it can be very hard to convince others of it internally when an incident happens.

But yeah, if you're getting interrupted and playing whack-a-mole every day, which has significant person cost to it. It looks like what happened in the incident is you ended up sacrificing other things that were working in order to get some of the live monitoring portions working. Can you talk about decision process a little bit and what was learned live?

Laura: Yeah. It was less that we sacrificed things that were working. There was a tiny bit of that, but it was mostly that we really actively prioritized the live data once we realized that was a concern. Again, there was so much limitation to our bandwidth because there were so many things that needed compute resources all at once, there were so many things that needed human attention all at once that we just had to choose what do we fix first.

I don't know that we realized that we needed to do that level of triage for several hours. I think that I was not actually present by the time we realized we needed to do that level of triage because I signed off around 5:00 or 6:00 in the morning to get some sleep and then came back again at about 2:00 in the afternoon that day. So I think that realization that we needed to be really ruthlessly prioritizing happened while I was out.

But in general, the question is what do customers need first? And I don't know that we actually turned anything off, except for some very minor things. We have this watchdog product that does a bunch of data analysis and tries to determine what patterns you're seeing in things, that was a thing we turned off and down scaled pretty early on. We said that we weren't going to do back fill for that, until other things had been back filled because it puts a lot of pressure on other systems when it does its own backfill. So we had done some of that, but it was less that we turned things off and more that we said, "Don't backfill this until this other piece is done."

Nora: Gotcha, that makes sense. My last question, and Niall brought this up a little bit at the beginning. We noticed there were a few new job postings after this incident on the Datadog site. I know you mentioned a few of them were there beforehand.

Laura: Yeah, I actually think it's mostly just quarterly planning that's caused the postings, not the incident.

Nora: It's a good opportunity for new hire training. I feel like incidents are the best form of onboarding.

Laura: Yeah, it's definitely going to go in the new hire training.

Nora: I'm curious how you plan to socialize this incident with new engineers at the company that weren't there internally as it unfolded.

Laura: Yeah, absolutely. We do actually plan to talk about directly in new hire onboarding as a case study of how do you think about resilient design. The biggest changes that we're making around the incident are more around how do we design our systems to better gracefully degrade, which is going to be long term work. The obvious quick fix is we turned off automatic upgrades, that part's easy. Root cause solved.

But in terms of restoring service to customers more quickly and being able to recover from major, huge problems like this more quickly, most of that comes back to we need to be addressing the resilient design of our systems and the ways that they recover in degraded mode. So that's where a lot of our focus is, and so as part of new hire onboarding we're going to be talking to people about that and giving this incident as an example of it.

Internally, we've done a bunch of presenting to people who saw only their part of the incident who weren't involved at all with the larger story. Those are all recorded. My team has actually put together a little bit of a multimedia experience for people around going through that internally with some like, "How long do you have to learn about this? What are you really interested in?" And then it has some entry points for people. Definitely we've done some internal development around making sure that this becomes a thing that people really know about. Then like any good engineering team, especially any good SRE team, obviously there are stickers.

Niall: "I survived the 6:00 AM UTC chron job," stickers, or?

Laura: Something like that. I think the internal incident number and fire might be the theme.

Nora: I always say there's certain incident numbers I'm probably going to be rattling off when I'm an old lady. They kind of stick in my head. What did you title this incident internally?

Laura: We called it The Apptocalypse internally, for the fun name because it's appt update. The incident numbers, I can just tell you, it's 19254.

Nora: You'll know it for years, I'm sure.

Laura: Yeah, we'll know it forever. Right. So we mostly just refer to it by number.

Nora: Gotcha. Well, I think we are at a great place to wrap up. Niall, did you have any other questions?

Niall: Yeah, I suppose, Laura, if there's anything else about the future? I mean, we've talked about the past a lot and the future a little bit. But is there anything else you wanted to surface about what Datadog was going to do, then this would be an opportunity to do so.

Laura: Yeah, I don't really think so. Like I said, we've got redesigns and reworks and all of that, that in many cases are already underway around more resilient designs and have been accelerated. Some places that we found we did have some gaps that we want to address. As you might expect, we found some circular dependencies that we need to come back and deal with as part of recovering from this.

But honestly, there's the quick fix part of this which is just turning off automated upgrades and then there's write better systems which is a thing we were already doing. There's not really a lot else to do to respond, other than, again, fixing our customer communications, practicing our incident response. The stuff we were already doing.

Nora: Well, thank you so much for being on the show, Laura. It was really awesome to have you here.

Content from the Library

Visit library

Aug 24, 2022

Podcast

Getting There Ep. #4, The April 2022 Atlassian Outage

In episode 4 of Getting There, Nora Jones and Niall Murphy discuss the Atlassian outage of April 2022. This talk explores...

Jun 28, 2022

Podcast

Getting There Ep. #3, The October 2021 Roblox Outage

In episode 3 of Getting There, Nora Jones and Niall Murphy unpack the Roblox outage of October 2021. Together they review the...

Nov 22, 2021

Podcast

Getting There Ep. #1, The October 4th ‘21 Facebook Outage

In this inaugural episode of Getting There, co-hosts Nora Jones and Niall Murphy unpack the October 4th ‘21 Facebook Outage and...