JUN 28, 2022

40 MIN

Ep. #3, The October 2021 Roblox Outage

light mode

about the episode

In episode 3 of Getting There, Nora Jones and Niall Murphy unpack the Roblox outage of October 2021. Together they review the incident report, discuss the contributing factors and the users affected, and examine the attributes of Roblox’s business model that led to this 73-hour outage.

show notes

about the episode

about the guests

show notes

transcript

Nora Jones: All right, hi, everyone. We are back. The outage we're talking about today made a lot of headlines back in late October, and this was the October Roblox outage that happened around Halloween.

It was very notable, it was very spoken about, it was unique in duration itself and the fact that Chipotle was launching a big campaign around this outage as well made us really interested in digging more into it.

Also, one of the more interesting things was that Roblox the company published their public incident review back in January of 2022, which was a few months after the outage actually took place. I'll let Niall dig a little bit more into that.

Niall Murphy: Yeah, first thing I want to say folks for those listening who were involved in the outage and also other people, just wanted to say I am issuing a public apology, no more questions at this time. But I am issuing a public apology because I stated publicly that I did not believe Roblox would publish this post incident report and they did, so you comprehensively proved me wrong, which is excellent.

And even better, the post incident report is actually very good and there's quite a lot of interesting and motivating detail announced to support the credibility of not only the Roblox team, but what they're going to do in the future.

One of the other interesting things I wanted to surface from this report just before we get into the detail, is that they are also really good about pointing out where they get help from their major technical partner, HashiCorp, which is something you do not often see in a lot of other post incident reports. So apology, hopefully accepted, and not in a Darth Vader sense. But anyway, we'll move on.

Nora: I actually have to commend Roblox for publishing it later, I think a lot of tech companies, especially when it's an outage of this magnitude and it's talked about all over the world, they're in a rush to get it published sooner.

But I really commend Roblox for the amount of time that they took to really understand what led to this outage and what contributed to it, and I wish more companies across the tech industry would take a little bit more time on these things because I think it's actually really helpful as a reader of these documents.

If you are listening to this and you haven't read it, I highly recommend going through it. It's really well done and as Niall said, they collaborated with HashiCorp quite a bit too which was unique in nature.

Niall: Absolutely. Do you want to tell us a bit about the company itself then, Nora?

Nora: Sure thing, yeah. So Roblox was founded actually back in 2004, they didn't launch their first official game until 2006 and they changed a lot. So in 2013, Roblox made a critical change to its developer platform by allowing creators to exchange virtual currency, Robux, for real world currencies.

That developer payout has increased year since launch, it's been incredibly lucrative for them and it led to Roblox paying $250 million to developers in 2020. They've generated a ton of revenue since then and in 2020 alone they were generating $920 million in revenue with an 111% increase year over year.

They were seeing 150 million people play Roblox once a month. 33.4 million people use it daily. That was a huge uptick in the number of users over time. So they have a few different models, they get money from ads, they also make money from their Robux, people buying Robux. They have freemium models as well and then they also have licensing agreements which has ended up quite lucrative for them.

So their platform being down impacts their revenue almost immediately because they are not getting paid for the folks paying for ads, they're not getting paid for the folks that are buying Robux, et cetera, et cetera. Niall, do you want to tell us a little bit about the outage itself?

Niall: Yes, and actually this is, for me anyway as an engineer on call and as a previous person on call, this outage report is one of the juiciest when it comes to actual details and figuring out what's going on, and describing I suppose the diagnosis process in very great detail. So I really like it from that point of view.

But to start out with the high level pieces, first of all the reason it's notable is it's a 73-hour outage. It is a long weekend outage and it goes on for a long time before they actually start making progress on the actual contributing factors, et cetera. There's also a large number of customers affected. I think in the report they say about 50 million people, elsewhere they say 44 or 45 million daily actives.

So not quite sure, but a very, very large number of people affected. The other interesting thing which we might later come back to in this conversation is the median age of the people affected by this outage is 13, as opposed to for example, I don't know, a 23 or a 33 year or a 43 for other cloud products. Also, interesting to note again from a high level point of view, that it's diagnosis that takes up the bulk of the time.

There's a very significant amount of effort put into restoration as we'll see when we go into the details. The other thing from the technical details point of view, is to say that Roblox do a number of slightly unusual things compared to what I will for the moment declare arbitrarily as the mainstream of production data center activities.

They run the Hashi Stack, which is so say Nomad, Consul, and Vault, one of which turned out to play a fairly major role in this outage, and they also run their own data center which I think in the post incident report they say they have something like 18,000 machines op time at the incident, and 170,000 containers. So this is a very significant deployment of this stack, possibly a very, very big one indeed.

From a contributing factors point of view, they in the report say that there's two technical underlying factors. A particular feature of Consul and the underlying infrastructural, in this case, storage support for how Consul operates generally, and specifically does in this feature.

They have all kinds of other contributing factors to why this went on for 73 hours as opposed to 7.3 hours or equivalent. But the report is definitely an interesting example of what happens when you are buried in the minutiae and you're trying to figure out what went wrong, but we can go into that in more detail.

Nora: All right, Niall. Could you give us some background on what the stack does, what the stack is and tell our listeners a little bit about that?

Niall: Yeah. This will be extremely high level so I don't promise that folks who are coming natively or naively to this discussion will necessarily understand everything I say first time. But I will say the following stuff, that when you are running a production service in a data center and it is a distributed system, these days you invariably end up relying on some kind of distributed consensus in some way. This is basically a way to agree on a set of values, often it's some kind of log level that you reach in some storage system that tracks changes that you write over time.

In essence, you are obtaining reliability by having a cooperating set of machines agree that, "Yes, the state of the world is X." And X can be a particular configuration of a key value store or it could be arbitrarily many things.

But basically they use distributed consensus, and the thing that does distributed consensus in the HashiCorp world is a combination of Nomad, Consul, and Vault, I support. But Consul is really the major thing that does this. Nomad schedules things, so they put the running jobs on the machines, Consul is how you find stuff.

It does service discovery and also tells you where things are, like a mapping between machine and job and stuff like that, and Vault is used for storing secrets. At least that's how I understand it. So these days if you're running anything of any significant size, you are more or less committed to running some system or infrastructure that will allow you to migrate between machines when machines fail. And not manually either, migrate automatically between stuff.

Nora: And one thing to note here is just how big the traffic is and how much they have expanded recently, Roblox, as a company so these clusters are likely quite large and difficult to manage and becoming more complex as time goes on too. So responders are constantly having to update their mental model about what is changing with these servers as their business grows and as their user base grows.

Niall: Absolutely. So in this particular case, one of the interesting things I think about this, where they date the start of the outage to which is the 28th of October, 13:37 or thereabouts. So in that afternoon, Vault performance is degraded, which is to say the secret stores, where you are writing passwords and certificates and stuff like that.

They see that a single Consul server has high CPU load, and basically they start to investigate at this point, nothing untoward at this point. No players impacted, according to this. However, reasonably quickly it starts to degrade from that point and the underlying key value store, KV store that Consul uses to store its mappings between the various services and where they're running and so on, that write latency goes up from... 50th percentile latency I should say, goes up from about 300 milliseconds to about 2 seconds.

Now, those folks who have run production services before understand that when your 50th percentile, median write latency goes and multiplies itself by six, you're going to have a bad time and anyone depending on you is going to have a bad time as well. The interesting thing is that of course harking back to the discussion about distributed consensus and the ability to survive the failure of any particular machine, Roblox in the post incident report state very proudly that the Consul can survive hardware failures and the difficult thing is not when you have a hardware failure but when the hardware is slow.

Like it's just slow enough to impact overall performance, but isn't slow enough to be swapped out and actually be replaced. And so, at this stage the team thinks that, "Oh, there's probably degraded hardware performance, and what we'll do is we'll switch out one of the cluster nodes, the broken one, for another piece."

And at this point the Roblox engineers actually page in the HashiCorp folks and they are helping with diagnosis and remediation from this point on, and very nicely in the PIR it says that the team and the engineering team from this point onwards refers to both folks from Roblox and HashiCorp.

Nora: I've been at companies before that have had to pull in HashiCorp just due to high traffic and understanding how Consul and various HashiCorp services work under that amount of traffic too. So I do think it's awesome that the HashiCorp team comes in in those situations, and how companies learn like, "Hey, this is a little bit beyond our expertise here as well."

Niall: Yeah, good stuff. So we're still in the early triage phase, and what happens is they go and replace that hardware but for various reasons we'll come to understand pretty sharpish, that doesn't actually fix anything.

And so they go, "Okay, given the severity of the incident and how quickly we need to react, et cetera," the team replaces all of the Consul cluster nodes with 128-core machines instead of 64-core machines and by 19:00 on that date, the replacement is completely done and the cluster is unfortunately still reporting that a majority of the nodes are not able to keep up and were still at the 50th percentile for writes on the key value store at about two seconds.

Now, to me this is extremely strongly reminiscent of the market can stay irrational longer than you can stay solvent. Or to put it another way, software can stay broken quicker or more thoroughly than you can get your hardware better. So I think there's an interesting lesson here that actually software can be worse, more than hardware can get better.

Nora: It's true. I need that printed on a T-shirt. Okay. So where are we at now, Niall?

Niall: Yeah, so that takes us to about 2:00 AM on the 29th where the Consul leader is still a bit worked, and they decide to get into things like, "Okay, perhaps it's a particular kind of traffic that's coming from a particular kind of source." So they have this pretty clever way of firewalling off with IP tables, the internal traffic of various kinds, in order to see if that's going to make any difference to the write latency. See if it's traffic triggered, essentially.

And this doesn't help, expect me to repeat, "This doesn't help," a lot during this post mortem. I feel for the poor things, this is just a 73-hour ordeal where they don't really have a good handle on what's happening for about half of it. So after the move to 128- CPU machines, which doesn't help, and they realize that actually maybe instead of looking at the traffic between internal services, they'll start looking at actually how Consul itself is behaving internally, which is actually quite a fruitful area of research as it turns out.

So from I think 2:00 AM to 4:00 AM or thereabouts on the 29th, they're doing this and then they go into what turns out to be about 10 hours of research into how contention works and how thread locking works. They also move back to 64-CPU machines from 128-CPU machines because they go, "Oh, actually if contention for the underlying resources is part of what's contributing to this, then with 128 CPUs we just made that twice as bad, so we'll move back to 64 CPUs." Unfortunately this does not in fact help as per previous comment

Nora: Sorry, I wish in this report they had talked a little bit more about what responders were feeling and doing at this particular point in time. I mean it's been going on for a couple days now. I imagine they've also rotated responders.

So how are they keeping each other up to date about things that they have looked into a tried, and how did that make things difficult as well, did they have responders from throughout the world? Where did most of their Consul expertise lie as an organization? Was it in one particular part of the world or was it in multiple parts of the world where people could actually go to bed and continue diagnosing the issue?

I think from a lot of what you just said, it's notable to me that they didn't start looking at Consul in particular until 36 hours into the outage. That has to be really difficult in itself when you're just going down all these different paths and trying a bunch of different things and seeing what lands and what doesn't.

Niall: Yes, I think my observation would be that the document as written leaves unaddressed a series of questions or points around how they conducted incident resolution. I think conversely however you don't see a lot of that generally speaking. You certainly don't see a lot of it from the likes of Amazon, Facebook, Google, et cetera. Sometimes you see a bit of it, but in general you don't see that. You do see a lot of focus on the technical and actions to remediate and the actions they're going to take in the future and so on.

I think this post incident report is quite good on that actually, but broadly speaking they don't really talk about how they coordinate, they don't really talk about how that could improve, and that is, I suppose from a Completionist's point of view, that that is a gap.

Okay. So the next portion of work is what they titled Root Causes Found, Contributing Factors Found, which is from 12:00 to 8:00 PM on the 30th or thereabouts, noon to 8:00 I think. That is the point at which they decide or they make traction with this decision to pivot away from a systems point of view, "Are these machines inappropriately communicating? Is there a traffic overwhelm problem?" Systems point of view, they move to a software point of view, and actually it's flame graphs that save the day because they pull out the call stack, they pull out the performance counters, they look at the flame graphs for what's actually going on inside the software and they see wonderful, light dawns, evidence of streaming code paths being responsible for the contention that is causing the high CPU usage.

And streaming, to come back to the thing we were saying earlier in this, is a new feature which is somewhat ironically, and I expect that some of the Roblox people were a bit annoyed about this, but is somewhat ironically designed to lower CPU usage and RAM. Just a different way of structuring how Consul does its work, and I imagine how long it holds onto things which are sensible and so on that's implied by the streaming model.

So they discover that the call stack says, "Actually we're spending a lot of time in recently. We recently," they say, "realized that we enabled in a new version of Consul that we rolled out as we moved from 1.9 to 1.10." When the configuration changed to remove streaming, it's rolled out the 50th percentile for the key value store write goes back to 300ms and they have a breakthrough. They finally have the moment where some change on the system restores previous behavior.

Now, there's still weirdness and in particular this is the kind of weirdness that impedes them from actually making quicker progress because they are trying to roll out new configurations, new versions and so on. And so the issue here is that even though they do have a successful return to baseline performance prior to the incident in some of the Consul leaders, they actually don't see it for everything and so some of the leaders exhibit the same latency problems and they're manually going around bumping the Consul service from machine to machine in order to get a set of known good leaders. That's a lot of manual activity but it does actually restore the overall Consul system to a healthy state.

Okay. So after having that breakthrough, unfortunately they still have a number of technical issues. In particular, some of the masters that they have elected in their Consul network, some of those masters still have latency problems even after they reverted some of the previous configurations.

And as a result, they take what they call the pragmatic decision, which I think is very true, to just essentially prevent those leaders from being elected again and when the problematic leaders are no longer used in the course of the Consul system, then that is part of the restoration to service that they accomplish after they do that rollback.

The interesting thing there is really the question about why those leaders ended up in that high latency state and how they persisted in that high latency state even though they thought that they understood that removing the streaming code path was one of the most important underlying issues. Now, this is where HashiCorp comes in because it's HashiCorp engineers, or this is one place where HashiCorp engineers are particularly important because they actually determine that what's going on is there's an underlying database called Vault Db which is used to store the raft logs which allow the master election process that they're talking about.

It turns out that Vault Db has a particular set of code paths, et cetera, which are triggered in a pathological configuration by what Roblox are doing, and as they say, quote, "Typically write latency is not meaningfully impacted by the time it takes to update free list. But Roblox workload exposes a kind of a pathological performance issue that makes maintaining what blocks are free to overwrite in the database extremely expensive." End quote.

So with that in mind, at this point in the outage we have some pretty good idea that streaming is involved. We haven't figured out what's going on with Vault DB but we will figure that out some days later. Nonetheless, it's time to start restoring service really. So Nora, what happens when they are restoring caching service?

Nora: So at this point it's been 54 hours since the start of the outage, streaming is disabled and a process is now in place to prevent slow leaders from staying elected and Consul is now consistently stable at this case so the team is able to move on from there and focus on more of a return to service. So we're seeing a different phase of incident response now, and I really did like in their post incident review how they separated it like that because it's really important.

I feel like a lot of companies tend to focus on the length of the outage as a whole, without looking at particular segments like when they were looking at restore, when they were looking at diagnosis, when they were looking at Consul specifically. A lot of those phases may overlap, but it can really help you with your action items if you drill down into those specific points in time.

So part of restoring the caching service needed to be done sensitively and everything went wrong, including scheduling. This is likely due to the Consul cluster snapshot reset that had been performed earlier on internal scheduling data that the cache system stores in the Consul key value were incorrect. Then we see that deployments of smaller caches start taking longer than expected to deploy and deployments of the large caches were not finishing.

So again, everything goes wrong situation and I'm sure a lot of the folks listening to this that are responders can empathize with the situation. It turned out there was an unhealthy node that the job scheduler saw as completely open, rather than unhealthy and I'm super curious how and why that happened specifically. But what ended up happening as a result is the job scheduler started aggressively trying to schedule cache jobs on that unhealthy node, which obviously failed because that job was unhealthy.

Caching systems automated deployment tool was built to support these incremental adjustments to large scale deployments that were already handling traffic at scale, and they weren't iterative attempts to bootstrap a large cluster from scratch. So again, it was built to support one thing and they were using it completely for another thing which led to a lot of the issues they saw with this restore. But then we move into a return to players phase where consumers can actually start using the application again. Niall, do you want to talk about that?

Niall: Yeah. I was just going to say actually, your point about the job scheduler attempting to schedule things on a machine which is totally broken but it's saying, "I am totally awesome and I am open for business. Please schedule your stuff on me." It reminds me of load balancing algorithms that go, "You are the quickest responder, I'll send all my traffic to you." And of course you can very quickly send the 500 or equivalent unhealthy return code and take a lot of traffic that you will, as a result, not in fact serve correctly. Anyway-

Nora: We did a little check for that unhealthy status once. Did you know?

Niall: Well, indeed. You were unhealthy three weeks ago, therefore you are still unhealthy, or you were healthy yesterday, therefore you are still healthy also, similar kind of reasoning. Anyway, from the point of view of the return of players or full return of service, I thought this was very gracefully handled by the team in question because they don't just run over to the panel, flick all the switches to on and stand back and wait for the water to come cascading through the pipes.

They actually do this in a very staged way and they do this using a technique which is very well known, at least in the cloud provider's DNS during where basically because of the way DNS works, when you ask the DNS system for the address or addresses of a name, you can get back a set of records and you can manipulate that set of records in order to attract traffic to and from away a particular server.

Basically it's a kind of core screened method of doing traffic control, I say core screened but still pretty effective to Google, Facebook, Amazon, et cetera, all do this technique so it is definitely used. But it's one of the things that they use to leak a certain amount of production traffic into the network, see how things perform, make sure that the performance levels are still relatively stabilized and so on.

So they do this and ramp up the player percentage in 10% increments or thereabouts. The interesting thing that I wanted to call out here is that they're in a position where they don't trust their monitoring, or at least it seems from everything else we've heard so far that they've a reason to not trust their monitoring. At least at the aggregate level, I mean.

And so every time they add another 10% worth of players, they go back and they check the database load and cache performance and stability and all of this kind of stuff. I'm reading between the lines, maybe I'm wrong here, but I'm getting the sense that they're doing a lot of this manually, they're checking the database load, maybe they look at the graph but maybe they're also mySQL'ing in and doing Show Process and looking at all those kinds of characteristics directly from the mettle rather than actually just looking at the observability piece.

I don't know, I could be making stuff up, but I get that kind of impression which is an interesting thing when we think about that in the context of managing incidents because the question of can I or can I not trust my monitoring or observability stack is always a huge question. Actually, almost by definition when you get into a serious outage, there's something your observability didn't catch or it did catch it but maybe it displays it in the wrong way or there's almost always some kind of gap you get into.

You have to fill it manually. I am specifically interested in that because I am certainly sure that there will be some circumstances in the future where there isn't the equivalent of the break glass, manual action thing and actually the observability will be all you have. So that might be a challenge at some point. Anyway, they very gracefully introduce their customer base back to their services in these staged increments.

Nora: Which is always a nerve wracking experience.

Niall: Yes, indeed. Exactly. Okay, everyone. Come on in. Always an interesting moment. Then by 16:45 local time, Sunday, 73 hours into this thing, 100% of players have access and we are restored.

Nora: That is such a long outage and I suspect many of the responders that started with the outage were continuing on for significant periods of time. These are the kinds of outages that ripple through the organization forever. I remember joining a company that had a big, public outage, similar to this and it was like everyone would reference that when someone new joined the company, like, "Oh, did you hear about this?" And things become not necessarily over indexed on the situation, but people remember them and those observability improvements, like you said, hopefully they lead to them.

But sometimes in cases, they lead to very random things that are so specific to this particular outage that they end up being a lot to keep track of later on. Given that it's been several months now since this outage happened, I'm wondering how the organization is faring now, how the new SREs are faring, if they still talk about this outage which my guess is that it does come up occasionally. So that's what I'm wondering about.

Niall: Yeah. I have no evidence for this whatsoever, so do take this with a grain of salt. But I suspect that outages like this have a pretty bimodal distribution, there are the ones that everyone is introduced to when you enter the hallowed halls of the SRE team onboarding process or whatever it is, somebody nicely puts an arm over your shoulder and says, "Let me tell you about the time that it all went horribly wrong."

Nora: Exactly. I did join one organization that had an outage on Halloween too, and they were like, "Have you heard about Halloween?" And as a new person you're like, "No." But I'd actually really encourage new people that join organizations like this to approach it from a curious level. Even if you do know about Halloween, it's really good to say, "Can you tell me about it from your perspective?" And take that curious point of view because you'll piece together a lot of the ways it's still impacting the organization, and a lot of the ways that they didn't talk about it online.

Niall: Yes. I think the other side of that coin is, of course, the other half of the bimodal distribution where you never talk about this ever like it is a-

Nora: That's way worse, yeah.

Niall: It's a taboo story. I think one of the successful things about incident management and post incident response or whatever you call it, however that is done within the context of your organization, the most successful outcomes probably involve storytelling, they involve turning the terrible thing that just happened to you or that happened to you last year into something that you can use to align people actually. That's one of the huge things that I think we sometimes miss when we're thinking about very large organizations and the huge transparency and maybe even legal exposure to various dangers they might have. There's so many limits on what they can say externally, but internally-

Nora: Internally, yeah. Sometimes it becomes a sweep under the rug situation where you talk about it but you don't want to make anyone feel bad, and that's not useful either. I feel like it takes a long time to invest in skills that allow people to share freely and in a psychologically safe way, and that psychological safety will allow for more transparency and, like you mentioned earlier, Niall, alignment. And with the new market that we're experiencing even in the last few weeks, companies are going to want to focus on retention, and they typically do in situations like this and so a big way to focus on retention is to focus on learning and disseminating and storytelling, like you said. Which doesn't come for free but it's going to be even more important during these time periods to invest in that.

Niall: Yeah. If you're going to have the trauma, you might as well have the alignment. And if you manage to get the alignment, there is a real human effect to the kind of bonding you have and ensuring that this will never happen again, I mean, end quote, "Never happen again." Right? But coming back to that point which regular listeners or readers will know it's a particular bugbear of mine, often there's commitments in post incident reports to repair items or activities of various kinds to offset the chance of this happening again, which people outside the organization are typically speaking in no position to evaluate right.

But I think one of the things I like about the credibility here is that they've got some pretty concrete actions, right? They have the telemetry improvements or observability improvements that we might expect, but they also have a bunch of slightly more system design or architecture oriented pieces like they're going to split off particular kinds of load to make sure Consul doesn't get overloaded in that particular way again.

And, hugely important, they're actually going to implement streaming for real this time and have it work. Also interesting, I suppose from a economic point of view, that despite all of this... Of course, the meaning of despite might vary depending on your point of view I suppose, but they say that they want to use some public cloud but they are not intending to switch to it wholesale. Floating in the back of all this is a question about autoscaling and I suppose some kind of parallel techniques available in the public cloud market that might have helped with this outage.

Nora: I'll be interested to see if they chat more about that. I think one thing I'm interested in in these big outages too, the action items that come out of it, how many times in previous incidents have those action items come up and if they were related at all. All right, I think we are at about time. Anything additional to add, Niall?

Niall: I don't think so, other than my congratulations and commiserations once more to the folks involved in this and if there is some kind of challenge coin available for, "I survived this outage and all I got was a lousy T-shirt," you folks definitely deserve that.

Nora: Yeah, I hope they're celebrating that. Yeah. With stickers, with T shirts, with something that ends up being quite fun and it allows for those learning opportunities and that transparency to occur. But yeah, thanks, Niall, and thanks for tuning in, everybody.

Subscribe to Heavybit Updates

Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.

Content from the Library

Visit library

Oct 12, 2023

Article

Incident Response and DevOps in the Age of Generative AI

How Does Generative AI Work With Incident Response? Software continues to eat the world, as more dev teams depend on third-party...

Jun 12, 2023

Podcast

Getting There Ep. #7, The March 2023 Datadog Outage with Laura de Vesine

In episode 7 of Getting There, Nora and Niall speak with Laura de Vesine of Datadog. Laura shares a unique perspective on the...

Apr 13, 2023

Podcast

O11ycast Ep. #59, Learning From Incidents with Laura Maguire of Jeli

In episode 59 of o11ycast, Jess and Martin speak with Laura Maguire of Jeli and Nick Travaglini of Honeycomb. They unpack...