In episode 2 of Getting There, Nora and Niall discuss the socio-technical aspects of the AWS outages that occurred in December 2021. Together they unpack what happened, the inherent implications, and how organizations can learn from outages at such scale.
Nora Jones: All right, folks, this is our second episode.
This podcast has had a one-episode history of discussing fairly wide-ranging outages.
We actually have another one to chat about today.
The AWS outages of December 2021, but we're going to focus on the first one that happened on December 7th.
We're going to go over what happened, when it happened, the implications, and take a sociotechnical view of the situation based upon what we know to be true.
There's really no shortage of pontification on the outage on the internet.
In fact, you can Google it and find several pages of stuff, news reporters, all kinds of things, some good and some not so good about the outage.
We're going to try to avoid pontificating today. With some of that background, I'm going to let Niall talk to us about what exactly happened with this incident.
Niall Murphy: Hey, Nora.
First thing to say is part of the reason the outage on the 7th is so notable is because it lasts so long.
It's something like between 0730 AM PST to about 20 past 2:00 PST, which is about seven hours, plus or minus a bit.
It also affected US-East-1, which turns out to be this incredibly popular and incredibly significant data center, I suppose, is the right word to use.
The outages of the 15th and 22nd, which are, I think, half an hour and four and a quarter or thereabouts hours, respectively, are not quite as impactful in a number of ways, including the duration, and also US-East-1 just seems to be this crucial data center, as I say.
So we're not going to go into them in any detail, but let's talk a little bit about what happened to poor old US-East-1.
I think the first thing to say is that actually when I came across this outage, I was reminded about another Amazon outage, I think in EBS, the Elastic Block Service, from 2011, which is a very, very long time ago.
Basically it's a well-known, although it's rare scenario or strategy for managing fleets, which involves having a separate physical network to do this.
This is kind of the separating control plane from data plane is one way to think of it.
Also, hosts on two networks are sometimes called dual home.
You sometimes hear about things which are dual-homed networks or equivalent.
Basically, as we know on this podcast, it's either DNS or the network.
In this particular case, it turns out to be the network because at 0730 ... Well, okay. It's not just the DNS and the network. It's also sometimes automation.
At O730 AM, PST, there's some automated activity that kicks off to scale capacity of one of the internal AWS services.
I believe it's internal from context, which is hosted in the main AWS network. "That ends up," says the note from Amazon, "triggering some kind of unexpected behavior," from a large number of clients inside the internal network.
Now, it is not 100% clear what that unexpected behavior is.
There's a lot of speculation, if not indeed active pontification around the net about what this is.
From the data it's, as I say, from what they say, it's hard to interpret, but it does look like something-- I find myself speculating that it looks like something which is triggering a TCP reestablishment.
Like some TCP connection, which has to be re-setup and going from a long running-- or going to a long running service, maybe jumping across a public to private network or private to private network, possibly even a fleet-wide restart or a very short duration rolling restart.
Anyway, this seems to run out some kind of capacity that the networks have to talk to each other.
Now, this may not just be bandwidth. Of course, there are things other than bandwidth, which can limit conversations between two networks, firewall rules, that kind of thing.
It looks like there's a cascading failure, which is another very popular failure technique.
Cascading in many different senses because we have the network overwhelmed with retries, which is another question because TCP itself would have a number of natural retries.
There may be some application which attempts to reestablish a connection itself by using another TCP connection or starting another TCP connection after the first one fails, so and so forth.
After this cascading failure, we have a number of consequent effects.
In particular, one pretty bad one from the point of view of the team's attempting to do resolution, which is that the monitoring data dies.
It now is impossible for the monitoring data, for what is happening, to get across those networks and be delivered to the places that it should be delivered to.
So the teams are essentially forced back to using plain old logs, right? Which is not necessarily great in that it's hard to summarize or get an accurate and wide angle view of what's going on in your network if what you're doing is tail -f'ing, a bunch of logs. On the other hand, sometimes logs can be a bit more moment to moment accurate about what's happening than monitoring data which can suffer arbitrary delays.
But they talk about what they do in the outage report next.
That is, they spend a lot of time chasing a red herring, which is coming back to the earlier comment that it's either DNS or the network.
They figure out that, hey, maybe this has something to do with the previous widespread DNS outage that the industry suffered from, and they move internal DNS.
I was interested if the human resolvers, not DNS resolvers, problem resolvers, incident resolvers, were possibly potentially influenced by the fact of the, at that time, recent Facebook outage, or maybe they had other signals to believe that moving internal DNS would solve things.
I think part of the reason that they quote in the report, if I recall correctly, is something like, "The traffic was believed to be contributing to the congestion."
I mean, the DNS traffic. By default, DNS traffic is small and it's UDP and so on.
But there are various ways to run DNS over TCP and it's becoming more popular.
If you do sign zones and a bunch of other stuff, there are various ways for it to end up running on TCP by default.
Anyway, it turns out that moving the DNS made things a little bit better, but it didn't actually fix it. In my act of reading the postmortem, I felt that the writers had possibly felt a little bit cheated that their idea about moving the internal DSN hadn't actually improved anything that much.
I read disappointment into those words, maybe accurately or inaccurately.
They then tried to move a bunch of other stuff around in order to bring down the number of attempted cross communications, but probably unsurprisingly given they're still lost in the fog of war at that point and couldn't see what's going on properly, there wasn't really much in the way of positive progress.
In fact, touching on a component of our last podcast, in fact, some of the things that they needed to fix were things that were in themselves broken.
There's this sentence, or this phrase in the postmortem where it says, "The impairment to our monitoring systems delayed our understanding of this event."
One particularly strong difficulty for them was the internal deployment systems are themselves living on the private network.
One extremely common way of addressing problems when you're doing this kind of instant response is you're doing infrastructure as code, right?
You change some kind of config file to describe what it is you're doing, ... Or sorry, the new version of a new configuration of the thing that you're pushing out and you try and push that out.
Of course, if your internal deployment system can't reach the external world or the bit where it's trying to deploy, then you don't actually have any way to fix that and you have to go to the system as it's live, change the config there, which always has some risk.
You essentially can't nuke-and -repave, it's called.
It's basically redeploy the entirety of a particular environment from just your config in public and so live customer traffic going back and forth.
You have to really do this super, super carefully.
Anyway, they ended up resolving the issue and their list of things that they're going to do to prevent it seem pretty sensible to me.
There's the first thing that they do, which is under the stop the bleeding heading, which is disabling the immediate triggering automatic scaling activities, because clearly this triggered some bad behavior on the back end and they don't understand what that is at the time.
So they pause that. They make sure that the current situation is scaled adequately for the predicted traffic in US-East-1.
Rather than just automatically doing the scaling, they're like, "Okay. Manually speaking, we're pretty sure we have 20% headroom on this."
Or whatever the percentage figure is.
This wonderful sentence, "Well-tested request backoff behaviors were in production for many years, but the automated scaling activity triggered the previously unobserved behavior.
I think that's a fascinating technical sentence because what they're saying essentially is, "Okay. So we do maybe not the same thing that everyone else does, but we have some well-tested request backoff behavior, which might be exponential backoff, or it might be three linears and then stop." Or something like that.
Anyway, they're fairly sure they have this backoff thing nailed and then all of a sudden it turns out that no, they don't have that.
Nora: We don't quite know what the unexpected behavior they said was, right?
Niall: Yes. Which, again, gross speculation and pontificating and so on forth, I realized we said we weren't going to pontificate and we are but-
Nora: No. I mean, I think as long as we announce it.
Niall: Yeah. Exactly. Okay everyone listening, here is where we're going to gross speculation and pontification.
But the reading of this sentence makes me think, "Okay. You had defined, maybe as an overlay configuration, some kind of backoff mechanism, and now everyone's using a different configuration or a new library has been pushed out since you made the automatic scaling change. All of a sudden, all of your apps inherit from some kind of base library that does something different."
But the previously unobserved behavior is a very interesting sentence.
I suspect that the people who are trying to resolve this are having a lot of fun trying to trace that.
The interesting thing about these AWS outages is that you end up learning how the entire world is using AWS, also how AWS is using AWS.
They had a lot of services impacted themselves.
As Niall was saying, one of them included their monitoring service, so they couldn't even get in to see what was happening, how it was happening, which significantly led to how long the outage was.
EC2, Connect, DynamoDB, Glue, Athena, Timestream, Chime, and several other AWS services, which looking at that list of AWS services, what I know about them, it feels random how connected they were, and so I am very curious about why those services were the things impacted.
Niall did bring up a strange loop, like we brought up in the first episode, when the thing you need to fix the thing that's broken is in fact the thing that's broken itself.
It wasn't just AWS that was having this strange loop, i t was also their customers.
They couldn't get into the AWS console to even debug some of these things.
In a lot of the world that was impacted by this, we do see a lot the internet like, "Oh, it's just a third-party thing. We don't have to take responsibility for this in our software."
Which you totally do have to take responsibility for it in your software.
I mean, you're making an explicit decision not to invest in redundancy beyond this AWS region.
It is entirely on you if your service goes down as well, but it's a decision to be made.
Some folks do not have the time or resources or willingness to invest in multiple regions, which could be completely reasonable depending on their business.
One of the most interesting things I thought went down was Statuspage, which is a popular piece of software for companies to report outages.
That was also impacted by the AWS outage.
Many companies use that to let their customers and others know that they have an outage, so most companies could not communicate to people that they had an outage.
Again, another indication like we were talking about in the first episode with the Facebook outage, like how much the world relies on this one particular thing, and Slack also having issues, which is where people coordinate with each other internally in a lot of companies during incidents.
A lot of companies don't practice what happens when their primary forms of communication fail. You'll notice that you don't actually have the phone number of your coworker to give them a call. You're not really sure how to email them. Maybe they don't even check their email that much. I think that also impacted the length of a lot of their customers' outages as well.
Implications from like an AWS perspective, moving fast is entangling things and there's a lot of ironies of automation as well.
The Ironies of Automation paper is actually a paper from 1982 from Lisanne Bainbridge and it holds up today.
One of the things that she says in there is, "A more serious irony of the automatic control system that has been put in place is because it can do a better job than the operator, but yet the operator is still being asked to monitor that it is working effectively."
As Niall talked about earlier, there was unexpected behavior.
That unexpected behavior is due to a result of exactly what Bainbridge is saying in this article, that the operator is expected to know this automation inside and out even though it is automation.
Niall, you had some thoughts about some of the AWS impacts themselves.
Do you want to touch on that?
Niall: I think a crucial point here, which you actually touched on earlier, Nora, is this question of where responsibility for reliability lies.
I mean, I agree with you when you say it is ultimately on you, the service provider, to manage the question of your dependencies and how successfully they work together, how successfully they provide the service you need, so on, so forth.
There is another angle here, which is economic incentives for safety.
For example, let's suppose the situation where buying a hot standby VM for the thing that you're doing, I mean, it almost doesn't matter, but buying hot standby VM in another availability zone is effectively free.
Maybe you can use only one of them at a time or some other kind of constraint that makes it economically reasonable for the service provider.
Let's also considered a case where in order to get another VM, you have to pay a thousand dollars or something like that.
Like for some folks, that will be a drop in the ocean.
For other folks that would be a prohibitive cost.
It is interesting to think about the situation where a sufficiently large service provider should be in a position--
I mean, maybe it isn't able to do this for various other reasons, but should be in a position to structure how it charges its clients so that their reliability is maximized and their availability is maximized and so on, in a way that doesn't cost the service provider too much, but preserves its reputation.
I suspect that there's something going on where if you were a start-off service provider, you want to be cheap as chips and you want to charge people for what they do and so you get people in the door for cheap.
But that means that they're not spending on the reliability, and when you get to a certain scale, you actually want them to be able to fall over to another VM without disrupting the rest of the world.
I think there are important questions for Amazon to ask itself and for other providers to ask themselves about this.
Niall: Yeah. On the general topic of the economics of safety, really, I think economics and many other things in the computer world overlap or affect each other in various ways of course, and some subtle.
But the thing that I really take from this particular component of the Amazon outage is that yes, there are all kinds of reasons why we might expect to charge more for perceived value-added services and so on.
In many ways that's a completely understandable economic situation.
But I think that we might be reaching the point where reliability as a value-added service actually becomes not in the interest of the cloud provider to get other people to pay for.
I suspect also-- I mean, obviously also not in the interests of the cloud consumer, or the client.
There may well be a different model of how reliability is perceived, how it is paid for, how is modeled that folks like Amazon are going to have to wrestle with over the next while.
That's just my feeling. I think there's some unexplored space there.
On other topics, I think one of the most interesting, or I suppose one of the most ironically humorous aspects of this outage is the question of actually finding out whether or not the outage was taking place.
Do you have anything to tell us about that, Nora?
Nora: Yeah. I think we touched on it a little bit earlier because a lot of primary sources of communication, both within each other and with like, to our customers were not working, like Statuspage in Slack.
That provides a whole new sense of issues because a lot of the time we don't practice how we talk to each other when our primary sources of communication are down.
We saw that when we talked about the Facebook outage in episode one, I'm sure it's going to keep taking place in future episodes as well.
I think we tend to forget as engineers that not only do we have to enhance our reliability and our technical systems, we have to practice and enhance our reliability with our coordination and communication, which is equally as hard and equally as important.
It presents a whole new set of issues when everything is underwater, and we're not quite sure who to get in touch with or how to get in touch with them.
One humorous aspect of this outage was how folks reacted to it too.
I mean, certain deliveries can get out and there was a popular tweet that went across because the drivers couldn't do their deliveries.
They took a lighter side of it and started doing karaoke in the warehouses, which was quite humorous to watch.
Along with that, all sorts of different things were impacted as well.
Washing machines, Roombas, Internet of Things devices of many kinds and Amazon itself.
The reverbial impacts were quite great and this is a really good example of the fact that we're not going to be able to measure the full customer impact and the full human implication of an outage like this.
Niall: I mean, actually that's a huge problem from the point of view of society that we are loading more and more onto the saddle bags of these horses.
Then when these horses get tired or fall over on the trail or whatever, we have really serious impacts throughout society, which are only really recoverable from because we have this expectation that those outages will be relatively short.
If you imagine some correlation ... One of the big questions in reliability engineering of course, is what is correlated in your system?
You can go trivially through some questions like, "Okay. Well, they have the same operating system or the bits of the distributed system or the data is stored on stuff that has the same operating system, the same cables going into the same switch."
Much of the technical exercise of reliability engineering is attempting to de-correlate things which would otherwise be correlated because a single failure can take out a bunch of things that are correlated and it's harder for a single failure to take out a bunch of things that are de-correlated.
But actually way more things are correlated than I suppose we think are including up to things like we share the same planet or electrons are the same everywhere, or maybe it's the same electron everywhere.
Basically I think that we are heading towards a future where as we see more and more intense, for example, climate events or politics, or all these kinds of things, we're going to discover that more and more of our modern world is correlated than we think is and the act of de-correlation is going to be expensive and difficult but necessary.
Nora: Absolutely. The world is getting more and more correlated over time.
We touched on a little bit earlier how the fact that this flavor of outage, this isn't the first time this has happened with Amazon, not even at the end of last year, but several years before it.
It harps in on the importance of writing down everything we know about these incidents afterwards.
Not only that, giving employees the time to study them.
You think about pilots and safety professionals, they read NTSB reports before they take the plane out.
Like, do we allow our employees to do that very much in software? I think the answer is no.
I'm super curious how Amazon invests in that in an organization too, because not only does it help us deal with these kinds of things in the future as they inevitably pop up, it enhances the employee's expertise and trust in the system and trust in each other in general.
I think it's about time for us to wrap up here. We have a few awesome outages that we can talk about next time.
Not that Niall and I are ever wishing for incidents, but we're certainly excited to talk about a few ones that have popped up recently.
Yeah. Any closing thoughts, Niall?
Niall: Two things really. I suppose the good thing about incidents is that they're essentially an infinitely renewable resource, so there's no question really of ever running out of them, which is good.
The second piece, I mean, harking back to what you were saying about giving employees the time to write things down, reflect on outages, reflect on that currently irreplaceable piece of value that human beings bring to incident resolution, which is out of the system awareness and being able to glue together various pieces of knowledge and improvise and all of those wonderful things.
The question of whether or not that work is valued or how that work is valued, actually I think in the limit becomes a societal question.
I'm coming back to the societal piece again and again.
I think for example, there's recent movements to bring some of the air accident investigation framework to how to process and treat security incidents.
I've seen that go past on LinkedIn in the recent past.
To my mind, maybe that isn't necessarily the specific direction that the cloud industry needs to go in.
You could argue like it would be an unsuitable model in various ways, but I think that's something which is above the level of a single provider and which can provide a countervailing force to just releasing as many products and features as quickly as possible to get as much money as possible in the door.
We need something, whether it's regulatory or-- I mean, I don't know, but we need something to push that angle, which is other than just the employees wanting to be better at their jobs.
I'm not sure we're going to see a consistent effort towards improving de-correlation and reliability engineering until we have something like that.
Nora: Well said. Well, thank you, Niall. Thank you so much for tuning in folks.