AUG 24, 2022

34 MIN

Ep. #4, The April 2022 Atlassian Outage

light mode

about the episode

In episode 4 of Getting There, Nora Jones and Niall Murphy discuss the Atlassian outage of April 2022. This talk explores Atlassian’s 20-year history, key takeaways from this 14-day outage, surprising findings from the incident report, and critical discussion of Atlassian’s response.

show notes

about the episode

about the guests

show notes

transcript

Niall Murphy: Hi, welcome to the podcast. This session we are getting to talk about the Atlassian outage of April, 2022. This outage is an extremely interesting outage from both mine and Nora's point of view, but we'll get into the nuances of that very shortly.

At the start I just want to give you the general background to the outage and what actually happened. Atlassian on Tuesday, April 5th, 2022, starting at 07:38 UTC had an outage that had a lot of customers and potentially many, many seats behind those customers. Part of the difficulty is interpreting what Atlassian thinks a customer is in the context of this outage report, which is again a topic we'll get into.

This outage started on April the 5th and actually lasted for 14 days, which I think is actually a record for outages we have looked at on the podcast so far. So these 775 customers who lose access to their products, to their data, et cetera, can't do anything with them but they have restoration of their service starting on April 8th and then everyone is restored by April 18th or thereabouts.

So there's a 10 day spread during which customers are being steadily restored. The curious thing here is that we'll have a lot to say about the conduct of the incident and similar, but one of the most interesting things about this is it's actually problems on the Atlassian end with respect to restoration data coordination and communication which are the overwhelming themes that come through on reading this report.

And as I say, we'll get into a lot more detail than that in a moment. But, Nora, why don't you tell us, for those who don't know already, what Atlassian is and what they do and a little bit about their history?

Nora Jones: Thanks, Niall. So Atlassian was founded about 20 years ago in 2002 in Sydney, Australia. They have a really impressively scrappy history, they actually bootstrapped the company for several years with $10,000 of credit card debt from the founders and they made it work from there. Their flagship product is called Jira, something that is pretty pervasive throughout the software industry and it's been pervasive beyond the software industry as well.

It is an issue tracking and project management system. Jira market share is somewhere around 50%. What's interesting, and we'll get into why this pertains to this outage later, is there's no sales team for Atlassian or traditional support structure. So their customer support team from the beginning has been a bit atypical from what we normally see in the software industry.

They have had pretty public incidents in the past, and publicity from public events tends to lead to emotional distress and leads to changed future development. There's a number of references and credits around that. But beyond that, they've also drawn a lot of attention to themselves via acquisitions of various companies. They've had 19 acquisitions total, and 10 of those have been actually acquired in the last 5 years.

As we all know, when you integrate another company into your company, for a while you're effectively managing two different businesses. Even though you're selling two different things, you start integrating the software, as you try to add more features and complexity, it becomes pretty overwhelming.

So as you're adding on more companies and acquiring more companies, your legacy software also increases. I think Niall will take us in a sec to what happened around the legacy software and such.

Niall: Yeah, I will just touch on something you said there and say I admire Atlassian and what they've managed to achieve, and a lot of the professed companies are things that deeply resonate with me. But I'm challenged by what I see in the post incidence report and how it actually resonates with those values so I think there's possibly a lot to discuss there.

But anyway, let's talk a little bit about the incident itself. So we have a massive outage, customer data gets deleted, it turns out that this is a classic example of the cultural value of the Unix API or POSIX API perhaps is the slightly better way to say it. When you tell the machine RM- RF/ the machine assumes you mean it and it will go and delete everything.

There are other APIs that don't work quite that way, but the Unix or POSIX API definitely does, so in this particular case the background to the outage is there's a plugin, because Jira and general Atlassian ecosystem is certainly large enough to support a marketplace for plugin styled economic setup. So they have a plugin, it's being retired, they need to go and delete the plugin from its installation on a bunch of customer sites.

One thing the post incidence report, the public post incidence report, does do quite well is it talks a lot about the architecture, how things are laid out in the Atlassian world and the technical infrastructure they have surrounding that. But broadly speaking, as you might expect, it is possible to delete things, it is possible to delete plugins, it is possible to delete customer data. These things are referenced by IDs.

So there's a script that they have, they need to delete this plugin from a bunch of customers so they run around and figure out what customers that applies to, they get a list of IDs. Now, unfortunately, as they say in the outage report, the list of IDs which for completely safe deletion should only reference the plugins that are to be deleted. Instead, that list of IDs ends up being the actual customer site, rather than the plugin within the site.

So there's some idea of container object, right? There's some idea of hierarchical data storage or customer attribute storage or equivalent. But the basic deal is here, where they should say, "Please delete Object X in site, Customer Y." You actually get out the number which is the customer site in its entirety. And so, second technical trigger of the outage, when they pass this list of IDs to the script that does or the setup that does the deletion, the system believes them. It says, "Okay, you want to delete all these customers? Sure thing, I'll go ahead and do that," and it does that.

And, as a result, from April 5th a large number of customers, again with some question mark around precisely how many affected human beings or how many systems are sitting behind that number, but 775 customers lose access to their products. Now, in a perfect world you could just hit pause on everything, crank up your restore script, feed in the same list of IDs to this system and just have everything restored. Unfortunately, this world varies from a perfect world in a large number of ways and one of those ways is that actually they don't have good process automation or whatever to do this.

The report, it's a little ambiguous as to why this is in my opinion, but broadly speaking, one crucial fact seems to be that when you actually delete the customer site you also delete a bunch of customer IDs, reference numbers that are used to capture and describe the customer and the plugins they have access to and the path names that you find their data on and so on, and so forth.

So when you nuke the customer IDs, you have a lot of downstream effects from that, one of which is actually, according to my reading of the report anyway, it's difficult to restore the data for a customer who no longer has an ID on the system.

Not only that, which we'll explore in more detail in a moment, but it turns out it's very hard to allow a customer to report an issue if they can no longer refer to their ID because it has been deleted, and this underpins some of the communication issues that happen later on. But they very quickly, to their credit, realize that stuff's going down and it's bad and so on, so they quickly realize this and Atlassian support I think acknowledges the incident by about 08:30 UTC.

Then there is what is an unfortunately large gap, to my way of thinking, where it's only April 7th at 00:56 UTC when they actually have their first broad, external messaging acknowledging the incident. But anyway, coming back to the things that they did well, they realized there was an issue fairly quickly on, they spun up a cross functional team in order to attack this issue. Then they work on a huge process of restoration with, what seems like from the outside, to be some kind of ad hoc process, some kind of automation.

They spend days and days restoring customer data. In fact they reference on the report that there's something like 70 individual steps in the first restoration process that they did. Again, harking back to the fact that when you no longer have a customer ID, it's hard to actually restore that data. But the process and things like creating a new site, you license it, new ID, activating the right products, migrating the site to the correct region because they do offer data residency, which is to say if you want your data in the right kind of geographical region or legal jurisdiction or whatever, you can configure that.

You can't just restore anywhere, obviously. Then a lot of internal stuff about metadata, identity data, product databases, media associations like the things that attachments are processed by, feature flags on all of this kind of stuff. So the restoration led approach ends up being used for about 53% of impacted users, they say users rather than customers so again there's some question mark around that. I think that restores like 112 sites out of the 775 or thereabouts.

Then restoration two which takes a substantially shorter amount of time. Restoration two basically involves the realizing, "Oh, actually we don't have to make new identifiers. We could reuse the old ones." And they don't talk precisely, but the implications are the complexities of this, but it seems to involve undeleting some records associated with the site so it may mean there's some database rows they can restore or there's some recreation which can take place via identifiers they can find elsewhere.

Anyway, they get to reuse the old site identifiers and that removes like 35 or 40 of the 70 steps in the previous process so they can go much faster in the technical restoration. But actually there's a lot of overhead that ends up happening in the incident response because the scripts have to be rewritten, the teams have to manage a much faster process which in itself increases the communication overhead.

There's parallel batches of restorations, all sorts of work to be done, they've a lot of testing and validation which makes for a lot of work, shall we say? So that is the technical background, and I think they declare the incident over in or around April the 18th. Yes, that's the final four days that are spent in restoration 4.

Nora: One thing I'm curious about is, given the length of the outage which as you mentioned was a fairly long outage, despite the 7,000 words in this incident report, I don't see much information about what on call hands off looked like. Surely no one was on call the entire time. And I also don't see what the impact was on morale at the company, how they're recovering some of that. It seems most of this, even though it's quite long and detailed in certain ways, feels like an apology tour for the customers.

Which is fine, and I think customers deserved that, I think obviously there were some complaints about it on social media and elsewhere, various Slack communities I participated in.

There's a lot of emotional toll this takes on your organization as well, and that might not feel relevant to bring up in a public post incident review but it actually is and it can be handled in a nuanced way. It does improve customer trust if you're understanding how teams coordinated together, if you're understanding how employees coordinated together because they also, at the very beginning of this post incidence report, bring up communication gaps as an issue.

I'm sure those communication gaps percolated in the incident as well. We tend to replicate our organization when we're in incidents, and when something is bleeding, all rules and procedures go out the window and you're just trying to stop the bleeding as fast as possible. Which provides a reflection point into how your organization actually works when it's outside of an incident as well.

Right under that, we see communication gap and we also see insufficient system warnings. Both of these were a little bit disappointing for me to read because that is normative language, and I know this was something that was being shared with customers but it's a very shallow analysis. When you take normative language like that, it kind of stops the incident review in general. It's saying things like insufficient or gap, or misbehavior.

It's kind of judging what happens rather than sounding curious, and if you sound curious, your customers will mimic that tone as well. That goes for judging poor behavior and judging good behavior, and so the norms on which this is built is defined in hindsight, like, "Yes, it was insufficient in hindsight and, yes, there was a communication gap in hindsight." But there was a norm before that that wasn't thought of like that, otherwise the accident wouldn't have happened.

So I wish that that detail had been explained a little bit more and that normative language had been avoided at the beginning too, because things like this also affect how people read things internally. Although it's kind of a marketing tactic and it's an apology thing, especially since it was written by the CTO at the time, it still impacts how employees view it too. I'm sure that it impacted how people spoke about the incident internally and how people felt about that, which impacts in future incidents as well.

Niall: Yes, I have to say I have a large amount of sympathy for Atlassian in this, under a set of narrow categories, shall we say? First of all, the folks who were working around the clock in order to restore the data, the folks who were improvising. Everybody was basically trying to move stuff along and do the right thing for the customers. I also have sympathy for them in this narrow category as well, Jira has a very large amount of market share, very well known in the industry, et cetera.

That's not necessarily because people love bug tracking products in general or Jira in particular, but any organization that has a product that has that kind of wide adoption is going to come in for a lot of criticism and it came in for a lot of criticism. And so I have a lot of sympathy there, but I think I in turn would have a number of pretty big questions to ask about the report as it is written, the mindset it reveals, and some of the things behind the scenes.

I'll give you a couple of questions, coming back to this question of empathy and so on. There's this quote in the report, "At Atlassian, one of our core values is, "Open company, no bullshit." But that is literally followed by a paragraph saying, "Although this was a major incident, no customer lost more than 5 minutes of data. In addition, over 99.6% of our customers and users continued to use our cloud products fine." Now, the difficulty with this of course is no customer lost more than 5 minutes of data, that's cool, good.

Yeah, of course 5 minutes of data could be a fairly large amount of data depending on how much you use Jira. Also, if it is 14 days between when you lose access to that data and you get it again, it doesn't really matter that you only lost 5 minutes. It is actually a pretty severe difficulty. Additionally, the bit where they go, "Over 99.6% of our customers continued to use the cloud products just fine," or equivalent, is that the 99.6% of customers that you still had after you deleted the rest? Or is it the other 99.6%? There's some huge question marks about the defensiveness there. Nora?

Nora: Yeah. Eric Dobbs who's an engineer at Indeed actually wrote up a pretty interesting article called A Better Apology For Atlassian, where he talks about the Open Company - No Bullshit thing that you mentioned as well.

He says, "While this is factually true, it ignores the elephant in the room; two weeks outage, all 200,000 of us customers saw it and are now faced with contingency plans. What if the worst case happened to us? Two weeks without the run books in our Wiki, without the acceptance criteria on all of our software tickets, without the hooks into our own incident management process and follow-ups, and without the hiring process documentation?"

I want to go back to what I was talking about earlier with the acquisitions and the amount of teams they're acquiring and the amount of complexity that adds to their communication and social structures as well which obviously impacts incidents and impacts how teams coordinate. It's increasing the cost of coordination, and so while they might actually believe they have an Open Company - No Bullshit culture, it doesn't just happen automatically. It needs to be actively practiced and I don't hear in here how it is actively practiced beyond that statement. I'm sure there are ways that they practice it, but that is a whole system to maintain on its own beyond just a software system.

Niall: Indeed, and they also have a Don't F The Customer value, which you could argue the whole outage, the conduct of the outage, the treatment of the customer and so on and so forth, would undermine somewhat. But I think there's another set of concerns to highlight here around how Atlassian coordinate internally, so if I read the report correctly they're relatively good about starting out realizing that bad things are happening and they need to pull together loads of people and so on.

In fact, I think they talk volubly about some of their incident management process, but they have this other sentence in the report, and Nora, it says, "Because this had to be done within our production system, it took several days to fully develop, test and deploy," end quote. Several days? I mean, if you had done it sooner outside of the production system, that could've been better.

Nora: I also thought it was interesting, it brought up in the beginning of the report that there was engineering teams using an existing script and process to delete instances of a standalone application, the script was executed following our standard peer review process, contained IDs for particular customers. They say in here, verbatim, that it was due to a faulty script.

But I become a lot more curious there, I want the history of the script, I want to know why it was originally created, I want to know which team originally created it, I want to know how people were involved in understanding together. That is stuff that I think is the meat of this rather than some of the technical portions it describes here.

Niall: Yes, there's a huge piece around that, I think, and it's possible to read into that the kind of underinvestment in unlikely operational scenarios that you often see in organizations that are, for example, trying to control costs or scope in some sense. "This will never happen, therefore we won't invest in it," is a fairly common theme you'll understand, of course. I think another piece which is interesting to investigate is the question of why they didn't respond publicly sooner because remember there is this two days-ish kind of public absence of communication.

And part of it is, of course, the fact that having erased the customer IDs you then don't know where they are, or maybe your backups reference old contacts. They talk a lot about some of those contacts being old when they try and explicitly reach out to the customers. But they say, "We prioritized communicating directly with affected customers implicitly, rather than broadcast."

There's a huge piece around why you would try and communicate individually with customers, rather than broadcast, when you've realized that 775 customers with potentially many seats behind that. But having lost the contact information, okay, fine, here's the real emotional nub of it. They say, "While we immediately knew what had caused the incident, the architectural complexity and the unique circumstances slowed down our ability to scope it, and rather than wait until we had a full picture, we should've been transparent about what we know and what we didn't know."

And I think that's the key realization there. I see them as being fundamentally, "Look, we really have to know what we're talking about before we say something." But actually, is that right?

Nora: Yeah. And I see that they did directly say in here, Why We Didn't Publicly Respond Sooner, and they say, "We should have implemented broader communications much earlier." But there's also this big complication and meta component to this, in that Atlassian brands themselves as a company that sells incident tools and incident management tools and they write their own guides on how to do it.

And so there is a lot of pressure on the people responding because even though they have this expertise on incidents and sell tools that other companies use in order to handle their own incidents, they're not immune towards having them themselves. I actually think they have a much higher responsibility to be very vulnerable in these situations than other companies, because of the tools they sell, and so this trust gets broken a lot greater than I think outages with other SaaS providers that you use because of how they tout themselves, And like you mentioned, the Open Company - No Bullshit thing.

And so there is clearly an investment in marketing some of these things and selling some of these products, but there's not so much of an investment as training their own people, training their own people on the response, on coordination, on doing incident reviews, on participating in incident reviews. I like that there was some responsibility taken from senior leadership for this, and I wish they had included the engineers in participating in this incident review.

I think that would've gone a really long way because the engineers are the ones maintaining the systems, day in, day out. They're the ones that know the pressures they are under and the trade offs that they're making that lead to some of these incidents, and so I think that would've actually escalated the trust from this document even more.

Niall: Yes. Again coming back to the operational starvation piece I was talking about earlier, it's possible to read into this large outage, consistent underinvestment and, in particular, consistent underinvestment by Atlassian in the capability of its staff and training in their own incident management process. Or maybe the incident management process as defined is not suitable for one of these kinds of incidents, which is another possibility of course. I wanted to touch on the actual recommendations of the report, or the actions they're going to do.

Which, to my mind, well, I'll talk about them and then I'll tell you my opinion. First of all they have, "Learning our lesson. One, soft deletes and code should be universal across all systems. Number two, as part of the DR, Disaster Recovery, Program we need to automate restoration for multi site and multi product deletion events for a large number of customers. Number three, improve the incident management process for large scale events.

And lesson four is improve our communications processes." All of which are worthy things to do, but actually I would put them in exactly the reverse order. The reason, there's a couple of reasons for that. The first one I'd say is soft deletes being universal across all systems, actually you probably don't get to do that as coherently as you think you do because the notion of deletion is so widely spread throughout all of the layers of various stacks.

I don't think you can achieve soft deletes being universal, and it's also probably the wrong design in some of these circumstances. You want hard delete in some circumstances. But anyway, I think they're over fitting or over fixating on the fact that this issue happened in this particular context. And of course it's important to tidy up, and of course you go through all of the scripts and go, "Do I interpret these IDs as potentially being a site one or pulling in a lot of data? If so, then flag or put up an error or reach out to a human or something,"that's totally fine.

But it's actually not what's going to hit you necessarily. If there's some other issue at Atlassian, the thing they need to fix is their communications process first of all, and the incident management process. They're very close in priority in my mind, they have to get better at conducting this and they have to get better at telling their customers what's going on and telling themselves what's going on. It's not clear from the report how good they are at doing that, I suppose. But the restoration program and the soft deletes, basically the technical bits of this are the least interesting bits and I think it's process and communication that really matter here.

Nora: Well, we're running out of time here but I do just want to add that a lot of the times in organizations as big as Atlassian's, and especially when we see senior leadership completing one of these post incident reviews there are a lot of coordination issues internally in these companies, not only between teams but between individual contributors and leadership. I have a hunch that this issue was known by a lot of individual contributors and perhaps even brought up to leadership, I mean it's alluded to in this document as well.

I'm curious how, one, senior leadership responds to these reliability issues or claims before they happen? And two, how individual contributors are able to give evidence to some of those claims and the likelihood they find some of them and the potential blast radius of some of them. Especially, as we were talking about earlier, as they're adding all these employees and adding all these companies, that becomes harder and also more work to do, and the potential blast radius of these outages becomes much greater over time. Any other wrap up items before we sign off, Niall?

Niall: I think two of the things to mention just off the top of my mind, you don't have to believe what we say, you can look at what a large number of other people have said. There's not just the Eric Dobbs thing of course, but also I want to particularly call out Gergely Orosz who's the Pragmatic Programmer Newsletter, he's very active on Twitter as well. He did a huge piece on this in some detail. There's some claim, I think, that Atlassian actually showed the preliminary version of the post incident report to him.

His big question apart from the communication piece, of course, is that 775 customers, how many actual people does that resolve to? And he has some guesses in his work there, but again, I suppose they're guesses. Anyway, he has a very interesting, detailed treatment of this and I would seriously recommend it. The other piece I wanted to mention is coming back to what you said about the senior leadership get involved and accountability and so on.

Of course the CTO of Atlassian is no longer with the company and again it's relatively easy to read all kinds of potential motivations or situations into that. But it is in some sense a loss for the company and a loss of knowledge and so on and so forth, but hopefully the company can do better and get better because we're all getting better in the future.

Nora: And just to add--

Legacy software will never stop being a thing. One thing we used, when I worked at Netflix, we always used the Roman writing metaphor where the new software isn't quite ready and the legacy software is still used, and so you're using both at the same time which inherently becomes its own new system to understand.

It seemed like they were trying to rush through this process a little bit and there's complications and pros and cons to both, but as software companies in this really fast paced moving industry, we have to get really good at deprecating software and that changes the norms around that change in every company. But all right, thanks so much, Niall, and we will see you folks next time.

Subscribe to Heavybit Updates

You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.

Content from the Library

Visit library

Jun 12, 2023

Podcast

Getting There Ep. #7, The March 2023 Datadog Outage with Laura de Vesine

In episode 7 of Getting There, Nora and Niall speak with Laura de Vesine of Datadog. Laura shares a unique perspective on the...

Jun 28, 2022

Podcast

Getting There Ep. #3, The October 2021 Roblox Outage

In episode 3 of Getting There, Nora Jones and Niall Murphy unpack the Roblox outage of October 2021. Together they review the...

Nov 22, 2021

Podcast

Getting There Ep. #1, The October 4th ‘21 Facebook Outage

In this inaugural episode of Getting There, co-hosts Nora Jones and Niall Murphy unpack the October 4th ‘21 Facebook Outage and...