May 26, 2016
WelpConf: Striving for Stability
On May 12th, 2016, Heavybit member company Opsee hosted their first ever WelpConf. The event featured guest speaker Andy Smith from Wercker,...
Thanks. Hi, everybody. Yeah, so Noah from Heroku. Let's see, I've been at Heroku for quite a while now, o ver four years, and when I joined it was about 10 people. We were just in a room around here in SOMA. I was actually in LA at the time, and probably ops was like one person. We had Ricardo. He was the guy that you would just send a message and he would fix everything or he knew how everything worked.
Today, we are close to 150 people and close to 50 people on the engineering team, so I've seen quite a big growth over theÂ years , I've learned a ton of lessons over my time here, about scaling the platform and scaling the operations but also just learned a lot of horror stories that I'm here to share a couple with you guys. Some side effects that we really didn't see coming as a team and engineering team and some changes that we needed to make to address all that.
A big part of my journey is actually chronicled on the internet for anybody to see. We have the Heroku status site. I was poking around on this thing today, which any of you can do , it's meant to be transparent for all our customers to be able to go here and see like, "Oh, wow, something is wrong with the platform," and so that means there's a ton of mistakes that we made that are just out here for everyone to see.
There's data here spanning four years back now and 600 incidents that we've since chronicled. This is all really out there but one really sticks out. It was June, 2012. It was the great Amazon outage, I see some nodding heads. There was an electrical storm and Amazon lost a data center, cascading, basically lost the whole AZ. Really, this was a huge operational incident for so many people.
It hit Heroku really bad. It hit everybody really badly. I still remember this very well and I was thinking like, well, how did we ever find ourselves there and then, what have we done since then?
Of course, there's this giant understanding of all of us around Cloud services and Cloud architectures and all these dependencies. It turns out if a data center full of EBS drives goes away then you lose everything, and you can fix that with our architecture.
Single Points of Failures. We've worked on this all the time, managed our availability zones and get rid of state, put it in different places, and worked the failover incidences, but this talk isn't about that, t hat would be another talk sometime. This represented a really personal problem for me. Some of you might know that this happened when I was at my friend's wedding. If you know Blake, Blake and Elena, also Heroku.
It was Friday night. We were on Treasure Island here and sat down during theÂ reception,Â aÂ cocktail in, and my boss comes around and he says, "Hey, the Heroku website is down." That's not funny. It's not down. I'm on call and my pager's not going off. But soon we'd realize like well, PagerDuty was down too. So of course, my pager wouldn't go off.
Basically, it was chaos. The next five hours, we were bunkered down on Treasure Island,Â had to go back to San Francisco, and the real annoying thing is that there's a bunch of GitHub folks there and their stuff was fine and they were sitting all smug and drinking while we're scrambling to bring Heroku back.Â
This was a really low point. I still think about this all the time. It's like a day I'll never get back.
It's fine but I really like to think about the people behind all these things. I'm sure there were many people at Amazon whose nights were ruined and everywhere else. I thought, where were we at this point? This is probably when we're about maybe 30 or 40 people, still a really small team b ack in the day. I think this probably resonates with a lot of us, e ven still today, it absolutely does.
We were this team of hackers and we were heroes and these were badges that we liked. "Oh, hack on some software tonight." Hero. Actually, muscle memory kicks in when I was working on these things, Â it's part of the name, Heroku. We used to sit around and you would say in Campfire, "Need a Hero", and that's when you would summon one of the guys that would come and fix everything.
But we really needed to grow up and figure out, okay, this was an extraordinary event. These are always gonna happen but how can we build a culture where we're actually ready to handle this and not ruin everybody's lives. This has been an incredibly slow and painful process and we're still not done. It's something that Heroku continues to work to improve.
I'd like to just share some of the lessons I learned from this. Maybe you guys might take something from this as you're a little bit smaller and maybe not make some of the same mistakes that we did. When I think about this, h ere's one of the roots of the problem:Â
Hackers who write too much software and basically, we have no real process to understand how to take that software and run it in a reliable fashion. Then the heroes, they mask all these problems. You know you can count on that one guy to just fix everything and so y ou're missing the big picture and you're not building a team that's ready to respond to this.
Process is definitely a dirty word- I know, hopefully, I'll show you, we don't do too heavy a process but there's definitely some stuff to learn here.
So hackers, a gain, too much software. Where this really comes from is kind of this "startingÂ building anything is effortless".Â In fact, I think Heroku probably more than many places lives this because our product is to make starting a project as simple as possible.
Sit down on a laptop. You get in it. You get pushed. Heroku. And look at this, I'm running a service and production. Great, b ut what's missing? What I really didn't know at that time and still struggle with today, is that that means nothing, it's the operations. That's what's going to cost you a lifetime of this code and then finishing this thing whether it's getting in a maintenance mode or even better like getting rid of it, is almost impossible.
You've got customers relying on this. You have other dependencies on this and basically, it's not something we value that much. But again, I pulled some numbers to see how bad is Heroku at starting a software. We have 1,200 GitHub repos. Who knows what's in there? All really awesome ideas that somebody wanted to start and didn't really get around to making a solid piece of software.
Side effect of all the software areÂ these legacy services. Part of the reason it took Heroku such a long time to recover on this Friday evening was all these legacy services. At the time, we had two routing stacks and in fact, we still do. We have one for Bamboo, that's the old, more Rails-ish platform and we had the new CedarÂ much simplified but then there's this problem in hindsight.Â
Well, cool, we have this simpler architecture but we still have the old one and now we have two problems. It's really annoying. This comes up all the time. Two database services. We have the legacy database services that I didn't see the very beginning of it but I'm sure it was like some of the earliest stuff in Heroku and it was still running because these are production databases that customers depend on.Â
And of course, this one came with the nice property of Â backups that aren't exercised. We say on this data side it was eight hours of recovery but we were literally working for a week to restore the last backups in the old database service, sitting on some weird EBS or S3 buckets somewhere,Â corrupt the whole time.Â
Not fun.Â You know, same thing but there's the shiny new one, but still you have two problems. Do you bring the new one up, but the new one doesn't actually have as much data as the old one?Â Metrics is the worst. We're constantly reiterating on our metric service and there's all these little experimental features out there at our experimental services. It'll make our lives easier.
But when they all go down at the same time, and that's the visibility that you actually need to bring stuff back,Â it's horrible. That was probably one of the hardest parts about this.Â
Again, I said this is something that we haveÂ a problem with to this day at Heroku,Â and we've constantly been trying to figure outÂ how do we change this?Â We've come up with a pattern these days.
Pretty boring, but it's not too heavy. All it's about is visibility and inventory. This is something we absolutely did not have at the time and could not tell you what was running in production. At least now, we have that.Â
The goal here is to just make a visible list of all your software, really reflect the ownershipÂ like, is this something that one person did right? Is it something that two or three people actually know what's going on? Or is it something that's unowned?Â
Those were always the worst. And reflect the maturity. You can put stuff in the hands of customers but we know if things are more ready to go if it's a prototype development, production, and then past production, deprecated, deactivated or sunset. I'll dig in to all this. Basically, you move through these gates by following a checklist.Â
This is a lightweight process we have in place. Here is our life cycle board. We use Trello a lot. This is a snapshot of last week, soÂ I counted here, we have 83 different services in production. There's the kernel apps. That's the core of Heroku. That's I think something like 20 different services, then we have platform apps.Â
We self-host to other operational services on Heroku or other things that make the product and a bunch of libraries and stuff like that but this is still way too much software. 80 different services and then past the production side there's a good 20 services there. Deprecated, it's things that we know absolutely we can't put any more things into it, anything on it, but there's still the RabbitMQ bus, it's still sitting around; the one from the beginning of time and we can't get off it yet, but we're really working on it. Divisibility here makes this really obvious to us.Â
We just did a big sprintÂ and there's a bunch of stuff in the Sunset column, andÂ we just shut down 12 different services so that feels really good.Â So as I said there's these checklists that then you use to move things across gates. My buddy was just reading The Checklist Manifesto, if you haven't read that,Â it'sÂ super important book for operations. This is super simple, again, there's no actual process here.
You just write down steps but again, you make it visible requirements. You make progress against that visible so we could go in the service life cycle board and see like okay, the things in development still need to do this and this and this. This checklist, all it does is writes down your best practices. The ones we know from studying architecture and distributed systemsÂ but just as important, if not more important, are dumb mistakes that we made a bunch and need to just stop doing.Â
If you can imagine the mistake of, like the first one on our checklist is "Code is visible on GitHub". Too many times where there's a production service up there and you just can't actually find the code and GitHub for it, like it's on somebody's laptop. That's a very bad sign, that's not production software-Â docs, ops docs, and instructions a.k.a playbook. Super boring but super important.Â
Staging, "Alerts a human if it is down",Â Â that itself is a whole talk.Â Easier said than done, but again, it almost seems too obvious. But we've been caught way too many times with, "Oh, wow, this thing was down over the weekend. Nobody actually knew about it."Â
Obviously, the whole platform isn't down, but customers are pissed off and frustrated and it's just a simple mistake.Â Send an e-mail to the engineering list and tell people there's a new thing. Like it's aÂ very simple thing to do, and of course, auto scaling and all that stuff.
A buddy of mine recently left Heroku and went into consulting. He was telling me he was making 10K a consulting session to just give a company this information. People don't really know this or at least it doesn't come down with authority or the history.Â
Production checklist- super important. And of course, again, you can drill into all this and it could represent months of work to do the logging correctly, but that's okay.Â
Then I also mentioned the sunsetting. This was a really important cultural change. Again, the hacker spirit is to make new stuff. How can you stop doing that? The answer isn't well, just don't do anything new, that's a good start but we wouldn't be happy,Â adding no new features.Â
The answer is actually really actively work to shut things down. We at Heroku treat Sunsetting as first class work.Â
On engineering, again, it's even harder than standing something up to actually get all the customers off of this. Get them onto another system. Get them onto a better system. That's what we're paid to do and that's super important.Â
You really gotta get the product guys to buy into this because product loves expanding surface area for features and to sell, so you gotta get them to help.Â These are some of the most painful things we've done, like shutting down Aspen.Â
The original stack wasÂ super old based on WN4 and obviously we can just turn this thing off in a heartbeat on the engineering side, but product has to reach out and talk to all these customers and you get these sob stories like people just like what's Heroku, like, "I paid somebody $5 a month or I paid somebody $500 to build my website and what are you talking about? I have no idea. Just please don't take my business offline".Â So, very, very hard.Â
Then most importantly, celebrate successful shut downs.Â
We have a hack here. We have a burn every time we do this. We just did this last Friday. Definitely based on Burning Man, which if you know James or Adam or Ryan, it's one of the cultural things going on at Heroku in general. Celebrate shutting this thing down. Burn it. Nothing lasts forever. It's over. Let's move on and do something else. It feels really good.That's hopefully a few processes that take us hackers and make us more serious engineers.Â
Then the next one was heroes, theseÂ people that really mask these problems.Â This is pretty well understood, like,Â "Oh, rock stars and heroes and things like that, but you probably, if some of you are this and feel like this, it is actuallyÂ super important in the early phase, all this stuff is. Hacking is important.Â Being on top of everything is really important when you're growing.Â
But at some point, it just doesn't scale. You have a handful of people that just by individual willÂ are patching everything and keeping everything up and not bugging anybody; not telling you if things are down because, "Oh, I'll fix it in an hour or whatever."
It feels like it's really high speed and certainly, the individual feels really empowered, but it takes a tremendous amount of energy and ultimately just creates this chaos when it comes time to really coordinate around these things, like a serious incident. Again, during the Amazon outage, literally the guy that was running the control plane for Heroku was on a plane and we figured it out but just not ideal.Â
It wasn't clear how he's been operating the stuff, the tools, the playbooks and then generally, there's this pattern I've seen where the heroes are not exporting that expertise either. It's just going into this one place. It's just like knowledge silo.
Â How can we fix that? Super important particularly as we scale out the engineering team. One trick here- this is like where the rubber hits the road is the pager policy.Â
At the time, again I was on call, butÂ we had all these people thatÂ wereÂ maybe just on call for their own little service and basically, you could just do whatever you want, you know it's covered. But these days, I'm sure you can imagine as you grow, this is something that reallyÂ you have to think about how to design and what this design meansÂ because every single page that goes off means something super serious for all the engineers on the other side.Â
Here's kinda how we think about this Heroku these days. One, every service is operated by a team. There's no more "one person on a pager rotation". Every engineer on the team is on call,Â it's kind of a dev ops thing that there's some interesting things happening to that scale, not everybody is obviously an ops-type person soÂ I suspect that will be another challenge to figure out in the future.Â
Every page is visible to the entire company on HipChat.Â It seems obvious in retrospect but it's just this, we want to export all this stuff that's going on and make it available for the entire company to see,Â whether it's another person on call and they're looking at a problem they can cross reference; or just people hanging out in the lounge or the chatroom or something like that. Then we get to the actual configuration of this and again, this policy encodes a lot of values.Â
First one, we have a monkey. This is one of my favorite things. There's a monkey on call first and it's a campfire HipChat bot and just the page goes to that and there's a shared understanding, a shared cultural understanding,Â that during the day, anybodyÂ is watching. They're active on the terminal or active on the chat. They can help and jump in or at least acknowledge a page and say, "Okay, I got some time to help it if not fix it",Â but this is a team sport during the day.Â
It's after hours where we need more special stuff. Of course, Level One: On Call, pretty familiar, but it's really important to be explicit about what this means. Okay, you carry around your laptop, you carry around your phone, you're expected to, with as reasonably as possible, field this thing, run through the playbook and then if something looks weird, you can escalate.Â
Then of course to Level Two, but to me that's really encoding explicit value that escalation is okay. We are a team, not everybody can.Â You know you're in the shower, you're in the bus or whatnot. There is a team here to have your back and anÂ individual to have your back. This is something kinda new at size.Â
This is important to me these days, I'm a manager, I'm not an engineer. I am Level Three all the time.Â Â This policy says that there's yet one more layer of teamwork but somebody whose job is to just really be explicit, to have explicit accountability for making sure the people are doing the right things.
Â In that case, it's like well, I'm most empowered to know, "Oh, maybe the page escalated because somebody was on vacation and we messed up the configuration this week or the overrides this week,"Â but ultimately responsible for making sure this whole program is good.Â
And then finally, the Incident Commander. This is where Ricardo stays now. He is the expert, but he is, what is that, five levels away from a page. He's always there. This is a team of people that's here for the entire company and they're the best at not fixing anything, but communicating about stuff.Â
They do the comms. They know how to update the status site, they'reÂ really confident in their writing and languageÂ and they know how quickly to post updates. They're the best interface to Amazon. They know the TAMs really closely and know the right file, "Oh, you put a request IDÂ in this ticket and you'll get a quicker response." They have the authority and the confidence to pull in anybody that needs to be here.
It's not always in a disaster scenario but you can always page an ICE if you need help and they will always come in and help you. But basically, there's no room for a hero anymore. You fit into one of these slots and you do your week and then you help out the next week and everybody's happy.Â
That's just the PagerDuty stuff.Â I highly recommend the "Monkey Pattern". Another side effect of that is a lot of pages that are flapping around, they'll just go to chat and then resolve themselves quickly and not cause anybody any trouble. Put monkeys on call and they can fight all these problems, and then you don't need a hero.Â
Finally, you can roll all this up into pager metrics. Something that we, you know, metrics for running business, everybody knows this but it's something we've lost and rediscovered from time to time. But again, the main thing here is that this is a whole company team wide issue.Â
The hero thing again masks like, "Well, I'm used to being paged a few times a night. I know I can ignore it, that's all right." That's not okay.Â
That's actually not okay for you. It's not okay for somebody else that's gonna have to help out when you're on vacation.Â We have to really track the metrics for this.Â
Recently, we've started a site reliability or service reliability team,Â SRE, pretty common these days and they're really responsible for the entire program here but really simple, measure everything,Â review it weekly on ops review, and then use that data to find the balance of how you're doing along these lines.Â
Here's the pager metrics for the last 18 months of Heroku. This is kinda average for four week chunks at a time so it kinda smoothes stuff out but this begins just after that Amazon outage and we were getting the magnitude of hundreds of pages a week across you know, still a fairly small team. Totally, unacceptable. Totally surprising when you look at it like this,Â the purple there are the routing guys and they're just slammed.Â
Really, it's really scary thinking back how bad this was. This is like your top guys are just getting woken up all the time. They're super grumpy. I don't know why some people just didn't up and quit, woken up all night and you're supposed to come in and have this energy to fix everything.Â
Really not good. You get the metrics out here and you say, "Okay, we have a serious problem and we all need to help this."Â
This is where you start throwing your SRE's at it. You start throwing some other engineers at it and you can see, we've really improved this overtime.Â
Still, I would love to some time later-Â I haven't been able to get tons of comparison among these metrics to other companies.Â I don't think it's a secret, but we just don't talk about it too much-Â I would love to know how you guys are doing on this; what your pages look like.Â
But something where I really feel confident that we've turned the cultural corner here is if you follow the Heroku releases. We recently launched performance dynos. A big change to our platform. One of the biggest architectural changes on my team.
I work on the run time stuff that we've done in, literally years. That went into limited availability basically at the beginning of December, and you can see that we're bringing a service online here. It's kind of annoying weeks.Â TenÂ or fifteen pages but we really took the pages,Â the page configuration, the metrics, and the operations super seriously and were super terrified going to Christmas.
Can we set this thing down at all and enjoy our stuff?Â We had weeks of quietness and we launched it somewhere mid-January, late January and we've been having weeks without pages and launching a giant new service. I really feel confident that we've turned a huge corner here.
To recap. We've had this culture of hackers but we really have to figure out like okay, you can hack on stuff on your own time. But if you're trying to get this into the platform and into the customers, you really have to be explicit about what you have and make it visible for everybody else, you have to follow a few guidelines in the checklist. Get your logging right. Get your alerting right.Â
You have to spend some time working on shutting things down, if not not a huge amount of your time. And the heroes, you need to get rid of that mentality and just work with the team.
Â Get in the pager rotation and work to help us all. Lend operational expertise. If you're truly a hero about this thing, teach everybody else how to be the hero too and really help the entire company fix the metrics.
If you do that, you can develop a healthy operational culture and hopefully then, if these things aren't waking us up all the time, we can keep hacking and we can keep inventing and doing the fun stuff.Â That's it. Thank you.Â