March 4, 2014 | 26 Min
Thanks. Hi, everybody. Yeah, so Noah from Heroku. Let's see, I've been at Heroku for quite a while now, o ver four years, and when I joined it was about 10 people. We were just in a room around here in SOMA. I was actually in LA at the time, and probably ops was like one person. We had Ricardo. He was the guy that you would just send a message and he would fix everything or he knew how everything worked.
Today, we are close to 150 people and close to 50 people on the engineering team, so I've seen quite a big growth over theÂ years , I've learned a ton of lessons over my time here, about scaling the platform and scaling the operations but also just learned a lot of horror stories that I'm here to share a couple with you guys. Some side effects that we really didn't see coming as a team and engineering team and some changes that we needed to make to address all that.
A big part of my journey is actually chronicled on the internet for anybody to see. We have the Heroku status site. I was poking around on this thing today, which any of you can do , it's meant to be transparent for all our customers to be able to go here and see like, "Oh, wow, something is wrong with the platform," and so that means there's a ton of mistakes that we made that are just out here for everyone to see.
There's data here spanning four years back now and 600 incidents that we've since chronicled. This is all really out there but one really sticks out. It was June, 2012. It was the great Amazon outage, I see some nodding heads. There was an electrical storm and Amazon lost a data center, cascading, basically lost the whole AZ. Really, this was a huge operational incident for so many people.
It hit Heroku really bad. It hit everybody really badly. I still remember this very well and I was thinking like, well, how did we ever find ourselves there and then, what have we done since then?
Of course, there's this giant understanding of all of us around Cloud services and Cloud architectures and all these dependencies. It turns out if a data center full of EBS drives goes away then you lose everything, and you can fix that with our architecture.
Single Points of Failures. We've worked on this all the time, managed our availability zones and get rid of state, put it in different places, and worked the failover incidences, but this talk isn't about that, t hat would be another talk sometime. This represented a really personal problem for me. Some of you might know that this happened when I was at my friend's wedding. If you know Blake, Blake and Elena, also Heroku.
It was Friday night. We were on Treasure Island here and sat down during theÂ reception,Â aÂ cocktail in, and my boss comes around and he says, "Hey, the Heroku website is down." That's not funny. It's not down. I'm on call and my pager's not going off. But soon we'd realize like well, PagerDuty was down too. So of course, my pager wouldn't go off.
Basically, it was chaos. The next five hours, we were bunkered down on Treasure Island,Â had to go back to San Francisco, and the real annoying thing is that there's a bunch of GitHub folks there and their stuff was fine and they were sitting all smug and drinking while we're scrambling to bring Heroku back.Â
This was a really low point. I still think about this all the time. It's like a day I'll never get back.
It's fine but I really like to think about the people behind all these things. I'm sure there were many people at Amazon whose nights were ruined and everywhere else. I thought, where were we at this point? This is probably when we're about maybe 30 or 40 people, still a really small team b ack in the day. I think this probably resonates with a lot of us, e ven still today, it absolutely does.
We were this team of hackers and we were heroes and these were badges that we liked. "Oh, hack on some software tonight." Hero. Actually, muscle memory kicks in when I was working on these things, Â it's part of the name, Heroku. We used to sit around and you would say in Campfire, "Need a Hero", and that's when you would summon one of the guys that would come and fix everything.
But we really needed to grow up and figure out, okay, this was an extraordinary event. These are always gonna happen but how can we build a culture where we're actually ready to handle this and not ruin everybody's lives. This has been an incredibly slow and painful process and we're still not done. It's something that Heroku continues to work to improve.
I'd like to just share some of the lessons I learned from this. Maybe you guys might take something from this as you're a little bit smaller and maybe not make some of the same mistakes that we did. When I think about this, h ere's one of the roots of the problem:Â
Hackers who write too much software and basically, we have no real process to understand how to take that software and run it in a reliable fashion. Then the heroes, they mask all these problems. You know you can count on that one guy to just fix everything and so y ou're missing the big picture and you're not building a team that's ready to respond to this.
Process is definitely a dirty word- I know, hopefully, I'll show you, we don't do too heavy a process but there's definitely some stuff to learn here.
So hackers, a gain, too much software. Where this really comes from is kind of this "startingÂ building anything is effortless".Â In fact, I think Heroku probably more than many places lives this because our product is to make starting a project as simple as possible.
Sit down on a laptop. You get in it. You get pushed. Heroku. And look at this, I'm running a service and production. Great, b ut what's missing? What I really didn't know at that time and still struggle with today, is that that means nothing, it's the operations. That's what's going to cost you a lifetime of this code and then finishing this thing whether it's getting in a maintenance mode or even better like getting rid of it, is almost impossible.
You've got customers relying on this. You have other dependencies on this and basically, it's not something we value that much. But again, I pulled some numbers to see how bad is Heroku at starting a software. We have 1,200 GitHub repos. Who knows what's in there? All really awesome ideas that somebody wanted to start and didn't really get around to making a solid piece of software.
Side effect of all the software areÂ these legacy services. Part of the reason it took Heroku such a long time to recover on this Friday evening was all these legacy services. At the time, we had two routing stacks and in fact, we still do. We have one for Bamboo, that's the old, more Rails-ish platform and we had the new CedarÂ much simplified but then there's this problem in hindsight.Â
Well, cool, we have this simpler architecture but we still have the old one and now we have two problems. It's really annoying. This comes up all the time. Two database services. We have the legacy database servicesthat I didn't see the very beginning of itbut I'm sure it was like some of the earlieststuff in Heroku and it was still runningbecause these are production databasesthat customers depend on.Â
And of course, this one came with the nice propertyof Â backups that aren't exercised.We say on this data side it was eight hoursof recovery but we were literally workingfor a week to restore the last backupsin the old database service,sitting on some weird EBS or S3 buckets somewhere,Â corrupt the whole time.Â
Not fun.Â You know, same thing but there's the shiny new one,but still you have two problems.Do you bring the new one up, but the new one doesn't actually have as much dataas the old one?Â Metrics is the worst.We're constantly reiterating on our metric serviceand there's all these little experimental features out thereat our experimental services.It'll make our lives easier.
But when they all go down at the same time,and that's the visibility that you actually needto bring stuff back,Â it's horrible.That was probably one of the hardest partsabout this.Â
Again, I said this is something that we haveÂ a problem with to this day at Heroku,Â and we've constantly been trying to figure outÂ how do we change this?Â We've come up with a pattern these days.
Pretty boring, but it's not too heavy.All it's about is visibility and inventory.This is something we absolutely did not haveat the time and could not tell you what was runningin production. At least now, we have that.Â
The goal here is to just make a visible listof all your software, really reflect the ownershipÂ like, is this something that one person did right?Is it something that two or three peopleactually know what's going on? Or is it something that's unowned?Â
Those were always the worst.And reflect the maturity.You can put stuff in the hands of customersbut we know if things are more ready to goif it's a prototype development,production, and then past production,deprecated, deactivated or sunset.I'll dig in to all this. Basically, you move through these gatesby following a checklist.Â
This is a lightweight process we have in place.Here is our life cycle board.We use Trello a lot.This is a snapshot of last week, soÂ I counted here, we have 83 different services in production.There's the kernel apps.That's the core of Heroku.That's I think something like 20 different services,then we have platform apps.Â
We self-host to other operational services on Herokuor other things that make the productand a bunch of libraries and stuff like thatbut this is still way too much software.80 different services and then past the productionside there's a good 20 services there.Deprecated, it's things that we know absolutelywe can't put any more things into it,anything on it, but there's still theRabbitMQ bus, it's still sitting around; the one from the beginning of timeand we can't get off it yet,but we're really working on it. Divisibility here makes this really obvious to us.Â
We just did a big sprintÂ and there's a bunch of stuff in the Sunset column, andÂ we just shut down 12 different servicesso that feels really good.Â So as I said there's these checkliststhat then you use to move things across gates.My buddy was just reading The Checklist Manifesto, if you haven't read that,Â it'sÂ super important book for operations.This is super simple, again, there's no actual process here.
You just write down stepsbut again, you make it visible requirements.You make progress against that visibleso we could go in the service life cycle boardand see like okay, the things in developmentstill need to do this and this and this.This checklist, all it does is writes downyour best practices.The ones we know from studying architectureand distributed systemsÂ but just as important, if not more important, are dumb mistakes that we made a bunchand need to just stop doing.Â
If you can imagine the mistake of, like the firstone on our checklist is "Code is visible on GitHub".Too many times where there's a production serviceup there and you just can't actually find the codeand GitHub for it, like it's on somebody's laptop.That's a very bad sign, that's not production software-Â docs, ops docs, and instructions a.k.a playbook.Super boring but super important.Â
Staging, "Alerts a human if it is down",Â Â that itself is a whole talk.Â Easier said than done,but again, it almost seems too obvious. But we've been caught way too many timeswith, "Oh, wow, this thing was downover the weekend. Nobody actually knew about it."Â
Obviously, the whole platform isn't down,but customers are pissed offand frustrated and it's just a simple mistake.Â Send an e-mailto the engineering listand tell people there's a new thing. Like it's aÂ very simple thing to do,and of course, auto scaling and all that stuff.
A buddy of mine recently left Herokuand went into consulting.He was telling me he was making 10Ka consulting sessionto just give a company this information.People don't really know thisor at least it doesn't come down with authorityor the history.Â
Production checklist- super important.And of course, again, you can drill into all thisand it could represent months of workto do the logging correctly, but that's okay.Â
Then I also mentioned the sunsetting.This was a really important cultural change.Again, the hacker spirit is to make new stuff.How can you stop doing that?The answer isn't well, just don't do anything new, that's a good start but we wouldn't be happy,Â adding no new features.Â
The answer is actually really actively workto shut things down.We at Heroku treat Sunsetting as first class work.Â
On engineering, again, it's even harderthan standing something up to actually get all the customersoff of this.Get them onto another system.Get them onto a better system.That's what we're paid to do and that's super important.Â
You really gotta get the product guysto buy into this because productloves expanding surface areafor features and to sell, so you gotta get them to help.Â These are some of the most painful things we've done,like shutting down Aspen.Â
The original stack wasÂ super old based on WN4and obviously we can just turn this thing offin a heartbeat on the engineering side,but product has to reach outand talk to all these customersand you get these sob storieslike people just like what's Heroku,like, "I paid somebody $5 a monthor I paid somebody $500 to build my websiteand what are you talking about?I have no idea.Just please don't take my business offline".Â So, very, very hard.Â
Then most importantly,celebrate successful shut downs.Â
We have a hack here.We have a burn every time we do this.We just did this last Friday.Definitely based on Burning Man,which if you know James or Adam or Ryan, it's one of the cultural things going on at Herokuin general.Celebrate shutting this thing down.Burn it.Nothing lasts forever.It's over.Let's move on and do something else.It feels really good.That's hopefully a few processes that take ushackers and make us more serious engineers.Â
Then the next one was heroes, theseÂ people that really mask these problems.Â This is pretty well understood, like,Â "Oh, rock stars and heroes and things like that, but you probably, if some of you are thisand feel like this, it is actuallyÂ super important in the early phase, all this stuff is. Hacking is important.Â Being on top of everything is really importantwhen you're growing.Â
But at some point,it just doesn't scale.You have a handful of people that just by individual willÂ are patching everythingand keeping everything up and not bugging anybody; not telling you if things are downbecause, "Oh, I'll fix it in an hour or whatever."
It feels like it's really high speedand certainly, the individual feels really empowered,but it takes a tremendous amount of energyand ultimately just creates this chaoswhen it comes time to really coordinatearound these things, like a serious incident.Again, during the Amazon outage, literally the guy that was running the control planefor Heroku was on a planeand we figured it out but just not ideal.Â
It wasn't clear how he's been operating the stuff,the tools, the playbooks and then generally,there's this pattern I've seen where the heroesare not exporting that expertise either.It's just going into this one place.It's just like knowledge silo.
Â How can we fix that?Super important particularly as we scale out the engineering team.One trick here- this is likewhere the rubber hits the roadis the pager policy.Â
At the time, again I was on call, butÂ we had all these people thatÂ wereÂ maybe just on call for their own little serviceand basically, you could just do whatever you want, you know it's covered. But these days,I'm sure you can imagine as you grow,this is something that reallyÂ you have to think about how to designand what this design meansÂ because every single page that goes offmeans something super seriousfor all the engineers on the other side.Â
Here's kinda how we think about this Heroku these days.One, every service is operated by a team.There's no more "one person on a pager rotation".Every engineer on the team is on call,Â it's kind of a dev ops thingthat there's some interesting thingshappening to that scale, not everybodyis obviously an ops-type person soÂ I suspect that will be another challengeto figure out in the future.Â
Every page is visible to the entire companyon HipChat.Â It seems obvious in retrospectbut it's just this, we want to export all this stuffthat's going on and make it availablefor the entire company to see,Â whether it's another person on calland they're looking at a problem they can cross reference; or just people hanging outin the lounge or the chatroom or something like that.Then we get to the actual configuration of thisand again, this policy encodes a lot of values.Â
First one, we have a monkey.This is one of my favorite things.There's a monkey on call firstand it's a campfire HipChat botand just the page goes to thatand there's a shared understanding,a shared cultural understanding,Â that during the day, anybodyÂ is watching.They're active on the terminalor active on the chat.They can help and jump inor at least acknowledge a pageand say, "Okay, I got some time to help itif not fix it",Â but this is a team sport during the day.Â
It's after hours where we need more special stuff.Of course, Level One: On Call, pretty familiar, but it's really importantto be explicit about what this means.Okay, you carry around your laptop, you carry around your phone, you're expected to, with as reasonably as possible,field this thing, run through the playbookand then if something looks weird,you can escalate.Â
Then of course to Level Two,but to me that's really encoding explicitvalue that escalation is okay.We are a team, not everybody can.Â You know you're in the shower, you're in the bus or whatnot.There is a team here to have your back and anÂ individual to have your back.This is something kinda new at size.Â
This is important to me these days, I'm a manager, I'm not an engineer.I am Level Three all the time.Â Â This policy says that there's yet one more layerof teamwork but somebody whose job isto just really be explicit,to have explicit accountabilityfor making sure the people are doing the right things.
Â In that case, it's like well, I'm most empoweredto know, "Oh, maybe the page escalatedbecause somebody was on vacationand we messed up the configuration this weekor the overrides this week,"Â but ultimately responsible for making surethis whole program is good.Â
And then finally, the Incident Commander.This is where Ricardo stays now.He is the expert, but he is, what is that, five levels awayfrom a page.He's always there.This is a team of people that's herefor the entire company and they're the best atnot fixing anything, but communicating about stuff.Â
They do the comms. They know how to update the status site, they'reÂ really confident in their writing and languageÂ and they know how quickly to post updates.They're the best interface to Amazon.They know the TAMs really closelyand know the right file, "Oh, you put a request IDÂ in this ticket and you'll get a quicker response."They have the authority and theconfidence to pull in anybody that needs to be here.
It's not always in a disaster scenariobut you can always page an ICE if you need helpand they will always come in and help you.But basically, there's no room for a hero anymore.You fit into one of these slotsand you do your weekand then you help out the next weekand everybody's happy.Â
That's just the PagerDuty stuff.Â I highly recommend the "Monkey Pattern".Another side effect of that isa lot of pages that are flapping around,they'll just go to chatand then resolve themselves quicklyand not cause anybody any trouble.Put monkeys on calland they can fight all these problems, and then you don't need a hero.Â
Finally, you can roll all this up into pager metrics.Something that we, you know, metrics for running business, everybody knows thisbut it's something we've lost and rediscoveredfrom time to time.But again, the main thing here is thatthis is a whole company team wide issue.Â
The hero thing again masks like,"Well, I'm used to being paged a few times a night.I know I can ignore it, that's all right."That's not okay.Â
That's actually not okay for you.It's not okay for somebody elsethat's gonna have to help outwhen you're on vacation.Â We have to really track the metrics for this.Â
Recently, we've started a site reliabilityor service reliability team,Â SRE, pretty common these daysand they're really responsiblefor the entire program herebut really simple, measure everything,Â review it weekly on ops review,and then use that data to find the balanceof how you're doing along these lines.Â
Here's the pager metrics for the last 18 monthsof Heroku.This is kinda average for four week chunksat a time so it kinda smoothes stuff outbut this begins just after that Amazon outageand we were getting the magnitudeof hundreds of pages a weekacross you know, still a fairly small team.Totally, unacceptable.Totally surprising when you look at it like this,Â the purple there are the routing guysand they're just slammed.Â
Really, it's really scary thinking backhow bad this was.This is like your top guys are just getting woken upall the time.They're super grumpy.I don't know why some peoplejust didn't up and quit, woken up all nightand you're supposed to come inand have this energy to fix everything.Â
Really not good.You get the metrics out hereand you say, "Okay, we have a serious problemand we all need to help this."Â
This is where you start throwing your SRE's at it.You start throwing some other engineers at it and you can see, we've really improved this overtime.Â
Still, I would love to some time later-Â I haven't been able to get tons of comparisonamong these metrics to other companies.Â I don't think it's a secret,but we just don't talk about it too much-Â I would love to know how you guys are doing on this; what your pages look like.Â
But something where I really feel confidentthat we've turned the cultural corner hereis if you follow the Heroku releases. We recently launched performance dynos.A big change to our platform.One of the biggest architectural changeson my team.
I work on the run time stuffthat we've done in, literally years.That went into limited availability basically at the beginning of December,and you can see that we're bringinga service online here.It's kind of annoying weeks.Â TenÂ or fifteen pagesbut we really took the pages,Â the page configuration,the metrics, and the operations super seriouslyand were super terrified going to Christmas.
Can we set this thing down at all and enjoy our stuff?Â We had weeks of quietnessand we launched it somewhere mid-January,late January and we've been having weeks without pagesand launching a giant new service.I really feel confident that we've turneda huge corner here.
To recap.We've had this culture of hackersbut we really have to figure out likeokay, you can hack on stuff on your own time. But if you're trying to get this into the platformand into the customers, you really haveto be explicit about what you haveand make it visible for everybody else,you have to follow a few guidelinesin the checklist.Get your logging right.Get your alerting right.Â
You have to spend some time workingon shutting things down, if notnot a huge amount of your time.And the heroes, you need toget rid of that mentality and just work with the team.
Â Get in the pager rotationand work to help us all.Lend operational expertise.If you're truly a hero about this thing,teach everybody else how to be the hero tooand really help the entire company fix the metrics.
If you do that, you can develop a healthyoperational culture and hopefully then,if these things aren't waking us up all the time,we can keep hacking and we can keep inventingand doing the fun stuff.Â That's it.Thank you.Â