June 19, 2014
Common Ops Mistakes
PagerDuty's Arup Chakrabarti covers the easily preventable mistakes that all companies (big and small) make and the actual steps to prevent ...
Thanks for the introduction. Here's my background. I'm currently a production engineering manager at Facebook, working on Parse. I was the first Parse ops person hired almost three years ago. We are currently hosting over half a million mobile apps. It's basically mobile back end as a service.
We were acquired by Facebook a year and a half ago. Production engineering at Facebook is basically a hybrid software engineering/operations role. I lead the team that does all of the infrastructure, DBA, automation, and reliability work for Parse.
I have previously been called a sysadmin, operations engineer, release engineer, DBA, and really shitty software engineer. In the past I was also the first hire, as mentioned, for Shopkick, which just sold to some South Korean conglomerate for about $300 million.
Before that I spent five years at Linden Lab. I was one of the first ops hires there. I learned a lot about organizational stuff there because the rate of growth was probably faster than any other company I've ever experienced.
I helped guide that operations team through 10X growth in terms of humans, and almost 1000X growth in terms of machines. I briefly managed the ops team there. I also took a detour for a year and a half or so to design and staff out and build a distributed NOC, which is similar to a 24 hour follow the sunlight tier one ops team basically, in the U.S., Europe, and Singapore.
By the way, this is basically the first non-technical talk I've ever given. I give a lot of talks, but usually it's about, "Here's what's wrong with your database or here's what's wrong with your systems," so if I get nervous and I start sputtering about database storage engines, it's because I'm nervous and that's my fallback behavior.
But actually, I'm really excited to talk about this, about building ops teams. This is something that I have done repeatedly and I have seen it done repeatedly. I have seen it done badly much more than I've seen it done well, and I have a lot of feelings about it.
By the way, when I use the term ops, I mean it to encompass ops, dev ops, SRE, reliability engineering, production engineering, etc. There are so many terms for it.
I will try not to rant too much about the really stupid ways that people use the term "dev ops" to mean quality operations work done by people who can code. I might have a little dev ops rant in there at some point, but I will try not to rant about it too much.
Since this is an accelerator space, I really want to pitch it to founders who are just starting out, building their ops teams, and trying to figure out how to interview for skill sets that they don't have themselves perhaps.
How to tell when someone is doing a good job if this isn't your core skill set, and what comes next after you've hired someone that you think is good? How do you keep them? How do you keep them challenged? How do you build a healthy culture?
There's basically three parts to this story. Do you really need an ops team? A lot of people think they do, and I would disagree. They certainly don't need an ops person like what I do.
What makes for a really good startup ops hire? Turns out this is actually really different than what makes for a good ops hire at a large organization.
Once you've identified the qualities that you need, when you're at the intersection of the skills that you particularly need for your business in the stage that your business is at, how do you tease those qualities out in an interview? I also want to talk a little bit about how it's extra important if it's your first hire, if it's the foundational hire for a new team.
Do you really need an ops team? A lot of people who think they need ops teams really don't. They just need someone to care deeply about their infrastructure and be passionate about keeping it simple and scalable.
There's an anti-pattern that I've seen startups do over, and over again where there's half a dozen software engineers, they figure out how to do AWS and they figure out how to deploy code. It's really not rocket science. They maybe have a couple of customers, but it's really okay if the site briefly goes down for a couple minutes a week.
At some point, your startup turns a corner and what happens is, you probably have a couple of software engineers who are already better than the others at knowing how to build and maintain infrastructure, and they start getting completely overwhelmed.
They get very unhappy, because all they're doing is ops stuff, it's not their core skill set. It's not what brings them joy. So they get really burned out and frustrated. That's when the team goes, "God damnit. Hire me an operations engineer to fix all this and make all the problems go away."
Maybe they're right, but let's be clear. Operations engineering at scale is a highly specialized skill set. It is not getting a low quality software engineer to do all the shit work for you. It is not getting someone to care about the build pipeline and the deploy cycle, to get alerts in the middle of the night, and own the service reliability so your developers don't have to.
If you want to use the dev ops buzz word, use it there, because that is anti-dev ops. I don't care who you hire. That is the opposite of what you need to do to have a healthy service.
You're not hiring someone to take over the shit work from your developers. You're hiring someone to, and it's important that you acknowledge this, someone who knows things that you do not about running systems at scale.
If you want to hire an ops person worth their salt, your entire team needs to be prepared to listen to them and to integrate them into the work flow.
Instagram, for example, had zero ops people when they sold to Facebook for $1 billion. What they did have was a small team of software engineers who were passionately dedicated to simple, understandable infrastructure; which is one of the core skill sets that any ops engineer worth their salt is going to bring to your company.
They're just going to hammer into your head, again, and again. Don't develop something new. Don't use another database. Don't use another solution when one reusable solution will cover most of your use cases.
They actually have a great blog post about this that they did long before they were acquired that shows their stack. It is brain-dead simple because they're sharing pictures. There's no need to add a lot of complexity. They did an amazing job of implementing that. They got a really long way with that philosophy.
So, do you really need an ops team? The answer is yes if you have really hard operational questions and problems. I'll talk a little bit more about what those are in a minute.
The answer is also yes if you don't have hard operational problems, but your software engineers suck at ops and they refuse to get better.
I think the better solution to this case is to get your developers to be more rigorous about learning good operational practices.At this point, I wouldn't hire an operations engineer who doesn't know how to write code. No one would.
You also should never hire a software engineer who doesn't care about how their services perform in production, who isn't interested in the instrumentation and the reliability, and owning the service from end to end.
Ops teams will build you infrastructure. Developers need to own their services. If you can't, for whatever reason, get your developers to level up and do a better job of ops, you may as well go ahead and hire a middling, reasonably good ops person to fill the gap. They will still bring skills that you don't have in your arsenal, and they can help bridge the gap between total catastrophe like Cowboyville, and reasonably stable production.
But you're not really going to ever attract the top tier talent unless you are offering real, hard, challenging problems of reliability or scale. For example, these are a few things that excite me and other people that I know that I think are really good operations engineers.
Let's just leave aside the whole category of your Googles, Facebooks, and your AWS', because everybody does. They have hard problems to solve, fine. But I'm talking to you guys, so we're talking about startups here.
What kinds of problems can startups have that are hard? That require you to have a world class ops team?
Well one of them is extreme reliability, or extreme security demands. I remember talking to Square, three years ago when they were still genuinely a startup. They were like, "We can never drop an API request, period. We drop 20 API requests, we post-mortem that shit." That's what happens when you're dealing with finance, when you're dealing with peoples' money.
Stripe has the same thing. They don't have problems of, "Oh man, we have hundreds of thousands of requests per second to handle." They have problems of, "We cannot afford to let any of our requests get dropped on the ground, or else our customers will lose confidence in us."
It's no coincidence that all of the startups that I know of that have been really successful in the last few years in the financial space and security space have invested heavily in top tier ops talent. They're chasing those many nines, because they have to.
But, most of us don't have those demands, frankly. You don't want to hire people who are just complete idiots about these things, but it doesn't have to mean they're a specialty. Most startups fail, and it's usually not because your website was down for two minutes a week.
Another category of really hard problems is your rate of growth. This is the intersection of scalability, reliability, having really good technical judgement, and knowing the landscape.
If you're in a situation where you are reliably 5Xing or 10Xing, hell, even 3Xing year over year, you are going to need top notch operational talent. This is the place where a lot of software engineers, even really good ones, even who went through really attuned infrastructure problems, will start tossing processes and best practices out the door in an effort to keep up with the rate of change; which is exactly the opposite of what you need to be doing.
Process, best practice, and taming chaos in general are the bread and butter of really good reliability engineers.
Google has this philosophy that you should never try to design for anything more than 10X what you currently have. They think you have too many incorrect assumptions baked into that planning process.
I think that's really smart, but if you're basically growing so fast that you're 10Xing regularly, you're implementing a brand new infrastructure every one to three years. You need people who can intelligently do that and not learn everything on the job.
A fourth category of hard problem is, you really want to solve some core operational problem for everyone on the internet. You want to take on just a seething mass of people who are doing things haphazardly.
Parse is one of these companies that says, "Fuck it, we're going to solve mobile apps for you guys." PagerDuty is a Heavybit company and is a great example of this. They're thinking, "Alerting sucks. Everyone hates it. Everyone is re-implementing this elephant just over and over, and over, and no one wants to do it. We're just going to fix it for the entire internet." Ops is super core to that. They need an ops team.
If you're doing something genuinely new, maybe you need an ops team, but honestly, that happens pretty rarely. I can't even think of anyone who is.
So let's say, hypothetically, you've decided that you have hard operational problems, and you need an ops team. Well, congratulations, you are really doing something right. At this point, you must have actual customers, funding, revenue, and hard problems, so good job.
Just pause for a minute. This is the happiest moment of the entire talk. You decide to start bootstrapping an ops team. There are a couple of points I'm going to touch on.
What qualities make for a good startup hire in general and what are the special challenges of the first hire for the team, or, the foundational member of any new team. This really applies pretty broadly to any type of team, not necessarily ops in particular.
So if you're a founder, and you're saying, "I need an ops team. I'm going to sit back and think about the ideal operations engineer and the list of skills and expertise that I would really like them to possess," and your list probably looks something like this.
Obviously, you want them to be an expert at Ruby or Python, some deep systems knowledge is great. They should also know how to recompile kernel modules. It would be super awesome if they were good at security, and they know how to do TripWire and all that shit.
They should be a great networking engineer. It would be really great if they know how to snip TCP packets and unpack them, obviously they need to know Chef and Puppet. All this shit.
Oh, you know what else would be awesome? If they only want to work for $25,000 a year plus stock and snacks and 0.01 percent equity. So, you want a unicorn? Well, we all want unicorns, but we don't get them.
What you do get, hopefully, is an engineer. An individual with very specific strengths and weaknesses. A background that may or may not be relevant to what you really need.
This means that you need to get really ruthless about narrowing down the list of skills. Not even skills, but the list of strengths that will cause your company to succeed. I'll talk a little bit more about how to identify your strengths and how that fits in a little bit later.
I will say this. In my experience, for startup engineers across the board, one of the strongest predictors of success is something I've heard referred to as a T-shaped engineer.
A T-shaped engineer meaning they are broadly literate across technology, like a range of technical topics. Ops is so broad, you could be a specialty in network engineering, routing, operating systems, and a million different kinds of databases. You could be a caching engineer.
They should at least be able to go and speak intelligently across a stack and they should have demonstrated the ability to go deep into at least one area.
When Parse was interviewing me, I had literally never used any of their core technologies. I had never used AWS. I had never used Ruby, Chef, Mongo, Redis, Cassandra, the list just goes on. I had never used any of this shit.
I had experience scaling really fast and I had experience in mobile, several times, with 10X companies and that was what they wanted. They were very clear on that point and so they didn't get distracted by things like, "You don't actually know how to solve our Mongo problems when you're coming in the door. They just assumed that I would learn that shit, and they were right.
By and large, ops people are really good at picking up whatever it is that they need to know.
In my experience, there are a few qualities that all good ops engineers have in common. They may not be the best software engineers and they might not write the best code the fastest, but they are allergic to doing the same thing twice, and they have sufficient coding skills to make that happen.
All good ops engineers feel really, personally responsible when their stuff breaks. Most good ops engineers have a really hard time taking vacations. Great operations engineers, and again, I'm going to extend to great engineers period, have really strong opinions on a wide range of technical topics, but they're not dogmatic about them. They're not religious about them.
They're aware that every engineering decision involves countless trade-offs, and they're persuadable. Emacs versus Vim is different, we'll just leave that aside. It's totally reasonable to be religious about your editor.
All great ops engineers really strive to simplify their architecture. They want to maintain the fewest things possible. They strive for reusable solutions. They strive to reduce complexity and they will push back whenever you want to add a new element to the stack, just by reflex.
It's a negotiation. I think about ops people the way I think about good lawyers. It's a good lawyer's job to never say, "No." It's to tell you how to get to "Yes." Maybe you don't like that answer. Maybe you're not willing to make those trade-offs, but a good operations engineer is there to help you get to where you want to go.
Ops engineers value process and the reason why that is, is not because anyone loves filing tasks, or attending SeV reviews, but it's because process is what prevents you from making the same mistake over, and over, and over again.
Empathy is the thing that the dev ops movement is really about. It's not about hiring dev ops engineers, because dev ops engineers are not a fucking thing. It's about managing the constant tension between the need for software engineers and product people to ship code, to get features out there to help the company move forward and succeed, and the need of the infrastructure people to make sure that it happens safely and doesn't take anything down.
Dev ops means you have shared responsibility for both goals. It means ops people are saying, "Yeah, yeah, I get it. We need to ship stuff as fast as possible. Let me help you figure out how to make that happen." The software engineers are like, "Yeah, we don't want to get paged in the middle of the night either, so let's figure out how to make that happen."
I will also just say, anecdotally, every good ops engineer I know, is a fucking wizard at Bash. Otherwise, it would just vary, but everyone knows Bash.
What's not on this list? Things that do not predict great startup ops engineers. They're great at whiteboarding code or any specific technology or language.
I'm also going to say this, I want to talk about the big company pedigree issue. VCs love big company pedigrees. They love it when you hire someone from Google or Facebook. The people who come from Google or Facebook have definitely demonstrated some superior skills at certain things. They have demonstrated their ability to work in very large, very structured environments on a problem that is a slice of a slice, of a slice, of a slice, of a slice, of a slice of a problem.
I think this is a trap. You're selecting for a different set of criteria when you're hiring for Google and Facebook than you are when you're hiring for a startup. Some engineers can cross that gap, and really shine in both areas. But they are rare. Most of us gravitate towards one or the other and can function okay in the other environment.
If your engineer from a big company has been at that big company for a very long time, especially if they went to that big company straight out of college, they don't live in the real world, in technological terms. They can learn, but they're starting from a place of being pretty far behind.
What does the real world use for a web server? Or for a load balancer? Or for a database? Or for a caching solution? Or for development tools? Or for continuous integration? I know an amazing SRE at Google who has said, "What's a LAMP stack?"
I'm not trying to say they can't learn, they can. But I feel like it's about the same amount of risk hiring someone straight out of a long tenure at a big company and hiring your first remote employee. It's a big cultural shift. They don't know if they're going to like it. You don't know if you're going to like it.
There's also a kind of learned helplessness that can set in at big companies after a certain amount of time. This is basically how I feel about the characteristics that make you succeed at a big company or a small company.
People who really, really love working at startups are often the kinds of people who will implement 80 percent of a solution and then move on to the next most important thing. That's usually what you want at a startup. At a big company, that's not really looked upon as a good thing.
Ops at a startup is fundamentally a very reactive role. Every role at a startup is a fundamentally very reactive role. Things are changing behind you all the time, but ops especially. At a big company that's really doing it right, it's not.
Now let's talk about if this is this your very first hire for the team. The founding team member for any new team is incredibly important, because it sets the stage. It sets the foundation and it sets the culture for all of your subsequent hires. You're going to rely disproportionately on their judgement to tell you if anyone else if your hire is doing well or not.
You're going to rely on their judgement to help you interview and hire anyone else for that team. If you get this right, you can substantially delegate a whole bunch of technical decision making to someone who can do it better than you can. If you get this wrong, you will have a messy clean up job on your hand.
So your first hire is your unicorn. Technical judgement is incredibly important. This person is going to be making decisions for your company that will be with your company until literally the day it dies.
At Linden they are still using things that the other founding ops engineer and I were embarrassed about hacking together 10 years ago. It's still there. They don't have to do everything perfectly, but it's important that your first hire makes more good decisions than bad decisions.
Your ops engineers have to have enough credibility with your software engineers that they will listen to them and so your ops team can actually have the power and the credibility to get shit done.
Your life will also be so much easier if your first hire is fairly senior and can grow into that tech lead role or management role.
It will be very awkward if they aren't capable of this and you have to hire or grow someone to manage them later. Great operations engineers are great communicators, full stop. They're good at getting everyone on board with the vision for the team. This is what needs to happen. This is what we need to pivot from building features to working on reliability. They have to have that credibility. They have to know how to communicate with the other teams, and let me just say, if you find these qualities in someone, it's worth paying for them.
Senior people who are worth their salt should know what they're worth. If you're hiring software engineer number 25 at the same time as you're hiring what you want to be your ops team lead, you should not be offering them the same salary as software engineer, salary, stock, etc, as software engineer number 25.
If you believe you've found the right founding team member and if operations is truly core to your business, you should compensate them the same way you do your software engineering tech lead.
We've talked a lot about what makes for good startup engineers and how what you're selecting for might be different from what they're selecting for at really big companies.
Now let's talk a little bit about the interview process, and how I would identify the qualities that you really need and how to hire those people.
Interviewing and hiring is hard. Nobody in this industry does it very well.
Even people who have entire divisions, like PeopleSoft, dedicated to try to figure out how to tease out good signal. But if you don't have someone who already does ops on your team, the best way to get a feel for the job is to do it yourself.
Kevin and Ilya did a great job of this at Parse. They did this not just for the first ops hire, but also the first marketing hire and for the first sales hire. They really dug into the challenges that we were having and tried to do the job themselves. That was really helpful for them to interview and select for the right people.
You have to be literate about the problem set before you can scream for someone to have it. I owe a lot to Ben Horowitz for this section. His book, if you haven't read it, The Hard Thing About Hard Things, is amazing. Read it. But you need to figure out what strengths you really need, and hire for those, not for lack of weaknesses.
This is something that big companies do really badly because over time, as they get more and more applicants and they have more and more ability to filter, they hire more and more for lack of weaknesses instead of the specific strengths that they need. This is a great for you guys because it means that there are lots of engineers out there who can't or won't get hired at the Googles and Facebooks, but have incredibly powerful strengths.
If what you need for your business to succeed is for someone to care about that fourth and fifth nine of reliability. You don't need someone like me, because I get bored around 99.99, I'm like, I'm done, my job is over. I do not care about those last packets.
If you need someone who can scale up really fast, hire for that. If you need someone who can really act as a software engineer but who cares more about the build pipeline and wants to work and developer tooling to support your team, hire for that.
If you're hiring a founding team member, that is one of the strengths that you need to select for and weight it against their other weaknesses. Then when you're hiring future people to fill in the rest of the team, look at the weaknesses of the team you have and try to fill those.
I got hired as the first ops engineer on Parse and I hate monitoring and graphing. I hate it. I can do it, but I'm not very good at it. I find it absolutely loathsome as a thing to spend time working on. So the first person that I recruited, and hired, was my friend Ben Hawthorne who loves monitoring and graphing. I could limp us along for a ways but I knew we needed someone to fill that weakness.
What makes for a good interview question? Good questions are really leading, and really broad and have lots of correct answers. It really gives you a chance to plumb the depths of someone's background and their knowledge, and the approach that they're going to take to problem solving.
If you're giving them a coding test, don't lock them in a room with no internet and expect them to solve it without the resources that they would normally have in their day to day job. I know some people who do that, I think that's insane.
One sample question that I really like is,"The site seems to be slowing down today. It's two or three times slower than it was at this time yesterday, and we can't figure out why. Where do you start?There are just infinite numbers of ways to approach that question.
Or, it's a classic, but it's a good one. What happens when you type Google.com into a browser? There's so many directions that somebody could go with that question.
My favorite question might be, "What is the most catastrophic mistake that you've ever made and how did you recover from it?" You don't want someone who will say, "Well, you know, I pulled out the power cord or something." No, no, no, no, no, no.
Every engineer who has been around for a couple years is going to have some story of something that they did which is truly atrocious. You need someone who's going to be cool under pressure. You need someone who, when the pager goes off at 3 A.M. and you're all up on Hip Chat or Slack trying to figure out what the fuck is going wrong; you need someone that you want to work with under those circumstances because you definitely will.
I also ask culture questions. Ask how they felt about their past jobs. If the answer is, "Well, this one sucked and that one sucked and this was bad because of this," pay attention to how they talk about their former colleagues and co-workers.
It's okay to have a bad experience or two, but if you have a chain of them, the common denominator there is probably you.
And if they have the attitude of learned helplessness towards their culture and their environment, ask them what they did to try to change the things that they hated about their past jobs. If their answer is nothing, that's not a good signal for a startup hire, because as a startup, everything is everyone's problem. You don't get to delegate it to some other team who is supposed to take care of these problems somehow. Learned helplessness is startup killer: kryptonite.
So, Congratulations, you hired an ops engineer. Now what? Do you just hand them the keys to the kingdom and take your developers off the pager rotation and stop caring about reliability and scalability?
No, you do not. This is where dev ops comes in. This is what dev ops means. Dev ops is not hiring a dev ops engineer, it is making your software engineers care about the health and reliability of their own services.
Operations engineers are an amazing resource for helping you do this. They are really good at building robust infrastructure, but they should not be expected to wholly own services that software engineers are writing. This means shared pager rotations. This means including your operations team in the design phase of every new product.
I love this quote from my friend Rick Branson from Instagram. There is no such thing as a dev ops engineer, there is no such thing as a dev ops team. Now, I get that we are probably losing the linguistic battle against this because it's just become this buzz wordy thing, and that's fine.
Some day I'll come to terms with this, but dev ops just means developers in operations having crossover skills and caring about what each other needs to do and helping each other get that shit done.
If you're advertising for dev ops engineers, just be aware that you're irritating a significant subset of the target market that you're trying to hire, some of whom happen to be at the very top of their industry.
The important questions are always,"Can they execute technically? Can they design processes? And can they handle incident response?" These are operational questions. You just need operational skills like whatever title you call it.
How to spot bad ops people. They try to wall off ops from dev. They add complexity. They don't admit when they've broken something or they don't know something. Fire those people, don't work with them.
On the flip side, here's how to lose really, really, good ops people. Blame them for things when it goes wrong. Don't have hard and interesting problems. Give them all the responsibility, none of the authority, all the crap work, and no other respect. They should quit.
At the end, I just want to make a small pitch for treating people really, really well. I don't mean perks, snacks, meals, and off-sites, etc. I mean really caring about your culture and putting thought into it. Encouraging people to go on vacation, to have healthy work-life balance. To not to treat each other like jerks, ever.
I worked at this one startup, I'm not going to name names because this is going to be on the internet, but the founder's consistently called out and glorified whenever the engineers were in the office until four A.M. the night before a big release. They'd be like, "Ah, you guys are heroes. This is so exciting, look at you!" Guess what? Every single fucking release engineers were in the office until like four or six A.M.
The patterns that you call out and celebrate are the ones that are going to get repeated. So pay attention to what you're actually giving people props for. They saw this as a great success every time. "We hit this deadline." Oh, good job for staying up all night writing really crappy code because you're so tired. It was not a success. Every single time that happened, it was a failure.
A lot of you are founders or early employees and you're carrying a lot of stress for the success of your company, but caring about culture is not orthogonal to your success. It's really critical for your success in this industry. This is a really small valley. Most startups fail. None of us are going to be at the job that we're at forever.
If people love working for and with you and if they want to follow you from company to company, that's like a super power.
Having people who will vouch for you to their friends and say, "Yeah, working with that person was awesome, and you should totally trust them," that is some powerful shit. That can be the ingredient that makes or breaks a startup.
For every single one of you engineers on my team, I genuinely, wholeheartedly believe that being on my team right now, is the best thing that they can do for themselves, for their careers, for their growth.
If I didn't believe that and if I couldn't fix it, I would tell them. Because a few years down the road when I'm at my next startup and I'm pitching those same engineers to come with me to yet another startup, I want them to trust me when I tell them this is the best thing for them. For their career, and for their growth.
But you have to mean it. Sometimes that means making hard choices between what's right for your company and what's right for your employees in the short term.
Over the long term, I think there's no question that investing in your people is always the winning strategy.
That's about it. I have some thanks to Kevin and Ilya, and my friend Mark who would be here, but he's sick.