October 31, 2019
Ep. #40, Large-Scale Digital Transformation with Brian Sodano of Liberty Mutual
In episode 40 of The Secure Developer, Guy speaks with Brian Sodano, Director of Engineering at Liberty Mutual Insurance. They unpack what h...
In episode 12 of O11ycast, Charity Majors and Liz Fong-Jones speak with Rich Archbold of Intercom. They discuss the crucial importance of timely shipping, high-cardinality metrics, and the engineering value of running less software.
About the Guests
Charity Majors: Does Intercom ship code on Fridays?
Rich Archbold: We ship code every day of the week.
We actually ship code at the weekend if we need it.
For me, the question about "What day of the week you actually ship code on?"
The answer to that is more of like "How do you actually think about shipping
The question is, "Do you actually feel
confident in your software development and
software deployment processes?"
For me, I think about our CICD is generally about five minutes long.
We can actually fully rollout or rollback any change in
about 10 minutes time.
We ship almost all of our changes under feature flags, and that means we can actually turn features or code on and off at the flip of a button.
Charity: This is even more impressive when you realize that you guys run Ruby.
Rich: Yeah. We run Ruby on Rails on the front end, Ember on the back end.
Shipping is our company's heartbeat. We absolutely believe speed of deployment and speed of development is a competitive advantage.
Charity: I love that phrase.
Rich: Basically everything we do is oriented toward shipping code as frequently and as fast as possible.
Now caveated with that, occasionally something actually
isn't going to be able to be fully de-risked or rolled back.
A massive MongoDB major version upgrade, or something like that.
Absolutely we're not going to do that on a Friday evening.
Charity: Right. You're not going to yo-yo a DB upgrade.
"Shipping is your company's heartbeat,"I love that.
I think that really gets to the core of this
argument that Liz and I have been having with the world about software.
But this feels like a good time for you to introduce yourself.
Rich: My name is Rich Archbold. I am a senior director of engineering at Intercom.
I actually run Intercom's back end
and foundations engineering group.
I've been with them for about five years now, I'm based in our
For anybody who doesn't know Intercom, I'd say go to the Honeycomb website and check out the customer chat widget in the bottom right hand corner. That's us.
Our mission is to make business personal, and we actually provide customer communication software that helps every part of the business talk to its customers.
Charity: We use extensively, and you guys were pretty much our first real serious customers too.
You remain to this day the only customers that have ever been really sold by the term "High-cardinality dimensions."
Liz is staring at me because it's one of the first things she said to me, "This is not going to work."
Rich: Before actually working at Intercom I used to work at Facebook and also worked at Amazon, so I've seen the inside of two massive megacorps.
Then I've also seen all of the tooling on the outside of the world, and really-- Scuba at Facebook was this game changer.
Charity: Totally a game changer.
Rich: It was something that Amazon had never seen or had any example of, so I actually knew this was gold working at Intercom because we had actually come across the need for all of this high-cardinality metrics.
We called it "Per-app metrics," or "Per-customer metrics" is really what we were looking for.
We ended up actually running two different
graphite infrastructures, which was taking just hours and hours.
Engineers and engineers worth of--
Charity: We had done the same with Ganglia at Parse. Just running, trying to pre-generate all of the dashboards for every single individual user, because that's what you need.
Rich: Yeah. We also tried running all this stuff through Datadog, and we actually crippled our Datadog instance because we just had so many metrics in there. Everything actually slowed down.
Charity: Turns out, you care about everything combined with everything else.
Liz Fong-Jones: But this isn't a Honeycomb ad.
So, let's talk about how did you get to that state of having shipping code being your lifeblood?
Did it start from the very beginning, or was it something that came to Intercom later in its life?
Rich: I think this is actually one of the interesting bits about the Intercom story, is that in the first six months of Intercom's creation as a company our CTO and VP of engineering decided we actually needed to have a world class CICD system and built their own CICD system at the time, which is an internal application called Muster.
This is the thing which allowed fully-automated push on green CICD.
Charity: Now, this is so interesting to me. I love talking about Intercom's engineering culture.
It's one of the companies at Honeycomb we've always seen as an older sibling.
Farther along, but believing so many of the same tenets.
I first realized this when I heard you talking about your approach to software, which is "Run less software."
Rich: Run less software, yeah.
Charity: I love that.
Rich: "Run less software"for us,
it's probably my favorite
Intercom engineering value.
It came about shortly after I joined.
We had our first infrastructure offsite, and we were just talking about MySQL and whether or not we actually needed to hire a dedicated MySQL engineer, or whether or not we needed to hire a dedicated graphite engineer.
Liz: This was in the days when RDS was not necessarily the oldest thing in the book.
Rich: This was maybe three years before Aurora. We were thinking, "We're going to have to spend dedicated engineering headcount on these tiny things that--"
Charity: You don't want to have to care about.
Rich: "Customers don't care about." Our CTO was just railing against this, and he said "I want to run less software, not more software."
This actually was the beginning of this concept. For us, "Run less software" is one of our competitive advantages, it's how we actually lean into high quality, high speed, low cost software engineering.
We focus on writing software for things that our customers care about, we actually build them out of a small set of core technology components.
These are Ruby on Rails, Go, Ember, AWS, Aurora, MySQL. By focusing on these things we can actually train our engineers in them and make them absolute experts in them.
Liz: You have a standard set of technologies and a standard set of frameworks.
That was part of what I really liked about when I used to work at Google, was that
I could parachute into any product at
And I could understand, "Where is the monitoring located? Where is the telemetry? How do I restart this?" All standardized.
Rich: It actually creates fungibility amongst your engineers.
Engineers are able to move between system to system, it makes the operations easy because you can centralize it.
Charity: You also have this very strong focus on "Everyone builds things that impact users."
That's the thing that I hadn't really thought of as an infrastructure engineer.
Yes you have need for infrastructure, but you keep it as little as possible because you want all your engineers to be thinking about being the customer.
I think it's very interesting and very fortuitous and makes sense that you, with that firm focus on the customer, that you would find us that early on.
Liz: The last thing anyone wants to do is have to build their own monitoring system, and build their own observability system.
Charity: That being able to see the world from every single user's perspective, every single app's perspective, is absolutely central.
Rich: Everybody knows that once you've got a platform that has tens of thousands of customers on it, a system-level SLA of 99.9%.
Charity: It covers over so many sins.
Rich: Yeah, that's way too low a bar to be thinking about things. You need to be thinking about how your customer's experience it.
Charity: Sometimes some of your highest-paying and most important customers will be barely a fraction of a percent.
Rich: Enterprise customers who would use a tiny feature once or twice, but that's actually super important.
Liz: Tell us a little bit more about how you think about your platforming decisions. How did you arrive at Ruby and Go and Ember?
Rich: That's a really interesting question. This is actually one of the ones where I think there is no right or wrong answer.
I think the right answer is that you have standards, and that you have made choices. I think it's actually way less important what those actual choices are.
For us, our CTO was a Rails engineer
and loved Rails.
Some of our earliest software development hires were Ember
specialists, so therefore we had Rails on the back end and Ember on the front end.
But I could see some other companies saying "We're going to have Python on the back end, and we're going to have--"
Charity: It doesn't matter as long as you pick something.
Rich: Yeah, it doesn't matter once you pick something.
Charity: Speaking of choice, though. You also have a completely voluntary opt-in on-call rotation.
This is something I've been repeatedly told is completely impossible. So, tell me how you got there.
Rich: That's a fun story. I love on-call.
Charity: Me too. I think on-call is great.
Liz: Me three.
Rich: But I also respect that some people find it super stressful, and some people don't want to do it, and some people are in different stages of their life where they've got different things going on outside of work or whatever.
Charity: I always said that was just excuses.
Then I had an engineer and one of my teams at Facebook who was totally willing, but was so anxiety-ridden that he wouldn't sleep all week.
At first I was like, "He'll get used to it."
He never did.
I started to feel like I was just torturing the poor fellow, like he couldn't
So instead of being on-call and carrying a pager, he became the person who was on-call for the CICD system every other week, and everybody won.
Having production excellence means involving everyone in some capacity, not necessarily putting them on-call in the evenings and weekends.
Charity: The on-call system, if something's not-- You don't suffer. Right? Making it palatable to have end of life.
Rich: I actually found by making the on-call
system voluntary, but both paid and
recognized and rewarded, and everything like that for the
specialist skill it is.
By making it voluntary, so during daytime hours all of the teams are on-call for their own systems and it's only at nighttime--
Charity: That's how you achieve the feedback loops.
Rich: Yeah, the feedback loop is incredibly powerful. Nobody ever wants their team to be the one which is actually waking somebody else up.
Rich: One of your peers doing it voluntarily, so that peer pressure feedback loop makes sure things actually get fixed way quicker than if it was the team carrying their own evening on-call load.
Charity: It actually helps to accrue some status to the engineers who choose into it too, because it's not a low-status thing where "You're the lowest man on the totem pole. We're going to make you do this." Instead, it flips the script.
Rich: "You're the superhero who's able to take on this thing."
Charity: "You have to earn this role. You have to be this tall to ride this ride, and once you're able to you get paid for it." Yeah, totally.
Rich: I think it's certainly the prestige, though. Like "I'm on the strike team." We call it "The strike team," it's like a superhero thing.
We actually have a bunch of swag around it, and it's also super well-supported by the leadership team.
Everybody who is a director or
above is on strike L2, level 2 on-call.
Basically we tell the on-call engineer, "Listen. If
you get paged, follow the run book.
If you can't actually follow the run book or feel unsure,
nervous, anxious in any way, shape or form, page L2."
Level 2's job is then to come in and be the incident commander, take control, help you out and make it easy.
Charity: Pull other people in.
Rich: Absolutely. Ring other people, ring the engineering manager, thank the person for escalating.
It's real, "Thanks for calling me and I'm here to help. We appreciate a great job, thanks. How can I help?"
Liz: Yeah, but you were talking earlier about the idea of wanting your service to be reliable enough so that it doesn't page the volunteers in the middle of the night.
How did you get that to work with pushing changes on Friday?
Rich: When you have a 5 to 10 minute feedback loop on pushing a change, and when you're actually able to push that change out under a feature flag, and when you're pushing it out in a monolith which is safe and well-instrumented.
When you're actually pushing it out you're following a process which has good principle engineer review over it, the likelihood of it breaking or breaking 1 hour later is just so low. If it's going to break, it's going to break in those first 5 minutes. You're still around and you just roll it back.
Liz: You have to be intentional about making your system, it sounds, not leave time bombs for the on-call engineer.
Charity: You have to have done it enough that you're not afraid of it.
You don't fear, because you know what to do if something goes wrong.
You know how to catch it and you know how to fix it, and you know it won't take that long.
Liz: And you know it won't ruin someone's weekend even if they do get a page. They can turn the feature flag off.
Rich: We deploy about 1,000 times a week.
Charity: That's awesome.
Rich: If you are determined to deploy code about 1,000 times a week and have each one take about 10 minutes, it's just hard to be afraid of it.
Charity: Yeah, that's true.
Rich: We just go, "We do this so often. It's just how we work."
Liz: Exactly. One of the conversations I was having a while back was about the idea of "If you push code automatically every single day, it becomes weird to turn the system off."
Rich: Our actual biggest problems are when we actually haven't deployed code for an hour, because then you have five or six changes backed up at the one time.
Charity: All the worst outages of my career have happened after freezes for holidays. All of them.
Because you no longer know what it is you're shipping, and you've got all these changes in there together jumbled up and nobody can remember what they were doing.
It's hard to find the problem when something goes wrong.
Rich: The safest change is the smallest one.
Charity: Yes. That's what they said in accelerate, "Our intuition tells us that if we just slow down and we're more careful, then the errors will fall."
But in fact, it's the opposite. They work in tandem.
Your velocity increases and your errors decrease.
Liz: Yeah, there is an interesting thing in the twenty eighteen CEO of Dev Ops report by Accelerate in which they pointed out the misguided performers that were slowing down and getting the worst of both worlds.
Rich: Yeah, I think even people understanding what "Fast" is and what "Frequent" is, is interesting. I remember--
Charity: Yeah. What is the time interval between when your engineers commit the code and when it's being used by users?
Rich: About 10 minutes, 5-10 minutes.
Charity: That's really good, and that's what they called out in that book as being the best
statistic that tells how good a team is, how efficient they are, how
much of a high performer they are.
And yet, almost no teams I know are tracking that.
Rich: I remember being at Facebook and when they would ship code twice a day it would take several hours.
There would be 50 or 60 people showing up on IRC to check in their change and support it, and I remember thinking "This is crazy. This is not fast."
Charity: Facebook is now fully-automated, finally.
Rich: That's great.
Liz: That's a question of, "Do you do this early on? Or do competitive pressures make you eventually have to do it, and at a huge cost?"
Rich: I think it's interesting as well, how much do you actually care about it?
Because I think it's easy to have it at the start when you're tiny, but then as you get bigger and bigger and bigger, are you willing to fight for it?
Are you willing to fight to keep it? Are you willing to create whole teams in order to lean into this.
Charity: Because it's not costless.
Charity: You have to choose to work on that instead of something else.
Rich: We have more CI servers than we have any other server type, but we've done that because we've gone to EC2 spot instances.
We used to use about three different third-party CI services, and at the scale we're at now in order to have the speed that we want and the quality that we want, we've had to go active/active to have dual CI providers racing against each other.
One of them is us, we've had to become our own CI provider.
This is one of those things where I would love to outsource this undifferentiated heavy-lifting, but at the scale and speed at which we want to do it, it is no longer undifferentiated heavy-lifting. It's actually a competitive advantage.
Charity: What's interesting to me is that you said your CTO built that from the very beginning, back when you would not have thought that it would be a differentiator.
Rich: Yeah. But I think this is why we've kept with it for so long, because it is one of the core beliefs of the company, that speed of development is a competitive advantage.
Charity: Back then, I think I would have told you that it was a mistake.
I think anyone would have said that was a big mistake, to focus on writing a CI tool that early in your development.
Rich: I think our CEO might have said it at the time, but here we are.
Charity: Yes. You and he had the last laugh, I suppose.
Liz: For companies that are not Intercom, how would you advise that they get started down this path?
How do you move from deploying once per day to deploying every single change within 10 minutes?
Rich: I think one of the other aspects of culture that we have is zero-touch operations, and moving to operations codified in software broader than in a Wiki.
I think just looking to understand your process, and just slowly but surely automate every step.
Liz: Step zero in that process is even writing it down. A lot of stuff lives in people's shell scripts in their heads, and even getting it down into a Wiki--
When I used to work at Google, we had the idea of the checklist Sisyphus.
It literally would be you write a list of bullet
points, and the automation reminds you "Did you do X? Did
you do Y?
Did you do Z?"
You check it off by hand, but eventually you can automate each individual piece.
Charity: Then once you've automated, then you start measuring to see where you can
I think you guys instrumented your whole build pipeline with Honeycomb, and that let you just keep knocking off slow tests.
Rich: Totally. This was actually one of those things where you go, "How much do you believe
How much are you willing to fight for it?
How much work are you willing to invest in it?"
This is where we had, "We have tens of thousands of tests and they are now arbitrarily bucketed together, and are now taking 20 minutes to run. Are we OK with that? Are we OK with our test taking 20 minutes to run?"
"We're actually a big company now, maybe this is OK. Maybe this is the bar as a company our size."
Charity: I remember that.
Rich: And you go, "I don't think so. I don't think we're happy to live with that."
We set about using Honeycomb to instrument every single test and to see the P50, P90, P75. P100 and that wellness of--
Liz: Yeah. Because people talk about flaky tests, but people don't necessarily talk
about long-running or variable-timing tests.
This sounds like a new thing.
Rich: "Is a test flaky all of the time, or is it
flaky some of the time? Is
the test flaky when it's actually in with this bucket,
versus in with this bucket?"
"Do we actually have to run every single test individually and
measure its time and measure its standard
deviation, and whatnot?"
was actually a huge amount of instrumentation, analysis,
individualization test analysis in order to try and figure out,
"How can we binpack these tests more effectively and how can we find out which are the ones which have flaky dependencies, or which are the ones which are unneeded or simply just badly written or bloated?"
We were eventually able to get that time down from
20 minutes down to consistently sub-5 minutes
and keep it there.
That's actually that broken window syndrome, once you actually get it back down that low you know, "OK. No matter how big it gets, we can keep it at this size."
Charity: I like this approach.
You can contrast it with a lot of the objections we are getting about, "You can't deploy on Fridays," where people just accept that this is the way the world is.
That lack of agency and that lack of-- If you tell me that you can't deploy on Fridays, I absolutely accept that today.
But what about a month from now? Do you want that to still be true?
Wouldn't it be better if you could trust your deploys, if it wasn't a big deal?
If you weren't scared of it? I don't understand why so many people just seem to be happy with the status quo.
Liz: It's even the basics of, "Can you restart a server with
a no-op change on Friday?
Let's start there.
Or, can you deploy a white space changer? Can you--?"
Charity: If you really think that you're not going to find various problems
for 48 hours, maybe we start there.
Are you looking at your deploy after it goes out? Are you instrumenting it?
There are only a few categories of problems that will take a day or two to show up, and that's not the majority of what most people encounter.
Rich: I think sometimes emotionally something can feel hard.
"This feels hard. I don't think we can do this. We haven't been able to do it so far."
And then you go, "Let me actually break it down into smaller chunks.
Let me actually measure each bit."
This was actually the test time situation, "We're at 20 minutes. I don't know, it feels like it's going to be really hard. Are we going to be able to do this?"
"Let's actually just start to measure each one and then measure it a different way, and
just actually start to layer on more and more on the inside."
And then you go, "OK. I think if we do this bit and this bit and this bit and this bit, we should--"
Charity: Individually, the problems aren't that big.
Liz: It's really interesting.
We've talked a lot about observability-driven development, and it seems like observability-driven development is the opposite of superstition-driven development.
We are thinking, "This is going to be so hard," or "The payoff is not going to be worth it." You can quantify it.
Rich: Yeah. I think that's one of the other things we talk about a lot, which is this design-thinking methodology where you really understand something from first principles before deciding how to solve it or how to fix it.
There's a lot of instrumentation, understanding, measurement, testing of hypotheses and whatnot that actually goes into helping you break something down into its primary constituents.
Generally, once you actually break something down into its smallest constituents, you go
"I can solve this using standard technologies.
This isn't some crazy, horrific problem. This is actually a database scaling problem."
Liz: If you look at it all as one blob, it seems overwhelming.
Charity: It's terrifying.
Liz: But if it's decomposed, then you can actually assign people to work on it. You can get pieces of it done.
Rich: You can have specialists.
Liz: And you can see improvements as you go along, rather than saying "It's all or nothing."
Charity: And once you start to get used to the idea that you can actually fix things, then you hopefully gain some confidence.
Rich: The funny thing is, engineers love this stuff. Engineers love it.
Charity: That's why it's so baffling to hear this from engineers. It's like, "Do you like the way you live?"
Liz: People get burned once and then they don't want to touch the fire again right away.
Charity: A good point that someone brought up is that sometimes they don't have control over a lot of things, and the only thing that they can control is this one dumb stick, which is "I refuse to deploy on Fridays because otherwise I'm going to be working all weekend."
The problem is the things that lead to them working all weekend,
not the Friday deploys, but that's the only thing that they have to push back.
In that case, I would say "Find another job, if you can."
But I get that as long as they're aware that that's what they're doing, and now they're not trying to argue that this is the way things should be.
What bothers me the most is when I can tell that people are holding up the "No Friday deploys" as an example of "This is how it should be, and if you don't believe this you don't care about people and their weekends."
Liz: The idea of "We don't care about our engineers."
Charity: I fucking care about your engineers a lot.
That's why I don't want them to have to suffer through this any day.
Rich: I think being afraid generally to deploy on a Friday is a smell of weird engineering practices in general, it isn't a thing in itself, it's just an indicator that something else is going on.
Charity: There is some stuff that needs to get fixed.
Rich: If your competitors are willing to deploy five days a week or six days a week or seven days a week, and you're actually only able to do it four days a week, you're going to lose.
Charity: I also feel like using rules should be a last resort.
Culture can do and should do so much heavy lifting for you here, because when you use culture and you use norms what you're doing is you're allowing engineers to learn and develop their own good judgment.
Liz: That goes into autonomy. Autonomy as a way of keeping engineers happy.
Rich: Even just about using software, we're actually
OK doing automated rollbacks as well as
Charity: Let's talk about ownership just a little bit as we're winding down here.
How does that manifest? How do your engineers own their work, when is their job done? How do you feel about life/work balance?
Clearly, you work your engineers 24/7.
Rich: Ownership is a funny thing.
I like to think about pride, and I like to think about
enabling people to have pride in their work.
I think you can have certain low bar of standards which set the base level of what people are going to do, but hopefully trying to inspire people to actually really be proud of the work, that's better than any stick of ownership or accountability.
When I look back on my career, it is not the jobs where I coasted and everyone thought I was amazing when I was doing an hour or two of work a day, those are not the ones that I look back and just go "That was amazing." I look back at the ones where I was pushed by circumstances, by users, by growth, by work.
But when I was pushed, and I grumbled sometimes, but I grew by leaps and bounds.
Rich: And you were proud of it, looking back?
Charity: I was proud of it.
Rich: I wrote a blog post a couple of years ago called Pride Over Process.
Charity: I read that. I loved it.
Rich: And that was definitely for me, I think
so much of the companies we work in people are here to do the
best work of their careers.
Generally actually trying to tap into that, and I think structurally process-wise you'd like to set up teams which enable people to have end to end ownership as much as possible, enable them to know really who their customers are.
Charity: Get those feedback loops.
Rich: Yeah, get them feedback loops. Get them as connected to their customers as possible.
Charity: But not everybody is in that phase throughout their life.
There are times when you might want to take a step back, and I think that self-knowledge is the important thing here.
Self-knowledge and businesses kind of accurately and honestly representing what kind of teams they want to build.
Liz: Also what they reward. If someone is giving everything that they have, they should be rewarded for it rather than having someone profit off of that.
Charity: Absolutely. This has been a real pleasure to have you.
Rich: Thanks so much. It's been super fun, thank you.