March 3, 2016
DX and Open Sourcing at Netflix
In this Speaker Series presentation, former cloud architect for Netflix and now Battery Ventures technology fellow Adrian Cockcroft discusse...
In episode 10 of O11ycast, Charity Majors and Liz Fong-Jones speak with Bugsnag CEO James Smith. They discuss the seemingly impossible ways an organization can measure technical debt and how they can attempt to reduce it.
About the Guests
Liz Fong-Jones: James, you talk about measuring technical debt. How can teams actually measure such an immeasurable thing?
James Smith: It's difficult because I think that there's an emotional component
debt, and then there is a measurable technical component to technical debt.
The emotional component of,
"I don't feel good in this code base," I think is always
going to be really tough to measure.
You have to probably take more subtle approaches, making sure that you have retrospectives and making sure you're allowing your team and your individual contributors and your managers to share where they're pissed off about things.
When it comes to measuring the direct technical impact of technical debt, that's something that I actually end up talking to people about a lot.
One of the ways to do that is
measuring crash rates or
stability, another layer on top of that might be performance impact that's introduced
try and look at these measurable aspects as well.
"Can we measure in this code base in this application?
Have we got better or worse in terms of stability?
Has the crash rate gone up per release, for example?"
Or, "When you turn on this A/B test, has the crash rate gone up in this experiment?"
One of the things you can do there is take a look at that number and use it almost as-- Not necessarily ammo, but almost as a start of a conversation.
Liz: Right. If you project that outward, it looks really bad over time.
Liz: If you're getting 5 bugs, 10 bugs, 20 crashes per week then you know
something is going to be up.
I also love what you said about capturing the human aspects, because in my old job at Google we used to think about "How long does it take a new SRE or product developer to get onto the on-call rotation? How confusing is it?"
Charity Majors: This sounds like a good time for you to introduce yourself.
James: I'm James Smith. I'm the co-founder and CEO of a company called Bugsnag, and we help companies understand, measure and improve software stability.
Liz: How do we decide how we're allocating our time? How do I decide when is too much technical debt? When should I start doing fixes instead of new features?
James: This is a great question, and is almost more of a negotiation point as well. I think it depends on the size and complexity of your organization, but in almost all--
Charity: Is it predictable?
James: I think it's predictable, but changes as you grow. I think when you're starting off in an industry, if you're building a new product, you have more tolerance for risk.
You're going to ship things quickly, you're going to see if the product feels right, you're going to see if people want to use your product.
Charity: No one ever failed because they shipped code too slowly.
James: Exactly. You'll start off and you'll just get something out there, and I think that it's probably predictable that your health/technical debt--
Charity: The successful companies will have racked up a lot of technical debt along the way, is what you're saying?
James: Yes, but I think there's different attitudes that cause things to exponentially get worse vs. just happen as part of the application code base getting bigger.
Charity: You can imagine two different teams having very different approaches.
The thing that I often see is these early teams are comprised mostly of software engineers, and by the time I show up knocking on their door as the first infrastructure engineer they are doing terrible things to themselves, like sending themselves 100 emails a day to validate that these backups have finished. I'm just like--
If you actually start out with an ops person from day one you rack up very different kind of technical debt.
James: Yeah, it almost feels like there's very smart people
working on these problems that let themselves slip into
this situation where there's these non-scalable things happening in
For example, email alerts that they've triggered themselves manually
that no one's ever thought, "Can we do a better job of this?"
I think actually, what I've seen a lot of time comes down ownership almost, sometimes no one's truly owning technical debt or truly owning the observability or monitoring side of the stack.
Liz: It's so insidious, right? When you have something that's not owned by anyone, then people don't feel like they can fix it. It just is a nagging thing that keeps on getting worse and worse.
James: That actually comes back to your question from a minute ago, I think the best teams that are big-- There's teams now that have massive code bases, like Square, where they still give a shit. They actually genuinely care about the quality of the code they're writing.
Charity: Because you identify with it personally.
James: Exactly. Like you say, when you join a company sometimes there will be that groan of looking at the "Here be dragons" in the code base. Like, "Don't touch that file. That's for the junk."
Charity: It's always that shit where someone used to own it and that person left or moved on, or something.
James: Yes. It's always easy to blame the person who's no longer there, and stuff. I don't think it ever is malice in that situation, it's just more like it happens over time and you don't have an owner.
Charity: We take the lack of ownership and then we assign some sort of agency to that.
James: Exactly. Going back to that question, I think that there's teams that have managed to scale giving a shit, and I think in those environments it works.
Charity: Lots of technical debt is pretty insidious. It's only visible in retrospect.
How can frustrated engineers do a better job of surfacing the consequences of this debt to decision makers, so they can be allocated real-time towards fixing it?
This is something that I have a lot of thoughts and feelings on, but I'd like to let you take a stab.
James: I have certainly opinions there. The way we think about it-- Because the reason I built Bugsnag in the first place was to address one of-- This is one of the main problems we were trying to address.
Charity: Did it grow out of direct pain that you had?
James: Absolutely. In my former previous company, it was a Y-Combinator startup and I was a CTO at that company, and before then I worked in enterprise.
In my Y-C startup, everything fell on me.
If the code quality was bad or I had trouble with engineers being mad at how
much technical debt there was, that's on me.
I was like "I have to solve this."
Whereas at Bloomberg, it was like "Tough cheese.
There's nothing you can do about this.
You have to--"
It's a great company in terms of having entrepreneurial spirit there, but there's no way you could justify spending time on fixing technical debt.
Charity: That's so weird.
Liz: You have that ever-growing list of features you're supposed to implement and your boss doesn't really care.
James: In fact, as an engineer at that company I was excited about the new features, so that I was almost convinced that I shouldn't be focusing on technical debt.
But yeah, "Measuring bugs" for lack of a better word is a really good way to do that. Because if you want to prove that you are--
Charity: Or paging alerts.
James: Yeah, exactly. Any customer-facing output, anything that's causing an outage or a session-ending bug, or something like that.
You can measure it, you can say "This customer saw pain due to the fact that we had technical debt."
Liz: But that requires you to measure whether your customers are having pain, which seems like its own struggle in and of itself.
James: There's multiple ways to do that as well. Like you're saying, there's the full-outage metric, which is like "Everything's on fire right now." There's the "Did a customer see a crash or an exception during a session?"
Charity: Did they get frustrated and go away because everything was so slow?
James: Performance issues, exactly. Frozen screens, actually we just launched a feature on Bugsnag for Android applications that detect something called ANRs which is when an Android screen is frozen for more than five seconds.
Because we think that's just as bad as if you've seen a crash in the application, you're probably going to bounce.
Liz: It's almost even worse, because then you have to wait there and see "Is it going to come back or not?"
James: Exactly, and probably most customers don't know how to swipe up and kill an application on a phone.
You want to pick some technology or tool that is low instrumentation work, but high signal when it comes to that kind of stuff.
Charity: I feel like decision-makers are always trying to do their best, and when we are frustrated about what we think are bad decisions being made it's almost always because they are only seeing a certain set of costs and pain, and they aren't seeing things that are being amortized over longer periods of time or things that are being felt by other teams.
Or things that teams are feeling but they aren't bubbling it
back up the ladder so the bosses never hear about it.
This is why I feel like the number one job of any senior person at a company is to look for those things, those critical pains that are not being factored into decisions, and amplify them.
Charity: Look for ways to measure them, to service them, to speak the language that the business decision-makers are speaking.
Sometimes it's just, we're like "We're getting paged all the time," and the business
leaders don't understand why that's bad
or how it's going to cost them.
You just have to keep trying over and over until they get it, then you can have confidence in the decisions being made.
James: It's interesting you mention that. Companies aren't good at observability and monitoring a measurement.
Typically there will be the "Highest paid person" issue, where if you get an email from the CEO saying "Why is this not working?" Or "I've had a key customer saying this isn't working," that's what everyone jumps onto.
That's the same pain as a poor stability rate or lots of crashes in your application, but one of them is being seen and directly observed and one of them is being measured and is very scientific.
I think that if you can draw those closer together, that's how you can bring this--
Charity: Data should be democratizing.
James: Exactly. A lot of companies will use something like OKRs. If you're running an OKR system, I know Google's a big user of OKRs. But if you run your OKR system--
Charity: Do you have feelings, Liz?
James: If it's run well, I think it works well. I think that's my opinion on this.
Charity: What do you do that's better than OKRs?
Liz: What are you even measuring, right? That's the important factor. Is reliability and code quality a measurement, or is that just a can that you kicked down the road knowing that it's not going to be you who's slowed down?
James: Right. One of the really cool ones, and I've seen examples of this, is poor reviews on the App Store.
For example, if you're a mobile application, something that's very visible to the marketing team or the CEO or someone who's not writing code is why we're getting so many one star reviews.
Maybe your OKR is "Reduce the number of one star reviews in the App Store," or "Become a five star rated application."
Which is something everyone can rally around, and then one of the key results
under the hood might be increased stability from 98%
So that's the measurable thing, that's injecting measurability into the goal that everybody shares, and then maybe the CEO gets an email from someone that when they're mad that the app broke.
Liz: What about the sneaky things, supposing that it takes four months for a new engineer to become productive or for that feature to be developed because it's so complicated and you have to plummet so many places? How do we prioritize that?
James: That's really hard.
I think the best big companies I've seen will have some--
I think Airbnb has a team
called "Developer happiness," and their job is just
purely to make the engineering team more efficient and onboard people
I'm not going to get that exactly right, but that's their job, that's their objective. Maybe they're measured on that there, but obviously not every company has the latitude and budget to hire a developer happiness team.
Liz: We can all internalize the idea of developer happiness, and--
James: "Be your own developer happiness expert."
I love that.
If you are prioritizing developer happiness, how can you then justify that you're going to be spending 2, 5, 10 hours a week of your time and not building things out of that? That's the other question, as well.
Liz: Because you're making everyone else more productive.
James: Yes. So if you can measure that, that would be amazing.
Charity: I really do think that looking for ways to measure things is a
next frontier in engineering management.
I think of this every time the subject of budgets comes
up, they can spend infinite dollars into the AWS budget
but they can't spend more than like $20 bucks a month on
It's just like factoring in hiring, if they have to hire five people to run an observability team instead of paying a fraction of that amount. But we're not good at measuring that.
Liz: One of the fascinating things was I recently in the past couple of months gave a talk about how to build systems that are humane to run, and I was able to hire a graphic artist to do it.
Which was phenomenal, but I never would have been able to do that because the previous company I was at didn't trust me with the autonomy to "Yes, you can hire someone to help you with this."
Charity: That's so random.
James: That comes back to the emotional side of things as well. If you can present this in a way that is compelling, it's going to convince people. There's the measurability, and then there's a need to feel it in your gut.
Charity: They say in order to love your work, what do you need?
You need autonomy, you need mastery,
you need impact.
I feel like so much of the modern corporation is bent on
dehumanizing you and making you part of
a system and a cog.
We manage to be pretty happy throughout our days because we can mostly forget about that, but it's little things like this that just remind you.
Liz: Speaking of autonomy, how do you see teams choosing which tools they adopt versus using a centralized tool provided by their company?
How do you look at teams and advise them, "What should I pick. How should I establish safeguards?"
James: We are the company that definitely gets
mostly adopted from the bottoms up, and I think
talking about scaling and giving a shit earlier, you have these champions
that give a shit way above the average in the company.
They tend to drag everyone else up.
I love finding those people, I love finding the champions.
Sometimes the champions
understand business value and understand how software is bought,
a lot of the time they don't know how to do that.
I think that the ideal person is someone who makes software that's mostly targeted towards software developers.
The ideal champion is someone who has bought software at their company before, can help us prove their business value, but also is going to kick the tires and run this thing themselves and set up the first part of it.
By the nature of what we're building, at least, as well.
You roll out something like Bugsnag on an application
per application basis, so you'll start on Android and then you go to iOS and
you go to the web, then you go to the back end.
Getting a top down cell is something that we've definitely done, but you tell a different story. It's a very different story. I think one of them is, "I'm going to help you do a better job today," and the other one is more of the high level metric side of things.
Charity: We talk a lot about breaking down silos in tech, but I often feel like
tools build silos at their
edges. The edge of your tool creates a silo and
you no longer speak the language of everyone around you.
How do we use these tools for good and not for evil, how do we not just add one more problem? "Now you have five problems."
James: You're right, there's some tools out there that are so complex to understand and the learning curve is so high, and I think that especially in the developer tools space a lot of developer tools are made by software developers.
Charity: Sad, but true.
James: It is. I'm a software engineer by trade, I don't get to code
much anymore, but there's a lot of software developers who think
"OK. I need to make every piece of data available, and
I need to have everything filterable, and it needs to be super-- It needs to do
everything on one screen."
If you can productize things, if you can say "This is designed to do this one thing." Like I mentioned, stability scores. That's not exactly a complicated mathematical concept.
It's like, "What percentage of user sessions were crash free?" That's it. That's just two pieces of data in that math, but presenting that in a way that you can rally around it and productizing it--
Liz: That gets to a subject that's near and dear to my heart, which is a service- level objective. What you're describing is a service-level objective, but instead of for a web service you're looking at a application.
James: That's right. SLAs and SLOs are, I think, very simple to understand. The math under the hood shouldn't be that complicated.
It should be, again, and we're talking about common languages a little bit. Common language between--
Charity: I disagree. These are infinitely complicated. It's like a like a fuckin' fractal.
James: I wish they weren't. Because I remember New Relic in the APM space, they had this thing called app decks, and I think they still have it.
The idea is it's a score that's
generated, and maybe it's just me and my
cynical British developer friends mostly, but
I don't trust numbers where I can't understand how the number was calculated
under the hood, as a software engineer.
So I saw this app decks number and I'm like, "I wish I could get everyone to look at this number," but I just didn't understand how it was made up and didn't believe it.
Again, it's probably only two or three inputs there under the hood, but it needs to be simple enough that you can explain it in a quick sentence.
Charity: The other thing is, "Is it the appropriate amount of reliability?" We're not going for infinite reliability, because that lies the way to madness.
James: I did talk about this and I did a blog post about this really recently, we got a
post that was on top of Hacker News that was "Why you should not
fix every bug."
I don't think that's a controversial statement to make, but people who want to fix every bug aren't the sort of people who want to get into using Bugsnag or Honeycomb. It doesn't make sense to them.
We started-- The reason we built stability in the first place was to
say "How many nines of stability are you going after?"
It's a concept that people understand.
SLAs and SLOs, you can do it on orders of magnitude, you're going to pick
the numbers there.
It simplifies it down, but maybe under the hood the math is complicated, but the concept should be simple for it to be I think relied upon by the org.
Charity: Yeah, I agree. Definitely. It's something that you can use as a translation layer between management and engineers too.
Charity: It's a thing you can all agree upon.
James: I have an interesting anecdote about that.
When we had--
One of our customers is HotelTonight, and when we were talking to them for a case study they were explaining how they use the concept of stability score to genuinely figure out whether they should be building features or fixing bugs.
Liz: That sounds a lot like the SRE error budget in the SLO.
James: Exactly. In fact, a lot of this stuff isn't in the SRE book. It's the same concept. "Don't fix every bug, know what number you're going after and use that as a powerful tool to make business decisions."
That's obviously what we intend, but it was great to see that in use.
In fact, the fact that they were below
their stability score target at one point actually
came up internally.
I think it was in an exec meeting or a board meeting where they had to explain "We're not
going to ship this feature on time because we need to get the stability under
As a common language in that part of the org chart and the exec meetings, that's what we're going after.
Again, as a former developer who's worked on building these things on that side of the table, now I'm on the side where I'm a pain in the ass to everyone saying "When is this going to ship?" and they can tell me.
There seems to be this persistent myth that many engineering leaders believe, which is that they can find these cycles to work on reliability and everything just in the couch cushions somewhere.
"You can do it in your spare time, just squeeze it in between that feature work." Which spoiler alert, does not work.
James: Doesn't work.
Charity: Doesn't work.
James: Obviously not going to happen, or the person who's doing it--
Charity: Will do it so shoddily--
James: Evenings and weekends and then burns out. It's just never a great thing to do.
Charity: It has to be a first level priority, and it's the job of the engineering managers to do battle for their people when they need to in order to get them that time and space to fix things.
James: It's interesting as well that this seems obvious to us, but
when I started my career and even before I started my career, software quality
and the concept of measuring this was
already a thing.
It was already a very important thing, it's just back in the day it was all static analysis.
Liz: I remember those days.
James: So now it went out of being cool because everyone moved to scripting languages, and then everyone came back to compiled languages and strictly typed languages again.
Charity: There's nothing new under the sun.
James: It's the same. We all care about-- If we're building a product, hopefully we care about the same things. But it's not something--
Charity: I think we're getting better. We're getting better and better at figuring out how to measure this.
I think that the shift from traditional metrics on the
back end to the event-oriented way that we
gather, it's better because it better maps to the actual
experience of the users as they're traversing your
We're not doing this because it's super fun and cool, it's because we realized that when you're just measuring what's happening on the back end you're missing out on a lot of context and a lot of experience can just go through the cracks.
James: I know a lot of people who use like tools like Splunk and they've got terabytes of data.
Charity: Junk data. Just paying for it, and nobody--
James: No one is looking at it. No one's looking at this data, it's just garbage.
Liz: Right. "Once read, never."
James: Yeah. Exactly.
Liz: "Why are you paying to index all of that?" Right?
James: It drives me crazy. A lot of our customers will be using Splunk, and--
Charity: It's a safety blanket.
Charity: That's all it is.
James: That's exactly what it is.
Charity: It's not useful for debugging in any way meaningful way.
James: Exactly. You have-- Really I think there's a split happening right now.
You've got the product analytics team who are measuring the KPI and success of the
product, and then you have the observability teams that are
measuring the health of the product experience.
I feel like there is a natural divergence there.
Charity: I feel like I see almost the same thing.
What I've seen is the traditional ops people looking at the health of the service,
which is all you're ever going to get with aggregates.
you're ever going to get is these averages, and then you've got the people who
are trying to understand the health of the
experience for the user, which is where I would say the observability folks are
I think of it as monitoring versus observability.
Charity: But I feel like we're both describing the same elephant.
James: I think it is.
I think it's naturally evolving, it's not "Just
shove everything in a big pile of data."
It's now back to ownership again.
It's like, "Who owns what? I own the experience, I own this." We've launched something last year which was the feature flag analytics, so you can as a release manager or a product manager you can use the stability score or error data to say "Is this feature healthy?" It ties them back together again.
Charity: Because as an engineer we have to give them the tools so they can own their stuff from end to end. If we're just giving them "Ownership" like, "Congratulations. You don't own this."
They're just like, "OK. What does that mean? How do we know if I succeed or not?" You have to give them a set of tools and a standard, like "This is what it means.
This is how you know when you have successfully owned it."
James: You should probably, hopefully, good product teams are writing that down before they even write a design spec or line of code.
Charity: Definitely. Every single one.
James: Or, in reality, probably--.
Liz: We talk a lot about observability-driven development. How am I going to know where it's working, where it's not working?
The only way to do that is to start it from the beginning. Or, at least the most pain free way of doing it. You can piggy back things on later, but it's not the same.
Charity: Ideally it goes back to that whole autonomy, and
having the ability to be creative at your job.
The product owner's job is not to tell engineers how to implement things. There is a vast amount of creativity and control and interesting stuff to be done there.
The API layer between product engineering should be you know
fairly porous, and it should be driven by
what is reasonable to build.
"Am I building what I thought I said I was building? Is what I shipped what I think I shipped, and did it actually fit the need that the product manager had?"
James: There might be shared OKRs or objectives
or goals between product and engineering, but you should have your own.
You should be able to measure things you care about, but hopefully the ones that impact customer experience should be shared, and those are the ones that bubble back up.
Charity: I think that there's no debate anymore about how this builds better experiences for users, but it also has to involve trust.
Because when you're giving someone ownership that means you don't have it anymore, you're giving over that point of pride to them and then holding them accountable--
Liz: You're giving away your Legos.
Charity: It is, it's giving away your Legos, and then you have to trust but verify so that they get to exercise that spark of creativity and creative energy.
James: It's interesting, measuring
quality of software via observability or something like
Also the way I think about this is if you are using a tool like this and you can see you are causing one particular type of problem on a regular basis, and if you take pride in your work, you can get better at your job based on this information.
Charity: Every engineer that I have ever worked with that I have liked working with,
which is most of them, has had that fundamental
When you see these bad outcomes where
teams don't have the ability to see what is hurting
them, or when you have this fundamental mismatch where the pain
is not being felt by the people who are empowered to fix the
The traditional dev ops split created all of these terrible feedback mechanisms where the developers were not getting the pain of the bad things that they put out there, and the people who are getting the pain aren't empowered to fix them. It's just terrible.
James: You need the autonomy and accountability together.
Charity: When you have just one and not the other you become very cynical and depressed.
James: I get it. It makes sense.
Liz: I'm starting to see an interesting thing happen though, which is that people are
even starting to have some aggregate metrics but
they're not necessarily getting a view of the pain
that individual customers are experiencing anymore.
That we've scaled up so far we've lost sight of that, and I think that's where observability really matters, is can you actually identify which specific users are having pain?
James: 100%, and we definitely share a product direction on that.
In Bugsnag one of the things that has continually been exciting to demo to big accounts is the fact that we can say, "Look. Here's your stability score, but here's the crashes that are impacting paying customers. Here's the crashes that are impacting customers during their trial onboarding process."
Segmenting that whole stack down.
Charity: It reminds me of that famous quote which was, "One death is a tragedy and a million is a statistic."
The way to actually be emotionally engaged in your work is to make a difference for individual people.
James: The craziest thing that we saw from Bugsnag after we launched that feature was, like I said we've been bottoms up. Engineers tend to buy in Bugsnag. But we started seeing customer success reps using the product, and they're account managers.
They're looking at-- This dashboard is not built for them right now. We're working on making it more accessible to those teams.
Charity: They can do a better job at their job if they have more access to data.
James: Exactly. Or if you're doing success teams, and account managers
do quarterly business reviews.
They'll go into these big accounts and they say, "We're going to check in on the health of you as a customer." Rather than just asking questions, saying "How are you doing?"
Charity: Do some research and show that you actually see and care.
James: Exactly. "We found this bug, we productivity fixed it within this amount of time." That is just wild factor for those customers and those accounts.
Charity: Do you think the technical debt is getting-- That the mass of it is getting larger, or smaller?
James: The amount of technical debt that we're creating I think is getting bigger.
I think that I would like the world to
make technical debt in an informed way.
I think it's OK to build things quickly, it's almost like the Agile methodology. "Build something quickly--"
Charity: But Liz wasn't asking what you want to have happen.
James: What's in reality going to happen is a different story. But no, genuinely I think that people are making more technical debt, and I think that given that, I think that navigating that world is going to become more and more important.
Charity: I agree. But I also think that we've gotten-- As an industry we are starting to get a handle on the difference between good and bad debt.
James: Yes. That's right.
Charity: Not all that debt is bad. The debt that you take out to buy a mortgage or to put yourself through college, that is good debt. There is good debt that helps your business succeed.
Liz: It's the same as complexity. There is a essential complexity, and in-essential complexity, and some complexity or technical debt has a higher interest rate than other technical debt.
James: I love that analogy. That's perfect. It comes to the end, if you can measure the emotional and mathematical components of that, it's OK. You should be able to create that--
Charity: I feel like we just closed on the most interesting thing that we've said all day. But, thank you. It was very interesting.
Liz: Thank you, James.
James: Thanks for having me. It was really fun.