Library Podcasts

Ep. #36, Resilience Engineering with Jacob Scott of Stripe

Guests: Jacob Scott

In episode 36 of o11ycast, Charity and Liz speak with Jacob Scott of Stripe about the need for SRE teams, prioritizing customer happiness, and the limitations of distributed tracing tools.


About the Guests

Jacob Scott is a reliability engineer at Stripe. He previously co-founded BetterWorks and was a software engineer at Lyft.

Show Notes

Transcript

00:00:00
00:00:00

Jacob Scott: I questioned the concept of an SRE team, right?

I don't think Stripe has SRE teams, right?

So you come back and say, "Okay, teams are responsible for their own code and production, end-to-end, but they also need some amount of expertise," right, which I think Netflix core models well.

Charity Majors: Yeah. They're not experts.

Nobody's asking you to be an expert in everything. SREs can help with that.

Jacob: So this is why this term, SRE, I think the same way that you might say reliability or resilience, right.

But a specialist team helping to coordinate-

Charity: What about the embedded model?

Jacob: The embedded model is interesting, I guess. I have limited exposure.

I haven't seen all of these models and I've only worked in so many orgs.

What I can say is that I think helping people make the right decisions at the right time, including in design, understanding that product engineers have 60 different trade-offs they're making and probably their management chain is most interested in the philosophy of their software.

So understanding how they interface with security, reliability, accessibility, any of these cross-cutting non-functional concerns, having an answer.

Maybe it's embedded?

Liz Fong-Jones: It almost doesn't matter. Right?

What matters is that somewhere in the ecosystem, you need to have someone whose job it is to understand this thing deeply and help you do it better. Right?

It's not abdicating responsibility, it's having someone that you can go to.

Jacob: Well, whose job it is interesting.

It's probably right, but I think there's this issue of, if it's no one's job, it's everyone's job, and if it's everyone's job, it's no one's job.

I see myself at Stripe doing some of this where I have a pretty full engineering workload.

I'm also involved in, at various organizational levels, trying to up-level our incident life cycle.

I think that there's a benefit to having practitioners, right?

So again, all this stuff, there's trade offs and it's like, "How does it work in your organization? What's legible? How do you set people up for success?"

But I think there is access to the expertise and then an organization which doesn't try and have its cake and eat it too. Right?

Because if you tell people, you want infinity nines and you tell people that you want all the features tomorrow, and you have a team of three people instead of 20 people, you won't get all of those things, right.

It is simply not possible. I mean, under this made up scenario.

Liz: Hmm. So now would be a good time for you to tell us about yourself.

Jacob: Sure. My name is Jacob Scott. I am a software engineer at Stripe.

I've been at Stripe for about a year, and I work on the product infrastructure that supports teams at Stripe building public APIs.

And I like being on the interface of infrastructure and product.

And I have a particular passion and interest in, well, you might call it resilience engineering, but that gets into a definitional issue.

I would say, modern safety science intersecting reliability.

Liz: So you're very well positioned to think about this.

How do you bring these practices and distribute them across a large organization?

Which I think is really, really cool.

Jacob: Yes. I mean, I agree completely, with the caveat that it is a forever work in progress.

I think that every organization does this differently.

Paul Osman, who's now at Honeycomb, who's awesome, who I've had the privilege to chat with, I think said recently, "Best practices may or may not exist or may be context specific."

And so I think that understanding how, at Stripe, you help with alignment.

I think in this really interesting cultural piece that, at least I first started going back to Netflix, I think maybe 15 or 20 years ago, about loose coupling and high alignment.

And I think that in the startup part of the tech world that I have spent most of my career in, everyone is on board with loose coupling.

No one wants the CEO to sign your TPS reports before you make your commit. But alignment is a lot more challenging in terms of, well, how are we all rowing the same way? And then particularly in some of these complex areas in reliability, if it doesn't metric well, it gets really hard, I think, or much harder to align.

Maybe I'm being too mean to executives, but I think that in general, they like to have a bunch of stoplights, and then if all the bottom stoplights are green, then they roll them up into the next level.

And they're always is looking for any red or yellow stoplight to trace it down to the root and some stuff in reliability doesn't work quite that way, and it can be a challenge to untangle that and create alignment.

Liz: So when we think about the idea that, are our systems working or not, right?

People instinctively tend to reach towards this idea of perfection and infinite nines, and we know that doesn't really work so well, but what about the reduction of SRE2 SLOs?

I know that you push back against that recently.

Jacob: This is super interesting, and again, there are a couple of prisms.

First I think, it's great. You can pull out one perspective on resilience engineering, which is safety one and safety two.

Safety one being how do we make stuff not go wrong?

Which is what we think about a lot. Safety two being, how do we make stuff go right?

How do we look at what we're doing well and accelerate it?

If you're using SLOs to good effect, more power to you.

Where I think I have this question about SLOs being insufficient is actually captured--

There's Narayan Desai, I think, at Google, is SRE. Google gave a talk at SRE con.

The map is not the territory.

And it's this question of-- I think SLOs, and this is my personal perspective, right?

I think they work well in statistical regimes.

So if you're trying to go from 0.995 to 0.996 availability on a thing that you do CI/CD, and you're leaking some failures and you then, like, okay, we need to prioritize fixing that so we can improve our availability.

They don't work well for all the failures you read about in the press, like when Google Cloud's networking goes down because they push some thing in some way.

I used to know the actual GC net 9007 or something, there was a year or two ago that Google network and Google Cloud networking went down for six or eight hours.

Liz: Right. The black Swan model. Right?

You cannot predict the things that you cannot predict.

Jacob: Yeah. This is unknown unknowns, it's dark debt, it's catastrophic failures and complex systems.

And so my belief is that SLOs are sort of insufficient for that, and I think it's fine for them to be insufficient.

My concern is about people who-- Because SLOs are metrics, and so they fit in this metrics box.

They're just like, "Great. Our reliability is metrics. We have a SRE team managing our SLOs. We're done."

And that is not -- You also miss this human component right?

Charity: People do that?

Jacob: Well, I think that it is easy for people to do it.

Okay, so there's a Harvard business review article, a cover story from the past year or so.

Don't let metrics destroy your business.

It talks about Wells Fargo, is the example where they're like, "We want to attach, which means, we want to cross sell. If you have a Wells Fargo account, you should have six Wells Fargo accounts. That's how we're going to bonus. This is a system design and incentives and like, blah, blah."

Well, the answer is that their sales associates committed fraud to sign people up for accounts that they didn't know they had, because it then looked good on their metrics that got them bonuses.

Charity: Yeah. Soon as you have a metric, people will try to goose it.

Jacob: Yeah. Good hearts or whatever it is.

And so I think that it is-- in a world with so many trade-offs, including at the executive level, right, I perceive an ease that the downward slope of, it should be a metric, show me the metrics, we'll monitor all the metrics, we'll shift resources so that the metrics are good.

Charity: In a lot of ways, I think that this is also just the natural end state of capitalism, which is to replace humans by widgets at every--

I told this story on this show before, but it blew my mind when I realized that most C-levels have closer relationships with their vendors than with their people, because people come and go, but vendors are forever.

And so when a vendor is like, "Pay me tens of millions of dollars and I'll make it so that I'll tell your people what to look at and I'll tell you people what it means. And they'll never have to think again."

Because I think that there's a lot of anxiety for executives and like C-levels in that, yeah, it's risky.

Your systems depend on your people wanting to work there and being happy and it's become like a risk for them to try to eliminate.

Problem is, it fucking can't.

Jacob: Yeah. And that, actually, it's this question of control versus nurture, right?

Charity: Oh yeah. You nurture people that's risky.

I mean, they might not do everything you want them, might not be best for them.

Do you really want to get them thinking like that? No, of course not.

Jacob: So how do you shift that mindset though?

How do you get people towards organizations that are oriented towards positive people-based outcomes?

Charity: Well, we succeed like a motherfucker.

And this is why I think that it can be very difficult to sit in the middle of your friends who are very idealistic and, "How could you work at a place like Google? How could you work with a person like Charity Majors?"

And it's like, we all have to make our own ethical, moral chances, but I believe that it's worth doing.

I believe that it's worth compromising my soul.

I believe that it's worth trying to build something mainstream, not a niche product where I can feel superior to everyone, but I want to change the world in a big way.

And that means you have to do big things.

Jacob: I would say to compose with that perspective, a pragmatic perspective might be, carefully, which is to say, see what you can do in alignment with authority.

Also, try and catch a wave. Right?

Like if an executive is like, "I went and looked at some incident reports in they're pretty shitty," maybe that's a great time to say, "Yes, and let's give people the Etsy facilitator debriefing guide, right? Like let's not just do--"

Charity: Be prepared and ready with your answers for how to change things.

Jacob: Well, this is interesting because of course, one definition of resilience is sustained adaptive capacity, right?

It's the, like, this is the academic woods or whatever, but it's your ability to respond to an emergency.

So it's like, are you just ready to pounce when the organization shows you an opportunity to make improvements along this ethical and moral dimension that are also legible to the organization, because--

Charity: Ultimately, the execs more or less want the same thing as we do when we're talking about CI/CD and everything, it's just that I think as engineers, we often fail to speak the right language to them to really show them how the solutions that we propose will help them move more quickly, will help them make their people happier.

If we take a very oppositional stance from the beginning, they're not going to trust that whatever you want to do must be wrong somehow, because they just don't trust you, right?

But if we're all-- and Liz is a master of this, of just being on the same page as everyone and helping them slowly walk closer towards each other.

Jacob: In particular as a practitioner, sort of-- I don't see myself as an engineer at a vendor the same way that--

I don't mean vendor to besmirch. Right.

But I'm hoping to learn from, go to market of all of these, of Honeycomb, of Jelly, of others. LaunchDarkly.

I don't know the full list, and excuse me if I'm missing anyone. I think those are some good examples.

Once you figured out how to entice the C-suite or how to shift, I'm drafting off your capitalism. Because like you'll succeed financially if you figure out how to position and pitch this.

Then we'll see. I have no purchasing power, which makes it easy.

But some products, some Stripe may or may not use at some point, not my decision, but the observability maturity model, et cetera, et cetera, those sorts of content and perspective and positioning, I will read all that stuff and try and figure out how to tweak it to fit Stripe's culture and Stripe's executive legibility and see what I can do with them.

Liz: Right, exactly.

It's almost like if you're trying to sail with the wind, not against the wind, right?

Jacob: Yes. The thing is, sailing directly into the wind, the market will stay irrational longer than you'll stay solvent.

Executives will stay around longer than you'll stay cranky.

Charity: Yeah. Yeah.

Even just explaining things to them in terms of going through, I mean, this is the thing where I think engineers try to be so precise, and so we get frustrated, be like, "Well, you can't convert the CI/CD. You can't convert that into dollars and cents, headcount."

Yeah, you can. It's not going to be accurate, but that's not your problem.

Business people make decisions every day based on back of the Audible list stuff, that's just, it's all they got, so they have to move so that they're not--

And engineers need to get better at this, too, just going to like--

Because I pulled it out of my ass that it takes twice as many engineers to support a system that takes hours to deploy instead of, but you know what, I've been checking my math with technologists for the past month and it's basically right. So.

Jacob: Yep. I think the other thing I would add, which is interesting, and I don't know whether it's more controversial, but, and some of these things you can sort of dual boot in your brain, right, they're different models and you can look at whichever one is best at the time.

Executives are people too, which is to say that everyone's in a complex system.

Executives have, there's a power differential, but they also are operative under uncertainty.

They also have cognitive load. They also have managers.

Charity: And you know what, we all overestimate how much of what we know, they know.

They're operating so little knowledge, you know?

And I feel like engineers tend to say things once and if it's not heard go, "Well, that didn't work."

And we also tend to just assume that there's knowledge.

Our manager, our manager's manager, manager's manager's manager.

And we also, I think, tend to assume that those people are inaccessible to us. None of those things are true.

Liz: What this feels like to me is almost like-- We were talking earlier about how as SLO's the map of the territory, right?

It almost feels like the value of SLOs is not the concrete number.

The value of the SLO is the shared understanding you build along the way.

Jacob: You can get a lot of benefit without injecting any faults. Right.

You design the experiment with not just the senior people in the room, but everyone on the team in the room and see who's surprised by what so absolutely and completely.

Charity: Yeah. Although I do feel like there's this tendency, and I noticed this with the resilience engineering community, of just always saying that everyone needs to be involved in everything.

And it's like, there literally isn't enough time in the day, honey, we won't get any work done.

But the solution is always, well, include security, include design, include blah, blah, blah.

Include blah, blah, blah. Now you've got 10, 15 teams.

You can't make a decision with that when you people and it's like, you do have to, it's not always the answer.

Sometimes it's helpful and you need to be mindful of it.

Jacob: And I guess this is what I would say about virtualizing multiple, running many different AMIs on your instance, right, like in swapping.

Because you can take a pure Google SRE, you could just forget about resilience engineering and look at your error budget run down, right?

Or you could forget about SRE and just quote straight out of Decker and understanding how to balance these, and it's going to be different in every organization, which center different sorts of pressure, different executives, different engineers.

And I guess also if you tie it back to alignment or whatever, it really helps people understand the ground truth and understand the trade-offs, right.

Because like what is-

Charity: And understands if people know what the point is too.

Jacob: Yes. Which has all common ground, right, because it's certainly the case.

Maybe you shouldn't run that game day.

Maybe one person should run that game day instead of a whole team.

Then you should expect, in the totality of time, maybe some drawbacks or less understanding.

That person, when they're going on call, didn't participate in that game day and will be less prepared.

That was a trade-off that we made and we thought was the right trade-off, right?

You're always making decisions under uncertainty.

Charity: And I think that the reason that resilience engineers are often just being, "Throw everything in the soup," is because not enough ingredients are currently being used.

And so it's not that it's the answer, it's that they're trying to pull the pendulum back to including more people and realizing that there are more stakeholders.

Liz: Right. Kind of pushing the Overton window, right?

We often ask for things that we know we're not going to get.

Jacob: I think it is also interesting to think about what people tweet on Twitter and what actually happens to people in organizations. Right? I think it's ...

Liz: Long discussion about, what does real CD look like, right?

Charity's like, "You must practice real CD," and it turns out, even at Honeycomb, there are a lot of shortcomings in our CD, right?

Jacob: I mean, I think it's Fred who started as Honeycomb's SRE, who's certainly part of the resilience engineering community and maybe in their circles has hot takes on root cause not existing.

That's, I don't think, going to prevent him from writing code and solving Honeycomb's pressing infrastructure problems.

Charity: Twitter is a cesspool. Let's just agree with that.

It is Satan's laboratory. It is a Hell site. Sorry. I've had a rough week.

Jacob: I guess you dual boot. Twitter can be-- I don't know that I would be involved in this domain without Twitter.

Liz: One of the things that Twitter has possibly done for us is Twitter was how I met Jacob and then how we invited Jacob to come to a company open.

Honeycomb in general is a pretty open company, so we were like, "Hey Jacob, you don't work here. We're curious for your take on-- would you like to sit in an instant review?" Like that was super fun.

Jacob: Yeah. It was fun.

And this I'm just looking was in December of 2019, or maybe I visited in November.

So it's like 18 months ago.

And it's really interesting to see exactly the same stuff we're talking about in this podcast show up in my writeup, like the positivity that was in that incident review meeting, right, as potentially example of western generativity, right?

Like this surprise about what that Valgrind being in your CI/CD system.

Didn't catch this out of memory error last anecdote, maybe, right?

The fact that I think you were dropping observability data at that time, because the exception happened after half the span had been sent, but not the other half.

And so bugs in observability systems are the worst, right, because that's what you're using to detect the bugs.

Charity: I can't trust my tools to help me debug my tools.

Jacob: But, yeah. That was super fun.

And I think I was at the South Park Commons then, which is a great organization I was lucky to be at, but it's just super interesting to see and learn from everybody.

And I think that this curiosity about reliability, I think is a great signal and sort of the-

Charity: Too often our tools punish our curiosity instead of welcoming it.

Jacob: Completely. But I think in a positive lens, right, I think that curiosity is a really interesting frame coupled--

I see surprise and curiosity is linked, and I see, if you think about the default way people think about reliability, which is, I want it to be perfect, if it's not perfect, it's broken.

It's bad that we're unreliable. Right?

Everyone starts getting anxious, you know, executives going to be at the review, right?

Curiosity is your inner child, it's like, "Oh, this is so-- what did I learn about the system today?"

Charity: One of the things that's been so great about-- yeah, sure, observability, blah, blah, blah.

You all have heard the spiel, but it makes it so much more fun to work, you know? It's a different job.

When you can turn into that childlike curiosity and just explore and find amazing things and horrible things and fix them and make your customers happy, you just go home feeling high at the end of the day.

And we're so lucky. Who gets a job like this?

The tools that we build, that we use in our socio-technical systems, have the potential to help us transform ourselves as well as our users.

Jacob: That's a pretty good pitch for either building or working with great observability tools.

Liz: Speaking of building or working with great observability tools, what's your observability journey been like now that you've started at Stripe?

What have you been up to that's been fun and exciting that you can share?

Jacob: An oldie but a goodie, I think, is from--

So again, two years ago, in fact, Brandur who's a teammate of mine wrote a blog post on Stripe, which maybe I remember from Charity retweeting before I worked at Stripe, about our use of canonical log lines.

So this is, Stripe is monolith-ish and we have, I guess it's a wide event that gets logged on every API call and actually a thing, speaking of what's gone right.

That data, not just live in a logging system that you can use during incident response or whatever, but gets archived as, I imagine, could be common.

I don't know how common, but it's certainly doable for other folks. Right. He goes to Kafka gets archived in RK.

There's a web tool that's pretty useful that you can make SQL queries on.

And so you can slice and dice to say, did latency regress-

Charity: It's revolutionary. When you can slice and dice in high-cardinality dimensions and the high dimensionality data, it's just transformative.

And it doesn't sound like a big deal.

This is why so many people have to see it in motion on their data for it to like go, "Oh."

Jacob: No, I think that's right. We have a lot of customers.

We have a lot of merchants. We have a lot of sub systems. We have a lot of products.

Charity: And without this, you're literally just, you're forming guesses or looking at top 10 list, you're guessing, and then you're looking for evidence that you were right.

With the slicing and dicing, it is transformative. It changes your life.

Jacob: I guess I'm lucky enough that this was mostly in place when I showed up.

Charity: So interestingly enough, it took us a year and a half to figure out the instrumentation side of observability, the single arbitrarily wide structure, log line per service per request.

Turns out Amazon's been doing this in EC2 for like, 15 years. And I'm just like, "God damn it."

So I'm delighted to see it.

I really think it's one of the primitives of observability and the more that people are doing it, the happier they will be.

Liz: You need two ingredients, right?

You need the right data, but also you need really great visualization and intuitiveness and interactiveness, really.

Those are the two magical ingredients.

Jacob: Yeah. And I think, in particular, a place where I'm curious is for distributed tracing, the places I've been maybe haven't had the slam dunk visualization there. Right.

And so it's like, if you don't have a good way to use that data that's easy for a junior or a regular engineer who has 50--

the same security or reliability expertise that people don't necessarily have because they're a product engineer doing 50 things.

How do you give people quick insight?

And I mean, BubbleUp is maybe a great example from Honeycomb, but in terms of distributive tracing, I've seen it be more expert oriented than like--

Charity: Yeah. I mean, that's been the fatal flaw with the last generation or two of distributive tracing tools is that, two flaws.

Number one, you can't roll it out partially. If there are gaps in it, it's just, well, it's broken.

And number two, every place that I know of that's rolled it out, it's not broadly used by people.

It's like, there's one person.

Everybody goes to that one person when they need something traced.

Liz: Yeah, it's definitely a pattern that you have to consciously work at.

You have canonical loglines and what are you doing to utilize them more fully at Stripe?

Jacob: I think it's about learning-

Charity: They hired Jacob.

Jacob: I would say plug it into the incident pipeline to understand like, okay, what crazy queries are other people in the organization writing?

When we went through this incident, what queries could we have written, right, to understand how to navigate the data?

Liz: So leveling up through social connections, right?

Leveling up based on people's tasks queries in order to help teach future people, or--

Jacob: Again, alignment, common ground.

The organization as a gestalt has pockets of deep expertise in understanding how to use this data and how do we get the right information to the right people at the right time so that they can also query like someone who's been there for five years because that person might be on vacation when an incident happens.

Liz: You mentioned earlier this idea that you're trying to align what the industry is doing versus what executives think and perceive.

What is it that startups, what is it that the rest, that the tools ecosystem can do to help facilitate that adoption?

Jacob: I mean, keep doing what you're doing, I guess.

Maintain the Goldilocks zealousness, right?

Take the strong underpinnings, whether it Scuba or John Allspaw's note to tooling vendors or whatever, right?

Take those inspirations, stay true to them, and then figure out--

If you stay true to them and you're successful, then you all figure it out for me, right, because you've successfully sold it to executives.

Liz: Yeah. I think that's perfectly fair, right? Peer validation, right?

Marketplace validation is a sign that enables you to pick up that same tool set and adapted elsewhere. Yep.

Well, this was super awesome.

Where can people find more about you, Stripe, your writing, and what you've been reading recently?

Jacob: Awesome. So for me, I would say, just keep it simple.

Twitter, J-H-S-C-O-T-T, JH Scott on Twitter, and I'll update.

I am trying to write more. I got to find, I think I'm on Substack, I don't know, but I tweet a lot too.

And then, Stripe there, I would highlight, Stripe is hiring.

We have a lot of interesting problems on all facets of this, and that's stripe.com/jobs.

And I guess one last plug for Stripe Press and The Increment.

The Increment is a great tech magazine, it's published some Charity's work. Stripe Press-

Liz: And some of my work too.

Jacob: Oh, sorry, some of your work.

I have the whole set of issues, but haven't read every article in them.

Stripe Press publishes great books.

So yeah, definitely check out Stripe.

Liz: Awesome. Thank you so much for joining us.