Ep. #25, Reliability First with Amy Tobey of Blameless
In episode 25 of O11ycast, Charity and Shelby speak with Amy Tobey of Blameless. They explore the evolution of the SRE role, incident management, and the pains of rewriting system architecture.
In episode 25 of O11ycast, Charity and Shelby speak with Amy Tobey of Blameless. They explore the evolution of the SRE role, incident management, and the pains of rewriting system architecture.
transcript
Amy Tobey: I've been at Blameless for about six months, and when I showed up I was the first SRE full time, and the team had built quite a lot of software with logging.
There were some metrics here and there, but there was a lot of time spent in incidents where we would be trying to figure out what's going on.
There were a lot of incidents in the past where we didn't figure out what happened.
And so I had a lot of work to do reinventing the SRE program and other things around, but kept pointing out and saying "We don't have enough observability in this system."
Until it came to a head, and at one point we had an incident with a customer where the customer very clearly said to us, "We're very unhappy with this behavior."
That was the opportunity for me to go to leadership and say, "We need to pull back some feature velocity and use it to deeply invest in observability.
Because we can't even tell people what is going on, so if we can't solve that problem we're going to be continually in incident hell and not making forward progress."
Charity Majors: You can't understand it for yourselves.
That's the first step, is understanding it before you can even help to explain it to someone else or fix it, or anything.
We see this so often where people want to move so fast, and they're just sprinting off in all different directions and they can't see what the fuck they're even doing.
Anyway, this sounds like a great time to introduce yourself.
Amy: Sure. I'm Amy Tobey, I've been doing what we call SRE since about 1999.
Charity: Before it was SRE?
Amy: Before it was SRE, before Dev Ops, and starting with things monitoring projects, doing Unix stuff, writing Perl.
But over that 20 years I've come to these places of observability and SLOs, and where we are in modern SRE because we're finally starting to find things that actually work to bring organizations closer to reliability, whereas that first 15-18 years really was just--
Charity: There wasn't a lot of progress.
Amy: A whole lot of shoveling shit, where I want to go and build something nice and I'm too busy cleaning out the barn.
Shelby Spees: I find it interesting that at Blameless you're facing customers without being able to explain what's going on, when a lot of the purpose of the Blameless tool is to help with incident review, by my understanding.
How did that play out?
Amy: That has been my primary tool for convincing people that we need to do this investment.
I think part of it is me bringing my experience, and being in and out of the trenches, and then bringing that concept and saying "Our audience, our customers have a higher expectation than you're going to get at a consumer company. Or even maybe some of the monitoring companies, because our tool primarily gets used when things are already going wrong for someone."
The reliability of our tool is absolutely a paramount feature of it, so that's really what most of my work has been in at my time at Blameless has been.
Our number one feature is reliability, because our audience is people who care about reliability.
Charity: You said you've been doing this since '99, and it feels to me like we're starting to reach the point as an industry where we have some institutional memory.
You see this young company, like every team, we all have to learn this stuff over and over.
But it always seems like the right thing to do is to move fast and to just do something, it always seems like the right thing to do is just get it out the door.
You have to see that fail a few times before you can retrain yourself to look for different things in that cycle and pause at different places.
It's reaching for a different certainty, reaching for a different comfort, maybe, with what's going on.
Amy: I have a simpler view that's a little less pleasant, which is people are reaching for control.
In large systems, and in the systems we run today, control isn't really available to us.
Charity: I think you're absolutely right, but the urge to reach for control has to be retrained.
We have to let our rational brain take over and go "OK, I did this 10 times. Every time I reached for control. Now, what's really happening? What can I really expect out of this and how can I act in a way that will produce the best outcome?"
Shelby: It's definitely an entire paradigm shift, and it's something that I've been learning about, actually thanks to you Amy, for introducing me to a lot of the learning from incidents community and resilience engineering communities, and things like that.
The concept of resilient systems over strong but brittle systems where if you solve for your current state and things change, that's what causes failure modes.
You have these emergent failure modes.
Charity: It's not about making systems that don't fail, it's about making systems that can fail a lot and still serve their core purpose of serving the customer.
Shelby: Exactly. Where nobody notices that your server fell over, or 100 servers fell over, if you're at that scale.
Amy: My training earlier in my life was I was a band nerd, and then I went to college to be a music educator, and obviously that's not what I ended up doing.
But all through that, there's a common phrase that gets said a lot. "Practice makes perfect, or perfect practice makes perfect."
I feel like that's what charity was saying about "We have to fail more," we have to practice in failure because if we don't, we don't develop the abilities and the muscles and the skills and the reflexes that allow us to respond to those.
Then we also discover where we're brittle, so we don't really have a choice. There was that brief moment when Amazon Web Services was first starting to take root when it was so bad, it failed so much that we built these muscles because we had to. Because instances would just die all the time, and then it got good.
Then things started getting more brittle again, at least that's what I saw.
So it was again, we fell out of practice and we spent too much time on the couch.
And now we're starting to develop actual discipline, like chaos engineering that will actually help us exercise our abilities.
Charity: It's about retraining ourselves to embrace this, and to feel-- I've always been excited when things fail.
That's when I click into my happiest self. I'm just like, "Oh my God. Everything's fucked. This is amazing."
But I feel like you can practice that, too. You can do that intentionally.
You can just go, "This is exciting. This is rare, this is one of the moments where I feel most alive. Things are just completely on fire."
I think that we underestimate our own ability to teach ourselves to react in different ways to situations.
I think that there's something to be said for just consciously going, "I'm going to decide to react in a different way the next time something like this happens."
It's stupid and scary and annoying to me how much this mirrors becoming an adult, because becoming an adult is about--
What do the psychologists call it? They call it "Mastering your executive brain."
Making the meta brain drive more so that you're not just reacting and existing in a space of Id, but you are noticing your reactions and paying attention to them, identifying them, communicating them, and making decisions about how to act that are consistent with your long term goals.
Amy: Right, but not everybody is good at that.
Charity: Oh, no.
Amy: There's an intersection with neuro-divergence there, which I would be surprised if any of the three of us turned out to be normal.
My own experience is that I struggled through school with ADHD, undiagnosed, unmedicated, all that.
I still am because I developed skills to cope with it, but I feel like part of the trajectory of my career was that I landed in operations early and I got to bring it to bear as a skill as opposed to a handicap.
I feel like that's starting to become something that more people are-- Like, these skills that we developed, neuro-divergents.
I'm not going to claim you two are without you two saying it, but--
Charity: No, absolutely.
I remember trying to work as a software engineer and it was just so tedious, the idea that I would know what code I was supposed to write days or weeks in the future, I just lost all interest in computers.
I've gravitated towards Ops because-- I was diagnosed with ADHD just last year. Everyone in my life was like, "Of course."
Amy: Duh.
Charity: And I'm just like, "What are you talking about? It never occurred to me." But yeah, you're right.
Shelby: I think it's really interesting, and it's something and it's a theme that comes up a lot.
I appreciate the metaphor of adulthood, or we were hearing from Donna at LaunchDarkly today.
A story from home, about parenting and DevOps and how you're just improvising all the time.
You have no idea what you're doing, and I feel like that's the theme of adulthood.
We've reached this level and we don't know what we're doing, but you can get better at improvising. It's a skill that you can develop.
That was something about incident response that I'm learning, now that I'm actually in teams that care about incident response and training it, is that it is a skill that you can learn.
You can learn incident response the way you can learn to be an EMT.
Charity: I'm curious to hear, Amy, about the systems that you have at Blameless.
I guess I'm a little surprised, in my mental model it's not the kind of company where you'd have a lot of production systems that are fragile, or they need to be post-mortemed a lot.
What do you have in the back end there?
Amy: It's a micro service architecture, and the older stuff is based on a RPC system called Namiko that was built in Python and uses RabbitMQ for its message bus.
Charity: Oh, say no more.
Amy: I knew that when I joined, and there's work in progress to eradicate it.
But yeah, RPC over RabbitMQ is exactly as bad as it sounds.
Charity: Totally. I will say, for queueing things, not to be marketing our own shit too much but one of the only ways that I've found to get a handle on those problems is with something like Honeycomb.
Where you can track everything that's in flight, where you can sum up all of the -- Or even just what percentage of in-flight workers are being used by--?
Break down by customer, breakdown by endpoint. Without something like that, it's just impossible. Being a DBA without that stuff--
Amy: Oh, my gosh.
Charity: Before and after, being able to just say "OK. Which of our customers is consuming most of the lock time in this table?"
Just really basic, straightforward questions like that. It is insane to me that as an industry we have gotten this far without having it.
Amy: In the case of this architecture, what I believe happened -- It's all rumors and legends, is somebody really loved this piece of software and felt that it was the right choice at the time.
Charity: Of course they did.
Amy: They were the lead developer, so they charged that and they built a ton of software on it.
So in a way, it's a very successful system. We have a successful business built on it.
Charity: Yes, you have survived long enough to hate yourselves.
Now the trick is turning what we've learned from that experience into what we're calling our multitenant re-architecture. Which is switching from-- We continue with micro services and leaning heavily on the micro service architecture to get out of that RPC system. But that's the other thing that's happening right now, is we're rewriting a huge chunk of our architecture.
Oh, boy. The Parse story in a nutshell was we wrote this very popular mobile back end as a service using Ruby and Rails.
So, you've got this pool of unicorn workers that can have one process per request at each point in time, and by the time we got acquired by Facebook I think we had 60,000 mobile apps all contending for those resources.
Around the time I left, we had over a million.
So you can see the problems here. As we start spinning up more and more databases behind this fixed pool of workers--
It doesn't matter how big you make the pool of workers, we're running at 10-20 % usage steady state under normal circumstances, but it can spike to many hundreds of times that amount just if a single backend gets a little bit slow.
Then every single worker will just be waiting on that database to return it's query, or that retis instance to return its--
Once you have tens of those databases, pretty much something is always slow.
We had to bite the bullet and go, "OK. We are going to need to do a full rewrite of our API in a threaded language," and we chose Go, and it was painful.
It was painful, but I genuinely don't know how people do it without something like Honeycomb.
Shelby: How do you make that decision and how do you get buy-in for a rewrite, or for your architecting entire systems?
Charity: It's like what Benjamin Franklin said about Americans, that we always do the right thing after exhausting all other possible options.
Amy: That's how it goes very frequently, because those rewrites that happen without that are often worse.
Charity: That's the thing. If there's any other option, you should take it.
Shelby: Just the hero working nights and weekends to write it all from the ground up.
It's not about that, it's about the longer your software has been around, the more stable it is and the more it's known.
The more it's boring, and boring software is what powers the world.
I know we're all singing the same song here, but having to take your more or less stable 3 or 4 year old API and rewrite a new one.
Not only that, but the impedance mismatch. Ruby and it's implicit everything, it's just going to guess at all your types.
Then you try to map that to Go and it's going to be like, "Nope."
All of the mobile apps out there in the wild, which aren't happening on a regular release schedule, they're stuck in 2002 when it comes to releasing.
They're like, "Once or twice a year we're going to do all or nothing."
Amy: And you have to be bug-compatible with that software.
Charity: "Bug-compatible" is exactly the word I would use. It doesn't matter.
There's no such thing as right, what is right is whatever mistakes you made before.
So we did a bunch of contortions, and this was the most painful thing that I've ever been through.
We finally settled on a cadence that was basically, we wrote this little thin shim thing that would fork traffic.
We would put that on one of the app workers and it would fork the traffic, and it would send it to a stable one and a Go one for that end point, and it would return and dif the results that it had gotten and log any differences to a file.
Every morning someone would come in and check that file, see which bugs we had and which edge cases we had.
So, those were the easy ones. Those were the read end points.
Amy: You're approximating a solution with that. It's almost like how ML guesses at things by wiggling into the right spot.
Charity: It's purely just, "Is the end user going to think this is a bug? Then we have to think this is a bug."
For the right end points we actually had to fork the database and daisy chain two of these so that we could--
It worked, but this is why when I was leaving Facebook, I went "Holy fuck. I don't know how to live without the tooling that we had built around Scooba, because you literally have to be able to inspect down to every single request. What is happening and what correlates?"
If you're dealing with metrics and aggregates, they are as good as useless.
Amy: They tell you the shape of what's happening in the system, but they don't tell you what is actually happening in the system.
They can't tell you anything about the reality at the pointy end where the work is happening.
Charity: Yeah, and I didn't understand all of the whys.
All I knew was that it had been categorically different than all of the field tooling that I had tried before, and it took two or three years.
I'm still going on explaining that experience.
I'm still trying to explain it to myself and to the world, just what is it that matters?
Why does it matter and who does it matter to?
Shelby: I found it really interesting, just coming into learning about observability without decades of experience, living with metrics and logs and stack traces as my debugging tools.
Where when I first was living in production systems, I'm like "Why can't I answer the questions that I have with these dashboards in front of me?"
Exactly what we talk about at Honeycomb, and I get so frustrated.
It's like "I just want to know what was happening at this point in time," and then I learned "You can just record that as unstructured data and then go back and ask those questions. Who wouldn't want to do that?"
So it's been a very interesting perspective. Saying, "Of course we should store it this way and then going and facing everyone who's like, "We've had our quantiles for the last two decades, how--?
Charity: You get used to a tool.
Amy: I have a theory about who doesn't want that.
It was me at a time, and it was probably Charity at a time, where my coming around on observability and really understanding how the modern tracing world is pushing the state of the art for it was the realization that the gut instinct that I bring to the table as an experienced systems administrator, performance analyst, whatever--
Is that I've built that knowledge graph inside of myself to be able to look at a huge wall of graphs, looking at the curves and looking at different pages, and putting the information together in my head.
What I struggled with is people would keep saying, "Amy. Can you please teach me how to do that? I really want to know how, because you always troubleshoot stuff and you get to the answer really fast. How do you do that?"
The best answer I could say is, "You've got to troubleshoot a lot of stuff."
Charity: It's intuitive. It's pattern matching.
We just learned it that way, and the feeling of power when you just can look at a graph and just go, "It's Retis."
And everybody's like, "How did you do that? It doesn't say Retis anywhere." And you're just like, "But I bet you I'm right."
And you are. That is godlike power and it feels so good.
Amy: It feels so good, yes.
Charity: But it's not replicable.
Shelby: It's diagnosing based on these peripheral symptoms.
Charity: It's an intuitive leap. It's an intuitive leap based on the scar tissue of many passed outages.
Amy: I was going to say, 101 intuitive, which is why I was reluctant to be like "I don't need these tools. I can already do this."
But what the newer Honeycomb and we're using Stack Driver right now, these tools bring that intuition.
They democratize it, they make it available to all the engineers that maybe haven't or don't work in ops and infrastructure, and see how all the pieces are put together and where the wires and lines are.
But that now these tools connect things for us, so it's a cognitive assist.
Now everyone can make those intuitive leaps at the speed that maybe Charity and I were doing it 10 years ago.
Charity: One of the things that I've come fairly recently to realize is key to this, is that when you're giving software engineers tools you need to give them tools that speak the language that they speak all day, which is endpoints and variables and functions and services.
The languages that you and I speak are like "There's four or five different kinds of memory. Which kind is it? Is it resonant memory or is it shared memory or is it peripheral memory?"
We're very good at translating from one to the other, and that's not an expectation that I think is reasonable for your average engineer.
Because what they need to know is, "Is each request able to execute? And if not, why not?" That's it.
They don't need to think about all the different kinds of memory, what they need to know is if they just shipped a change, did it triple the size of the memory usage?
That's useful information, but the memory usage is really all that they need to know.
Amy: There's an analogy that stuck in my head years ago, and I don't remember where I read it, but it was about why in detective shows they always show the detective with the flashlight in a dark room shining the flashlight.
Somebody gave a great explanation that said, "The neat thing about using a flashlight like that is it directs your focus into a smaller area. You're more likely to notice details in that small area than if you're in a well lit room trying to observe everything."
So one of the things that these modern tools are doing is they're bringing our focus closer to the problem more quickly. It is the flashlight in the dark that lets us make those intuitive leaps.
So now in a software engineer, who maybe has a ton of context in programming languages and build tools and all that stuff, doesn't need all the other context in infrastructure because they get spotlighted in to where they are strongest, where they are going to notice the right things.
Charity: They need to know the consequences of the changes that they have just made.
I feel like I get unfairly maligned sometimes for shitting all over metrics, but I believe that the metrics that aggregates the data dogs, the Prometheus' of the world.
That is the right tool for the job if you're managing infrastructure , and it's an amazing tool and an amazing tool kit. It is what you need.
You need to care about capacity planning, you need to care about in aggregate "Am I doing my job? Is the infrastructure serving its purpose?"
What you don't give a shit about, all the 500s that the software engineers are out there causing.
You can't feel personally responsible for every one of them because they aren't all your fault. There's only so much you can do.
There's this constant debate, or just push/pull dynamic about "Should we care about the error rate or not?"
Because we care about some errors, the ones that are caused by infrastructure.
Amy: We should care about the error rates that our customers care about.
Charity: Yes, but from the perspective of responsibility.
I've been thinking about trying to draw the lines between "What does a tool like Honeycomb need to provide?"
And it is the information that software engineers need to know to course correct, to understand the code that they're putting out in the world.
But then, infra teams and most of my ops people work for Amazon.
But they have a different need, they don't need to give a shit about my customers.
I think that platform providers are the ones where working as a platform provider at Parse is where this became a very crisp thing to me.
You are both the infrastructure provider and you're writing your own APIs and tools and stuff, and it could be very difficult to tease apart "Are these 500s that we caused that are our fault? Or ones that our customers cause that we are enabling because we're a platform, but it's not our job to fix them?"
Shelby: So where does that go? We've identified something that maybe is unexpected behavior.
My specialty lately, trying to make this more about me, is we discover these things happening in production and very often still even in orgs that have SRE teams and that have pretty strong software teams.
Like what I'm doing now, but the information just fell on the ground.
Charity: When I say "It's not our problem, it's their errors," the responsibility that I think we do have to them is to service enough information that they can be self-serving.
That they can understand what they've done, that they can correct it, that they can-- I think that there's a pretty clear line of demarcation between the stuff that I can control and the stuff that I providing that I can't control.
I can't control how users are going to use it, I just need to give them a good experience on my platform and I need to make it clear enough to them when they've done something wrong, when they've used my service wrong, I need to help them self-serve so that they can figure out how to fix that. They are other engineer.
Amy: But there's another side to that, is when we discovered these conditions where the users are using the API wrong and there's one, as you mentioned, an avenue where we go and engage with our users and DevRels like Shelby and I do and work with them to do it better.
But there's also that information needs to go back to our product teams, s o when they're designing V2 of the API or the next API, at least that information is available to make better decisions.
Charity: I think what I'm thinking of here specifically is people would use the Parse API and they would write these horrendous database queries.
The worst database-- They'd be doing something that seemed pretty legit through the API, and it would translate into five full table scans on hundreds of millions of rows.
Just five times, because why not?
I think we never really got to the point where we had enough engineers or enough sophisticated customers who didn't want to just bang on it, but I always wished that we had funneled up better information to them about how those were translating.
How the API was translating into MongoDB queries, because it wasn't really their fault.
I couldn't really blame them, they were just doing something reasonable with an algorithm and it was spitting out this incredible horseshit.
I felt we could have held them to a higher standard, if we had given them the tools to see why it was taking five minutes.
Amy: Have we broken that wall yet, though?
Because if somebody goes and buys Honeycomb and implements it deeply in their stack, that's great for the people inside the firewall, but the customer still doesn't get that observability benefit out of that.
Charity: They don't get that, unless we surface that to them.
You're absolutely right, and this is absolutely true.
Shelby: It's something that I don't want our users, our customers to have to go and look at the code and how our SDKs are implemented in order to understand the best way to use it.
That's a ridiculous standard to hold people to, and it's exactly like you're saying.
It's meeting them in a place where they can get the most out of the tool, out of the service.
Charity: Treating engineers like people, not like engineers.
Amy: I guess what I'm trying to edge towards is that at some point, as we're all working in a very SaaS world that's getting even more SaaS as the idea that things should be boring, we shouldn't be building crap that's not differentiating our business.
There's more of this, but now we're entering a situation where if we got 20 services in our infrastructure and 10 of them are SaaS, how do we tie that all together?
When I have an incident going and I need to go and say, "OK. I talked to my database SaaS and got back this result that went to my service that didn't talk to this other SaaS, that then looped back and then went to this other SaaS. Now I need to figure out what the heck is going on."
Charity: I think we've gotten really good at making APIs to get things in, and we haven't gotten very good at getting APIs to get things out.
This is something that we're struggling with at Honeycomb right now and we're actively working on it, but it's a different skill set.
It's a different-- You need to understand your customer's needs in a very different way.
There is not nearly-- Sadly, there's not nearly as much of a clear line to profit in doing it well.
Amy: That's true. I keep thinking about similar space things and I keep coming to I'm probably not going to be able to build a business around this, might have to talk somebody big into building a foundation or something, and that's not going to happen.
Charity: But I do think that there is something really valuable here, and it's deep value, its long term value, it is interoperability value.
I do think that we will get there, it's just going to be later.
You need to have established yourself and you need to not be worrying about survival when you tackle these problems, or I think you'll make poor decisions.
Amy: But shifting back to the current state of the art, that's often how people get in the hole where struggling, trying to achieve a place of feeling confident but unable to get there because the tools aren't available and the observability is not there.
The infrastructure just isn't evolving because there's no direction.
Charity: That is true. Now, in terms of observability and instrumentation, I think that the serverless kids have it right here.
I think that as long as your platform is instrumentable in the way that serverless is, which is to say they assume no log files and no loglines.
No anything except the code that you're writing should be able to report on its status at any point, and that--
When people are asking how to instrument in the brave new world, I'm always like "Think of how serverless does it, because I think that those are the right bundle of assumptions and practices to work towards."
Most things that let you-- Amazon functions, etc. You can make your software report its status back to you, and if you're using something like Honeycomb which is pretty agnostic, you can just send events to it from anywhere.
Knit them together with a trace ID so that you're persisting it whether it bounces around from system to system.
That's fine, as long as you as the provider are persisting those fields and allowing people to do the tracing and to report back out, I think that's a pretty achievable bar.
Amy: I don't think the modern serverless is even possible without that , because it'd just be little black boxes that break all the time and you can't do anything about it.
Charity: But when you're talking about knitting together all of these different platforms, that is the standard that I would feel comfortable holding everyone to.
Amy: Absolutely. I'm with you on that.
Charity: I think we've just solved all the world's problems.
Shelby: Thanks for joining us today, Amy.
Charity: Thanks for coming, Amy. This was delightful.
Amy: It's my pleasure.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
Getting There Ep. #6, The Impacts of the 2022 Twitter Acquisition
In episode 6 of Getting There, Nora and Niall discuss Twitter’s 2022 acquisition by Elon Musk. This talk unpacks the acquisition...
Getting There Ep. #5, The State of SRE and Beyond
In episode 5 of Getting There, Nora and Niall meet for a conversation at SREcon. This talk explores the history of the...
Three Key Best Practices for Modern Incident Response
Incident management refers to the process that a company takes to detect, act upon, and resolve issues with their software...