In episode 11 of O11ycast, Charity Majors and Liz Fong-Jones speak with Gremlin chaos engineer Ana Medina. They discuss the relevance of breaking things in order to engineer them more efficiently, monitoring vs observability, and chaos engineering at scale.
About the Guests
Charity Majors: Ana, you're a software engineer. What drew you to chaos engineering and the broader operability-side of the house?
Ana Medina: It was just completely different to me. I actually come from a self-taught background, I started coding about 12 years ago.
But I was always very focused on the software layer, and I never really had any mentors in the space of systems or network that even told me to look into this.
I actually stumbled upon operations when I joined Uber as a software engineer, except I was placed on the SRE team.
Liz Fong-Jones: They just threw you into the deep end, didn't they?
Ana: Yeah, they definitely did. it was very much like, "You want to be an intern at Uber? We're going to make you the first SRE intern.
That means you come in as a software engineer and we figure out your role as you go." Which turned out to be crazy, and really interesting.
I was placed on call in my third week with the company, so that made me ramp up really fast on a lot of system skills that you just needed, as well as understanding micro services and distributed systems.
But at least I had a lot of really smart-minded folks around me that I was able to get into rooms and be like, "Can you explain to me how distributed systems work, and micro services?"
Charity: Basically it wasn't so much a choice as you were dropped in the deep end and went, "OK. It's fine here.".
Charity: All right, cool. This seems like a good time to introduce yourself.
Ana: My name is Ana Medina. I currently work as a chaos engineer at a startup based out of San Francisco by the name of Gremlin, and we currently focus on building a chaos engineering platform.
Charity: Awesome. You might also recognize the voice of my new co-host.
Liz: Hi, I'm Liz. I'm the new co-host with Charity of O11ycast.
Charity: Thank you so much to Rachel Chalmers for starting the podcast. Our episode with Liz was our most popular episode to date.
Liz: Also, now I work at Honeycomb.
Charity: Also, now we're coworkers. It's so great. Back to the topic of the day, you're working on chaos engineering now. Were you also dropped into the deep end on that, or did you gravitate towards it because you loved it?
Ana: No, I actually also got dropped into chaos engineering.
Charity: I'm all about like opportunism.
Ana: I've definitely always had the mindset of "Just put me in, Coach." Definitely that's clearly what fell into the place of when I joined Uber, it was very much "We need someone to be doing chaos engineering, be software engineering--"
Charity: Gremlin, though. Gremlin is a choice that you made.
Ana: Yes, yeah. Gremlin was--
Charity: Tell me about Gremlin and chaos engineering over there.
Ana: I've been at Gremlin for about a year, and I joined due to a mentor of mine in the space already. Tammy Butow, she's the principal SRE for Gremlin. She'd been recruiting me for about three years.
Liz: Tammy's very persistent.
Ana: Yeah, she definitely was like, "I see that you want to work somewhere else."
This was something that we had talked about, and she's like "How about joining this company that I'm at?" I was like, "Wait, what do you all do?"
She's like, "Chaos engineering," and I'm like "That's that thing that I used to do at Uber."
Liz: That's totally your jam.
Charity: Totally your jam.
Ana: I was like, "I'm down to go break some stuff again. So, let's go do this thing."
Charity: "Breaking stuff." That's a good way of putting it.
Liz: Yes. In the intro of the podcast that we just recorded, we talked about "Making things appropriately reliable." How does breaking stuff fit into that?
It definitely comes with the idea that everything is going to break, therefore you should just actually break it on purpose so you can actually dissect the parts where the failure can occur, and you're able to set up time to focus on those issues.
Charity: You can only break it in the ways you predict, or the problems you predict you're going to have. Aren't those the easy ones?
Ana: That's correct. You will never be 100% covered from failure, and it's definitely one of the things in the space that we're still trying to grasp onto. It's like, "How do we get to that point of doing this properly?"
In a sense, you can think about "OK, how have my systems of software broken up to this point?" And do chaos engineering on those. But I also--
Charity: So, it's basically unit tests for production.
Ana: That's a good way to put it for someone that doesn't understand the space.
But I would say that you can also take it in to the point that you can go read on different outages that have happened with other companies, and it's like, "OK. Let's form a chaos engineering experiment on the knowledge that this company has shared openly about what has happened there."
You can make sure that you're predicting what can happen in your own systems.
Liz: I really love that idea. It's similar to how Chef taught us that we could borrow infrastructure from other companies and apply it in our infrastructure.
That we don't have to wait for failures to happen for the first time in our infrastructure before we figure out how to deal with them, that we can introduce them by learning from that experience of other people.
That's really cool. What role does observability play in this, though? Can I do chaos engineering without observability? Or, how does observability intersect?
Ana: I definitely see that you want to have some type of observability monitoring in place before you do chaos engineering, because there is no point in doing chaos engineering when you actually don't know how your system is behaving at your current state before you inject chaos.
That might be just even having standard system metrics of just knowing your CPU load, what processes are running, and things like that. Just to at least get to the point where you are actually able to--
Charity: I would say that if all you have is monitoring, as Martin Fowler would say, "You're not tall enough to ride this ride. You're not ready to start breaking your systems yet."
Because there's going to be this long, thin tail of effects, "Chaos effects," if you will.
I've heard of people running a chaos experiment and then 10 days later realizing that some node was stuck in a broken state ever since then.
In order to be able to quickly find those outliers, those things that actually broke, these high level system aggregates are not going to help you at all. You actually need to have observability, the way that we define it in the control theory way, being able to ask any question and be able to break down by any level of granularity to see each and every component, or how it's behaving for each and every user.
Ana: Yeah, I know. I definitely completely agree with that, as being able to know exactly what call was done and being able to trace it completely back.
Charity: Otherwise it's just breaking. If you're not if you're breaking and learning, then it's just breaking.
Ana: Yeah, I know. I agree on that. It would just definitely be that every company and every team is sometimes at different states, that they do need to start somewhere else.
That those first baby level steps come out to be "OK, we can actually do chaos engineering by just having some out of the box things."
Charity: But should we? I'm not going to nail you on this.
Liz: It's interesting, the idea of chaos engineering as chaos "Engineering." Are we asking questions? Do we know that we'll be able to measure their answers to our questions?
It's interesting to talk about monitoring versus of observability there, in that monitoring might tell you "You completely broke everything, you need to roll it back."
But it won't tell you "Was your roll back completely successful?" It won't tell you "What did we break along the way?"
And give you that high resolution data into "What are the after-effects? What happened during the experiment aside from 'Everything broke.' Right?"
That's the place where you need to figure out "How do we shore things up next time?" That's the "Engineering" part that's missing.
Ana: It's still constantly evolving. We're only in the preliminary stages of talking about chaos engineering.
Seeing it be done at different scales, not just the Uber scale, the Netflix scale, the Amazon scale, whereas smaller companies are trying to pick it up.
My personal take on it is very much on the "It's good to still inject that failure early on, even when you don't have that built-out observability tool, just to know how things are going and to know that you're heading."
Charity: But that's the point. You won't know. If you don't have the observability, you won't know.
Liz: You'll know that things broke, but you won't necessarily--
Charity: You might. Eventually you'll know.
Ana: Yeah, you won't know. The 10-day trace that you mentioned earlier, that definitely is like what you won't get unless you have that observability or that open-tracing tool that can call it out.
Liz: So, you talked earlier about how you made the transition from being a dev into learning ops stuff.
How do you see the evolution of software engineers in terms of starting to think about the production consequences of their code?
Ana: It's fascinating because I came from that mentality of "My code works on my machine? That's all I need to care about. I'm going to check it in, someone's going to deploy it for me, I shake my hands and I go to sleep happily. I don't have to worry about getting paged."
I love the theory of having to put devs on call for their own software for that same reason, because it totally changes your mind in the way that you build stuff.
You actually are like "OK, I should maybe make sure that I'm doing those edge cases testing, of like 'Am I even checking for a memory leak or any thread that continues running?'
To the point of like, 'Is this software allowed to be deployed in a part where things auto-scale for me in case things go really well?'"
When you start giving devs that mentality of like, "If you actually don't set things up properly, you're going to be the one that has to wake up in the middle of the night and figure out what's going on, and the one to fix it.
Charity: I think of it as less ops and more reality. You don't know your code if you just know it in your laptop.
You can think you know your code, but you don't know it until you've watched it run on real infrastructure with real users talking to it. You have no idea what piece of crap you wrote.
Liz: Speaking of testing on real infrastructure, what is your thought on testing in stage versus testing in prod?
Ana: I'm 100% on set for testing on prod. We should all be testing on prod.
You can never create an environment that's going to be like production, like you can get close to it but it's never going to have every single little identical thing as much as we try doing it.
So, I 100% support testing in production. Then when it comes to the topic of doing chaos engineering, on staging and on production, I very much love telling folks "OK, let's go. Let's do chaos engineering in production."
But at the same time, you also don't want to run a chaos engineering experiment. If you know it's going to break production--
Charity: Do people actually run chaos engineering in staging?
Ana: Yes, at Uber we were running chaos engineering at staging, and then now at Gremlin we have a few customers that started with running chaos engineering experiments in staging first. It was just to verify things are going--
Charity: Chaos engineering experiments, just to clarify, it's things like intentionally breaking the network connectivity or making the network connectivity lossy for a period of time?
That sort of thing? Making disks go away?
Ana: Yeah. It includes various things of like, "OK. What happens if I shut down my server or my container, or actually change the system clock, to black-holing certain ports, to actually just injecting latency into calls?"
Liz: Yeah, exactly. We had an outage a while ago in which we had our MySQL servers get slow and then that caused a cascading failure, and that would have been a really cool thing that we would have known that was a failure possibility if we'd experimented with that in a controlled circumstance.
Charity: If you had thought to experiment.
Liz: Yes. If we had thought to experiment, which was always the hindsight bias.
Ana: It's like we are trying to get folks to share more a little bit about what chaos engineering experiments they're doing at their companies.
At Gremlin we're going to be pushing out how we're running game days internally.
We actually just released the first one today, that is like "We actually want to test how we're doing monitoring on our staging environments, doing chaos engineering."
We started doing that in the preface of "We want folks to be sharing about what ideas they are thinking. That way we can also continue thinking about them without having to wait for that postmortem to be out."
Charity: When I think about the traditional way that a monitoring system would mature, you would build a system and you would look at it and you would predict the ways it was going to fail.
Then you'd monitor for those things, and nowadays you would add chaos engineering tests for those things.
Then over the course of the next year or so, as you ran it, you would gradually encounter more and more of the rare events and then you would add monitoring checks for those.
Then you would add chaos regression tests, basically, just to make sure that you don't regress there.
Then what's left is the long tail things that almost never happen, that you can only really-- This is where observability comes in, it's the ability to introspect and ask any question.
Do you need to test for every failure?
Ana: In an ideal world, yes. But it goes to that point that you mentioned earlier, it's like "You can't predict every single failure."
Charity: Do you test for-- Do you include things like testing user behavior? Like, injecting users that are doing abusive things to your platform?
Ana: Chaos engineering can definitely add onto that. I feel like we can get to a point that chaos engineering experiment can include what actually happens on the infrastructure, plus injecting that user load.
Charity: What is the difference between a test, and a chaos engineering experiment?
Ana: In terms of chaos testing versus chaos engineering?
Charity: No, just what is the difference? Can I just use these terms interchangeably?
Ana: Yes and no. Chaos engineering is a part of testing, but it goes more into the resilient space of actually just pushing things to the edge.
Liz: One specific test, though. If I decide that I'm going to have 1% of queries suddenly take twice as long, and I set that up, is that a test or is that an experiment? Or is that both?
Ana: You can call it both. My personal take on it is that a chaos engineering experiment can have more than one test.
Charity: The first wave of dev ops, I would say, was ops learning to write code. We have received the message, we all write code now, and I feel like the second wave we've been on for a couple of years now.
It's like inviting software engineers, "It's your turn, time to learn about operability and what happens after you push deploy."
How far do you think we are into that transformation? Where do you see the industry these days?
Ana: We still have a long way to go, but I would say that I would give credit to how far we've come.
I've been in tech for not as long as many folks, like it's still just like 10 years. In the operation space, only 3 years.
Even in just these 3 years I just saw a lot more folks being open to sharing their failure, just how they learn more about the things that's going on.
But realistically, we still have a very long way to go just because we're still doing operations in a very reactive sense v ersus thinking in more of that proactive sense, and that's exactly where observability and chaos engineering come together.
Charity: It feels like the argument is over. Everyone accepts now that this is where we need to go, but we aren't sure how to get there, and that just needs to be thought out by someone.
Liz: That's where your point about us talking to each other is super important. I co-chair SRE Con fairly often, and a lot of the value of that event is getting people on board with understanding "I accept that I should do this. How do I get started?"
And also for people who are experienced, "What's at the forefront? How can we keep pushing at this rather than doing the same things over and over?"
Ana: Yeah, definitely. That sharing culture has continued.
Charity: Consensus. Driving consensus on core questions of the day.
Ana: This also goes really hand-in-hand with the dev ops movement, the SRE movement and those things.
Here we have more resources of how larger companies have done this, let's write huge books on it that would tell you "In this chapter, we're going to talk about how you can get started on just monitoring, just on logging, just on observability, just on -call."
It puts it up in little smaller pieces that if you're a small startup that only has 3 engineers, you can still be like "I want to be better in operations and implement some of the SRE model, but I don't have the bandwidth to implement the entire Google SRE model--"
Liz: Nor should you.
Ana: Yeah, exactly. So, we're getting there.
Liz: To jump back a little bit about testing in prod versus stage, what about game days?
If you know you're going to need to leave a system in a broken state for someone to fix, is that a staging thing?
If you are testing, can our engineers figure out how to, for instance, bring up Kafka after it has gotten horridly stuck and crashed?
Ana: I'm a huge believer, just because I've been burned by this, of using chaos engineering to make sure you have really strong run books.
Game days comes into that mindset, where it's like you've got your engineers together and you're performing these experiments, but at the same time you're going to be using the run books you wrote 200 days ago.
You're going to want to be updating those, because maybe you realized that you even forgot to pass a flag or you can actually do this better by doing it in a different command, and then be able to just validate that?
It's like "OK, crap. I just broke down Kafka. But if I actually do these five different commands like this, I will actually be able to bring it back."
Liz: It reduces the pressure and gives you an environment in which to think about the technical debt. That's your documentation, right? To update your documentation that's gotten stale.
Charity and I have sometimes gone off and provocatively said, "Staging his dead. Don't use staging." But I think that it's important for us to have this conversation too, about what is staging useful for and what is it not?
Charity: What is the new usefulness of staging?
Liz: Yeah. What is the new usefulness of staging? Which is not, "Can we reproduce every possible user failure?" Because you can't.
Ana: I feel like definitely staging could be that good one, where you're definitely putting those production outages to the test just to make sure your documentation doesn't fall under tech debt.
That way you're also-- There is muscle memory to an extent, not all of on-call is muscle memory because you'll be encountering new things.
But there's a part where it's like, "OK. I've done this before because I prepared for it in a game day setting."
It's like the usage of tools is a muscle memory, even if the specific combination of tools that you put together is novel every time.
This is one of the things that we've struggled with at Honeycomb, is that people tend to assume Honeycomb is for the difficult, gnarly problems such that people don't really get fluency with it unless they use it every day.
Charity: People are so used to the idea that that they don't know what's going on with their code in production, and that's OK. That's the way it's supposed to be.
Once you've had the experience of being able to see, it's impossible to go back.
This is why our best customers are typically people who are leaving their job and going to the next company, and they bring us with them.
Because it's impossible to go back to just blindly shipping and crossing your fingers, but most of the world hasn't had that experience yet, so they don't know what they're missing.
Ana: It's definitely a thing like back to the earlier question of we're still really early into the stages of these two topics, and it's like "Until we start getting more smaller companies involved and talking about it--"
Charity: People need to raise their standards too. People need to expect more from their tools. This is not really acceptable.
Ana: This is true.
Liz: So Ana, if I am a company that wants to get onboard the chaos engineering train, how do I get started and convince my executives that this is a safe thing for me to do?
Rather than, "Oh my God, you're introducing more outages. We can't possibly have that."
Ana: You can definitely start with the fact that chaos engineering is not done to produce outages, it's to actually lower the downtime that your company has as well as the money that you spent into it.
But there's a few things that you have to keep in mind as those prerequisites of doing chaos engineering, it's having that monitoring and observability in place being the number one thing.
Then second of all, you want to make sure that you're either going to be using a platform or building your own tool that has a big shiny red button that lets you stop any chaos engineering experiment that's currently going on and safely roll back to a steady state, as well as thinking about the blast radius.
If you actually don't know how this chaos engineering experiment is going to go on 4 of your hosts, you don't want to be running this on 100 of your hosts either.
So thinking about "How do I do this in a very small way and proactively increase it as the chaos engineering is successful?"
As well as you also want to come up with those abort conditions and write them down and let your engineers know.
Like "If we actually see the we have lost like 20 % of our users, let's stop the chaos engineering experiment."
Liz: Exactly. That you may want a service-level objective, or something to tell you that you are hurting your actual users, rather than just conducting an experiment without causing harm.
It also is really interesting to think about things like, "If I'm running an experiment on a small subset of hosts, how do I tell what's the difference in behavior between what's going on in this host that I'm experimenting with and the control group, as you were?"
Ana: Yeah. I think that is a great way to start off when you're doing something like this, because then you can also think of "OK. What is my error budget for this quarter?" Then use that error budget to do chaos engineering in production.
Liz: Yeah, that definitely makes a lot of sense. If you agree in advance what your error budget is and it's not zero, it makes it a lot easier to argue for.
"Yes, we can afford to cause a little bit of pain and have the safety measures to roll back."
Charity: Because fundamentally you need-- Since, as you say, "Your systems are broken always."
Like, there's so many things that are broken right now that you don't know about, or you suspect but you haven't been able to find.
That means that as people running the systems, we need to be constantly practicing small failures too.
Practicing how to recover from them, often in code, but sometimes not because that's what helps you train your team.
It helps them practice, it helps them not freak out when something goes wrong. It gives them experience in recovering and rolling back to a good state, or whatever your practices are. That's just as important as the software side.
Liz: It's really interesting to think about rollbacks as well, because a lot of the pain that we've seen people talk about are "Rolling back is not as simple as 5 minutes."
It's like, "OK. We could accept that you can't roll back in 5 minutes, or we can actually dig into that and say 'What stops you from not being able to rollback in 5 minutes?' Right?"
Ana: It's like, "How can we actually make this better?" It's what you mentioned earlier, we just have to have higher standards for our tools and our infrastructure--
Charity: And for the amount of self-abuse that we accept. We often, especially on the ops side, we have this long history of throwing ourselves on grenades. Like, "I will suffer through this," it's just not good for us.
Ana: Yeah, it's great to also be talking about--
Charity: It's not good for the systems, either. It's not good for anyone.
Liz: It also brings us to the stereotypical example of "The disk got full."
It's like, "Do we need to throw ourselves on that grenade? What if we purposefully fill up the disk and verify that the world does not end?"
Charity: "What if we could absorb a lot of full disks and it could be OK until morning when someone wakes up and wants to deal with it?"
Ana: It gets to a point where it's like, "OK. We have managed to auto-scan our disk in a sense that now on call is 10 in the morning. You've maybe had your coffee. "Can you look at me now?"
Liz: Yeah, getting things out of being crises in the middle of the night and transforming them into problems that we can think about when we're awake is such a radical transformation.
Charity: Awesome. Thank you so much for coming, it's been really fun to talk to you.
Ana: Thanks for having me.