OCT 16, 2018

34 MIN

Ep. #6, Customer Reliability Engineering with Google’s Liz Fong-Jones

light mode

about the episode

In episode 6 of O11ycast, Charity and Rachel talk to Google developer advocate Liz Fong-Jones about ways to build systems to be more transparent and explainable.

about the guests

Liz Fong-Jones is a developer advocate and former site reliability engineer (SRE) at Google.

show notes

about the episode

about the guests

show notes

In episode 6 of O11ycast, Charity and Rachel talk to Google developer advocate Liz Fong-Jones about ways to build systems to be more transparent and explainable.

transcript

Charity Majors: Liz, do you have a CS degree?

Liz Fong-Jones: I do. But I got a CS degree about 10 years into my career.

Charity: Interesting. How did you manage to get a job without a degree?

Liz: I happened to work as a systems engineer for a few years, and then I managed to parlay that somehow into a job working at Google as a site reliability engineer.

Charity: This seems like a good time to introduce yourself.

Liz: Hi, I'm Liz Fong-Jones. I'm a site reliability engineer at Google and I work on the customer reliability engineering team. So my job is to help people make good use of their public clouds by building and operating reliable systems.

Charity: Nice. When I think about computers, it feels to me like it's been a wild, wild west for so long. This is definitely what helped me get my start, as a music major dropout, but there's increasing professionalization lately. How do you feel about that?

Liz: I have mixed feelings about it.

It is good for us to develop communities of practice and figure out what's important to us, but it's important to have on-ramps for people to allow people from any background to find their way into our profession.

Rather than saying, "You must have a degree from a top-10 school in computer science."

Rachel Chalmers: Right, because professionalization has two meanings. One is the gatekeeper stuff where it's preserving this domain so that only lawyers can make money on it. The other is like the Canadians having that iron ring for engineers. Which is super cool because it's a commitment to public safety. I'd really like to have the second kind of engineering professionalism,

I really don't want to see CS turn into a legal profession where you have to pass the bar exam.

Charity: Or a barber profession. States will enact these things where you have to have a certain certification just to make it a cartel. It's been really interesting, because given all the economic trends in this country this has been one of the only avenues for people who haven't had the money to go to college to make a really nice living.

Rachel: Definitely. When I came up in the '90s, being a sysadmin which is what we did when we rode dinosaurs to work instead of SRE stuff. Most of the sysadmins I knew had English degrees or history degrees, or theater or art history. And it was awesome because we'd sit around drinking and talking about Monet and computers.

Charity: It turns out computers aren't that hard.

Rachel: Exactly.

Charity: They really reward tenacity, curiosity and exploration.

Liz: And communication--

Charity: And communication.

Rachel: It's interesting that you say curiosity, and we'll probably come back to this, but I have a theory that people write the software that reflects their truest self. So, Edith Harbaugh who is basically Leslie Knope, has written software which is binders full of processes so that you can have backup plans for your backup plans for your backup plans. And Charity, you've written software which is about your curiosity and your desire to explore the world.

Liz: Yeah. Curiosity is definitely...

When I talk to students who are interested in getting into SRE, I tell them that curiosity is the number one trait that I look for.

Charity: And lust for power, don't forget that part.

Rachel: And whiskey.

Charity: I say that jokingly, but there is an element of having a God complex. Just, "I can make all these amazing things happen!" And you just get high off that. Is that just me?

Rachel: Not just you.

Charity: All right. Moving on.

Rachel: It's actually an interesting segue to the next question, because part of the CS question is, what kind of people excel in the tech industry? And it's not only people who know how to code. It's not even primarily people who know how to code, it's systems thinkers and those people can come from anywhere.

That question is intimately related to the anxiety people have over AI and ML and being automated out of a job. Let's talk about whether automation can replace human technologies, or human approaches.

Charity: I've had many Google engineers tell me confidently that there's no point in building Honeycomb because AI is coming. Like, they've seen this at Google, and pretty soon there's going to be no need to understand our computers because the robots will do it for us.

Rachel: And they say it in exactly the same tone of voice that they used to ask me why I was bothering to study English.

Charity: Yeah. Very confident.

Liz: I'm really disappointed in those Google engineers who were telling you that.

The truth of the matter is that computers are really good at doing things that are rote. They are not great at delivering insights, and not really good at delivering insights in a predictable manner.

You can have unpredictable insights but when you're trying to build systems that run reliably, it's really hard to rely on something that you can't easily comprehend. That you can't question, "Why did you do that?"

And therefore, I think it's really important when we're designing systems, whether they be for observability or whether they be for anything else, like healthcare or decisions about whether to give someone a loan, that they be transparent. And if you can't have that transparency, if you can't debug them, then you're building something that you're never going to be able to get to run reliably and repeatedly.

Charity: You can't hold someone accountable for something that you can't explain. I read recently about parole hearings that are being done by algorithms.

Rachel: Oh God, no.

Charity: And it turned out later, as no one could have predicted, that the algorithms became very racist.

Rachel: The algorithms are not impartial.

Charity: No one could have predicted this.

Rachel: They're not unbiased. Algorithms are human artifacts and they contain all of the preconceptions of the people who built them. And not to pick on Google, I apologize, but this is my go-to example. It's Google Buzz. I was reading Harriet Jacobs' blog before Google Buzz came out, which was a pseudonym for a woman who was on the run from an abusive husband.

Liz: Yes, I remember this.

Rachel: Because Google Buzz connected you to your friends of friends contact details, her ex-husband hunted her down. And there were people within Google, I know this, who said, "This is a really bad idea." And they were shouted down by the people who said, "No one should have anything to hide."

Sure, in a perfect world and to the first approximation, which is, "If you're a straight white dude there's nothing to hide." But for the rest of us, life is really complicated and gnarly, and algorithms that don't acknowledge that are actively dangerous.

Liz: Yeah. It's definitely important to have people who are marginalized involved in the design of these algorithms. I do think that eventually we'll have explainable AI, but that's an open research area. That's not something that's coming anytime in the next year or two.

The robots are not immediately coming for us, and in the long term, if we involve the right set of people and think about the most challenging problems rather than slapping AI on everything, then sure. Maybe one day the robots will let us go and do more productive things with our time.

Charity: I just look at it like, "OK, but what about when the AI breaks?"

Liz: Totally.

Charity: The more mysterious it is, it's kind of like the difference between the old cars that anyone with a wrench and some eyes could figure out, and the new ones where it's like, "Where do I start?"

Rachel: It's an electric space ship.

Charity: Yes. And sometimes it's great. And sometimes, like if you're trying to fix things or your entire society relies on these things, you need to have people who are capable of understanding them.

Rachel: It would be nice if we knew what happens inside Diebold voting machines, for example.

Liz, I wanted to jump on that word "explainability," explainable AI. Can you unpack that? That sounds super fascinating.

Liz: It's important for humans to be able to understand about their systems. How are decisions being made, where are they coming from, what factors are they taking into account and what combination of factors is leading them to make decisions?

Even knowing, what are the inputs? How are they being weighted? And how could you change the outcome? If someone gets turned down for a loan, what factors would cause them to get the loan in the future?

Charity: There just needs to be an audit trail, right? Reproducibility.

If one person has these results, then another person should be able to understand it and get the same results.

Liz: Yes, exactly. And also being able to get some degree of consistency, and knowing when you change your algorithms, is it leading to a better experience? For which users? Is it leading to a worse experience for which users? Those things are things that are tricky to get right if you don't think about them.

Charity: The phrase "it just works" has always terrified me.

Liz: Nothing is really magic.

Charity: No, as it turns out.

Rachel: In practice it sounds like that would include adding metadata, and adding structured events, and adding audit trails to decision making pieces of software? I'm trying to visualize how that would actually work.

Charity: It's an ops log, basically. Being able to list and articulate all the inputs that come in and their dependencies, and then the sequence of decisions that are made.

Liz: And being able to see what happens if you change these factors, how much does it influence it?

Rachel: I'm going to have to Google some more about that. Which comes first, the tool or the profound culture change?

Liz: There is a need to have both tooling to facilitate your culture change, as well as to have some kind of direction that the culture change is coming from. It's very rare to see, especially in large enterprises, grassroots efforts succeed on their own without some kind of champion. Without someone who's willing to provide that air cover to say, "Yes. We want to do this. Let's do this." And then yes, the tooling help you.

But tooling by itself doesn't motivate people to change the way that they work.

Charity: I think of it more like rocks in the stream. We don't like to change things because it's hard, and if the river is flowing and a tool is there, we're just going to flow around it.

Rachel: You interpret it as damage.

Charity: Exactly. We're not going to try and flow into it harder. I do think that tooling shapes our behaviors in many ways that are very implicit, that we don't stop and think about and that we aren't conscious of.

One of the things that I think about is on-call. The tools that we use for on-call matter! And the amount of thought that goes into this is usually quite shallow. It's, "What have I used before? What's the quickest, fastest thing that I can get started doing?" And yet people are going to be woken up in the middle of the night, it's going to impact their ability to plan their vacations, or sleep, their family.

Liz: Yeah. That's definitely a thing that often happens, is that when people are under pressure they don't necessarily have time to think about how to change the situation for the future when they're under that pressure. You have to allocate time towards doing that, you have to make sure that people have the room to do it.

If you have people that are completely disconnected from the production operations of their system, they're never going to design useful tooling. And if you have people that are completely immersed in the production writing their system and have no time to write tools, that's also not great, either.

So you have to be somewhere in that happy middle, where you know enough about your systems to know what tooling is useful to write and have the time to write the tooling.

Charity: This is kind of the software ownership, the end-state of DevOps is everybody ships code and everybody supports what they ship. Supporting the code that someone else wrote is terrible because you don't have the context. And I'm talking about early stuff, like right after it's shipped. Once it's stable you could hand it off to teams of people to maintain it.

Rachel: Maybe all code needs to be explainable.

Charity: Yeah, very much so.

Liz: One of the interesting things about that is even if people are supporting their own code, different people have different abilities to do that, and different interest in doing that. It's a matter of cultivating those skills and making it possible for people to level up. Rather than perceiving that as, "I don't do that." Or, "I'm awful at that I'm never going to do it."

Charity: Exactly. And a lot of this comes down to prestige. We've get kicked around ops for a long time, and so nobody perceives there as being prestige in learning how to operate their own systems. Well, I say nobody, but that's not actually true. I've worked at places where operations was highly valued.

Rachel: Name those places.

Charity: Linden Lab, Honeycomb, Parse. All three of them. Very, very much valued.

Rachel: Interestingly, all three companies are likely to put out a generation of engineers who are really unusually skilled.

Charity: I've found that the people who operate software tend to have a tighter cohesive bond and identity than any other engineering team, and I think it's because there's that element of being in the trenches together. You're the last line of defense.

There's some really cool psychological benefits that you can leverage, and I've had other engineering managers ask me how they can get their teams to feel as cohesive and have as much fun as my team does. And I'm like, "Put them on call."

Liz: That's true up to a point, though.

Charity: Definitely up to a point.

Liz: You can have situations where you'll get hero culture, you can have situations of firefighting culture.

Charity: It's like putting salt in your dinner. You don't dump a cup in, you dump a couple teaspoons and it makes it flavorful.

Liz: I love that analogy.

Charity: There has to be some amount of high stakes now and then to make you taste life to the fullest.

Rachel: It's somewhere between Marine Corps morale and actual shellshock.

Charity: Yes, exactly.

Liz: Yeah. And that goes back to having a community of practice. As you said, having equal parity of esteem and to have people feel like they're valued in doing this work.

Charity: Senior engineers have to live out these principles. Because other engineers, junior engineers, everybody knows who the best engineer is and everybody is watching them because we are hierarchical beings. We're hairless monkeys. So it's really important for those people to model the values that you want your entire team to display.

Rachel: And I hate to bring it up, but I think there's a real toxic masculinity at play here as well. There's so much esteem granted to someone who's seen as a firefighter or a first responder.

Charity: Instead of being a servant supporter.

Rachel: Someone into public safety, someone who goes and investigates an aviation accident and documents exactly what went down, and yet the latter saves so many more lives. It's much more highly leveraged.

Liz: And that's why I highly recommend both a talk by Tanya Reilly on the history of fire escapes and the history of fire codes. That's an amazing talk. Also other people like Alice Goldfuss that have mentioned things about how we need to focus on not having this macho culture, and having a culture of being able to celebrate people who are just making the world better in a quiet fashion.

Rachel: Now I'm double fan-girling. Not only that I'm in a room with you, Liz, but also that we mentioned Alice Goldfuss. So, yeah. I'll just wither into a heap. What happens to cloud-native apps as they scale up and up? Is the curve continuous, or are there step changes?

Liz: The treacherous part of it is that it looks continuous, but that there are step changes. That as you build systems, you often don't think about the technical debt and complexity that you're introducing until one day you realize that you no longer understand your system.

Your herd of individual pets has become cattle, and yet you're still dealing with them as if they're individual pets.

That kind of a situation just creeps up on you one day until it smacks you in the face.

Charity: I've always heard and said that, "You cannot design a solution for more than 10x your current problem." You don't know what the breaking point is going to be, you don't know what the extra variable is. You don't know what it is that's going to have changed, you don't know what is going to break first. All you can know with some amount of certainty is that something will break.

Liz: Conversely, if you use something that's adapted for someone else 10 times your size, that's not going to work well, either.

Charity: No, not at all. We all have specific workloads, it turns out. And for platforms this is extra important because a platform in my mind is a system that you've built where you're inviting everyone else's chaos to come live in your house.

You don't have the ability to track down those engineers and make them change their query, you just have to make it work and not affect anyone else in the house.

Rachel: Wait. This is extremely disturbing. As a VC, I'm in the business of selling people the stuff that Google uses. When in fact it's the stuff that Google used 10 years ago.

Charity: Well, there's that too. I'm constantly entertained by people who are like, "This is what Google used!" And the Googlers are just like, "No, no it's not."

Liz: That's what people are getting wrong about SRE outside of Google. They're taking away this idealized image of Google and thinking, "I can just adapt this without regard to what my current context is." And that's not even what we're doing at Google.

The number one way that manifests is getting back to our earlier conversation about how people hire and how people get into the industry, people have this conception that every single SRE that they hire needs to be a battle-tested veteran sysadmin who can also code and has a computer science degree from a top-10 school. And those people are very few and far between. You'd be much better off investing in building people's skills.

Charity: People will spend six months looking for their unicorn, trying to recruit one when they could have trained one or built one, some hungry new grad in half that time.

Rachel: How often have you worked in a company where somebody from a big name competitor parachutes in, spends six months resting and vesting--

Charity: All their assumptions are wrong. Or they're just breaking things, because they're so sure that Google does it this way, so that must be right.

Rachel: Or the big company got rid of them.

Charity: Yes. One of my friends says, "There's no such thing as a good ex-Google engineer." Because they often leave, but then Google knows which ones they are and they always lure them back with a lot of cash. Or they know they weren't that good and they just let them go. Which in my experience, you can round that down to true.

Rachel: Liz, you've talked about how your ultimate goal is to empower people and make them more productive. That's one of the reasons we're gleeful to have you on the show, because that's ours, too. How does observability fit your goals, if at all?

Liz: Part of what's going on in IT operations right now is a change in how we approach our operations.

Observability is one piece of many pieces in the change that we're making.

Therefore, this change that we're doing is making people more productive. It's enabling a team of 10 people to manage way more complexity, and way more services, and way more scale than they could 5 or 10 years ago. That's amazing and I want to share it with so many people.

A lot of my customers, for instance, are big banks and they are used to running things in the same way that they run things for 5 or 10 or 15 or 20 years. And showing them that there is a better way of doing things, and there are some techniques that you can borrow from our playbook.

"Take the ones that you need and leave behind the ones that don't work for you. We can help you become more productive and empowered." That's amazing. To have all these people that are suffering from lack of parity of esteem of the ops and telling them, "You have a valuable and important skill. Here's how to leverage it, and here's how to feel empowered."

Charity: A thing that we struggle with a lot and that is hard to talk to people about is that people are used to hearing happy talk, "Do this and everything will be better." So much so that they don't believe it when you tell them that even when it's true, when in fact the newer ways of doing things like doing observability-driven development where you lead by instrumenting and you check yourself by instrumenting, it is better. It is better for humans.

You build more understandable systems, you don't get woken up as much, you don't have to change context as much, you can take over other people's software so much more easily and it feels like software that you wrote too.

The bad old ways are bad, and the benefits of switching to a world where software engineers are-- Software engineers being on call was one of those sticking points. I get it, we have a real problem with masochism in operations.

We're not saying, "All right, software engineers. Come be masochists with us." We're saying, "OK. It's time for the masochism to stop. We need to adopt the software engineering principles that will make our systems better, and we also need to shorten these feedback loops." Everyone wins. It is a better world.

Liz: It's a difference between a local maximum and a global maximum. People have gotten stuck in this rut of, "I can make my systems work better by outsourcing operations to someone else."

Charity: Through whatever means, they've got it working, and they're clutching into it with white knuckles because it's fragile and they're afraid it's going to vanish. And what breaks systems? Usually it's introducing change. So they're like, "I'm going to introduce change into the system? Really? I just got a handle on it."

Rachel: The example of banks is super interesting because their environment is fundamentally different from any of these scale out apps in the valley. And their architecture, they're still running IBM mainframes, there's still microcomputers in there. There's probably still a DEC VAX running somebody's general ledger.

Liz: And yet, they're experimenting too.

Rachel: Right. I spent a lot of time with the banks, first with VMware and then with Docker, because the idea that they could encapsulate that complexity and present a new set of interfaces to younger people coming into the workforce so that they didn't have to learn all of the layers behind them, was super interesting to them.

Charity: If you don't experiment, you die. And banks might be lagging in tech, but they're good at business and they know this.

Rachel: Yeah. And one of the things that keeps them in business is being very risk averse. Except when--

Charity: But also, looking at the cutting edge all the time.

Liz: Yes. They absolutely looking at the cutting edge.

Charity: Some of the first people to reach out to us were Barclays.

Rachel: They have a real interest in keeping their systems up and in understanding what went wrong when it went wrong, because their downtime can be measured as millions of dollars.

Liz: So getting back into, "Why do I care about observability?" I think that it's really important for people who are contemplating making these kinds of leaps to have the right tooling to support it. To know, "What are the best practices?" And have them implemented for them.

Charity: It's also a way of doing this incrementally, of introducing without having to make a big bang and switch from one system to another, you do it with steps and you check yourself with observability. That's how you gain confidence in what you're doing.

Rachel: That's what I was trying to get to earlier. Is observability a way to make systems explain themselves?

Charity: Yeah, absolutely. Instrumentation is the way that the software explains itself back to us.

Liz: You can't even do a successful migration without knowing, "Did you make it better?" If you don't have the metrics you're not going to be able to tell.

Charity: "Did you do it well? Did you do it completely? It will always be different. Do you know how?"

Rachel: "What is time? What are numbers?"

Charity: It's true. I've become accustomed to having such fine-grained visibility into my systems, and I think back to some of the database upgrades that I did. MySQL 4.1 to 5.0. I didn't know, "Did it work? It seems to be up. Site seems to be working. Log in once or twice, yup, carry on." Terrifying to me now. I'd never do that.

Liz: Yeah. We need to have both the top level surface level objectives, "Are things working correctly? Are you confident that your customers are happy?" As well as the ability to dive in when your surface level objectives are in danger.

Charity: I really like what Google has done just to proselytize SLOs and SLAs, because that to me has forced engineers to start thinking more about business and users.

Your beautiful nines don't matter if your users are not happy.

Rachel: There it is. Charity's catchphrase. Take a drink.

Charity: It's going to be on my gravestone. But seriously, there was a month at Parse where our uptime was 99.99% and I just looked at it and went, "People are complaining all the time. Either we're not measuring the right thing or something's wrong."

Rachel: Liz, we've both come up with super positive stereotypes about Google and super negative stereotypes. I'm guessing that the truth is somewhere in the middle. What are the kinds of things that people get wrong about the inside of Google, and especially SRE?

Liz: There is definitely a myth that every single team at Google has SREs, and that's absolutely wrong. SREs build the platform that everyone else uses. And sure, some services do have SRE support, but it's almost a failure of complexity if you need to have a dedicated SRE team for your service. You're much better off served building on top of the platform and managing it yourself.

Charity: And everyone's on call.

Liz: Yeah.

Rachel: How does that compare to the infra team within Facebook? Is it an exact analogy, or is it different?

Charity: Facebook is the same. Production engineering, some teams have one or two that are in rotation just as software engineers. Sometimes software engineers don't get PE support until they're service is of a particular quality, so it's supportable.

It's the carrot and the stick. Some teams are half and half, production and SREs, but there's about a 7-1 ratio, 10-1 ratio for software engineers to PEs. They're pretty rare, they're an unusual skill set.

Liz: That's about the same for Google as well. There are about 2,000 SREs compared to tens of thousands of front development software engineers.

Charity: Makes sense.

Rachel: How many other companies would you think of in the vanguard of this? Maybe Amazon, anyone else?

Liz: I am not familiar enough with Amazon's practice in this area. I haven't seen a lot of externally visible efforts there.

Charity: They don't really talk about what they do internally. Everything I know I only know from drinking with people.

Liz: Right. So, examples of companies that I have seen a lot of things that I respect and admire from, include companies like Lyft. Lyft and the Envoy effort is so much in the vein of what we do as SREs, of building platforms that do the right thing out of the box.

Charity: Etsy used to be on the cutting edge of this stuff.

Rachel: Yeah, when Allspaw was there.

Charity: They were the first generation of really living out their values.

Liz: And then of course you have companies like Dropbox, you have companies like Facebook, you have companies like Shopify and Fastly. There are a lot of companies that are traveling in the same direction that we are, and it's really exciting to have this community around SRE where we're sharing with each other.

Getting to where the next 5 years are going, and the biggest change I've seen in the past two years, has been the change from people not talking about how they do operations and considering it a secret, to--

Charity: To not being able to shut up about it.

Liz: Yeah. To not being able to shut up about it, like us.

Charity: It's a welcome shift.

Rachel: Fastly is another one where the founder wrote software that reflects his true character. Fastly is impatient software, like Artur. We're got to get Artur on the show.

Charity: Yeah, we do.

Rachel: All right, so we've kicked around the phrase "software ownership." What does it mean to each of you?

Charity: To me, it means you develop software no matter how small or how large to production quality. You have the ability to deploy, the ability to roll back, and the ability to debug it in production. That to me is the full lifecycle. If you're missing one of those elements, if you don't have the power to deploy rollback and get the system in a good state, it's not ownership.

If you don't have the time to write code, that's not ownership.

And if you don't have the tooling lets you explore and understand how it's performing, maybe you could call it ownership but it's like negligent parenting. Like, you're not feeding your kids. To me, that is the natural end-state of the entire last 10 years of DevOps. I've said this many times, but the first wave of DevOps is all focused on lecturing ops people and telling them to learn to write code. And we did.

It's no longer considered acceptable to be an engineer and not write software. The last couple of years the pendulum has really started swinging the other direction. The focus is on, "OK software engineers, it's your turn." The systems are getting so large, with distributed systems operability is the primary concern. Most of our time is spent maintaining and extending and debugging software, not greenfield development.

Liz: Yeah, it turns out that there is a fascinating statistic we looked up for the SRE book. It is that you spend somewhere between 40% and 90% of your total cost of ownership of software maintaining it, not writing it.

Charity: And people ask all the time, "Is ops going away?" Well, no. It just looks different.

Liz: I like to think about ownership in terms of, "Who owns the user happiness? Who owns the reliability of this?" And it's important that everyone be invested in it, that everyone have agreement on, "How are we going to measure it? How are we going to defend it? What are we going to do?"

Charity: User happiness does not always correlate to reliability, too. Sometimes it's other metrics that it correlates to more. Obviously, if your system is down all the time, that's not going to make people happy.

Rachel: Slack is pretty outage-y, but I still love it.

Charity: Slack is pretty shitty all the time. It has to be good enough.

Liz: Yes, it has to be good enough. "Good enough."

Charity: The iPhone camera of reliability.

Rachel: That's a super interesting idea, that ownership is owning user happiness. Because it makes me think about Steve Jobs and how completely obsessed he was with the out of the box experience, and the integrity of the product. Is there some product thinking in this idea of software ownership?

Liz: There is product thinking, and there is ethical thinking. If we own user happiness, then we as the people who are operating the system need to think about, "How does our software impact the people that it interacts with?"

Rachel: And, "What are the disastrous failure modes?"

Liz: Exactly. Ethical failures are product failures and reliability failures.

Charity: I've definitely worked on products where I felt better about the world when they were down.

Rachel: Not to pick on Facebook, but this discussion about on the one hand Facebook is saying, "We can pinpoint your ads with laser precision." And on the other they're saying, "We can't really figure out who's posting all of this nasty stuff."

Charity: It's the purest form of capitalism. It is completely amoral. They're chasing clicks.

Liz: It's a willful disregard for explainability.

Rachel: That's interesting. Say more about that.

Liz: You can perfectly well design a system that can laser target things, and can't tell you why it's doing it. It is perfectly self consistent that might happen.

Rachel: They're telling the truth.

Liz: Yeah. They may very well be telling the truth because they haven't invested in explainability.

Rachel: Because they don't want to own it.

Charity: Yeah. "The software did it."

Rachel: Capitalize on it without ownership.

Liz: It's like saying, "My dog ate my homework."

Rachel: Should the people who write the code also support the code in production? I guess we've covered that. That's a resounding "Yes" from all of us.

Charity: I don't think anybody is going to say no to that. At this point they're just going to come up with a lot of reasons why it's inconvenient, and they need exceptions.

Liz: We do need to support people who have, for instance, recently had a child. We have to support people who can't work between Friday evenings and Saturday evenings.

Charity: Absolutely.

Rachel: I was going to say, we're all still using Unix, and Dennis is gone and there's systems which will outlive us. Who supports them then, is it our children? Is this feudal?

Charity: There's an element of shared sacrifice to this, but it is not self flagellation. There's definitely a rule that anyone who is being woken up by a small screaming child is not also going to be woken up by the pager, because that's just inhumane.

Rachel: How small? Because my big one is 15.

Charity: But that doesn't mean that they don't do their share. They'll be on call during the day, maybe handle extra escalation from support. I had a boy on my team at Facebook who tried to be on call and really wanted to pull his weight but he had such extreme anxiety. He just wouldn't sleep, all week.

Even if the picture didn't go off and it wasn't getting better, and it's who he is, he's in his 30s. So we took him off because that clearly was not OK. Instead, he owned Jenkins reliability in the build pipeline and I swear that was worse. But it didn't impact his life.

Yes, we need to expect some shared supporting from everyone, but it doesn't have to look the same for everybody.

Liz: It takes different forms. That's part of the discussion of mental health and SRE in ops, and also of humane on-call. I found it really cool that Intercom gave a talk last week at Datadog Dash.

Charity: Intercom David is really doing cutting-age stuff.

Liz: Yeah. Intercom gave a talk at Datadog Dash about how they staffed their 12 hours of not being in the office out of just volunteers.

Charity: 100% volunteer.

Liz: And that was amazing. That was cool to hear about.

Charity: To hear their engineers talk about it, it is a badge. It is a prestige. If you know enough about your systems to be responsible for it, and they don't get paged more than once or twice a week. But it's what they aspire to do, and that is exactly how it should be.

Rachel: That's super interesting, because it's another axis of diversity, it's another axis of difference. Mental illness is part of it and just temperament. Extrovert vs. introvert.

Charity: This is something that we should pay more attention to. We talk a lot about how to mitigate the downsides of on-call, we don't talk about what an amazing recruiting thing it can be for your team if you do it well and nail it and talk about it a lot.

It's a differentiator in a world where everybody is competing and trying to outcompete each other to get engineers, and engineering time is the scarcest resource. Just fix your team and talk about it and you will not have trouble hiring.

Liz: You also waste less of your engineers' time. It's a win-win.

Charity: And stop trying to rebuild things internally that you should just be paying vendors to do.

Rachel: Anyone who has had one toxic job in this industry becomes hyper aware of team dynamics. You can spot a speaker on stage and you can tell.

Liz: That's also an important thing. Trust has to be earned, and therefore it's totally understandable that someone might say, "Put me on the tickets rotation for a while. That's how I want to contribute to ops until I can actually see what your team's on-call culture is like."

We all have different ways that we contribute, and the most important thing when we talk about ownership is, "Are you involved enough in the system to understand its production characteristics? Are you involved enough that you're going to not throw a heap of toil and operational load on someone else?"

Charity: The reason I keeps emphasizing prestige and everything, is because it's the difference between norms and the law.

Yes, you can enact laws, but they are a very blunt instrument and they tend to backfire on you and nobody can keep track of them if you have too many laws and everything. But if you have a norm, then people can use their judgment for when to flout it.

If you have a norm that everyone is on call and this is something you aspire to, then you can trust people to opt in or out. But if you haven't built a culture of that, then you have people who don't want to participate and suddenly more than 50% aren't participating so you have to make a law so that everyone has to participate. Any time you're trying to push behavior around with rules, you've lost the plot.

Liz: It's a delicate balance though. There are all these interesting studies about how women tend to get "voluntold" into volunteer tasks. If you have a guideline then it's this tricky thing of figuring out, "How do you enforce accountability and fairness?"

Rachel: How about this? How about we expand the notion of software ownership to, all of us who are involved in the production of software are ultimately responsible for the damage software wreaks on the world. To people who are on call, to Muslims who may be on a registry. What if we all step up and own that and try to make things better?

Charity: The reason that's so frustrating to hear is because we all say, "Yes!" But we don't have the power, we don't have the power to make changes.

Liz: We don't think we have the power to make changes, and I think that's--

Charity: We don't directly. In aggregate we do, but this is a slow and frustrating and diffuse way of wielding power.

Rachel: So, let's unionize.

Liz: Charity can't speak because she's a manager, whereas I can. That's one of the powerful and useful things.

Charity: I would love for there to be unions in the tech industry. I would love it. At the same time, it does feel a little bit indulgent sometimes when people start talking about it and I'm just like, "OK. We're all being paid $150,000 or more a year. There are people in this society who are suffering so much more than us. I don't know. I'm torn.

Liz: We can do both. My viewpoint on that is, "Why not both?"

Charity: Let's do both.

Rachel: All right. We fixed it. Congratulations!

Charity: We fixed it. I think we're done. Good job everyone.

Rachel: Everything is great. Thank you so much for joining us, Liz. This was a delight.

Charity: This was great.

Subscribe to Heavybit Updates

Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.

Content from the Library

Visit library

Mar 19, 2024

Podcast

O11ycast Ep. #68, Observa-What? with Michele Mancioppi of Dash0

In episode 68 of o11ycast, Jess and Martin speak with Michele Mancioppi of Dash0. This talk examines what it takes to make...

Mar 14, 2024

Podcast

Jamstack Radio Ep. #143, Jamstack’s Next Chapter with Mike Neumegen of CloudCannon

In episode 143 of Jamstack Radio, Brian speaks with Mike Neumegen of CloudCannon. Together they discuss the evolution of the...

Feb 27, 2024

Podcast

O11ycast Ep. #67, Managing Infrastructure Costs with Performance Engineering

In episode 67 of o11ycast, Martin, Jess, and Liz dive deep on performance engineering. Pulling heavily from Liz’s extensive...