Ep. #9, High Performance DevOps with Jez Humble
In episode 9 of o11ycast, Charity and Rachel sit down with Jez Humble, Co-Founder and CTO of DevOps Research and Assessments (acquired by Google since this session was recorded), to discuss DevOps security and how a team’s culture relates to their success.
Jez Humble was the Co-Founder and CTO of DevOps Research and Assessments (acquired by Google since this session was recorded), a prominent team of DevOps researchers. He is also the coauthor of several books on software including Continuous Delivery, Lean Enterprise, DevOps Handbook and, most recently—Accelerate.
In episode 9 of o11ycast, Charity and Rachel sit down with Jez Humble, Co-Founder and CTO of DevOps Research and Assessments (acquired by Google since this session was recorded), to discuss DevOps security and how a team’s culture relates to their success.
transcript
Charity Majors: Very few people have had the cross-industry exposure to DevOps teams of all types that you on the DOR team have had. To what extent are there unifying skills that cross all boundaries, or is it impossible to characterize? Really, what I'm asking is, how fungible is the DevOps skill set, really?
Jez Humble: That's a great question, and I can only talk about things that I've seen. I don't think there's a common skill set in the sense that you must have this, this, this. My personal experience, which I've kind of seen reflected in the industry, is--I started in 2000 in a dot com in London where I was a second employee, and I had to do everything. I was racking the servers, I was sorting the operating system.
Charity: And you knew how to do none of it when you showed up, I assume.
Jez: Right.
Rachel Chalmers: It was a great time for those of us with English degrees.
Charity: The music majors.
Jez: I had a philosophy degree. I don't have a CS degree.
Charity: This is a room full of liberal arts majors.
Jez: Perfect. I had a personal PC when I was a kid and I did my philosophy essays on Emacs on Red Hat.
Charity: 'Cause you do.
Rachel: Pour one out for Red Hat.
Jez: Right. My god.
Rachel: Kids, this was before there were any SREs. We called ourselves sysadmins.
Charity: Sysadmins.
Jez: System administrators.
Charity: Sysops, even.
Rachel: When sysadmins ruled the earth.
Jez: Right. The only reason I got Red Hat was because I bought a computer with Windows on it and it died, and I didn't have the CD, so I went downstairs.
Charity: It's free.
Jez: Right. [I went to] Oxford University computing services, and I'm like, "Can I have an operating system, please?" And they said, "Here's a Red Hat CD."
Charity: Passed around like...
Jez: It was solving problems, and everything's evolving all the time.
The idea that you can be an expert in some skill and have your career be that thing. Once upon a time, maybe, but not anymore. The one universal is that we're all solving problems, and that's really what it's all about.
I mean I was doing some work with NavaPCB, and there was a new hire there and we were talking about DevOps.
I said, "What does DevOps mean to you?" And she said, "All the things that I didn't need to know when I used Heroku," which I thought was a great definition of DevOps.
Rachel: Yeah. That's really good.
Jez: Right, which is fascinating, but it points to the fact that you've got to learn shit, you've got to solve problems, you've got to work it out. And certainly I've been in situations where I haven't known how to do things, and I've been really scared about it, and I've been like, "Oh my god, what if I screwed this up?" And I get through it with the help of people around me. B eing able to ask for help is a really important skill.
Charity: Most screw-ups are not permanent.
Jez: Huh?
Charity: Most screw-ups are not permanent.
Jez: Right, most screw-ups are not permanent. In most places, there's people you can ask to solve problems and if not, you find a way to do it in a way that hopefully isn't catastrophic. Which is an important skill--k nowing how to try things out in a way that isn't catastrophic. But ultimately, it's electronics, funnel machines. The fundamental principles are well-known and well-understood.
Rachel: All ones and zeros. Now would be a great time to introduce yourself.
Jez: My name is Jez Humble, and I'm Chief Technology Officer of DevOps Research and Assessments. A small, three-person company that researches DevOps and does assessments for people. I've also written some books on software, co-authored Continuous Delivery, Lean Enterprise, DevOps Handbook. Most recently, Accelerate with Dr. Nicole Forsgren my CEO, and Gene Kim.
Charity: And Jez is short for Jezebel?
Rachel:It is now.
Charity: Great, just checking.
Rachel: So, as I said, Charity and I have already committed Accelerate: State of DevOps the Report to memory, and we preach it to the team.
Charity: I read it every night before bed, a few verses.
Rachel: We fell on it with glad cries because you identified this group of elite DevOps, and this mirrors exactly what we've been seeing in the market. C an you talk about your elite group and what characterizes them?
Jez: Yes. In previous years, and this is the fifth year we've done it. In previous years, we've had low, medium, and high performers. We do this thing called cluster analysis in statistics where we take all the responses and we ask the algorithm to split them into groups that are more like each other than they are the other groups.
Rachel: This is where Nicole's academic background really shines in the report, it's so clear.
Jez: Yeah, I mean Nicole is a truly brilliant statistician and scientist, but she's also really good at explaining things, and if you read the second half of Accelerate, she lays out all the statistical stuff behind it in a very clear way, which is excellent. A great communicator as well as just a brilliant scientist. I'm not going to go into the details here, but we do cluster analysis, we finds high, medium, and low performing groups.
Then this year, something really interesting happened. We found this kind of elite performing group which is actually a subset of the high performers. 7% of our overall responses, t hey are deploying on-demand multiple times per day. They can get changes out into production in less than an hour, and in common with all our groups, they also achieve really high levels of stability. They can restore service in less than an hour in the event of an outage or service degradation, and they have these really low change fail rates, which is a measure of the quality of your release process. When they push changes out, they don't typically need to remediate.
So they achieve high levels of stability, high levels of throughput, and we measured availability for the first time this year. They also achieved significantly higher levels of availability. In fact, 3.5 times higher than our low-performing group.
Charity: I love that you attached actual numbers to those groups. This was really eye-opening for me because I read that, I read the numbers and I went, "Holy shit, we've been building for not only the top 7% but the top 3.5% of all teams," and I thought in my mind that we were building for more like the top 20%, so that was a real come-to-Jesus moment for me. "Oh shit."
Rachel: I have been telling you for two years that you assume everyone's somewhere near as smart as you are, and--
Charity: I'm not that smart, honestly, but I've been fortunate enough to work with really great teams and so my idea of what a great team looks like is skewed.
Rachel: But what really fascinated me about this is that the difference between these elite teams and the rest often comes down to culture, and you spend a lot of time talking about what characterizes the culture--
Charity: I went on a long Tweetstorm about this, 'cause I started thinking about it like, "This is so true, it is not the best, quote-unquote, engineers that make the best teams." The team, and God bless them, I love my team, but it's not the best engineers that I've worked with ever at Honeycomb. By and large, on average, we have a lot of junior folks. Very consciously, we've brought in a lot of junior folks, a lot of people right out of hack academies, and that's part of why I thought, "We're probably down 75th percentile," because it's not the superstar engineers that I've worked with.
The skill set of being a great team member is not the same skill set as being a great, quote-unquote, engineer.
Rachel: Exactly, and--
Jez: Totally agree.
Rachel: It would be so easy to conflate these results with, "Those teams just have the 10x engineers, and that turns out to not be true. It's all about emotional intelligence.
Jez: There's a great story from Adrian Cockcroft , where he would go and give these lectures to big enterprises, and that they'd say to Adrian, "It's all right for you and Netflix. You've got all these amazing people. Where do you get them from?" And he would turn around and say, "I get them from you," because it's not actually the people.
Charity: There's a little bit there where I'm like, "OK, but the Netflix and the Facebooks of the world because they have tens of thousands of applicants every year, t hey can screen them whenever the hell they want to and get great candidates, right?" So I feel like, yes, they're kind of skimming what they see as the cream of the crop. But they're skimming the cream of the crop in a very narrow sense and they're leaving behind a lot of incredibly high performers because they don't conform and they can't answer this one, unique distributed systems question in interview.
Jez: It's really interesting. I took a totally different lesson away from that story than you did, so that's fascinating in its own right. But like the story what I took away from that was that it was the system effects that were important and the organizational structure, not the individual people.
Charity: That's also true.
Jez: I see that again and again.
Charity: 'Cause the thing that you see is that an engineer is hired into a Netflix or a Facebook or whatever, and they come in with a set of raw skills which are what the Facebooks or Netflix value. But they come out transformed, and they are able to take that culture and transplant it to somewhere else and bring the things that they've learned and bring the team up with them, because they're then the ambassador for their culture that they learn, and the culture was what empowered engineers to succeed.
Jez: I agree.
The team is the unit of performance. This idea of individual performance is fundamentally flawed.
Charity: It's ridiculous.
Jez: Individual variance is overwhelmed by team and organization-level effects, and--
Charity: We love to think in the West that it's the individual, but we are the product of our environment and the things around us.
Jez: Absolutely, and I think that the only way it makes sense to talk about individual difference-- I really love the work of Carol Dweck talking about mindset. When people talk about A and B players, I love to tell the story that Malcolm Gladwell tells about the big company that followed the McKinsey advice of the 90s of only hiring the A players and letting them do whatever they want. W hich was Enron.
Rachel: Lord of the Flies.
Jez: Right, and there are A and B players. The A players are the people with a growth mindset, which goes back to what we were talking about earlier about DevOps engineers being people who solve problems and acquire the necessary skills, and are mastering a team context.
Charity: And are willing and able to turn that that eye towards problem solving on the organization itself, and iterate. To learn from their failures, not just try the same thing over and over.
Rachel: This is one area where I actually see the really big players learning from their experience. I would say that in the last 10 years, Facebook and Google have changed their recruiting, and it still systemically discriminates against underrepresented people, but one of the things they're literally testing for now is emotional intelligence. At Facebook it's called the Jedi qualities, and you are seeing, on the one hand, this selection ofpeople who do know how to problem-solve in a consensus way.
On the other hand, and this is something we continually neglect in these conversations, you're also seeing organizational commitment to having those people's needs met. Having really good insurance, having really good support for maternity and paternity leave, because a lot of what we're talking about, when you look at the difference between A and B players and people who have resilience and people who don't is how many resources they have a value to them.
Charity: And how distracted are they? They show up to work with their whole self and not feel tugged in 15 different directions 'cause they're so anxious about everything.
Rachel:
It's the emotional intelligence of the organization that allows people to feel the emotional safety that they need to be able to problem-solve in innovative ways.
Jez: There's a really good interview in The New York Times with Laszlo Bock. He used to be SVP of People Operations at Google, and he talks about the three characteristics that he hires on. He says all those dumb questions like how many pianos there are in California, he literally says those predict nothing, and what predicts stuff is your learning ability, the ability to synthesize information and process on the fly--your mindset. People who can learn from failure, who don't commit the fundamental attribution error, and immersion leadership which is not just "Can you take charge where you can help?" But also, "Can you step back when it's time to let someone else lead instead?" I thought that was just good.
Rachel: There's enormous applications for my industry when--By my industry, I mean finance and venture. If you look at how venture is conducted, the vast majority of it is still selecting for A players, and far from providing psychological safety, it's the opposite.
Charity: They're chasing those few outliers that they think will--
Rachel: Everybody wants their own Uber or Airbnb. It's very much winner-take-all, and I think it's still very much an open question whether you can practice venture in a different way and be sustainable.
Jez: There's a good book I read called Chasing Stars [by Boris Groysberg] which is all about the possibility performance in financial services. Sounds like you've read that.
Rachel: I have not, but it sounds amazing.
Jez: It's really good and that talks about some of this stuff as well.
Rachel: Anyway, we've totally derailed from talking about your amazing report.
Charity: It was a great derailment, though.
Jez: I can talk about culture in the report. I mean none of the stuff I just talked about is--
Charity: What characterizes some of the organizations who are struggling with DevOps?
Jez: We got another really interesting group out of our cluster analysis this time, and--
Rachel: The underperformers, I think you called them.
Jez: We really had trouble finding a good name for this group.
Rachel: Because their intentions are really good. I did find this whole section really fascinating.
Charity: One of the most, one of the--
Jez: Misguided.
Charity: Slogans that have had the greatest impact on me was, "What's the worst that can happen?" We used to say this to each other all the time at Linden Lab. When we're trying to decide whether or not to do something or not, we'd often just look at each other and go, "What is the worst that could happen? Site goes down? Fine, it's happened before. We can get it back up." I love that attitude because more companies fail, more teams fail because they don't try enough things or because they don't move fast enough, and they don't fail for lack of fundamental resources, they failed because they aren't moving fast enough and trying things fast enough and learning.
Rachel: Linden always fascinates me. I mean you've heard me say this before, but there's a generation of extraordinary women engineers who came out of Linden.
Charity: And trans engineers of both genders.
Rachel: And it wasn't even a particularly progressive environment as San Francisco companies go, but it was psychologically safe.
Charity: It was psychologically safe.
Rachel: In a way that maybe--
Charity: I would show up to meetings in my pajamas. I would sleep under wherever, and I'd never felt unsafe to do anything there.
Rachel: If you had told me 15 years ago that Second Life would be this extraordinarily influential engineering achievement, I'd be like, "I don't think so," and yet here it is.
Charity: In many ways, it wasn't. I never really-- We were before our time, but--
Rachel: Technically, it wasn't. Culturally, it's been enormous.
Charity: Huge. It's really undervalued.
Jez: It's interesting. I was working for the federal government in 18F for a couple of years ago, and we had some ex-Linden people at 18F. 18F was, for me, I mean I never worked at Linden but 18F had that whole psychological safety thing. It was really good in that respect and I loved working there. On my first day, I got this sticker which was made by a woman called Lauren Ancona and it says, "Winging it, we're all making it up as we go along," and I stuck that on my laptop. It was one of the first things I did when I joined 18F, and that really helped me, that idea that, "Actually, we're on the same boat. We're all just making it up and we can lean on each other."
Rachel: I wonder if this is a struggle that we have when we're talking about observability, because we have tended to talk about how complex these systems are and how difficult it is to manage them, and how the tools need to help you navigate this really difficult world. Maybe what we need to foreground is the collaboration that you can achieve with observability tools and--
Charity: This is a thing that from Honeycomb, from the very beginning. We don't build for individuals, w e build for teams. I always say I learned Unix by reading other people's Bash history files. That's how I learned, and to the extent that we can tap into people's curiosity, lower the barriers, and just tap into the social snoopiness-- I want to just look over the shoulder of the great engineer who's working on a thing that I'm fascinated by. I don't want to ask them because that's terrifying. I just want to look at what the questions that they're asking through this tool. I want to look at their history.
If I get paged about something, the first thing I want to do is go to the expert in that and not talk to them, because maybe it's 3 AM, but looking at what they were doing when they interacted with that system. Because it's so informative and this is the thing that we've seen over and over, w e have a hard time getting Honeycomb into organizations but once we get it inside, we do almost nothing with post success. But the adoption just goes up and up because we've made it so easy to just observe what each other is doing and collaborate.
I post something into Slack just like, "This is interesting." You click on it. You could access not just my graph but my history, and everything that I've done, and I really want to find ways to incentivize people with the UI to add comments. Put something that's in your brain about the context, put it into the tool so that other people can search it and explore using that information. Because we can't keep leaning on our brains to reason about these systems, because they're too big and they change too much and they're sprawling. A nd not only that, but my brain is not accessible to you. The tool is.
Rachel: This is where I think the real legacy of both XP and Agile is in pair programming. I think it dwarfs every other innovation. Just sitting beside somebody. But sitting down together and working on something, it forces you to empathize and it forces you to see through somebody else's eyes.
Charity: Talk and think out loud, right? That to me was the biggest value I got from public speaking, was I've always been someone who could not think and talk at the same time. I could do one or the other. If you wanted me to change my mind in the course of our argument, I would have to leave the room, and--
Rachel: Put on a new persona.
Charity: I would. I would process it offline and come back to you, and now I can after five years just grueling and pair programming is the same way. It forced you to talk through what you were doing and thinking.
Rachel: That's the fundamental problem. I mean, Jez, you're characterizing everything we do as technical workers as problem-solving. Our little brains are too small to solve these problems. The systems have emergent behaviors--
Charity: They've gone context.
Rachel: Way beyond human scale. We have to figure out how to work together.
Jez: The other problem which you also hint at is a lot of the heuristics we use and not actually, they're not accessible to your mind. You're using these heuristics but if you ask someone to explain what the heuristic is and how they're playing it, it's intuitive, so that's no good. You can't scale that, and so you've got to work out how you can explain and teach other people that. Which is one of the big problems of Ops.
I mean there's no ops school, there's no DevOps school, and we all develop these heuristics that we can't articulate but we apply all the time.
Charity: This is the thing I talk about all the time when it comes to the tools that we've had. We've got the time series aggregates, where everything that happened in this interval of a second gets smushed into one value. The only way to interpret these old-fashioned dashboards is with your intuition, and when you try and give a software engineer, you're like, "Software engineer, please be on call, here are your dashboards," they don't have that intuition. It's literally impossible. You're asking them to do two jobs, and a big part of what we've been trying to do at Honeycomb is that's just a bad model.
You have to be able to get to the raw events. You have to be able to speak not in terms of CPU and load average and memory, but in terms of functions and variable names, and software engineers who spend all day looking at code. W hen they're trying to debug it and understand it, they need to be looking at something that's familiar to the context from which they come. I don't remember where I was going with that.
Yes, intuition, absolutely, I absolutely agree. The reason I rant about dashboards is you need so much intuition to unpack what's going on. There are no straight lines to draw to the problem that you're trying to solve in your code.
Rachel: This is why there are so many humanities grads in DevOps, because it is about intuition. It is about interpreting very, very complex signals from very, very large fields of data.
Jez: And synthesizing.
Charity: Yes, and also if you're a good enough, I consider when somebody is good enough technologist. I'm a pretty weak programmer, good enough technologist, but communication skills are so key that it kind of doesn't matter because you're on a level playing field.
If you can communicate and synthesize and ask for help, then you're just as good as anyone who is a rock star at those things. None of us are going to build Google on our own. The key inside of sapiens was that humans are a storytelling culture and the thing that sets us apart is our ability to build these massive stories together, and to build off of each other's work.
Rachel: Like the US Constitution.
Charity: Yeah.
Jez:
This is why I hate this idea that we should use human nature as some guide to how we should create our societies, because the one thing that distinguishes us is that we change things and solve problems and synthesize things in new ways that have never been done before.
Charity: Evo-psych is bullshit.
Rachel: Co-signed so hard.
Jez: Yeah, plus a hundred. I do want to get back to how we kind of briefly touched on misguided performers and people who do badly. We found this really interesting group where they're deploying really infrequently, and it takes them a long time to get stuff live. B etween one and six months, and they also have this relatively low change-fail rate, 16 to 30%, which is lower than our high performance but not as bad as our low performers.
Here's the thing. As you know, data doesn't tell you why people are doing something. What happens is their time to restore service is really, really bad. What we think is happening is they're putting loads of work in to prevent bad things from happening, so doing more testing, more inspection, more heavyweight change of management. T hey're trying to really make sure nothing goes wrong, and most of the time, that works. When something does go wrong, oh my god, they're totally fucked, and it takes them a really long time to fix it.
Rachel: Again, because of my finance DNA, this just totally jumped out to me as a risk-averse population. They're trying really hard to forestall any hint of failure, and as a result, their results are not competitive with the top--
Charity: When I talk about testing in production, this is what I'm talking about. Taking some of that energy away from preventing failures, and literally just reassigning it to resiliency, to detecting quickly, to making it not that big of a deal, to rolling out to small subsets of the population using feature flags or internal testing or canaries or automated promote. It's not free, it's not like you just quit caring about stuff after you've taken these resources away from pre-production. It's that you reallocate them to hardening and to making it not a big deal when failures do happen.
Rachel: Now I realize it dovetails with the helicopter parenting versus free-range parenting conversation as well.
Jez: Right. You can't. Also, that's how kids learn.
Charity: It's how systems learn, too. They need to feel a lot, it turns out.
Rachel: Systems are our kids.
Charity: They are, in a very real way that's a little too real. But yes, they need to fail a lot, and teams need to practice failing a lot, too. Everybody in your team needs to know how to get to a known good state. Everybody on your team needs to not freak out when something breaks 'cause--
Rachel: They need to learn to self-soothe.
Charity: Spoiler alert, things are broken right now that you just don't know about, all the time.
Jez: This is why I like the DiRT stuff that Google does.
Charity: Eating dirt? That's great.
Jez: No, but yes. Also good.
Rachel: Good for your immune system.
Jez: The Disaster Recovery Testing exercises they do at Google. Kripa Krishnan's team.
Charity: Is it literally called DiRT?
Jez: Yeah, Disaster Recovery Testing.
Charity: Eating DiRT, I love that. For your system's immune system.
Rachel: I found myself feeling so much compassion for these misguided underperformers, because their intentions are really good. I was sort of thinking, "Who hurt you? Who told you that failure meant that you were not--?"
Charity: They have a boss who yells at them.
Rachel: Exactly.
Jez: There's some other things. Before we go into this, I just want to go back a bit. I mean one of the things you're talking about, failure's inevitable. I remember this talk that John Allspaw gave, must have been eight years ago now, where he talks about the move from MTBF to MTTR. Instead of trying to prevent failures and extend the time between failures, it's all about time to restore. T hat insight, that has really stayed with me and really proven to be true.
Charity: The thing is that you need to practice it. You do actually need a constant stream of small failures to practice, or you're not going to be able to shrink that amount of time. It's like running drills.
Jez: I love that you're talking about testing in production, like Cindy Sridharan's blog posts about testing in production. I found those really excellent as well. She has a really nice diagram about pre-prod testing and then prod testing, and all the different types of prod testing which I've stolen and put on all my slides with attribution, of course, because it expresses that so well.
We find that these misguided performers, also strong users of functional outsourcing, which is one of the other things we looked into this here. We find functional outsourcing is really bad, but also pervasive, despite you're talking about Agile on XP. They all talk about cross-functional teams and the importance of cross-functional teams. I still go to lots of organizations who say they're doing Agile but they outsource testing, or testing is a completely separate team, and we find that is very bad. Elite performers almost never use functional outsourcing. But that's another one of the practices in XP and Agile, XP and Agile is it's either like applesauce or I mean it's just like, "Of course, everyone's doing that," or it's "No, we're not going to do that."
Rachel: Yeah, and it's all the way back to The Mythical Man-Month. The reason those functional teams decrease performance is because of the communications overhead. You've got these very solid walls between the different functions, and getting through those walls takes a real resource cost.
Jez: And you can model this in math. Basically, what happens is you have high transaction cost. When you have high transaction cost, you end up with big batches.
Rachel: Yes.
Jez: Then that's exactly what's happening. It's just queue theory, and when you have big batches you want to put a lot of effort into stopping something from going wrong. Because then you just can't work in small batches, and small batches are an essential prerequisite to be able to test in production and restore service rapidly in the event of something going wrong.
Rachel: God. That's a really strong incentive to be risk-averse.
Jez: Right.
Rachel: Yeah, it's all math.
Jez: It is amazing, the extent to which you can just bring it back to simple math. T here's so many organizations where it's just like, "We can't do this, this is impossible," and it's like, "You will fail."
Charity: How often do teams actually execute this as a turnaround? How often do teams who are low performers really transform themselves and become high performers? Do you see this happen often?
Jez: Yes. Absolutely.
Charity: What does it take?
Jez: In fact, the story of the Continuous Delivery book is the story of a team that I was on in 2005, where we were doing releases on weekends in the data center using Gantt charts. A team of us, those eight of us whose job was to get the software deployed into a production-like environment, and it was a shit job.
Charity: Sounds bad.
Jez: Yeah, it was terrible.
Rachel: And the pay was terrible.
Jez: We were consultants so we were paid better than the permies, but it's still. W e were literally in this tiny room and it was really sweaty and gross. It was Java, right? So Java's platform-independent.
Charity: Poo, really gross, sorry.
Jez: We're developing on Windows laptops, deploying to Solaris Cluster. That couldn't possibly go wrong, right?
Rachel: Oh my god, you are from the past.
Jez: Our team was in charge of deploying to this Solaris Cluster, and we found all these problems the moment we deployed it. It took us two weeks to deploy the first time. We found all these problems like developers with cache data on the file system. That works great on your laptop, not so good in the Cluster. We put NFS in place to fix that problem.
Charity: Now you have two problems.
Jez: Right, there you go, perfect. So that was where it came from, and we actually had automation. We had an 8,000-line Ant script which automated the process, and we went to the ops people and we're like, "H ow do you like our 8,000-line Ant scripts?" Broadly speaking, they told us to fuck ourselves, and we said, "What technology do you like?" And they said, "We use Bash." So we said, "OK," and we built them a deployment system in Bash called Conan the Deployer. And you would give it a tag in CVS to build off, and the name of the environment to deploy to.
I t took us a couple of months to build this thing, and we did some really shanky things. W e're deploying to WebLogic, and in the day, you installed WebLogic through the clicking installer and we weren't going to do any of that shit. W hat we did is we got a clean Linux install, we installed WebLogic, and then we did a filesystem diff and we took all the binaries and put them in CVS. Then we created a directory for every environment we were deploying to, and copied the environment into that directory. Installing WebLogic was like, "Check out the binaries, check out the right directory with that environment's configuration," and you were done, so--
Rachel: The machine does the clicking for you.
Jez: Right, and it certainly invalidated the EULA. That was clearly the case, but it solved the problem. So we ended up being able to do, and again, we only had one set of production hardware. We had these Sun E4500s, which were these sexy--
Charity: I remember those.
Rachel: TARDISes.
Jez: Yeah, huge, really good I/O, tiny little CPUs. We calculated that each of these boxes would be outperformed by an iPod, so were literally deploying to a cluster of iPods.
Rachel: Awesome blinky lights, though. Really good blinky lights.
Jez: They look amazing.
Rachel: And do you remember how they smelled? I used to love the smell of those machine rooms, it's so good. Sorry.
Jez: No, that's the visceral amplified by those weekends in the data center.
Charity: Rachel is high on drugs this week.
Jez: And also has Stockholm Syndrome from data centers. So anyway, we got it down to the point where we only had one set of hardware. Blue-green deployments was invented, so I overcome the fact we only had one set of production hardware.
Charity: Totally.
Jez: But we got the deployment process down to less than a second.
Charity: So you know it can be done. That's not my question.
Jez: How often is it done?
Charity: How often does it get done, and what makes a team that has been a low performer for a while, what is the ingredient that suddenly makes them transform?
Jez: It's the sense of urgency that actually we need to do something about this.
Charity: And maybe personal identification with the problem. You feel not just like, "Wow, help us." It's like, "I should probably identify with this problem. It's mine to solve."
Jez: Yeah, there's a certain sense that you see learned helplessness in organizations.
Charity: Can teams hire their way out of this problem?
Jez: I think so. I mean, Deming has the saying which I really like, which is if you hire a good person into a bad system, what happens is that you break the person.
Charity: Then what, is it hopeless? What makes a bad system into a good system?
Jez: What changes, and we see this again in the state of DevOps report, is good leadership, effective leadership. We talk about transformational leadership in Accelerate, and the things that make an effective transformational leader, and those things predict your ability to implement the technical practices. Monitoring observability, continuous delivery, the management practices, effective management of work-in-process, visualization of the flow of work and of quality, and of product management processes like working in small batches and allowing teams to experiment. All these things are enabled by effective leadership. They predict culture, so that's interesting.
The way you change culture is by implementing the practices and the way you improve the practices is by having leadership which encourages you to do that, and then rewards people who do it well.
Rachel: I'd flip the argument on you, Charity. I would say every company we talk about that's engaged in DevOps transformation is doing exactly this. I mean the canonical example of the elite is Cap One, which has been around for a long time, and it really was the commitment of the leadership to changing the culture. It's been around so long that Waterfall was state-of-the-art and now it's able to--
Charity: It's mysterious because I've never really participated in that. Linden was my college, right? And they did things well, and I've always been lucky enough to work places where they do things well, far enough that I feel offended if I come into a place that doesn't do things well, so I make them do things well.
Jez: Whereas I came from consulting, which is--
Charity: Sure, the opposite.
Jez: Right.
Charity: My other reason for asking this question is because in my experience, and I've said this a few times, when you're putting software engineers on call, assuming a functional team which is a giant assumption. But assuming that you have a functional team, I have put software engineers on call successfully, I have put software engineers on call and had it fail spectacularly. Sometimes, I've had it fail and succeed with the exact same team.
For me, assuming a functional team with good communication and all these things, there's been a missing link there that has been observability, that has been giving software engineers a tool that speaks their language that they can use to debug their own stuff in production. Instead of making it so that they're trying to fly blind, fix problems without having the data that they need, the debugging context that they need, or the access that you need to do it. My experience has been the past 20 years of us building monitoring software, you can't put software engineers in front of that and say, "Now own your code." You just can't do it.
Rachel: It's written by ops for ops.
Charity: It's been by ops for ops, and it takes all the intuition like you were talking about to interpret it and to draw a line back to lines of code that are changing, and we need to speak to them in their language.
Jez: We find that in the data. We look to monitoring and observability this year. We find that if it's one of the factors that predicts software delivery performance, it reduces burnout.
Charity: Were you breaking it down by observability versus monitoring tools, though?
Jez: It's really interesting. W e asked people a bunch of questions about monitoring, and we asked people a bunch of questions about observability, and what we found is-- Which was kind of weird and interesting in its own right, is that they load it together, which means that people perceive them as being the same thing even though we were very clear on--
Charity: But you could ask questions that are about different things without asking them to define it for themselves.
Jez: Right, exactly, which is what we did. I mean we didn't say "observability," we didn't use that word, or "monitoring." We asked people a bunch of questions that we had predefined based on our understanding and asking the main experts like you about what these things meant. That was kind of interesting, but they do predict, along with those other technical practices, reduced burnout and lower deployment pain and so forth. We also found that what's crucial is having a feedback loop back from what's going on in production to business decisions.
Rachel: Yes.
Charity: Yes.
Rachel: W e were really surprised to get our first business use case, what, in the first six months? Charity Navigator, the IT admin just gave his chief development officer an account so she could see big donations roll in. It was astonishing, so there is going to be a--
Charity: There's a lot of these feedback loops that have been going on. Chaos engineering is another big one. Chaos engineering without observability is just chaos. You're just firing shit out there. If you can't actually tell very specifically, you have to get down to the raw events, "What is the impact of this? How did it change? What changed?" Otherwise, you see people doing chaos engineering, they fire shit off into their stacks. They find out a month or two later that they screwed themselves up in some way. 1-2% of all requests have been failing because of something that they did, that they couldn't see.
So I totally hear you when you're saying that people think it's the same thing. It's our job to start explaining to people that they're different, and the reason that it's important to me that we define them in and disambiguate them is because we have 20 years of best practices for monitoring. This is good shit that I don't want to lose. I don't want to muddy the waters because the best practices for observability are often the exact opposite, and I don't want to be like "Every alert must be actionable. You shouldn't have to stare at graphs all day, the system should let you know when it's dead," and all of these rules.
They're good rules, and I don't want to muddy the water because observability is very different. It's not biased towards outages, it's not biased towards alerts usually because it's about interrogating. It's about, maybe the questions aren't about downtime or outages or anything. It's not biased towards that, and you should look at graphs every day. You should have the muscle memory of shipping some code and going to look at it. " Did what you think you just deployed actually deploy? Did you ship what you think you shipped? Is the impact what you expected it to be? Does anything else look weird?" There's just so much context there that your eye may pick out that you could never have predicted and written a monitoring alert for.
That's why I feel very strongly about observability is an emerging thing. I'm not trying to denigrate monitoring at all. I'm just saying it's different, and particularly for software engineers who are trying to empower to own their own systems. You can't really give software engineers a monitoring system and let them own their own code through it because it doesn't speak that language of variables and functions and so forth.
Jez: One of the interesting things that's happening more and more as we move to a product-based model, what you find is the system's changing quite often and the behavior of the system's changing quite often. So this idea that you can predefine the things that are going to predict bad behavior, that just doesn't hold anymore. As we move to--
Charity: Your runbook, that will have all of the possible outages. "You can refer to page 74 and..." I don't get that.
Jez: I was talking to [inaudible] Alice, the other day and then he was saying, "This is a real thing. Level three support is no longer as effective as it was because we can't predefine all the things that are going wrong anymore, because our systems are changing so frequently." As we're moving to this Google model of developers own the software, at least at the beginning, before it's become stable and predictable that's when you need to be able to understand what's going on without understanding how the system behaves, 'cause you don't know yet.
Charity: Yeah, exactly. It's a real shift from the known unknowns and the LAMP Stack where you could visualize, you could look at a dashboard and you could use your intuition, you would know where the problem was instantly. You don't get that anymore. Now you have a system that looks more like the national electrical grid. It's chaotic, it's ephemeral, it's blipping in and out of existence, and it's all unknown unknowns because you've solved the known unknowns. You're not getting paged about them, you fix them. Every time you answer your phone, it's like, "This is new."
Rachel: This is where I get really passionate about tools like Honeycomb and LaunchDarkly, is that we are entering this new realm of complexity and what we're actually working on is tools that encourage the best ways to solve those problems, the most social and collaborative ways for people to bring their insights together, and figure out unknown unknowns, really--
Charity: We keep cutting you off, Jez, I'm sorry.
Rachel: Yes, sorry.
Jez: No, it's fine. This is great. The only other thing I was going to say is security.
This is the other place where the unknown unknowns are really important, and monitoring won't necessarily help you because people are finding new and exciting ways to hack into things all the time. One of the things that struck me about a lot of the security breaches we've seen in the last couple of years is people didn't even know they were happening until much later, and it strikes me that's a perfect use case for this.
Charity: Absolutely. It turns out that the origin for most of the chaos in the universe is people. It doesn't matter if you're trying to use your systems or attack them, but that's where all of the unpredictable that you've never thought would happen is coming from.
Jez: At least in physics, we have preservation theory, but--
Rachel: So what are you thinking about going forward, Jez? What interesting problems do you think this research pertains to and where do you think you can apply it?
Jez: We've spent a lot of time looking at the role of culture, which we started off this section of the podcast talking about, and that for me is very interesting. I know Nicole finds it very interesting as well, and we're both really boosted about the fact we found a valid and reliable way to measure culture and its impact, and the factors that impact it. I can see us doing some more stuff on that.
This year, we also looked at the role of retrospectives and have autonomy and of trust, and we found some really interesting results there. A gain, people always ask, "How do you change culture?" Because we're using psychometrics, we can investigate that. J ust taking a step back, t he story of the scientific investigation of software teams is a terrible story because you cannot do randomized controlled experiments. Or at least I'll say, "It's extremely hard to do randomized controlled experiments."
Rachel: And unethical.
Jez: Also companies don't want to have the control team because it's expensive. W hat Nicole did, which is you use psychometrics to investigate software teams. That's a paradigm shift and it's proven hugely powerful, so that's what the feature of this is. "How can we use psychometric methods to investigate how to build high-performing teams and high-performing organizations?" That I think is a huge question, and it's going to be very fruitful, and I would love more people to use these methods.
Charity: We have to stop being one-offs as teams and we have to start learning from each other, yes.
Jez: Yeah, in scientific and valid--
Charity: We've done this in our-- We're starting to just start tools more and more, but we haven't really begun in terms of our teams.
Jez: Yeah.
Rachel: Cool. Always such a delight talking to you, Jez.
Charity: Absolutely a delight.
Rachel: Thank you so much.
Jez: Massive pleasure, thanks for having me.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
O11ycast Ep. #66, Building Observability Platforms with Iris Dyrmishi of Miro
In episode 66 of o11ycast, Jess and Martin speak with Iris Dyrmishi of Miro. They dive deep on what it takes to build an...
Incident Response and DevOps in the Age of Generative AI
How Does Generative AI Work With Incident Response? Software continues to eat the world, as more dev teams depend on third-party...
MLOps vs. Eng: Misaligned Incentives and Failure to Launch?
Failure to Launch: The Challenges of Getting ML Models into Prod Machine learning is a subset of AI–the practice of using...