Ep. #5, The State of SRE and Beyond
about the episode
Nora Jones: Hello, everyone. This is actually the second time we're recording this podcast. I forgot to hit record the first time. We had great content, but we're SREs, we adapt, we prepare for situations that did not go as planned. But today is actually a really special edition of the Getting There podcast. Niall and I are together in person for the first time ever. I don't think we've actually met in person before.
Niall Murphy: I would be surprised if we weren't 150 meters from each other at some point or other at some conference. But not actually introduced.
Nora: Yeah. So we're at SREcon in Amsterdam. I am here as an attendee and a sponsor. Niall is a cochair of the conference, along with his program cochair and the rest of the program committee and USENIX board that helped put together this amazing event. Today we're just going to jive on the state of SRE, what we're learning at this SREcon, talk about some of the talks and, yeah, you'll get us live and unfiltered.
Niall: Yeah. So I suppose to start off with the theme of the conference, the CFP, Call For Papers, Call For Participation, had the text, "What could SRE be?" As the theme of the conference. It's kind of interesting, because the job of cochair is, of course you get roped into a bunch of stuff, but the job of cochair is this very indirect one where you have massive program influence but it's hugely indirect.
You get to ask people to give talks, essentially, and get to curate, work behind the scenes a lot, with the idea of addressing the theme, in essence. I think in our or in this particular situation, we were trying to stimulate a conversation, not necessarily about this existential angst that we face about our identity and so on, which maybe we'll come into. But the question of what the future of SRE could be?
Because the way I see it, there's a bunch of different flavors to what we do that are entering the mix, and some of those flavors are relatively new like human factors, cognitive systems, engineering safety science, all of that kind of stuff, and some of them are relatively old like large scale systems administration and so on and so forth.
So a bunch of different flavors and trying to start a conversation about not just whether or not we like pistachio or whether or not pineapple on pizza is a valid flavor. Some people are on different sides of that.
Nora: It's so true.
Niall: I didn't think you were that kind of person. Anyway, so it's trying to start a useful conversation about, of those flavors, do we have an opportunity to steer ourselves in one way or another? Should we get the community talking more about certain of these flavors? All of those kinds of concerns, and I think that some of this obviously relates to the economic situation we're in.
The recession, potential recession, et cetera, et cetera. But also a bit about what are the problems of the future going to be? Are they going to look like the problems of the past? Or are we entering into various new domains? I think my intuition is that the problems of the future are as yet... we don't find many examples of them in the past and so we're probably going to have to change as a profession, accommodate more flavors, get better at certain things and so on.
Nora: I'm almost curious, I think you mentioned SREcon has been going on since 2014, I'm curious about how the theme of these conferences has changed throughout time, and also what has stayed consistent. You mentioned large scale systems administration has stayed consistent, but I'm curious to map the theme of the conference to also what is happening in the world. I cochaired SREcon Americas the year that the pandemic started, it was December, 2020.
It was the first time we had to go online, and we actually had a really cool panel there about all these places that suddenly had a lot of scale. We had people from food delivery companies that had to be SREs and so that one was really related to the theme of the world at the time.
I think you're right, I think our jobs are changing, they're evolving. Tech is only getting more used, it's not getting less used. Every company is a technology company and every company is going to have incidents that involve technology.
So our jobs are evolving and changing, and I think like you were mentioning, a lot of the human element is coming into this too. I'm thinking about socio-technical and complex systems, and the humans that are using them, the humans that are building it and the humans that are using it. And, how all of that relates to each other and affects each other, yeah, it's really fascinating to me.
Niall: Yeah. There's a huge amount to respond to there. I think the first thing I would say is very early SREcon flows on really from LISA which was for large scale-
Nora: That's right, the performance. Yeah.
Niall: Yeah. And in fact, LISA, that conference has been shut down now and SREcon is more or less the moral successor of it, however you define that. So there is a thread of continuity about the tactics and the day to day experience of managing computer systems, which continues. That's cool and fine, and we totally, absolutely need that.
However, what has started to happen... obviously this depends on your conception of the world. Personally speaking, for me, this is the year 6 A.B, which is after Bowie, because after David Bowie left I feel the world really took a wrong turn. So after that point there's a lot of concerns about system breakdown in the world. There's this obvious one about the pandemic, but there's political systems and so on. As I said in the opening keynote, if you look at the graphs, we will shortly have a UK Prime Minister every four seconds.
This kind of general instability in the world, I think is bleeding into the conversation that we have about ourselves and about our value. That's one of the themes of the conference, really, which is how do we understand and communicate our value.
Nora: As SREs.
Niall: As SREs. As systems thinkers, which is one of these sub flavors that is in the profession, and the system thinker, the fate of the system thinker as I wrote in a piece a while ago is more or less to be right, to be correct, but never to be valued. I think that this relates to the legend of Cassandra, who is a prophetess and always correct, but never believed.
I think that the difficult thing about never being valued in an organization is that you may well contribute a lot of value, and I believe we do, systems thinkers do contribute value, but if that's not interpreted, processed well by the organization as a whole or the individuals you come to buried in their silos and you're this SRE trying to stitch together a coalition of 17 teams in order to improve something or other-
Nora: And they don't quite know what you're doing, but you're talking to them. We talked a little bit on our first recording, is it too soon to mention that? We talked a little bit about our first recording on expertise in those areas, systems thinkers have this expertise in how all the pieces of the puzzle fit together, but they're not quite going deep on any piece of the puzzle.
But they do recognize that all the pieces of the puzzle require deep thinking. They're just not the ones to do the deep thinking on an individual piece, they understand it to a certain point. Whereas there's going to be people in a certain niche that don't understand how the whole system works together, which is deep thinking in its own way.
So I think both sides of the coin really need to respect and understand each other, but part of our duty as systems thinkers and systems SREs is to help the person that is managing a single service, not understanding the view of the world of 17 services, really understand how they fit into the bigger picture, why they should care, and help build that bridge between us and them.
Lauren Hockstein and Laura Maguire, who Lauren was a former colleague of mine, and Laura is a current colleague of mine, and they're both brilliant people and they talked a lot about what the future of SRE holds and the history of SRE and it really got me thinking about the need for historians and organizations around how we think about systems, around how we think about technologies, around how we think about the interaction of the technologies and the people.
But you can really only gain some of that knowledge by studying it and being around for a certain period of time, that you can relay it to others that have joined the organization. You can use that during incidents, and we need to also figure out how to equip some of our colleagues with that knowledge too. We have our own skillsets in being able to do this in other organizations, but you could be the most brilliant SRE around and it's still going to take you some time to ramp up in a new organization and get some of that history in a way to relay it meaningfully and help manage folks in times of crisis and emergency.
Niall: Yeah. A huge amount to say here, could really talk for hours about this. Interestingly enough, Google was one of the few organizations I've worked with that had a historian on staff.
Nora: Really? And that was their job title.
Niall: I think it might've been cultural anthropologist or something like that, but essentially they were an historian.
Nora: Okay. Engineering background, or?
Niall: I don't know, I'm afraid. I think possibly anthropological background. I think the person in question is no longer working, and I'm not even sure the role exists anymore. But I suppose I've been seeing that within certain companies and certain moments in time they have a sense of the historical value of somebody who-
Nora: Yeah. It reaches a point where no one can hold it all in their head anymore, and the business is evolving and you need someone that can help you manage some of that and accept it.
Niall: Yeah. And I think the terrible thing about Silicon Valley is that in combination with this really rather wonderful drive to reinvent and do new things, and push forward the wall of the world by throwing yourself at it enough times that it moves forward. Going hand in hand with that is also an emotional commitment to forgetting, because you have to forget that all these other people failed in order to try yourself. Or the incredibly arrogant, but maybe not.
Nora: But you also have to accept that those things are going to happen, or else you're going to keep repeating the same line of thinking, but just with a different flavor. I didn't get to attend Andrew Clay-Shafer's talk but I heard a lot about it, does the language we use matter around what we call ourselves? Did you get to catch it?
Niall: I did indeed, SRE, as she is spoke, which is a great title and I hope that when the Open Access process actually puts the recordings up you go check it out because it is genuinely very good. But his thing is, sure, our language in obvious and subtle ways guides our behavior, constrains our thinking or enables thinking.
But the great point he had, I think, which is the key one I'll retain, is that when you're doing a new thing there's a bunch of people who do the new thing because they're seeking advantage, and then once that group is exhausted or completed or however you would put that, there's a bunch of people who are seeking legitimacy flowing on from that. So I think part of what we're doing right now with the systems thinking, with the cognitive systems engineering, with all of this stuff is seeking legitimacy for these acts which are justifiable to no single individual.
Nora: No, it's hard to measure the ROI of it.
Niall: And I think there's also a huge thing here from organizational dynamics because typically speaking in, we will say, the standard corporate setup however you would define that, but typically speaking in those organizations we overload a high up organizational position with the idea of systems thinking and cross scope. You're allowed to have cross scope legitimacy if you are the VP of whatever.
Nora: Right. But they're not always systems thinkers.
Niall: They're not at all.
Nora: And they're also managing people, so it's like they have their own sharp end but it's not the same as the SRE who probably does not have direct reports. I don't think they should, honestly, and they are understanding the systems, they're understanding how people work together, they're understanding the nuances.
Them not having direct power, should we say? Over anyone will help build that trust, too. That's why I think it's so important for that person to not have some of these direct reports too. This is a little bit of a tangent but one of my colleagues, Emily Rupe, is here and she is giving a talk later today on Jurassic Park and incidents. Have you seen Jurassic Park?
Niall: I have.
Nora: Yeah. So she talks about the different Jurassic Parks and how far away they are from each other and how there are some of the same attitudes towards the situation being repeated. She goes on to talk about if they learned from that incident, right? Or we have in this new set of characters that didn't learn from the characters before, because that's what's going to happen in our orgs too.
Especially as the tech industry is still really young and there's new startups popping up all the time, and at this point in time you might be working with all your same coworkers. But it might completely flip later on. I know you and I have both inherited legacy systems that we have no idea, we don't even know the people that wrote it anymore.
I've certainly been at companies that I've looked at Git Commit history and then looking people up in LinkedIn to try and get in their head a little bit. It's almost like I want to make some of that thinking, we want to make some of that thinking easier in the industry. How can we understand why it made sense to this person to do the things that they did in that moment? Because that will help us as the present employees do a little bit better too.
Niall: Yeah. I think that's part of it. Just to turn the Jurassic Park thing, I think I would describe Jurassic Park as being well supplied with autopsies and poorly supplied with organizational post mortems.
Nora: You should attend that talk later, it's pretty funny.
Niall: I'll have a look. As I said earlier, the whole question of demonstrating value and how questions of value resonate with different places in the organization, but also with different humans because humans have backgrounds and specific ways of interpreting the world and so on. One of the underpinning forces that drives a lot of this, I think though, is complexity.
I also think about this under the heading or using some of the language from economics, right? Because in the world of complex, online systems you often end up adding complexity in order to chase revenues, so there's some relationship between system age, revenue it captures and complexity.
In general, the evolution is towards more complex, right?
Nora: Yeah. I don't think we should avoid complexity. I think we need to embrace it. Because it's complex, it's not bad. Dr. Richard Cooke said a lot of times, "You're having incidents because you're successful. Your systems are complex because you are successful." So I think that's important for us to celebrate, is there are certainly ways that we can... I think it's shifting towards understanding rather than judging and trying to fix our systems, we need to shift towards understanding them.
Not only we, because I think we are really getting there as the SRE industry, but we also need to empower our colleagues, right? We have a huge responsibility to our orgs because we are the ones that are maintaining, understanding all the services and how they fit together. And so our relationships to our colleagues are deeply important. I think our job is one of the most important jobs to have in terms of making sure we're collaborating with our colleagues that are not on our team, and we're not trying to fix our systems but we're trying to understand them and we're trying to enable them to understand them a little bit better too.
Niall: Yeah. I think inherently that's difficult if you're coming into a situation where you're trying to explain somebody else's shit to them. That's generally speaking, poorly received. I agree by the way, that complexity is inevitable and is in many ways a consequence of success. But you can also end up with a situation where the complexity management is poorly executed and so the precise trade off between success and complexity does not in fact enable you to grow at the rate you would otherwise have, and so on and so forth. But in general, people handle this in two ways. The first one is good scope. What can I successfully ignore? Okay, you are turning off to the side.
Nora: And how are they deciding what to let turn off to the side? Helping other people understand that, I feel like, is also hugely important because there's always things worth turning off to the side. But how do you help your colleagues understand why these things are turning to the side?
Niall: Also a huge question. But sorry, just to tie off that point, you can cut scope by ignoring various things that are going on, or you can tell a story about the things that are going on which is effectively summarization. So you say, "Oh yes, we launched V2 of the system and it's just like the first one, except it does this thing differently." That is a mechanism by which we try and address complexity in a way that the human brain aligns. Well, I shouldn't say the human brain. There's a lot of human brains, many of those human brains think in different ways. But one useful way is to go, "It's just like this thing except a bit different over here."Many people absorb that message a bit better.
Nora: It's interesting, what you're saying right now reminds me of an incident I was in a while ago. I was at an organization that had really long tenured employees, but it also had a lot of new employees. It probably had more new employees than it did longer tenured employees. But as a result, there were a lot of systems that there was a new version and then the new version wasn't quite feature parity with the old version, and so the old version wasn't deprecated. We had this one system where it was a homegrown feature flagging system.
There were hacks that people had to turn some things off, they would just put in a ridiculous feature flagging number that probably didn't exist so it was like one million to try to turn a particular feature off. I don't remember all the nuances of it at this point. But what happened with this incident is that one million or whatever the number was, was higher than the max integer that Java could parse, and the entire thing went down. There was a lot of whispers and stuff in the office like, "How did this happen? This feels silly."
Almost like, "How did this happen?" I was actually curious how it happened, and so I went and talked to the person that had typed in that number. It was reasonable to me, we come up with things in the moment to try to get the thing to work as quickly as possible and we grab onto whatever we can, and I thought it was a really creative approach.
But I went to the system where you could do feature flags like that and I couldn't type in a number that high. I was like, "How were they able to do it?" And so I sat at their desk and I was like, "Can you show me how you did that?"And they pulled this UI that I'd never seen before, and then I noticed that this person has been at the organization for 10 years, and I was like, "I've never seen that UI before?" And they go, "Well, what do you use?" Then I pull up a new UI and they were like, "I've never seen that UI before."
It's things like that, that it's so utterly important, socialization and communication and storytelling is so hard, but it's so important and we both learned that this system existed. Then I looked up the system that they were using and it was running on a single node and no one was managing it any more. But when you're in your nuanced world, you don't know things like that, right? And neither should you. What they did completely made sense to them.
Niall: Yeah. I'm completely comfortable with the serving that the old UI probably did things better than the new UI.
Nora: Exactly, exactly. And I am sure there were people in the organization that knew about the new UI and also chose not to use it for that reason. I just think storytelling is more important than ever and not trying to fix complexity, but also just really making sure we're socializing it.
Niall: Yeah. I think also developing that a little bit more, but also you briefly said it in your last couple paragraphs, that collaboration is hard. I think the strange thing is, it might not actually be that hard inherently, but we put a lot of barriers in its way.
Nora: We individuate a lot in the tech industry. It's not totally focused on collaboration sometimes.
Niall: That's part of it, and I think there is a cultural component to that. I think there's a Valley cultural component to that. I think there's a mainstream business cultural component to that. There's a bunch of different ways where we pivot from going, "Okay, this is a whole group thing and I'm trying to make things better for everyone." Versus, "Okay, this is now an individual thing and I'm making things better for me, or maybe for a very narrow scope, larger than one individual." But this question of better for the team versus better for the company or better for the individual versus-
Nora: Well, it's also like when you just, in general human relationships, you want to try to understand that person's world a little bit better and then you can connect with them more. It's like when the person that is managing a single service, but doing it very deeply and very nuanced, versus the SRE that is managing the whole system. They both need to work to understand each other's worlds a little bit more. I think each of them has a responsibility to make it easier for the other person to understand their world, which I feel like is what we're not quite hitting as an industry with collaboration between SREs and service owners today.
Niall: Well, I think also part of the social hierarchy dynamics end up coming into this, unfortunately. So you can say, "Somebody said to me the other day that part of the genius of SRE is, what of operations, but high social standards?" Which I think is an interesting way to put it, and I've seen that put in a variety of different ways, obviously being software oriented rather than operationally oriented for some reason has acquired this cache, et cetera.
Perhaps that's also related to revenue generated, revenue generating versus perceived as a cost sector and so on. All of those things are longstanding cultural and business related behaviors that have gone on for along while. But broadly speaking, what I see is that the act of collaboration happens most easily with people who are kind of born believing collaboration is inherently a good activity.
Nora: So you don't think it can be taught?
Niall: Oh, I think it can, but it's tougher. Right? It's particularly tougher in certain cultural contexts.
Nora: I think it's honesty table stakes for an SRE role. I think it should be part of the interview process, like, "Hey, you need to bring in someone to an incident that you've never met before and you know their service is impacting the incident, but they don't know. Go figure out how they would approach that situation." I just think there's little nuances like that that we have to deal with all the time, that don't feel that "technical" but they're deeply technical and require a different level of thinking. I do have a question for you, we're only halfway through the conference and we're on day two, but I'm curious what talk that you've seen so far has left you thinking the most? What talk keeps popping in your brain now that you've-
Niall: As cochair, I could obviously not publicly say.
Nora: I'm not asking you to pick a favorite, I was very, very careful.
Niall: All of my children are wonderful, and they are. But the thing that resonated with me the most is one of the subtracts that we were trying to pick is what we informally called overlap, which is overlap between other professions and SREs and so on. Andrew Clay-Schafer, co founder of DevOps Movement of The DevOps Movement and however you would define that, I really thought this language around legitimacy versus advantage and the question of where DevOps folks see themselves in the world and what they regard their role as and SRE folks and so on.
I thought that filled in a lot of background and I thought that was a great talk. I will also say that another thing we're trying to do this year was start this outage review talk, and Meta did a great talk on the incident with turning off the backbone routers that we did a podcast on previously. They provide some additional context there, so I thought that was great.
Nora: Yeah. I felt for them. They actually had an incident with WhatsApp during this conference that I'm really curious to know about. We couldn't quite pull something like that together for this episode super quickly, but it was a really interesting incident. Some of their SREs are at SREcon.
Niall: Yeah, some of the people who know the answer are probably 200 meters away. Anyway.
Nora: Well, cool. We're just about at time, and we have lots more exciting talks to go through. But is there any closing words that you'd like to say?
Niall: I suppose for those listening to this after the conference, which is going to be everybody, please do check out the videos on the USNX page on YouTube, which is where they typically go. But also in your day to day work, keep in mind the question of how you understand and how you represent your value. There is almost never a downside to communicating your value. It's just a question of figuring out how to do that authentically, because I know loads of people run from the very idea of saying, "Hey, I'm great." And it turns out that that's occasionally useful.
Nora: It's super useful, and on the other side of that, seeing how people understand what you do, the value that you provide to the org and being curious about it, because I think that helps you improve that communication as well.
Niall: Good stuff.
Nora: Well, cool. It was great chatting with you in person, and we'll see you next time, folks.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
Getting There Ep. #6, The Impacts of the 2022 Twitter Acquisition
In episode 6 of Getting There, Nora and Niall discuss Twitter’s 2022 acquisition by Elon Musk. This talk unpacks the acquisition...
Three Key Best Practices for Modern Incident Response
Incident management refers to the process that a company takes to detect, act upon, and resolve issues with their software...
O11ycast Ep. #49, Incident Commanders with Fred Hebert of Honeycomb
In episode 49 of o11ycast, Charity Majors and Jessica Kerr speak with Fred Hebert of Honeycomb about incident commanders. Listen...