Incident Response and DevOps in the Age of Generative AI
How Does Generative AI Work With Incident Response?
Software continues to eat the world, as more dev teams depend on third-party microservices as their daily infrastructure. Which means that outages are more common and costly than ever, costing upwards of $100K per incident, and that successful incident response workflows are more important than ever as well. What about the wondrous wave of artificial intelligence products from Microsoft, GitHub, and OpenAI? Reports suggest generative AI tools boost developer productivity, reducing bottlenecks by streamlining the process of coding with code snippets, search, and summaries. Could generative AI be a breakthrough for IT operations in managing incidents that helps stem this rising tide?
62% of the general populace is “concerned” about modern AI, and 86% “believe AI could accidentally cause a catastrophic event.” So where, if at all, does GenAI fit into incident reponse and day-to-day site reliability engineering? In this article, we consulted with a panel of site incident management veterans with more than 40 years of collective experience. As one of our experts put it, GenAI is good at “confidently delivering text that is pleasant to read, but not always complete, or correct.” As another suggested, GenAI is “not good at making decisions for you...or [emulating other people’s] expertise.”
Below, our panel explores known GenAI vulnerabilities in security and privacy, not to mention its well-documented hallucinations, and the need for ad hoc collaboration and consequential decisions in IM. Is there an eventual future for AI-powered incident commanders, or will teams always need that proverbial human in the loop? Our panel discusses:
- The Strengths and Weaknesses of GenAI for IR and SRE: Which capabilities of GenAI are a strong fit for day-to-day incident management.
- How GenAI Will Affect the DevOps and SRE Professions: How GenAI will impact professionals who work on both product and the operational side of product.
- How GenAI Will Ultimately Affect Dev: Our panel also weighed in with their thoughts on how GenAI will impact the general business of software development.
Disclaimer: While the panelists interviewed here hail from companies such as Jeli, Amazon, and PagerDuty, the views expressed below are those of the individual panelists and do not reflect the views of their employers.
How to Utilize GenAI Within Incident Management Platforms
Nora Jones is an incident response veteran who led IR teams at Slack and Netflix before founding the developer-first IR startup Jeli. She’s also a co-founder of the IR community LFI. Her company has implemented GenAI directly into its platform, utilizing natural language to rapidly spin up shareable incident reports to quickly get team members up to speed as well as to draft overarching narratives based on different touch points over an incident’s life (including detection, diagnosis, and repair moments directly from chat logs, which the platform has already been fully annotating). Jones believes that GenAI’s ability to accelerate incident logging is a valuable tool and makes GenAI worth considering as another member of the team–but not as the overarching decision maker:
- GenAI Has [At Least] Two Key Strengths for IR:
- Spinning Up Summaries to Catch Up Teams: As incidents are happening, GenAI’s ability to quickly spin up content can be useful to provide instantaneous summaries of incidents to relevant team members to keep everyone in the loop as things happen (rather than pulling people sideways by requiring them to drop everything and hunt down the details).
- Incident Analysis: Post-incident, GenAI can help teams uncover and compile context around incidents to create richer, more-valuable post-mortems by collecting insights and notes across the incident lifecycle.
- Why GenAI May Not Be Taking the Incident Commander Chair Anytime Soon: Incident response continues to be a field full of unknowns and exceptions–not exactly a good fit for tools that are built largely to pattern-match based on past data. Fully AI-run incident remediation is unlikely to be “a thing” anytime soon.
Discussion: Where GenAI Makes Sense for IR with Nora Jones
In addition to offering the above observations, Jones opines that modern SRE can optimize their GenAI usage by recognizing its assorted strengths and weaknesses. Specifically, GenAI was not built on all-knowing, benevolent algorithms developed solely to decide how to manage important decisions, such as issue resolution steps. GenAI algorithms, at least for now, are optimized to generate and summarize content.
“I don't think you trust GenAI to be an ‘expert’ in anything. It's not good at making decisions for you. It's not good at [emulating other people’s] expertise. It is good at summarizing pieces of information. But just because it's easy to use and easy to ‘sprinkle’ AI on anything you're doing doesn't mean you should. I would really encourage folks that are starting to play around with it to understand actually how it works.”
I think what we really want to do is use AI to get people more curious about what's happening in their incidents.” -Nora Jones, Founder / Jeli
“Ultimately, I think what we really want to do is use AI to get people more curious about what's happening in their incidents. I’ve always believed that if you learn how an incident actually happens, you'll be better off in the future. You can be more proactive about your incidents, resolving some of them more quickly, getting the right people in the room more quickly,” Jones explains. “Where AI seems really interesting for incident management is when we can use it to bubble up some of those interesting learnings, which then gets people investigating the incident...and gets people a little bit more curious about how it unfolded in the first place.”
Jones suggests that artificial intelligence provides opportunities to help both professional SREs and developers of all stripes. “I think GenAI will bring big changes in the field of incident management in terms of how incidents get communicated to stakeholders that are impacted by those incidents. But for developers in general, I think there’s an opportunity for them to use AI to accelerate their processes and help them get curious about other areas.” In the future, Jones suggests the possibility of AIs trained on large amounts of previous incident data being helpful in taking a more-proactive approach. “I don't think generative AI is going to fix the incidents for you, but I think eventually, it might help point you to previous incidents that look like the one that you're solving right now. But as far as I know, it can't get people to talk to each other. And I don’t see it being a magic box for auto-remediation anytime soon.”
Where GenAI Impacts DevOps and the Future of Software Dev
Jeremy Edberg is a longtime DevOps expert who currently helps lead Amazon’s Alexa Operational Excellence Team, but has done tours of duty at leading tech companies including eBay, Reddit, and Netflix–where he was a founding member of Netflix’s SRE team.
- Large Language Models May Be the Next Major Evolutionary Step in Human-Computer Interfaces: Whether GenAI achieves the Nirvana-like goal of artificial general intelligence (AGI), prompt-based LLM chatbots may well represent the next step in the way humans interact with technology, as they effectively help computers take a big step toward being able to understand human language.
- We’re Not Yet at a Point Where LLMs Can Credibly Recommend Remediation Steps: Right now, human users will trust their monitoring and alerting systems after those systems have proven themselves reliable, even going as far as allowing them to take automatic actions. But we are not there yet with LLMs. Right now, the best we can hope for is LLMs trained on previous incidents and patterns producing one (or a few) possible remediation steps and having a human select the best course of action. Over time, if the LLMs prove to produce the correct course of action in almost every case, they will be trusted to work autonomously.
- Potential Career Evolution for DevOps: Language Model Operations in AIOps?: DevOps, being generally tasked with the maintenance and caretaking of infrastructure, may also inherit the care and feeding of language models. As LLMs come to represent more-significant components in infrastructure, organizations will need people who understand distributed computing, machine learning inference, managing GPUs and CPUs next to each other, storage, and other maintenance considerations. Could there be a point where entire careers are focused on monitoring AI models, updating them, and making sure the models are getting the right inputs and appropriately learning new things?
Discussion: What the Future Looks Like for Devs Using GenAI with Jeremy Edberg
“Right now, GenAI is something of an advisory tool. We're not to the point where we trust it enough to take the actions based on what it says,” Edberg explains. “In some ways, you could compare some of GenAI’s use cases to those of what monitoring used to be–or how things are when you're first starting out because you don't know that your monitoring and alerting are correct.”
“As an advisory tool, GenAI can tell you, ‘Hey, something is probably wrong here, and you should look into it,’ but there still needs to be a human in the loop there. Eventually, we'll get to the point where we can take the human out of the loop for the easy stuff...maybe. The thing is, better monitoring and alerting have already made changes to the way we operate. And LLMs will definitely make changes to the way we operate, but there'll be new challenges instead. Overall, I don’t think GenAI will eliminate DevOps jobs. It will, hopefully, make DevOps practices–and practitioners–much more efficient. So maybe in that regard, it would actually generate some net-new jobs.”
In the future, if you are good at logic and want to learn how to reason about computer systems, [software engineering will still be] a great place to be.” -Jeremy Edberg, Principal Engineer / Amazon
What effect will GenAI have on day-to-day dev workflows, or on software engineers as a profession? “If I were addressing a class of junior developers, I’d tell them, ‘GenAI is going to be a tool that will drastically speed up your development process, but it will not replace you.’ Not yet, anyway,” says Edberg. “Could it lower the barrier to entry for getting a job as an engineer? I could definitely see a situation where people–who hadn't considered this type of career before, maybe because they weren't interested in learning the details of coding syntax, for example, but are still good at general reasoning—might choose engineering now instead of business, law, or some other path."
"Somebody who has these reasoning, logic, and analytical skills might be more interested now because the ‘hard parts’ are taken care of, the syntax, the math, that kind of stuff. In the future, I think if you are good at reasoning, good at logic, and want to learn how to reason about computer systems, it's still a great place to be. If jobs do end up going away, they will be the ‘I've learned enough to know how to write code, and I'm going to spend most of my days writing basic, boilerplate' stuff,’ because the LLMs will take care of that.”
Deferring Low-Level Tasks to AI so Humans Can Focus on Strategy
Mandi Walls is a long-tenured developer advocate who has been advocating for AI and automated solutions to help make SRE and DevOps teams more productive for some time. She’s currently building communities of highly engaged developers at PagerDuty and has also served tours of duty at Chef and AOL.
- We’re a Ways Off From Fully AI-Powered L1 Responders: The sheer amount of training datasets from real, recorded incidents across a single organization required to stand up completely AI-powered Level One incident responders just doesn’t exist yet. The closest path to something similar to this in the future might be from large orgs running similar services on a very similar platform with similar runtimes, which would presumably generate incidents of a similar character with similar symptoms.
- The Most Immediate AI Opportunity in SRE May Be for Low-Level Remediation Tasks: Generative AI might be able to make the most immediate impact if it were trained to manage the low-level hiccups and false alarms that do not require extensive triage, and which experienced SREs resolve in minutes, anyway. There’s also strategic value in developing AI tools that can manage most low-level remediation tasks–because such tools would free up veteran SRE teams to focus more of their time and undivided attention on higher-priority incidents and post-mortems.
Discussion: The Division of Labor Between Human SRE and AI with Mandi Walls
Walls suggests that the immediate value of generative AI in SRE might come from spinning up documentation and after-action reports, but also in a variety of other areas. “Our incident response process includes Zoom calls, recordings, transcripts, and Slack channels, along with charts and graphs and many other kinds of data and artifacts...it’s a slog. So there’s value in letting AI generate all the components and artifacts we need.”
Regarding how GenAI and its associated tools, such as code generators, could affect the profession of development as a whole, Walls sees opportunities in many areas for GenAI to be valuable. “Stuff like coding assistants are super interesting. Some of it is really clever and is already doing a really good job for folks doing some of that work. But as someone who uses a lot of products, I'm hoping for improved documentation—and API documentation in particular–that developers don't have to write themselves. It’d be good to see tools improve enough to automatically generate all that stuff and make it more useful.”
Walls suggests that testing may be another area of opportunity for GenAI to improve development pipelines. “Another use case would be generating tests. I think there's a lot of knowledge already in that space, especially over the last 10 years as that whole practice has become more automated, and maybe it will become even more so. So maybe the work of test engineers will move more towards creating better harnesses and doing performance monitoring on the testing process rather than anything like writing a tool. Also, developers have artifact repositories. There's all this stuff that has to run together really closely. And keeping that all in line plus maintaining changes that come in from the vendors would definitely be helped by additional tooling that's a little bit smarter than what we have right now.”
For positions like SRE that are usually more directly integrated with an engineering practice, they'll see more benefits from coding tools, which could start to learn as much about infrastructure tooling as they do about regular languages and runtime-application code.” -Mandi Walls, Developer Advocate / PagerDuty
“I’m thinking about something that’s even a level up from Dependabot–which right now, will send you an email that says, ‘Hey, here's this thing that needs to be updated.’ It would be really useful to see this kind of use case broadening out to alert you that your vendor is doing an upgrade. An alert that could tell you, ‘Here's what we recommend for your specific use case.’ For example, if your cloud provider is turning off instances of your level, here's where you need to migrate...and then starting to do that work for you without having to really intervene.”
On how GenAI may affect the business of DevOps workflows, Walls is less eager to make predictions due to variance across orgs. “DevOps jobs are different in every organization. So it's possible that GenAI tools, such as code generators, could make a difference because, in some places, a lot of those folks are writing more code. However, in other places, they're just working more in advisory positions. And then, some orgs take more of a build-and-release approach. So it's hard to stay. At the macro level, if there's going to be a deep change in what it means to work in DevOps due to generative AI, well...I'm not sure there's enough of a consensus of what a DevOps engineer should be doing to be able to say that.”
On how GenAI will affect the business of SRE, Walls is significantly more bullish. “I think for those positions like SRE that are usually more directly integrated with an engineering practice, I think they'll see more benefits from coding tools and things like that...which potentially could start to learn as much about infrastructure tooling as they do about regular languages and runtime–application code versus infrastructure code. I’d like to see GenAI help SRE teams push forward along their golden path because so much of their infrastructure is hopefully managed as code.” And in the same way that developer technology such as containers expanded into open source with Kubernetes, there may be opportunities to see open source contribute to generative coding assistants for SREs. “I think these teams will also benefit from those same code generation tools–but they may be in Terraform or Pulumi, rather than Python/Elixir/Go/Rust.”
Why Humans in the Loop May Always Be Needed in SRE
Brent Chapman is a pioneer in what is now known as modern SRE. Throughout his career in technology, he has always also worked as a volunteer in public safety and emergency services, starting as a search-and-rescue pilot and incident commander for air search and rescue. He applied the principles he learned in emergency services to his tenure at Google, where developed the company’s internal Incident Management at Google (IMAG) practice, and later, brought similar foundational practices to Slack. He currently runs the incident management consultancy Great Circle Associates.
- Large Language Models Have the “Natural Language” Part Down, But May Still Lack in Other Areas: While things may certainly change in the future, LLM chatbots seem best at confidently delivering text that is pleasant to read, but not always complete, or correct, and certainly not above human verification. Today’s LLMs are sometimes wrong, but never uncertain.
- GenAI’s Greatest Value to SRE Might Be for After Reports: Post-incident phases call for a great many write-ups to document the conditions leading up to outages, the circumstances and effects of the outages, and the actions taken to resolve the outages. GenAI can certainly produce something readable and user-friendly for general audiences, though expert engineers may prefer to keep all the gory details.
- Maybe There’s a Future for AI-Powered Pattern-Matching and Timeframe Planning in IM: Chapman recalls his years working with highly experienced engineers who were so well-versed in their systems that they seemed to have a sixth sense when it came to browsing a series of graphs and detecting seemingly imperceptible irregularities when investigating root causes. Could AI tools eventually become “smart” enough to detect such inconsistencies? Maybe. They might be even more useful in helping size incidents and projected response time windows that correspond with severity.
Discussion: The Brushes May Change, but Engineers Will Still Create Art with Brent Chapman
As Chapman reflects on the fundamental practice of incident management, he finds few intersections with GenAI’s biggest strengths and sees many of the processes as still being fundamentally human. “In incident management, the challenge is always that we know something has gone wrong, but we don't always know what has gone wrong, or who needs to do what to fix it. The process is about figuring out the details, executing that response, getting things back to a stable situation, and then getting fully recovered to a normal state of operation. All very challenging activities that involve working under time pressure that people normally don't have, working across teams that don't routinely work together and don't know each other's capabilities and concerns and considerations and so forth.”
“In our day-to-day work, we establish project teams. We spend a lot of time ‘storming and norming’ to build that whole framework of ‘how do we learn about each other and work together effectively,’ and have debates and arguments and joint planning activities that let technology companies do the amazing things they do,” Chapman offers. But in the same way that Agile methodology teams emphasize flexibility and a pragmatic approach to delivery, incident teams also need to be realistic.”
“Those things all take time and energy to establish. And you don't have that time and energy available during an emergency.” Downtime doesn’t just provide opportunities for collaboration–it demands collaboration. “You need to find a way to work together quickly and effectively enough for the emergency, even if it's not necessarily a great way to work together in the long run. It's very top-down, it's very authoritative, it's very hierarchical, it's very old-fashioned. But it works better in an emergency. Also, not everybody who's going to help you will be available at the same time, and certainly not at the start of the incident. You have to start responding with who's available at the time and incorporate more people over time. You need to have ways of effectively putting people to work and then putting more people in the process without disrupting the work that's already in progress, so people can come up to speed without disrupting those activities and then plug themselves in, offload, or take on some new tasks related to the emergency. And somebody has to manage and coordinate all of this and manage the communications.”
[Working with GenAI is] a lot like dealing with a very junior programmer. You still have to check it. You still have to write the code. You still have to write the test.” -Brent Chapman, Principal / Great Circle Associates
Chapman suggests that part of the excitement, and confusion, around AI and its benefits may come from an excessive widening or narrowing of definitions. The SRE veteran discusses automated systems that he worked on at Google for real-time ‘traffic’ (server cluster load balancing) management. “I worked on a system that tracked incident response patterns whenever there were problems with a given cluster, such that the first thing we’d do is drain incoming traffic away from that cluster, and send the traffic somewhere else that's still healthy. Realizing that this was the first thing we almost always did, we decided to automate that. But we needed some guardrails. For instance, we didn't want to drain traffic away from the last cluster in any continent, or from a cluster when there were only two clusters left. So I built this system in Python which, when it was alerted to unhealthy clusters, it would ‘think’ about draining it, run through the list of checks for that service, and decide whether or not to drain the service in that cluster. All taking place while people were still responding to their pagers. But depending on your definitions, this might be more of a style of ‘mechanical turk’-style automation rather than ‘AI,’ I suppose.”
What will the future hold for software engineers? For operations, Chapman sees potential in highly-trained AI systems with the ability to take certain actions, such as taking steps to provision new systems, or adjusting configurations, autonomously. “For instance, if you tie a generative AI system to your AWS console–your control system for your cloud computing system–and you can start having a conversation with it about, let’s say, bringing online another 20% of capacity in London. And the system replies that there's not enough spare capacity in London, but it can give you 15% in London and 5% across Europe, for example. You can imagine starting to have these sorts of operative discussions with it that are going to result in things happening...under approval with your supervision, and so forth. Right now, a lot of our monitoring control systems present human operators with problems, and it's up to the operator to solve them. I think the next step is going to be to have generative AI propose solutions. Now, is it going to propose better solutions than your average new hire six months out of college? Maybe. That's going to be an interesting question.”
On the topic of AI’s impact on developers as a whole, Chapman still feels strongly that systems will ultimately still need humans in the loop. “I think there's going to continue to be a need for more developers. However, there's going to be a new skill set that many of them have: ‘prompting,’ basically developing prompts, asking the right questions, feeding the right data to get a useful result out of the generative AI systems they're working with. It seems comparable to art. ‘Creating art’ is going to change from knowing how to mix paints and pick a brush and applying certain physical techniques to knowing how to describe what you're looking for to the AI...so that it can generate something that looks like what you want. I'm already hearing a bunch of my programmer colleagues talking about how they’ve had good luck using ChatGPT for tasks such as providing the framework of a Python application that does such and such. It’ll write the first hundred lines of code and create the first 10 files, and basically sets up your project for you. And then you can go from there. But it's a lot like dealing with a very junior programmer who doesn't always understand your intent and sometimes just goes off into the weeds. You still have to check it. You still have to write the code. You still have to write the test. I’ve also heard some people report success generating unit tests based on inputted code (which seems kind of backward, since you’re supposed to write your tests first and write the code to make the test pass), but is that better than no unit tests at all? Yeah, probably.”
“One of my favorite science fiction authors is Vernor Vinge, who was a professor of computer science at San Diego State University. He wrote a story with a character who, for various reasons having to do with relativistic time dilation and traveling near the speed of light, ends up returning home a thousand years later, despite only aging about 10 years. He believes his skills are going to be completely out of date in this new world. And it turns out, the guy's a programmer and there is a place for his skills as basically someone who understands the system 14 layers underneath what the AIs are doing in that present day. He has an ability, by understanding those underpinnings, to bypass a lot of layers and go straight to the ‘low level,’ a bit like a programmer today who still understands assembly language. Obviously, this is kind of a simplified example, but I think there are still going to be plenty of roles for humans in technology. Someone still has to verify–to ask the questions: ‘Is this right? Is this useful? Is this complete?’ I don't see that role moving away from human judgment.”
Conclusion
GenAI is already seeing direct applications in day-to-day SRE work. For more information and discussion on how AI may affect the future of incident management, DevOps, and the business of software development as a whole, join the DevGuild: AI Summit event.
More Resources:
Content from the Library
Enterprise AI Infrastructure: Privacy, Maturity, Resources
Enterprise AI Infrastructure: Privacy, Economics, and Best First Steps The path to perfect AI infrastructure has yet to be...
Generationship Ep. #18, Intelligence on Tap with Shawn "swyx" Wang
In episode 18 of Generationship, Rachel Chalmers sits down with Shawn "swyx" Wang to delve into AI Engineering. Shawn shares his...