1. Library
  2. Podcasts
  3. Generationship
  4. Ep. #2, Putting LLMs to Work with Liz Fong-Jones and Phillip Carter of Honeycomb
Generationship
36 MIN

Ep. #2, Putting LLMs to Work with Liz Fong-Jones and Phillip Carter of Honeycomb

light mode
about the episode

In episode 2 of Generationship, Rachel Chalmers speaks with Liz Fong-Jones and Phillip Carter of Honeycomb. Together they explore use cases for LLMs, insights on navigating AI hallucinations, the tradeoffs between prompt engineering and fine-tuning, and the privacy and security implications inherent to commercial LLMs.

Liz Fong-Jones is a developer advocate, labor organizer, and ethicist, as well as a site reliability engineer with over 18 years of experience. She is currently Honeycombโ€™s Field CTO.

Phillip Carter is Principal Product Manager at Honeycomb, where heโ€™s leading AI efforts and focusing on developer experience. He was previously Senior Program Manager at Microsoft.

transcript

Rachel Chalmers: Today it's my absolute pleasure to welcome Liz Fong-Jones and Philip Carter, both of the category defining observability company Honeycomb. Liz is a legendary developer advocate, labor organizer and ethicist, as well as a site reliability engineer with over 18 years of experience. At Google she worked on everything from Cloud Load Balancer to Flights, and she's currently Honeycomb's field CTO.

Philip Carter is principal product manager at Honeycomb, where he's leading AI efforts. He came there from Microsoft where he was the owner of the F# programming language. Most recently, Philip went viral with a fantastic post on the Honeycomb blog titled All The Hard Stuff Nobody Talks About When Building Products With LLMs. That's what we're going to talk about today. Philip, for our listeners who may not have read the post, can you quickly summarize what Query Assistant is and what it does?

Philip Carter: Absolutely. So, the general idea is that Honeycomb as an observability product is a tool that you use to query your data. You want to be able to figure out what's going wrong with my application, and you want to slice and dice all kinds of interesting dimensions in your data related to latency or errors, or user behavior. Pretty much anything at all that could fall under the umbrella of software reliability.

Now, the problem with that is you need to learn how to query your data and we find that especially new users who come to Honeycomb struggle with that quite a bit. There's a user interface that you need to learn and there's a methodology in querying your data that people often struggle with. However, when we talk to a lot of these users they already knew what they wanted to start with in their head and they could certainly express it using natural language, but they didn't necessarily know how to use our UI to accomplish what they wanted.

And so, as luck would have it, the Large Language Models that exist right now, and in particular earlier this year at about springtime, are extremely good at taking natural language inputs and interpreting them, and being able to output something useful.

And so in this case, we built a system that uses OpenAI's Large Language Model to take natural language inputs and then it produces a JSON object, what we call it is a Honeycomb Query Specification. So the idea is that that object can then be read by Honeycomb itself and turn it into a query that executes on behalf of the user.

The idea is that if I'm interested in slow requests through my system, maybe grouped by endpoint, I can say, "Show me my slow requests by endpoint," and then it's able to translate that based off of the schema of the data that you actually sent us. So you don't have to know how you actually query that source, and so then what it typically accomplishes is you can continue to use it to refine your query or something like that.

Or you can just keep using our UI because little pieces of the UI are filled out for you and you can see, "Oh yeah, I can click right here and maybe group by the other column, or maybe do this other thing," and level up how to use Honeycomb initially to the point where you're now basically proficient at using Honeycomb.

That's what Query Assistant is. It does a lot of things under the covers to accomplish that, but at the end of the day it's really a tool to help people onboard, either onboard into Honeycomb themselves for the first time or if they know how to use Honeycomb but maybe they're querying a new data set that they have no familiarity with. Being able to ask really broad questions and see what's available and get a decent query at the end of what they're looking for.

Rachel: I really think this is my favorite use case for Large Language Models so far, because observability is so powerful, and yet, as you say, there's that challenge with getting the first query out there. This really fills in that gap and helps people get productive with Honeycomb so much faster. It's pretty exciting. Have you seen an uptake in usage since you've published the Query Assistant?

Philip: Absolutely yes. We're actually going to be publishing a blog post that we're titling So We Shipped An AI Product, Did It Work? And we're going to lay out a pretty honest set of metrics that we were tracking the entire time. For the most part it's positive. There's some metrics that it didn't actually move, which is why I use the word honest. But I think thing that's most promising is we track what we call manual querying retention for active teams, or new teams that were created maybe this year that are considered active.

That means that they've been in the product, clicking around, doing stuff at least once over a given month. So what we care about is how often are you querying your data, because if you're querying more, then you're likely sending us more interesting information, you have some live system.

You're using us meaningfully, basically, if you're querying. So we found that for new active teams who were created this year who use Query Assistant, by week six, about 26% of those teams were still running manual queries. However, if we looked at the group that did not use Query Assistant, that are still these new teams, that number is only 4.5%, and so there's a really, really strong correlation there that if you use Query Assistant, you know how to then query your data and you're much more likely to keep querying it over time.

So that's a nearly 5X, actually I think it's a little... Yeah, more than 5X improvement over that particular measure that we have.

Liz Fong-Jones: What I find really interesting about this though is that it requires conscious design to get to that point, because you can imagine a lot of gimmicky AI stuff where people play with it a couple of times and then it doesn't actually contribute to learning and people just either become dependent on the AI and its limitations or they just give up entirely, right? What's interesting here is that hybrid approach of using an AI to reinforce that kind of initial human learning, but not having it be the be all, end all restrictions on what people can accomplish.

Rachel: Yeah. There are two really delightful things about it. One is that it's providing that missing translation step between human language and machine language. The other is that it's delivering, it sounds like, on that Holy Grail of retention in a big, complex software product. It's pretty exciting. Philip, in the blog post you talked about having to make a trade off between correctness and usefulness. Has that balance shifted over time?

Philip: The balance has definitely not shifted. In this post I'm also going to have a little section on some of the weird and wild and wonderful things people have been using Query Assistant for. Basically, the kind of inputs that they do. But so what I wrote about earlier this year was the idea that people are coming into Honeycomb for the first time, they may have a vague sense of what they're interested in but they may not necessarily know how to describe it specifically.

And so, we don't want to just give up if somebody is vague in describing what they want. We want to try to do a best effort, like, "Hey, this is a query that you can get based off of your inputs, and we'll execute it for you, and if that teaches you how to use the product then mission accomplished." It may not have necessarily answered your question, but it did its job.

Liz: I think that's another really fascinating thing, it's the what is the threshold for success here? What does the AI have to be sufficiently good at to beat 90% of users?

It turns out AI can beat 90% of first time users at formulating a query that runs right, that generates some meaningful data.

It may not be the data that the user was looking for, but by being able to start to fill in that blank and give someone a springboard to iterate off of; it's doing its job there. I think this is a reason why, Richard, you were saying observability is a perfect use case. I think it's both that the data is very rich, which you touched on, but also that there are no mistakes in observability, by and large.

You can run a query that runs the wrong results or that doesn't return any results at all, and you haven't caused any lasting harm. I think that that is a lot larger of a risk in systems where there's some potential to perturb the state of your system by running a query through an AI.

Rachel: Yeah, that's absolutely true. What I was thinking as you both were talking about this is that you've taught the AI to listen generously and to offer answers that may approximate what the user was asking for, and to me that's a real reflection of the culture that you've built at Honeycomb. The founders have always been so mindful of building an inclusive and blameless culture, and I think that really comes out in the products, and in this product as well.

Philip: Yeah, definitely. As we've been exploring and iterating in production, I think a lot of this has just been reinforced. We went out with an initial, I guess, philosophy you could say, around trying to accept as broad of inputs as possible.

In particular, there are prompt engineering techniques that you can apply that increase accuracy, but that does come at the cost of not necessarily accepting a certain kind of input.

That's kind of a fundamental trade off for certain kinds of techniques that you do when you're building with this AI. So our philosophy was, all right, we're going to get it out there, we're going to see what people do, and if it turns out that people are quite precise in what they want then maybe we could actually be a little bit more accurate and apply some of these techniques in certain ways.

And, it turns out, we were completely correct in our assessment that people were going to be extremely broad on the kinds of things that they do, and people are pasting in SQL statements for some reason, which doesn't make any sense. But whatever, it works. People are pasting in just IDs of things and we've even looked, we've traced through the feedback mechanism, did they give us a thumbs up or a thumbs down or something on the response?

People are pasting in what I often think as nonsense, but then they say thumbs up, and it's like, "All right, well, hopefully it was not a misclick. Hopefully that was actually helpful for you,"and when we talked with a few people, that's actually what it was.

So certainly I think for our use case, that was the right approach and this is a dimension of product development that people are going to have to keep in mind when they're building with AI because it might actually be the case that accuracy is much more important than accepting a broad number of inputs. In those situations, you need to know how you're going to do your prompt engineering and maybe assign to the model to accept that particular modality.

Rachel: But at least understanding that that's an affordance and that you can dial it up and down, it captures the idea of error budgets from SRE. There's a level at which more information is not useful, let's optimize around the level of information that is really going to make a difference to someone. How are you handling hallucinations?

Philip: That's a good question. There's a couple things that we do in our pipeline. You can think of it on both ends of the spectrum, on the inputs and the outputs side because hallucinations are a fact of life when you're working with LLMs. In fact, you might actually take the philosophy that a Large Language Model's outputs are always hallucinations, it is just a matter of picking which hallucinations work best for you.

Rachel: Man, you're blowing my mind.

Philip: So that's actually a helpful framing that you can have there. There are techniques that you can apply, both on the input side and on the output side to help with that. First of all, one of the most common ones that people apply to day is retrieval augmented generation. The idea is that you pass in a bunch of context to the language model and you say, like, "Hey, this is the useful data, and these are the rules that we're playing by," and there's all these sliding scales.

If you don't have enough context, then it's not going to be too helpful, but if you have too much it can get lost in understanding some of that, and so you want to figure out what's the right piece of context to fit in there and so we do a lot of work to base that off of somebody's schema and select the correct subset based off of what they're asking, and a few other pieces that come from some settings.

We try to parameterize a lot of stuff, and this is all on the input side so that on every single request, there's a very specific prompt that is actually assembled at run time and sent off to service that request. Now, on the output side, we parse those outputs and we validate what we have parsed into an object before we do anything. And so with that validation, there are a whole bunch of different rules that could potentially fail.

Now, when we're at that validation step, it turns out that a lot of those pieces that may be missing or incorrect, or are superfluous or whatever, we can actually programmatically correct them right then and there because in a lot of these cases, the Large Language Model outputs something that's 99% correct. That additional 1% we actually know statically, and so we can just add that additional 1% and move along with our day.

We think that's a hell of a lot better than saying, "Oh well, it failed validation. Sorry, we can't give you a query." And so those are two very, very, I think, important pieces. Being able to parse and validate outputs and then do something with that stuff that you're doing, and then apply steps on the input side, especially using retrieval augmented generation and really being thoughtful about selecting the right context to pass into the LLM. Then it ends up being pretty good on the hallucinations front.

Liz: Philip, I think you're forgetting one of the most important aspects of our work load which is we are one shot, right? It is not a continuous back and forth. As we know, the longer you go back and forth, the longer your inputs, the longer the output that you're expecting, the more likely you are to see a hallucination. So we limit a lot of that possibility just by saying you get maybe, what? 1,000 characters of input to describe what you're looking for, that gets tacked onto the end of a prompt, and then at the end, as Philip was saying, all we ask for is a very short JSON log. We're not asking it to write one piece, so it's not going to hallucinate details of characters and people, and so forth.

Rachel: Keeping it easy for the machines, lowering our expectations.

Liz: It's important to know what things they're good at and what things are risky. That can calibrate what you're doing, and we know that as long as we stay within the bounds of things that are less risky, we're less likely to have major problems. Similarly, we talk in this sector a lot about trying to exfiltrate prompts. We don't care that much at the end of the day if you get our prompt right, it's basically, "Here's a description of the Honeycomb query language. You are a helpful AI assistant, please construct a query." It's really not that secret, so again, we're not really worried about exfiltration that way. Again, if we were working in a different environment, that would be something we would have to more rigorously defend against.

Rachel: We'll come back to that in a minute, but just before we do that, Philip, how do you think about the trade off between fine tuning and prompt engineering?

Philip: So first of all, fine tuning a model is very difficult and very expensive for most organizations. In theory, it's a way that you can get more out of a particular model. Now there's a whole entire ecosystem of foundational models, most actually are based off of the ones that Facebook are releasing, and then you can fine tune them to your particular use case. It's a way of over fitting the model to a very specific model, which is actually what you would want.

But the challenge with it is you need really good data that informs it, like, "These are the inputs and outputs that you should expect," and so you want really, really good representative data. The only way you can collect that is by either working for years and years, trying to be as creative as possible, trying to mimic everything possible that your users could do. Or you collect data about how your users actually interact with stuff.

And so that puts you in a situation where you're like, "Well, how do I collect that data?" Well, best way to do it is to ship a product that is using a Large Language Model already that is giving them that opportunity to interact with and build up that data set to start fine tuning. Now, it's expensive and there's compute expense, but it's really time. It's a time problem, and so you may find that if you fine tune a model a whole lot, it's not necessarily going to output as good of...

It's not going to be that much necessarily better than if you just did better prompting in the first place, and furthermore it doesn't obviate the need for good prompting techniques either because you still need to... Basically, you could imagine a prompt that has a set of examples. Well, a fine tuned model won't have that set of examples but it still needs to have the rest of the prompt to know how it's going to behave, based off of the data that it's fine tuned on.

And so it's not necessarily an either/or, you're always going to have prompt engineering. But if you have the time and money and the affordance to be able to actually go with fine tuning, it's certainly worth a try.

Liz: This is why we're really excited about the GPT Extension thing, to be able to give some additional custom training and essentially pre-train what they prompt, rather than have to train your own model from scratch. Because one of the reasons why initially what Philip described as a cost in terms of time and money and energy, why we didn't train from scratch, I think the other angle is access to broader sets of information.

We don't know necessarily what's going to be relevant or not to the training of a model that is useful for this Query Assistant use case. But what we do know in general is that access to information about how systems work, how businesses work, that is all very germane to how people think about querying their data because they're usually querying their data in the context of they have a business problem that they're trying to solve.

If we narrowly constrain to training just on Honeycomb queries, then it might not understand what a browser is or what a user is, to the extent that any LLM understands these things. But I think having that context to better understand our users' inputs helps it generate better output, than if we had more narrowly and focusly trained it.

Rachel: Super cool. Back to what you were saying earlier, Liz, a question for both of you. In using these commercial LLMs, do you worry about the privacy and security implications?

Liz: Yes. We worry a whole lot about it, and I think that there are a couple of different angles here. First of all, that we obviously have contracts with our enterprise customers and those customers can choose to opt in or out of having access to the query system turned on for their users.

Secondly, when someone is not necessarily in an enterprise tier, and they are using the Query Assistant, we are disclosing as part of our terms of service that OpenAI is a sub-processor, that we are working with OpenAI, and should we work with other machine learning companies in the future, we will disclose who our sub-processors are. I think the other angle here is that because we use vector embedding, we try to not necessarily leak a huge amount of information that can be more accurately summarized as vectors.

I think the final piece here is that we don't actually feed in user data. We feed in the names of columns, so we feed in metadata but not actual user data, and I think that ameliorates some of that risk. At the end of the day, you do have to trust OpenAI that they say that data that's fed to them via API and not via the chat interface, that that API fed data is not used for their own learning, that you're not going to wind up having your data resurface in someone else's queries.

But I think it's this defense and death approach, where you have to disclose, disclose, disclose, add appropriate safeguards and, at the end of the day, mitigate the risk of, if should something wind up leaking, what could possibly leak.

Philip: Yeah. I would second all of that. We were very, very thoughtful along this line because, in particular, as we were building the feature, this was when OpenAI was having some of their data leakage issues and they were having some of their kerfuffles with being compliant with EU regulations and stuff. Now all of that is behind them and it's a lot easier, probably, to use something like OpenAI than it was back then.

But I think they learned the hard way that you need to earn trust as a platform and we're seeing them earn that trust, which is definitely good. The other aspect to this is that privacy is itself a product trade off. If you can get something out there to even just a subset of your users, where this is not really a hyper concern for them, then it might be worth it to actually do it and use a commercial model while you work towards building something that is going to be as compliant as possible for every possible need that you have for your users.

In our case, for example, we have a set of customers that the feature is not even available to them ever, and it's literally just because of needing to be able to sign a particular piece of paper with OpenAI and they haven't been able to sign it with us.

Because of that, we can't actually opt these people in. Now, we are thinking about how we can adopt an open source model that we then post ourselves, and then we're fully compliant as far as everything else is concerned. But that didn't prevent us from actually going out and shipping stuff.

Now, this is a trade off that's going to be different for everybody, but if you're like us and you have a whole bunch of users where data and, in particular in this case, metadata is being sent to a provider that is SOC-2 compliant and all of that, and it's used to augment our product experience just to make their lives easier.

If they're mostly okay with that then it's totally worth it to build with a proprietary model like OpenAI's and use this third party vendor, then figure out how you're going to handle it for the rest of your people in the interim.

Rachel: Yeah. The thing about building on that foundation of consent and opt in and transparency is you're building the trust with your customers and you're empowering them to make the right choices and the right risk trade offs for their environments.

Liz: And there has to be something compelling there for the user, right? I think no one really wants their data to be just sucked up and used to improve ad performance 5% against them. I think that when there is a lot more tangible of a benefit that you get in exchange for allowing AI to work with your data, I think that's something that people are a lot more willing to consent to.

Rachel: Great point. Let's talk about the user. Liz, how do you see AI enhancing or detracting from the work of site reliability engineers in general?

Liz: I have been on the record consistently as a very, very deep skeptic of AI and I still am.

I think that a lot of the previous wave of AI ops has offered promises that were over simplistic, that were overblown frankly, and that did not actually make systems meaningfully more safe.

So when we talk about the goal of having AI assist humans, I think a lot of people jump to immediately what's going to happen to my job, is the AI going to completely take over my job?

And a lot of the marketing of these things was originally, "Hey, you can hire fewer engineers. The AI will just run your system for you." And it just does not work that way. So I think, basically, the skepticism comes from how we relate the role of the AI or the role of the computer system and the role of the human. I think that humans need to be in the loop for any substantial decision.

Now, we can debate over what a substantial decision is, and that's definitely changed over time, but I think it revolves around accountability, understandability, how do we actually ensure that a human being is able to explain the outcome of the system and to tweak or change it if there is a problem. So when we look at the previous generation of AI focused systems, a lot of them were, "Hey, you're getting 10,000 pages or alerts per day, right? We're going to sift through those and find the 10 most impactful per day." It turns out 10,000 is actually not that big of a data set in the world of AI.

You really, really need much larger data sets in order to get a reasonable margin of true positive, true negative, et cetera. Also, these things were taking autonomy away from humans and they're encouraging people to proliferate more garbage in, rather than cleaning up their garbage. So I think that's an example of a first wave AI system that failed to achieve its promise. It was supposed to help people sift through all of these alerts, and in fact just made the humans' lives more complicated and made it harder for them to understand what was happening.

But if we can instead think of things as AI is a tool that we can use in order to enhance what we can do as humans, to extend the reach of what we can do but we're still ultimately in the driver's seat, then it becomes a lot safer, it becomes a lot easier to reason about what are the affordances, what does someone on call need, and how can we ensure that they have a synergistic relationship? Rather than working against their tools.

One of the expressions that I've been using for several years now is we're here to build mech suits, we're not here to build Skynet. The mech suits are to fight Skynet, right? We're not going to make your Skynet problem worse. That's how we think a lot about acting as an extension of the person.

Rachel: I want to dig into something you said, Liz, about the promise of AI ops, you'll be able to fire all of your engineers. As a labor rights activist yourself, what are some of the risks you see LLMs posing to the workforce and how do we mitigate those risks?

Liz: I think we have to even back up beyond the workforce. I think that consumers should be the most concerned about the biases that are baked into LLMs. If you ask an LLM to generate a text or string about a doctor or an engineer in a language that has gendered pronouns, you will find that it tends to make assumptions about people's genders.

And so on and so forth about generation of art via AI, where you're having these generative art models that are creating primarily either white or white gaze models of what the world is supposed to look like. So I think that to me is the more terrifying macro situation, is regardless of the working conditions of individual human beings, I think the impact on society at large is our first, most core responsibility.

Then, yes, when we think about how AI will impact workers, I think that companies are too often focusing on what can I use AI to replace workers with. This has been going on for a long time, by the way. Hiring screens used to be done by recruiters, and now are done by personality tests administered by robots so I don't think of this as new to LLMs.

But I think as plenty of commentators have said, LLMs are a situation where there's the temptation for workplaces to try to replace knowledge workers, people who are producing written artifacts or other kinds of knowledge products. I think in these situations, it's very, very shortsighted because who's going to actually monitor the quality of what's being spat out? How do you know? AIs are prone to cheat at things.

How do you actually know whether the output is crap or not? As an example I think Cory Quinn asked an AI to put together a proposed sick note or something, like, "Hey, I'm going to miss my meeting because I'm sick," and the AI hallucinated something that would've been completely off-putting to receive as a customer. So I think that's kind of the caution here, is you do need human beings to review the results of this.

It is much better to think about things as how do you extend the power and capability of what your workforce can do with augmentation, rather than slashing and burning your workplace because at the moment, at least, the technology is at a point where you can really tell when something was written by an AI versus not.

It has the corresponding effect, especially in industries that are built on person to person relationships. If someone sends you an AI written missive, you know that they are not really being respectful of you and of your relationship.

Rachel: Yeah. Our industry is so person to person. The bean counting mentality is just wild to me. It's like, haven't you ever worked on a team where you genuinely liked all of the other people? It's such a great experience, why would you want to replace that?

Liz: Yeah. Oh, and I think another thing that I should've said earlier also is that AI models don't necessarily fully remove humans. What happens in that play is a displacement or replacement of labor. What you don't see is the mechanical Turk. You don't see the people who are feeding inputs into the system, you don't see the people who are having to do safety evaluations, who are doing all of these menial steps that supposedly AI has replaced, except for that it actually hasn't.

You have to look very carefully at the workplace dynamics of that, in terms of what are the socioeconomic statuses of these various jobs, what races are there of the people who are working these jobs, and I think that one of these findings is that mechanical Turk type labor is undervalued and very heavily racialized.

Rachel: Oh yeah, absolutely. Again, it's another example of how AI is amplifying long preexisting problems in the tech industry. That's always been true, but the amount of mechanical Turk work that's required to maintain LLMs in particular is enormous and, as you say, it tends to get outsourced to countries where people are brown and make much less money. It's something that we really need to be aware of. Another question for both of you, if everything goes the way you'd like it to for the next five years, you're God Emperors of AI, you get to dictate how things turn out. What do each of your futures' look like? I'll start with you, Philip.

Philip: Oh boy. I view what we have as a step change, but an incremental step change in technology. So, with that is going to come a whole lot of things throughout our entire industry. It's kind of hard to predict them. What I would say that I hope is true for my job day to day, since I'm a product manager, I do like technical work and non-technical work if you can even call it that.

But I think fundamentally the bulk of what matters in my work is looking at data and circumstances and talking to people and trying to make sense of stuff, to try to understand if we're doing the right thing at the right time for the right people. Sometimes, a lot of that is rote, mechanical work. I have differentiated myself in the past for knowing how to write good SQL statements in a business intelligence suite, but I think the SQL is not actually what's important.

It's being able to get the results of that SQL statement that runs against all our BI data and interpret that information, and ascribe meaning to what's going on and use that to inform what we should be doing. That's actually the value that I bring when I'm in the workplace.

It's not the fact that I can write the SQL statement that gets me there. So the more tools I can get to automate that stuff that's not actually what I'm paid for, not actually my job, or why I have meaning, I guess, the better because we're bottle necked on so many things that are extremely important in our work and in our lives right now.

We have a lot of decisions that we have to make, a lot of things that we have to interpret that we don't have enough time to actually spend really carefully doing that because we have to do a whole bunch of work and stuff that leads up to that important stuff in the first place. I would love to live in a world where I have enough tools, certainly in the workplace, to let me do that a lot more.

We're clearly not there yet, and I don't think we're going to be there maybe even in the next five years. But hopefully it'll be at least a little bit better.

Rachel: That sounds really great.

Liz: I love how you're so glass half-full, Philip. I'm the glass half-empty here. What I fear a lot about what's going to happen over the next two to three years is companies are going to shift towards a lot of automated generation of code and generation of artifacts, without investing in the ability to understand it. That work is going to fall on my fellow ops people. I come from an ops background, I'm currently field CTO, it's true, but my people are the ops people.

I think what happens when you get rid of half your developers, replace it with churning out AI written code is no one winds up being able to understand it and everything goes to shit. I think that's the real danger that I'm continually steering us away from the rocks there. It's like, "Hey, how are we going to actually validate this output? How are we going to be able to understand what the produced code is doing in production? And how do we empower people to run with higher leverage, this higher volume of code?"

So when I look at GitHub CoPilot, I am excited and I'm terrified. Excited as Philip is for the, hey, I don't have to write this boilerplate myself, but also terrified for what if there's a bug in those 500 lines of code that was generated? Where am I going to find it? Is my code reviewer going to catch it or are they going to deploy an AI to review my code and that's not going to catch it either?

The responsibility has to sit somewhere, and I think that's what keeps me up at night, is what happens when a business critical system is written using AI and it crashes and no one can resuscitate it and suddenly we're reliving Healthcare.gov, except for instead of an army of humans debugging it, it's a much smaller of army of humans debugging it because the machines wrote all of the code.

Rachel: It is interesting, even that you're coming at it from different angles, how your answers rhyme. You're both fundamentally concerned with why does this matter? Why does this matter to humans? What's the real world, lived implication of this? And I do think I'll probably end up quoting this in every episode of the podcast, Fernando Flores, the Chilean economist said, "Work is the making and keeping of promises."

And that really is at the heart of it, it's taking responsibility for outcomes and understanding what those outcomes mean to other people. So it's exciting to have two people as high profile and conscientious as you two working on it. Another fun question. The podcast is called Generationship because we're on a spaceship sailing to the stars. If you each had a colony ship, what would you name it?

Philip: I'm an incredibly boring person. I would probably call it Philip's Ship.

Rachel: At least we would know it's yours.

Liz: I'm going to have to go with the Sim City Arcologies, Launch Arcology. I'll literally call it Launch Arcology because that's what it is.

Rachel: It's beautiful. Yeah, I love that. Philip, Liz, thank you so much for coming on the show. It's been great talking to you. I'm excited about your future, I'm excited to read the next blog post. It's like hearing that my favorite show got a new season. Good luck with Query Assistant and with everything you're working on.

Liz: We've got a lot of very fun things coming up. If I can tease one thing?

Rachel: Please.

Liz: We've been putting a little bit of thought into what jobs SREs do that are repetitive and one of those jobs is coming up with service level objectives and translating that into one's observability tooling. So that's an interesting area that we're putting some thought into and poking around at, that has that kind of curve of an AI is maybe 90% better than humans at doing X. Not going to specify what that X is, aside from saying that it's in the service level objective space.

Rachel: Super cool. Thank you, both.

Liz: Thank you.

Philip: Thank you.