MAY 1, 2025

21 MIN

Ep. #35, Wisdom with Brooke Hopkins of Coval

GuestsBrooke Hopkins

light mode

about the episode

In episode 35 of Generationship, Rachel is joined by Brooke Hopkins to explore what it takes to make voice AI agents reliable, robust, and ready for real-world deployment. Drawing from her experience at Waymo and her current work at Coval, Brooke reveals how testing and evaluation are the key to staying ahead in a fast-moving AI landscape.

about the guests

Brooke Hopkins is the founder of Coval, a platform for simulation and evaluation of AI agents. She previously led core simulation infrastructure at Waymo, where she helped build tools to test and validate autonomous vehicle models. Brooke holds a B.S. in Computer Science and Mathematics from NYU.

show notes

about the episode

about the guests

show notes

transcript

Rachel Chalmers: Today, I am delighted to have Brooke Hopkins on the call.

Brooke is the founder of Coval, a simulation and evaluation platform for AI agents.

She previously led evaluation job infrastructure at Waymo, and if you're in San Francisco and you haven't taken one of the robot taxis yet, you have to. It's so science fiction, it's amazing.

Brooke's team was responsible for the dev tools for launching and running simulations, and she engineered many of the core simulation systems from the ground up.

She has a bachelor of science degree in compsci and mathematics from New York University.

Brooke, it's so great to have you on the show.

Brooke Hopkins: Thank you so much for having me, Rachel.

Rachel: Super excited. How is testing voice AI agents different from testing traditional software. And how does it compare to your experience testing the AV driving models at Waymo?

Brooke: Yeah, so when I was at Waymo, I think I spent a lot of time building out these simulation and evaluation tools for how we can test and create really reliable self-driving cars.

And when I left Waymo, it felt like there was so many similar problems happening in the rest of the industry and AI that weren't just AB-specific, but how do you take really large data sets and find interesting examples within those data sets and then test non-deterministic models.

So being able to run the same examples over and over to get some probabilistic output.

And then taking all of those insights and being able to make sense of all this data in a way that you can draw conclusions about how your system is performing.

And these are all the same problems that are happening in AI. Especially with voice.

How do you run voice agents so that from point A to point B, like let's say a customer support ticket, how do you make sure that that customer support ticket is being handled correctly every single time?

Even in lots of varieties of scenarios and circumstances and how that person responds, if there's background noise, and all of these other attributes.

And so this is really similar to self-driving cars because in the same way that when you're driving down the street, you might come into the same intersection every single time, but there's always going to be these slight variances.

And so we're taking the same approach of robotics and self-driving is how can you take these simulation best practices and apply them to voice and chat agents?

And eventually, other agents that are navigating the external environment to be able to produce really reliable autonomous agents.

Rachel: So the car needs to get from A to B, even in this very challenging and changing environment without killing any pedestrians, the customer support agent has to get the customer the right answer even though the customer themselves is discussing the problem in different ways and really complicated.

And the way you do that is you run 1,000 simulations and you try to get the bad outcomes down below 0.1%. Is that right?

Brooke: Exactly.

I think that the big difference between traditional software engineering and agent or non-deterministic models development is that for the same input, you might get lots of different outputs.

And so a test where you say, for this input, I expect this exact output, those tests no longer make sense because really, what you care about is how often am I getting it right across a slew of use cases.

And for some cases like that, reliability needs to be super high. And so similar to cloud infrastructure or these other places, you know, you might want six nines of reliability for those.

Rachel: Please don't kill any pedestrians.

Brooke: Exactly. Or like definitely want to be compliant or want to make sure that we're getting this answer right every time, even if it means sometimes we don't serve it autonomously.

And then on this other side, there are some cases where actually just like any response is okay, and you might have a lot lower reliability requirements.

And so I think being able to find the in between between all of these is super important.

So it's not just like, is it getting the right input for this one instance? It's, what's the probability across all of my simulations?

Rachel: So those are the big differences. What engineering principles from traditional software dev still apply versus the ones you've had to completely rethink?

Brooke: Yeah, I think sometimes people put their hands up and they're like, "There's no way we can get really high reliability for autonomous agents because they're non-deterministic and we'll just never be able to control the models."

And I think this really isn't true because if you look at cloud infrastructure, that's a great example from traditional software engineering where there's inherent unreliability across the stack, right?

Like everything is unreliable from the servers, from the network layer, all the way up to the application layer.

There are so many layers of reliability that theoretically if you compound all of those, you would basically get no uptime.

And yet we've been able to engineer systems through redundancy and all sorts of other engineering best practices to be able to get really, really high reliability, like six nines of reliability.

And so I think what I'm really excited about is how can we build those same things for agents where you're not just relying on a single call to an agent and if it fails, then it falls over.

But how can we start thinking about, for example, like fallback mechanisms so that when an agent fails to complete a task, like can it retry or can it have another agent that's helping to check those answers?

And people are already doing this today with guardrails. Like, can we have checks as it goes along throughout the task, like have some checks to double check those answers with smaller models or more expensive models, etc.

And then like a more fundamental layers like can you have multiple calls so that you're like double checking work, etc.

And so I think taking some of those principles of how do you create redundancy and strong fallback systems in a fundamentally unreliable system to create reliability is super interesting.

I think also testing best practices in many ways, have dramatically changed, but in many ways, they're still very similar. Where you want to have unit test, you want to have some integration tests, you want to have like a testing pyramid of these are all my base cases, these are all my edge cases.

And running variety of like small tests to iterate really quickly and then large tests to be able to test your system end-to-end before big releases.

Rachel: I love using the SRE and platform engineering world as a sort of example of how to deal with uncertainty and the messiness and chaos of the real world.

It seems like a really powerful analogy for what you're doing.

Brooke: Totally. I did a lot of SRE work at Waymo and so I think there it's like really exciting to do reliability work because you have all of these non-deterministic models across all of the stack and being able to like jump into any problem and say like from first principles, like, what's going on here?

There's something really thrilling about that.

Rachel: We live in a chaotic and messy world that mathematics and statistics give us a really powerful way of thinking about it.

Meanwhile, the voice AI market is evolving at just an incredible rate. Now that you're at Coval, how are you helping companies keep up with this rate of change?

Brooke: Yeah, I think it's been really crazy since we started. Back last summer, there are very few people doing voice.

I think the models yet weren't yet good enough. It was still extremely early in terms of every layer of the stack. The speech-to-text models didn't even exist.

The ability to do text-to-speech and speech-to-text, and then also the reasoning capabilities of these models were just much farther behind them where they are now.

And so I think this explosion of voice has been really exciting, but it also feels like things are moving so fast where every week, there's a new thing that's happening in AI, or in voice AI.

And so a question we get from our customers a lot is like, which parts of the stack should I be looking to switch at? Which parts of the stack should I just like keep and assume that it will continue to get better?

And I think this is where having really strong evals is super important. Like Garry Tan said that the foundation of really strong agentic systems is actually in how strong your evals are.

And the reason for this is I think because you're able to swap out these different modules as things improve so that you can move really fast with the rest of the market.

So when the next best voice model comes out, you can throw that in your system, see how well it performs with your use cases, tune it and ship that really quickly.

Whereas if you are in the dark of how your voice agents are performing, it becomes really hard to swap out any one piece because everything is so brittle. And I think this is true with prompts.

When you change to different LLMs or different reasoning capabilities, it's going to be a lot harder to try out these new experimental models if you don't have the evals and baselines to compare that against.

Rachel: Because you're quantifying the performance of all of these different stacks and models, you're in a position to advise your customers on what's actually working with some objective backup to it.

That's that's a very cool place to be.

Brooke: Totally. And I think that you also have the ability to experiment.

I think some of the companies at the forefront of voice agents are there because they're constantly trying out new architectures, they're constantly trying out different ways of doing it.

One of our customers who is like, I think we pair a lot on not just how should they set their evals, but how can they run experiments of these novel voice architectures to see how it compares across against their baseline.

And this is changing like every month they're thinking about new ways of doing these things.

And so I think at that rate, it's just really hard to compete in the market if you're not also iterating at that speed.

Rachel: So we're ending up with these stacks that are going to be very dynamic and people are going to be reassembling them from different components as new capabilities come online.

How do you design dev tools that can remain relevant through all these transitions?

Brooke: Yeah, I think the beauty of doing the testing aspect is that even if we get to AGI, what we're doing is we're saying are you achieving the outcomes that you're looking to with your autonomous agents?

And so it's less about any of the specifics of different parts of the stack. If you're using speech to speech and in cascading architecture, that's using speech-to-text and then using an LLM and then text to speech and kind of iterating through that loop to respond to your user.

And that can be compared with speech to speech. And I think here we're seeing, even in these all these different cases, you're still ultimately trying to get back to your user as fast as possible.

And so I think even as we evolve into web agents or as we evolve into these other autonomous agents, it's going to be a very similar process in the sense that you're trying to evaluate the ultimate outcome.

And so as the tools shift under you, the key is to just, I think obviously staying up-to-date and being in tune with the latest that's happening in agents and the best practices, etc.

But I think we have a little bit easier there, whereas I think the models definitely are just constantly changing.

And if you're not constantly releasing the best model, it's yeah, the market moves so fast.

Rachel: So what metrics do matter the most when you're benchmarking voice AI systems and have those changed since you started Coval?

Brooke: Yeah, I think they definitely changed over time. I think even starting with more traditional NLP metrics now are no longer useful just because the models have become so good.

So for example, like Glue and Rouge or some of these other academic benchmarks that have been used for a long time in benchmarking NLP models, those no longer are super useful for being able to say like how accurate or like conversational an agent is.

And then similarly, there's a lot of other things like for example, word error rates is very common around measuring text-to-speech models or latency is obviously a big consideration.

But I think beyond that, we're actually now seeing like the frontier of what people care about constantly shift.

So in the beginning people were just thinking about like, can it respond to the user with a relevant response?

Like basically, can it have a natural conversation for a prolonged period of time without forgetting who the agent is and where in space they are?

Rachel: Can it pass The Turing Test?

Brooke: Yeah, exactly. Like can it just have a conversation back and forth? But now, I think we're seeing people like they're saying, can it follow this really complex workflow?

Can it call all of these tools and make function calls in really intelligent ways? I think other things that we see come up a lot are, can you say emails or phone numbers or complex alphanumeric strings?

These are all super difficult to nail because the cadence at which you say different numbers matters depending on the context.

And then I think also being able to tune, and this is a metric that we don't yet have, but I think we're working on some exciting things around how do you tune how casual or formal an agent is?

So being able to create the user experience around these agents. So for some agents, you want them to be much more professional or rigid, whereas other agents you want them to be much more casual and human-like.

And so you can add in things like disfluencies around saying, "like," "um," all the things that make me sound really human.

And then you can have these very professional agents, like if you're an airline, you're obviously going to want to sound very professional.

Be it like web agents or voice agents, chat agents, like all of these different modalities, you're going to want to create a different brand and a different tone.

Rachel: Yup. Teach the voice agents how to code switch.

Brooke: Exactly.

Rachel: How do you think about creating standardized benchmarks in a field where every company seems to be measuring success differently?

Brooke: Yeah, this is something that we're really excited about because we get this question from customers so often, which is like, which parts of the stack should I be using?

We've actually started to do benchmarking across lots of different models. So for example, with text-to-speech, how can you think about which text-to-speech provider to use?

Like doing the running continuous latency, word error, and other benchmarking for those.

I think something that's really missing from benchmarking right now is the continuous aspect.

So being able to go and say like, "What is the latest for all of these different models on the continuous basis? Like over the last week, how reliable were each of these models? Can I compare how these different models sound for my different use cases?"

And so that's something that we're kind of investing some time to do that both for ourselves of how can we be able to determine which ones we should be using internally or help our customers decide that.

And I think it's also just like a huge need right now is there are so many models coming out all of the time, like to be able to choose between these. I think voice right now is pretty underserved in terms of benchmarking capabilities.

Rachel: What unique challenges have you encountered building dev tools specifically for voice AI companies? What surprised you?

Brooke: Yeah, I think we went into it thinking that there are a lot of similarities with self-driving.

And I think that has proved even more true than we expected around just the common, I think there's a couple of like really core fundamentals emerging from self-driving robotics and now autonomous agents that I think we're going to look back on as kind of this new era of software engineering best practices.

I think similar to web or object-oriented programming or like kind of web architecture and web interface architecture.

I think we're going to start to see like what agentic and non-deterministic AI systems, how they should be architected and what those look like.

I think in self-driving, there emerged a really common architecture across all the companies around perception.

So perceiving all of the inputs of the different sensors on the car. Behavior prediction, which is given all of these different inputs of like, this is a pedestrian, this is a car, etc.

What do I think all of the objects in the scene are going to do next? And then planner is based on all of these inputs, what should the car do next in order to achieve its goal of getting to point B?

That's obviously oversimplified, but I think there's a lot of commonalities between all of the different self-driving companies.

And then on top of that also like how you simulate and how you deploy software here. There are a lot of learnings there. And so I think we're going to continue to draw on those learnings.

And self-driving is used as an example, I think increasingly so in autonomous agents for that reason because it's a great trend line to draw between all of these different non-deterministic agents.

How do we start to emerge like a new software engineering deployment practice for these new systems?

Rachel: So sort of along the lines of those emerging standards from the autonomous driving companies, how do you balance the needs of different stakeholders within the voice AI companies?

You've got your ML engineers, you've got your product managers, you've got your QA teams, how do you make sure everybody's needs are met?

Brooke: Yeah, I think this part is actually really exciting because as AI is kind of upping all these different roles within companies, I think a lot times people are talking about like PMs and engineers are merging or different roles are changing so dramatically.

I think we're seeing that the line between developers, PMs, or developers and sales is closing as well.

So for example, when you're buying an agent, you can't just assume all the buttons work, right?

With traditional B2B SaaS, you didn't have to like go into the dashboard and be like, okay, you say that you track all my applicants, but do you actually? Like, let me go check.

And with agents, you totally have to do this because the difference between a really highly performing agent and a poorly performing agent means whether or not it's useful or not.

I think a lot of these agents, obviously, you would love for an agent to be able to close more sales for you or handle all of your customer support requests.

But if it's not actually doing that for 90% of them, then it's actually just creating a of noise.

And so I think that a lot of these roles are going to change where software engineers actually have to provide a lot of these low level analytics or insights into how well the agent is performing in order to convince customers that their agents are behaving as they expect.

And vice versa, I think as a company procuring agents, how do you think about that buying process and how do you think about like, what questions should you even be asking as a CTO?

Like should you be calling an agent over and over? I think that with Coval, like what we really want to do is be able to provide a framework for how people can think about these questions.

Rachel: Brooke, you've convinced me. I'm going to make you God emperor of the solar system for the next five years.

Everything goes the way you think it should go. What does the world look like in five years?

Brooke: I think we are much more creative and we have space to build so much more.

I think in general, over time, humans have become more creative with every new technology advancement. And I think that's overwhelmingly apparent with AI.

People are building software who never could have built software before because you have things like Lovable or Bolt.new to be able to just spin up applications out of nowhere.

And then on top of that, you have the ability to do like all this amazing design work or being able to put words to thoughts that you had and maybe just like weren't able to bring together cohesively.

And so I think I'm really excited about just like how creative it will make us.

Rachel: And now to test your creativity. If you had a colony ship to the stars, if you had a ship that was going on 100-year journey, what would you name it?

Brooke: I love this question. Is it inspired by a Generationship?

Rachel: Yes, exactly. That's where we get the name from. The idea that the journeys that we make will take more than one human generation to survive.

Brooke: I love that. Well, I would have to name it Sofya because we're named after Sofya Kovalevskaya and she was the first female mathematician to get her PhD.

And then Coval is also conversational evals. And so we're definitely named in her honor.

But I think that this idea of being at the forefront of your field and being like the pioneer obviously fits so well with being a Generationship too, in many ways. So I'll have to name it Sofya.

Rachel: And of course "Sophia" is wisdom, and wisdom is going to get us to the stars if nothing else does.

Thank you so much, Brooke. It's been a delight to have you on the show. Come back any time.

Brooke: Thank you so much, Rachel. It's been great to chat.

Content from the Library

Visit library

Jun 12, 2025

Podcast

Generationship Ep. #37, Optimism with a Plan with Latrice Barnett and Kanika Chander

In episode 37 of Generationship, Rachel Chalmers chats with Latrice Barnett and Kanika Chander from Majority Collective about...

Jun 5, 2025

Podcast

Open Source Ready Ep. #15, Codename Goose and the Future of AI Agents with Adewale Abati

In episode 15 of Open Source Ready, Brian and John chat with Adewale "Ace" Abati from Block about Codename Goose, an open-source...

May 28, 2025

Podcast

O11ycast Ep. #82, Automating Developer Toil with Morgante Pell of Grit

In episode 82 of o11ycast, Ken and Jess chat with Morgante Pell, the visionary behind Grit, an AI-powered agent designed to...