1. Library
  2. Podcasts
  3. Open Source Ready
  4. Ep. #27, Rethinking AI Evals with Adam Hevenor
Open Source Ready
40 MIN

Ep. #27, Rethinking AI Evals with Adam Hevenor

light mode
about the episode

On episode 27 of Open Source Ready, Brian Douglas and John McBride speak with Adam Hevenor, creator of Vibecheck. Together they explore the fundamentals of evals, how teams can run structured experiments across model variations, and why cost-efficient design matters more than ever. Adam also offers grounded perspectives on MCP adoption, Claude Skills, and the economics shaping today’s AI tooling ecosystem.

Adam Hevenor is a software engineer focused on building practical, developer-friendly tooling for AI agents. He is the creator of Vibecheck, an eval and experimentation platform designed to help teams understand and improve model behavior. With a background in observability and infrastructure, Adam brings a deeply technical perspective to emerging AI patterns and workflows.

transcript

John McBride: Welcome back everybody to another episode of Open Source Ready. Here again, as always, with Brian. How are you doing?

Brian Douglas: I'm doing fantastic. I'm definitely staying busy over here in Continue Land, out here in Oakland. But yeah, I'm happy to be chatting with Hev and also with you, John.

John: Yeah, very excited for this. So today we have Adam Hevenor, or maybe "Hev" as some people would know him. And Adam, I would love for you to introduce yourself and tell us what you've been up to.

Adam Hevenor: Yeah, my name's Adam Heavener. I'm a software person, I work on lots of different projects. John and I go back a ways and share a former employer and yeah, I'm working on really a new agent framework for doing evals.

So we'll dig into what evals are, what that means. But yeah, I've been spending really all my time figuring out how to put that together and getting a developer experience that's a little bit more pleasurable for that.

John: Yeah, very nice. It's funny and kind of serendipitous because we've been having a lot of these AI technical adjacent conversations. What is an eval? This is a word I keep hearing and I'm not even certain I know what it is. So why don't you tell our audience what that is?

Adam: Yeah, so an eval is a small repeatable test that you can run against model output. And that model could include tool calling and things like that. So it could be a model or it could be an agent outcome that you're looking to accomplish. And it's really how you measure if your system is improving.

So you set up these evals, much like test driven development. You take a collection of evals, you put them into an eval suite, and then you use that suite to benchmark performance. So, that's what Vibecheck allows you to do. We have a series of abstractions to put that suite together and then allow you to run experiments against different models, different system prompts, things like that.

John: Yeah, I am reminded of this stuff that we looked at, Brian and I, a year plus ago when we were doing this incubator with Microsoft and they were showing us all this stuff in their ginormous AI studio. And you could plug all this stuff in and then get percentages and even apply some machine learning to understand how your model was doing in real production use cases.

How does your open source framework and product, Vibecheck, how does it kind of compare to bigger suites of tools? Is it similar?

Adam: Yeah, it's similar, but hopefully simpler is one of the differentiators for us. A lot of the products that are out there are sort of grown to be complex and they're also, in a lot of cases, focused around that production monitoring use case you mentioned.

So w hen you say evals, a lot of times people are applying those evals to production systems to just determine if anything bad has happened. A lot of times evals, you sort of take a risk, what is undesirable behavior that the model might do and you try to isolate that, test for that and make sure that's not happening.

So a lot of the existing products and services in the space are kind of focused around that. They're also focused around fine tuning use cases and things like that. And they've grown in complexity quite a bit.

There's a good one called MLflow. It's like a Databricks product. It's pretty industry standard across ML teams that work with Spark, but it's enormously complex. It has tons of different score weightings, all these things that you have to like wrap your head around.

And I'm trying to just bring that back to some basics, use some of the paradigms that folks are familiar with from CI/CD Systems and make that developer experience just really familiar. We have a DSL or domain specific language that allows you to write evals in YAML and then you can then perform a series of different checks to confirm or see if a tool has been called and things like that.

John: Yeah, totally. Brian, I'm curious if you and the Continue folks have been using evals or how that plays into your product.

Brian: So I've been at Continue since June. Technically March, I was consulting. But yeah, June is when I started full time.

In the time between now and June, there's been a new model dropping almost nonstop.

I think we had a bit of a slowdown in maybe the end of July, but then we picked right back up with some Chinese models and Kimi 2, and stuff like that. But I guess I bring this up because when I asked the question, are we using evals internally or for our users? Yes, users are setting up your evals.

But the rebuttal I hear from a lot of folks is like, the models drop so fast and they get better and better. So like, by the time you do an eval, like, you're already on the next version of the model. Like, you go from like Claude four to Claude four five within the span of like six months or whatever it was, or seven months.

So, how do you predict or how do you like invest the time or who's investing time in doing evals when there's new models coming all the time?

Adam: Yeah. And actually to continue my thought, you know, we have these existing systems that are focused on monitoring production and what's going on. I built my platform really focused on the experimentation phase and I built it on top of a service called OpenRouter. So we are able to bring in models basically as soon as they release them.

I have a little bit of validation in place to make sure the model works and there are certain rate limits and things on some models from OpenRouter that means I can't offer them. but like the Kimi models, for example, we already have the Kimi models. I've been testing out the Kimi models to kind of see, is this claim better than ChatGPT5 legit?

I haven't seen that, to be honest. I think it's getting a lot of hype. And I think it's one of these interesting thinking models. I'm admittedly not trying to run super complex benchmarks. I'm like, how many Rs are in Strawberry?

But I have some like, pretty interesting evals, things that like, ask the model to translate between languages and things like that. And it doesn't perform nearly as well as GPT5 does.

Brian: Yeah, and this is interesting because like the background at Continue, it's the first open source coding agent to get VS Code and Ollama to work together back in 2023, summer, the progress expanded beyond that. But we have like a such a strong, like we have 60,000 people in Discord who all they do is this.

Where they're like, okay, Kimi 2's out. Is it better than Copilot and Claude? How can we get to that point with different rules and MCPs and stuff like this?

And I imagine they're doing evals. I imagine there's a lot of this, "put your finger there and see how the experience works."

I'm curious. So you mentioned Vibecheck. Is this a platform, GitHub Repo, how are folks interacting with it?

Adam: Yeah, it's the combination of both. So we have GitHub repo, which sort of outlines how the DSL works, how to write your evals, and is a CLI package you can use to log onto our cloud platform. And then the cloud platform is where we execute the evals. I like to call it an eval experiment platform.

So you might already have like a system to go monitor your production. You're seeing certain behaviors in production. You can then bring it in, write a quick eval onto my platform, and then try it with different system prompts or try it across different models. Maybe you want to adjust the temperature on your model.

We make that really easy to run really fast. We're still kind of in a design phase for an experiment framework so that you can systematically do what I just described. Like, "I want to run this system prompt against all these various models and assess on a score."

Or maybe I want to vary the temperature. You choose what variable you want to set on your experiment and then you can plot the results. Like I said, a lot of the products focus on that monitoring eval case. We're really focused on trying to make it easy to experiment so you can get the best possible settings and model that matter most to your use case.

Like, one of the insights I've seen working especially with MCP stuff is like, the instinct is to go for one of these big models. Like, "oh, I'm going to go through Claude. They've created MCP, of course, let's put Claude in there."

But the more reasoning the model, potentially, the more it may want to riff on the thing that your tool does.

There's a lot more chance for, like I was working with the tool that's kind of in like the Wealth Advisor space. And like, the models know about IRAs and everything and they really want to like, help you and so it tries itself to help you before it calls the tool.

And you know, in that case we actually found that like downgrading to a much smaller model that has tool calling capability is a lot more effective at just handing off to the tool.

John: Yeah.

Adam: So just, like I said, figuring out the behavior you're looking for or not looking for, building that out, running experiments, we see that helps teams get to production, helps them save money, helps them iterate faster.

That's kind of the hypothesis around Vibecheck. I should say, we just launched, we're in Developer Preview so anyone's interested in listening to this, you can come on for free and try it out.

John: Yeah, two of the things that I think are so interesting I would love to double click on is like the smaller model and the cost savings there. That was actually, at the time for me , was a very surprising finding, like a year plus ago at OpenSauced for Brian and I, where we were using a much smaller model at the time.

It was GPT 35. I don't even think it was Turbo, but it was like the smallest one that had tool calling capabilities, these huge RAG pipelines and like some tool calling things, doing like some summarization and yeah, using some of the smaller models was way more effective. Are really cheaper ultimately.

Adam: Totally.

John: And one of these things that keeps coming up in these conversations we've been having is the AI economics around a lot of this. Where have you seen I guess some of the other cost savings or where like some of the economics of a lot of this is going for teams.

Like is it in you know, picking models, fine tuning, finding ways to make usage of tokens much better.

Adam: Yeah. What I've seen be the most successful patterns of teams that are both shipping and not hitting a surprise bill, inference bill or having to kind of like scale back or something is first off like looking at application designs that aren't just like the completely open ended chat window.

That is a pretty hard like space to solve for. And especially with the evals, a lot of the customers I've been talking to, they're in some kind of like regulated industry. They want to adopt AI, but they're also specifically looking to tools and other things to constrain AI in specific ways.

So there's two aspects to that. One is like application design that looks for like smaller tasks with like lesser search space on what the end user might put into the box and then that just leads to if you have these ability to do the experiments and you're able to kind of try different models, what you'll see is that you really just don't need a larger model in these circumstances.

You're not trying to go update 30 files in a repo. All you need is a small context window potentially and then ability to call a tool and that can often work the best. I have also seen like pipelines that handoff from one model for like the complex step to the smaller model for the simpler step. And I also think that's a really good design pattern.

W hen you first get into this, it's intimidating, all the models and what you would use them for, but after you start to get familiar with them and you understand what you're going for in your particular application pattern, that's when you can make a more informed choice on what to pick.

John: Yeah, I mean it reminds me a lot of just good product development and understanding the user space and what users are trying to do ultimately and where you can add value with the software.

The paradigm really seems to be very hyper focused in my eyes on that chat window and being this all -consuming God product that is just like it'll do all of it because you're chatting with God basically or something, right?

Adam: Yes.

John: Yeah. So I love what you said there about like scoping it down and the product and really understanding that space.

Adam:

I think we're going to see a lot more of those inbox looking products where you've got the set of agents that are accomplishing tasks. And I've actually found with the evals you can set loose. You can run such a big experiment on every model if you want that. It's just a matter of how much you want to spend on inference.

Like I built in some throttling because I had a bad experience or spend about $50 in 90 seconds. So I think we'll see that as a more of a design pattern especially because like you can already accomplish so much with these existing chatbots that have done a great job of getting access to your entire computer and your cloud accounts and everything. So it's pretty hard to like compete with that.

John: Yeah, you had mentioned MCP and tool calling which, for those unaware, is the hot thing right now in enabling AI agents to go and do additional things, have additional capabilities with third party APIs and all these things.

And back in April we had Steve Manuel on the podcast talking about and unpacking MCP really early on, before the hype cycle had even hit. And Hev, I know that you've been in deep building with MCP, so I'm very curious if you could give our listeners kind of your hot take on where MCP is at, where people should be focusing in the space.

Adam: Yeah.

I would say the hot take I have is that I have zero MCP servers hooked up to any of my agents.

Like there's a couple that I, if I was doing a lot of Jira, I probably would hook up to Atlassian, but I just have the luxury of not doing a lot of Jira at the moment.

John: Yeah.

Adam: And the reason for that is--

I do think we're in the glory days of local tool calling.

And pretty much all my workflows are enabled by some CLI. I have local, and I don't really need an MCP server. I can set up a skill for Claude, really keep that context window much more reasonable and not have to bloat it up with a bunch of MCP tools.

John: Yeah.

Adam: At the same time as I'm building Vibecheck, I'm looking at these agent use cases for if you want to run a model in the cloud that's going to accomplish a specific task, the MCP model there basically unlocks a capability to kind of access really anything that could be behind an API and, and do so in an isolated environment where you can like put some guardrails on it. It's only running with--

You know, I like to think of agents as, you're going to assign it to do a specific task with a specific model and a specific tool. And MCP does work quite well for that.

John: Yeah. It's interesting. I also don't have a lot of stuff hooked up with MCP right now, and I noticed that almost as a litmus. Brian, where's your MCP usage at these days?

Brian: So MCP is a year old. Like we're literally like a year into it. Like the launch of this podcast, like one of the first episodes was with Steve Manual about MCP run. And over the summer I sat with the Continue team and we do this daily chat about our product and just like go around the table, give feedback.

And we went around the room, all the engineers, and we were just like, "who's using MCPs?" And like, no one. I was the only one. And I happened to have two MCPs I use pretty regularly. Supabase, actually that's the only one I use pretty regularly. Playwright I use occasionally but I see the complete shift.

So I'm based out here in the Bay Area and I think MCP is maturing rapidly now in the last like literally 60 days. So I think where maybe three months ago it'd be like ah not really doing much with it. I think now people are realizing okay, we kind of left a lot of money on the table I guess per se or a lot of like opportunity there.

Like Sentry for example, I love their MCP because they have their AI seer thing that will take your Sentry error and give you like a progressive step of like go look at the error message, go get some actionable steps, go track how many times the steps fix themselves and then go report back to Sentry.

It's still pretty naive but it's like I could see the sort of like play -doh of like what would be useful with specifically the Sentry MCP and why I wouldn't go directly to the API.

Adam: I've noticed there's a lot more like hosted MCPs which I do think are kind of like the breakthrough of like going from something you can kind of accomplish another way. Like I said, I am able to accomplish a lot of the tool calling goals I have through like a local CLI I was already using.

Like I said, Jira happens to be a gap. There's not really a good Jira CLI. Maybe you can go like probably find somebody's open source one or something. But Atlassian has a good MCP server and that's a good use case of create this issue for me.

And certainly we have an MCP example on Vibecheck and it's linear and it's another good hosted one. I don't happen to really use Linear in my day to day, but it works pretty well.

Brian: We're heavy Linear users at the day job and we've gone down the rabbit hole of building. So linear has a really good entry point for building agents inside of Linear. They've shipped it about m. A month ago. we were using it pretty heavily in the last three months as they were shipping these different pieces. So if you think of like a GitHub app or like an OAuth integration like those types of interfaces, that's how Linear has now been agent friendly.

Adam: Interesting.

Brian: it's great. You can assign an agent to a Linear ticket. It can go fill out your description. Because most engineers are like, "title: like bug happened yesterday."

Adam: Yeah.

Brian: And they're like, okay, what's the description? Like, how did we get here? Where's the details? And what we've done, we kicked off we have a Slack integration that will hopefully launch pretty soon publicly where we can connect Slack to Linear and Linear to Slack and they now have a first party experience of that where feedback channel or a bug channel is like a bunch of threads of like stuff people going back and forth on.

And then you could just be like a"@continueagent," go kick off a linear ticket to like track all the stuff and get out of Slack. And then like, you don't have to constantly try to figure out like, where did we start this conversation? Where did it end? Is it fixed?

Adam: Absolutely.

I mean I feel that's one of the bigger unlocks of AI and agents and everything is building in a different way. Like all the way from like user feedback point to agent pickup is automatable is pretty wild.

John: Yeah. You had mentioned skills in Claude Code, which I do find very fascinating and reminds me of some of the stuff we had talked about with Solomon Hykes about Dagger and how you can use that sort of containerized engine and SDKs to build a bunch of really personalized agents and like just kind of have these things that get spin up really quickly.

And it feels very similar with Claude Code Skills where you can have like a little Python script or really even just have Claude go and actually build your little CLI as a skill for then it itself to go and use. And I sort of see the flywheel there maybe at the future.

And it makes me wonder like where Anthropic's bigger, broader head is in the space. Because they obviously went and invented MCP, said that like this is a thing that we want even before Claude Code, which is kind of crazy to think that MCP was a thing before Claude Code was even really available for, you know, people to go use coding agents and all this stuff.

Yeah, I don't know, it feels like things are, are still like very rapidly in flight and changing all the time.

Adam: Definitely.

The pace is overwhelming. It's hard to see where maybe there's winners or losers. It does seem like MCP has come a long way in a year.

And I think it's found some adoption in like I said some of these places where there's some, especially like CLI or local tool gaps. I think Skills fills that nicely. But MCP, I do feel like, has this chance to kind of achieve escape velocity and the Airbnb MCP is the number one app MCP thing you can easily book your whole vacation or something.

Whereas Skills, I think, will always remain kind of a developer focused. I have a skill that writes evals in my DSL.

John: Oh nice. That is kind of the thing it needs is a development environment to be able to run and build a skill.

The other thing this makes me think of is OpenAI's new apps SDK, which actually uses MCP to kind of as the bootstrap to go and, you know, pop up with these little really, I guess, web widget things inside of ChatGPT's window. And that, that would be the Airbnb experience for like the everyday person who's not technical and plugging these MCP servers into whatever by themselves, which is kind of a gnarly user experience.

Instead, there can be these little app widget things that have MCP as the sort of foundation.

Adam: Yeah. And actually one of those MCP projects I worked on, we did exactly that. Like the tool call.

John: Oh, nice.

Adam: We created a custom chatbot and it had a listener, and the tool call would push an event to the listener, which would render an HTML embedded.

John: Yeah.

Adam: Because we kind of saw this need of like, if you want to put controls on it, like just like throw out the HTML and allow them to use the dropdown menu.

Brian: Yeah.

Adam: Rather than just like riff off, you know, "are you married? What's your contribution limit?"

John: Yeah. Yeah.

Adam: So definitely think that step is going to be really important. And I kind of wonder what the MCP response is going to be to that because it does seem. I mean I'm sure it will be just kind of the W3C standard version of the proprietary thing OpenAI puts out.

John: Yeah. My very hot take is that all of these AI labs are trying to obviously find ways to monetize the crazy amount of compute commitment spend that they have coming up, you know, the billions and billions of dollars. And they're trying to basically create the app store for AI and monetize that where, like, that was obviously very lucrative for Apple. And I guess really anybody who like, has a monopoly on like how you run the thing for your phone, for your computer, for your AI agent.

So the apps SDK is feels like, oh, we can capture a bunch of other companies using our platform to get some work done, book your Airbnb, do whatever, create the app store and then take 3%. So I also am very curious how Anthropic will respond or I guess even Gemini. I'd be surprised if everybody just leaned into OpenAI's apps SDK.

Brian: Yeah, I looked at the apps SDK is. It feels way more like, Zapier than like building your own MCP. it's like a lot of point and click.

But the one thing that my thought is, so we started this part of the conversation with like not a lot of people using MCP. And I think that's still true today. I think there's just not a lot of like where you have an app store and like there are a bunch of fart apps. The fart apps like 15 years ago were actually providing value. Like I can do, I can install it, I can show it really quickly.

And I think with MCP today, outside of like the hosted ones, a lot of it takes too much work for people to even like get value. So there's more pain to get to the value of like setting this thing off and make sure you have a token and then not exposing it and making it secure.

From my world, a lot of enterprises are building MCPs themselves. Well they'll see like GitHub's MCP, but then they'll just rebuild it behind their firewall because there's just too much friction in trying to maintain and manage all the tool calling.

Like everyone's solving stuff like piecemeal, like slowly. Like GitHub has like 200 tools or whatever the crazy number it is, and now you can like selectively select tools. But six months ago everyone built their own GitHub MCP. So like, it's almost too late now. You gotta like wait another six month cycle for people to adopt that.

So my two cents is like, it's still pretty messy and like we need a bit of a cleanup before we can see like a proper app store. So Anthropic gave us like the MCP registry, but it still feels like it's missing a lot of pieces.

Actually what it really feels like is when I was at GitHub is GitHub Actions, when we were a host, we weren't hosting, but we were referencing actions and it was like all over the place on what actions did what and who made these and which ones were verified and how we did verification. That's what the registry feels like today.

John: Interesting. Well, I think that's a good note to take us to Reads. So Hev, I gotta ask you, are you ready to read?

Adam: I'm not ready. What do I do?

John: Okay, so I had a couple reads I can kick us off with two kind of short ones. T he first is this article that Montana becomes the first state in the United States to enshrine the right to compute into law.

And for the astute listener out there, they know that I, you know, love Stallman-esque philosophical conversations. And obviously this is an open source podcast. We love to talk about open source licenses when we can and the right to compute and being able to fork stuff.

But I just thought this was so interesting and that they would go and you know, especially in the age of AI, where being able to run your own compute should be like basically enshrined in that state's constitution.

No idea what that means for the future or what practically it even means, but I think it's cool. So go Montana. What do you think, Hev? You're very close to Montana out in Colorado.

Adam: Yeah, I mean, I think that's a pretty insightful thing. It's a pretty surprising thing and I would love to see Colorado adopt something similar. I see where they're coming from. It does seem like it's kind of like headed towards utility status. So, yeah, that'll be interesting.

Brian: Yeah, I was going to say like, if this was added to your electric or your gas bill like that would be wild.

Adam: There's been, and it's been wildly successful in Colorado, the city of Longmont, Colorado decided to create its own wi-fi, its own, you know, telecom network. They're a telecom and they provide the entire city with free wi-fi. You just get free Internet for living there.

Obviously it's built a little bit into your taxes or whatever. But yeah, it's like excellent. It's extremely good. The IBM campus is just right around the corner. So they're near fiber and everything and people rave about it, it's one of the best things of living there.

John: Well, and this always keeps you know, the other Internet service providers in Denver like very afraid. I think it keeps prices low in Denver and speeds very high. Like you can get really good fiber in Denver for I don't know, like, 60 bucks a month or something, which is just insane.

And I think they know that if they did anything crappier, the people from Longmont would be like, "hey, we did this. And it's amazing at like, 10 gigabit speed ,"or something.

Brian: Yeah, I actually remember it. So in Oakland during the pandemic, I think I had to upgrade because we did some live streaming and when everyone was, like, moved to stay in your house, so we did some live streaming from some large events to like, 20,000 people with like a 10.

So they had me upgrade my Internet. And I remember, like, going from whatever I had, I might have had, like, 25 megs to like 100. And it was like, over 100. I was paying nothing. And then I'm going up to like, 120 bucks a month for Comcast, which got me to start lurking for other alternatives.

So I am on fiber now and I'm super happy about that. Thank you, Sonic out here in the East Bay. But, yeah, that's wild. Like Longmont. I think I might be looking for houses pretty soon.

John: Yeah. My next read for today is actually one from Fast Company, and I saw this on Hacker News, and it definitely resonated with me. It's titled AI Isn't Replacing Jobs, AI Spending Is.

And something funny that I noticed that they kind of briefly touch on in this article is that, you know, a couple weeks ago, Amazon announced they were laying off, like, 30,000 people. And it was very big layoff across corporate which was touching all kinds of people in engineering and all kinds of stuff.

And then the news cycle kind of went and came and it was gone. And then they announced they were going to spend x billions of dollars with OpenAI, some $30 billion, I think, in compute commit, again, the bubble grows ever bigger and bigger.

But I think some of the commentary was that, like, "oh, well, that spend had to come from somewhere. And a good place to go find that spend is in jobs," because then, you know, you can loosen it up a little bit. You don't have as much commit to salary, so then you can commit to spending on compute.

Weird times. Very weird times we're, we're living in.

Adam: All right, well, I got one.

John: Yeah, get it.

Adam: It's really the story of something huge that has been built and just recently discovered: the largest spider web in the world discovered in a cave in the Balkans. Something like 1400 square feet or something.

John: Oh, my gosh.

Adam: You guys see this?

Brian: Oh wow.

Adam: 1140 square feet and it's like inches thick, put together by two different species. It's so creepy.

John: Was this that video where the person was like?

Adam: Yes, there's a video and they're like--

John: And they're like poking it? Yes, I did see this and I quickly closed the video. It was like something straight out of Lord of the Rings. Like Shelob was just going to be there.

Adam: It looks like stone or something too. It's wild.

Brian: Yeah, yeah. I immediately go to Indiana Jones. I imagine there's a future episode with his hopefully, not Shia LaBeouf, but maybe his ancestor.

John: It is interesting because like you know we're all very online people I guess and you know building computers and software and all this stuff but it's crazy to think that in the physical world there's still things that we have not really discovered or places we haven't been.

That always boggles me a little bit to think that like oh wow, there's just like potentially a whole species of animals or insects or something.

Adam: We're not the only ones surfing the web.

John: Not the only one surfing the web. Wow. Pun of the episode. I love it.

Brian: Amazing. Yeah. Actually I have a couple picks. So this came up in our conversation Linear. This is actually about the self-driving SaaS. So Chatbots Are Back. So I worked at GitHub. I don't know if you got that background Adam, but I worked at GitHub for five years. We had this very extensive chatbot system on how we deploy and ship stuff at GitHub.

And part of the reason why GitHub's probably gonna stick on Slack for the longest time cause teams doesn't have this type of functionality. But Linear is now building this within their products. So I kind of hinted at this a bit with how we're leveraging Linear at Continue.

So yeah it's on their homepage it's a self-driving SaaS would definitely recommend people check it out if they're interested in seeing the value that Linear offers.

And then the other thing that kind of is tangential is Andrej Karpathy talked about the self-healing infrastructure early this year at the YC event. And I think we're almost there where you can set up webhooks so you can connect your models and your LLMs to start taking inbounding like Sentry issues and a bunch of other stuff to basically react and like progressively enhance your product, or whatever you're sort of working on.

You know, it's hit or miss if people are super trusting and like in like one shot stuff and get fixes. But what we're doing at Continue is like we're building a--

So like all of our Sentry errors go in this staging area within our product and then you can like kick off, you get a bunch of errors that sort of like look alike the same. You can kick off an agent with like a hundred errors that all hit within the last couple couple days and now you have like all this context you can go send to the agent to then go debug and report back.

So sometimes it's a Linear ticket, sometimes it's actual PR. But we're now trying to sort of measure you know, low, medium, high, like which things we can confidently give the agent to go fix because they happen frequently and like we just don't have time to go like actually clean those up or what is going to require like a full on Notion doc to like start doing some remediation and starting to figure out how we got here and how we fix it.

So it's been fun to watch. Hopefully we'll have a blog post about this soon. But check out Andrej Karpathy's self healing infrastructure post.

John: Yeah, I thought it was such an interesting post. It kind of goes back for me to the whole like AI economics thing. I imagine enterprises will have a hard time or maybe not enterprises, but maybe like medium sized companies will have a hard time kind of stomaching the many thousands and thousands, tens of thousands of dollars in tokens just like kind of iterating in a loop and spinning on like-- because you know, you basically would have to eliminate the noisy neighbor problem in your logs and your errors to make sure that you weren't just sending endless cruft to these things.

Brian: Yeah, more than likely like the idea is like I think we're not ready for like the one shots. I think it's just when you get enough context, enough information to have confidence in the agent. I guess if you have a tool, maybe call it Vibecheck and you can actually set an eval to identify. Hey, that sensory error looks like it crushed it. Let's get more of those types and see what happens.

So yeah, like after this definitely drop the link to Vibecheck in the show notes. I honestly, personally would love to check it out and maybe point users to it.

John: Yeah, Adam we didn't get into this in the episode but I did want to mention that, yeah, way back when we worked at Pivotal together, you were really involved in the like observability space and Prometheus and Grafana and like deep, deep SRE practices.

On the whole like self-healing infrastructure stuff, do you see a lot of the same patterns emerging like, but now it's just AI agents?

Adam: Yeah, no, I definitely do. And going back to like our MCP discussion, this is where like actually like the MCP just like starts to make a lot of sense. You want these isolated things are going to go, they're going to tool call you. Kind of like I said, you want to have these agents that have limited scope of they're going to try to accomplish a task and specific tools they're going to call.

And it's easier to plumb that with MCP than it is to try to install stuff in some container and then it could use it. I mean that is a way, you could do it that way. But MCP kind of opens up, "Hey, any of our APIs we need to go grab, go hit this MCP server"

You know, Brian, you were mentioning like Supabase. Like I'm seeing that as like-- Or like I'm working on a project right now and I'm like, "oh, Snowflake has an MCP and it's like oh, that's like a huge unlock potentially."

Brian: Yeah. I just gave a workshop at the Small Data conf. So DuckDB hosts this and they got wind of like, I used the DuckDB MCP and also DLT Hub's MCP and this built this like DLT pipeline to basically siphon data off of GitHub and it was just more of like hey, can I do it?

One, I can't do it without the help of AI. Two, like let me use these MCPs to get me unblocked to build this product. And where I find MCP working really well is like onboarding. Like I didn't know anything about those two open source projects but I onboarded myself pretty quickly into them thanks to MCP.

Am I using them day to day all the time? No, but when I have to go do some cleanup or if I didn't want to build like a larger feature or new pipeline, I will bring those in for that experience.

Adam: Yeah. And I mean one of the things with the Snowflake one that I'm kind of excited about is like Snowflake has put a lot into text-to-SQL. And so you can have your own kind of internal stuff to query your warehouse, it's like almost basically free. You can kind of just like set it up and be like, if I work in healthcare, "show me patients where this happened."

John: Yeah.

Adam: And definitely that's game changer for them.

John: Yeah, this was actually a few models we were looking at using at the Linux foundation was their Arctic text-to-SQL models which are like very good and very cutting edge and can start to unlock some of those like niche use cases that you were talking about where it's like maybe you don't need the whole entire like chat box window thing or like chat agent, but like something that can take a shorter query, turn into SQL, hit your Snowflake, you get a bunch of data and then like that can unlock people who don't maybe have the data expertise.

It's very cool. But I think that's all the time we have for today. So Adam Hevenor, I wanted to say, thank you for joining us today. Check out Vibecheck, GitHub.com/hev/vibecheck or vibecheck.io. And remember, listeners, stay ready.