1. Library
  2. Podcasts
  3. O11ycast
  4. Ep. #91, Every Failure Becomes an Eval with Janaki Vivrekar
O11ycast
42 MIN

Ep. #91, Every Failure Becomes an Eval with Janaki Vivrekar

light mode
about the episode

On episode 91 of o11ycast, Ken Rimple and Jess Kerr sit down with Janaki Vivrekar. Janaki shares how Amplitude is building AI-powered analytics agents, why evaluation frameworks are becoming essential to AI product development, and how teams can use observability techniques to improve agent performance over time. The conversation explores eval-driven development, production feedback loops, and the challenges of helping AI systems reason about complex business data.

Janaki Vivrekar is an engineer at Amplitude, where she works on AI-powered analytics systems and agentic workflows. Her work focuses on helping organizations derive insights from behavioral data while building reliable, evaluatable AI products. In addition to her engineering work, Janaki is a writer and artist whose projects explore the intersection of technology, storytelling, and human experience.

transcript

Janaki Vivrekar: I think something surprising for us was, whenever the new models drop, we upgrade our systems to use them and we hope we get a little bit of a lift on our performance. But at some point there's a little bit of a cap to that is what we realized.

With upgrading to Opus 4.6, we ended up seeing a lift on some of our long running tasks, or tasks that take multiple steps, but for basic tasks the performance remained about the same compared to like Sonnet 4.5. And so that was surprising to us at first, but then it made a lot of sense because that's what Opus 4.6 is good at, is like longer running tasks.

And so we realized that a lot of the meat of how well our system does is related to the harness around our agents rather than just upgrading the models.

Jessica "Jess" Kerr: The harness, like the deterministic code in the agent?

Janaki: Exactly. And just like, what are the specific tools that are accessible to the agent, what is the logic that it can run? And how do we encode being able to do analytics into a system rather than just using state of the art models?

Jess: Oh. Okay. Tell our audience who you are, where you work, and what you mean by analytics.

Janaki: Yeah. Hi everyone. My name is Janaki and I'm an engineer at Amplitude.

I help build some of the systems that allow us to do analytics much faster. And so analytics includes the world of product analytics, marketing analytics, any sort of information that you want to know about how users are interacting with your product in production and how they're behaving with your tools and what you can do to make them better.

Jess: Nice. So the agent that you're working on that's doing the analytics, does it have a name?

Janaki: Yeah. So we just launched a product called Global Agent at Amplitude, which does the job of an analyst. And that involves not only asking what is going on in my data, but also why? And that's really the job of an analyst is to investigate why spikes are happening, why dips are happening, and then what are we supposed to do about those? What are we supposed to do about that information?

Ken Rimple: So you're building these agents. Specifically this is like a persona almost, right? You're kind of building a Persona of an agent that has the workflow and the thought process of a human agent to a degree. So how do you break that apart? How do you actually like break that into steps and make sure each of those steps are being properly processed and doing what they're supposed to do?

Janaki: Yeah, that's a really good point because there's a lot of steps to doing the job of an analyst really well. Like if you imagine, like Ken, if you were an analyst and I work with you and I want some information from you, I'd probably tell you, like, "hey, I noticed there's like a surge in signups happening in our product. Go and figure out why."

And there could be a number of reasons why that's happening. It could be something related to just organically users finding your product more, or it could be the result of an experiment that you're running or like some product improvement that took place.

There could be something globally happening that is causing people to discover your product more. And so you as an analyst have to piece together information across the global sphere, across what's happening in your product, and break that down by segments, do a lot of hours of analytics and time consuming work to arrive at an answer.

And so we had to encode all of these different steps into an agent that not only understands natural language questions, but is also able to query the correct data sources, understand your taxonomy for how you instrument all these different events that users may be doing in your product, build the correct analytical workflow, and then do the actual analytics behind the scenes and come back to you with a cogent explanation for why you have certain observed behavior, for example, in your product, and then also tell you potentially like, what you can do about it.

Like, "maybe this experiment's really successful, you should consider rolling it out"or, "it looks like this cohort of users is struggling with your product. Maybe you should deploy a survey or add a guide to help them better learn how to use it."

Jess: Okay, okay. Where does it get all this context for? Why is it reading the news?

Janaki: Yeah, there's a lot of different places where it can find all this data. So backing up a little bit, here's a little bit of context about Amplitude. Amplitude, if you use it, has a whole host of behavioral data.

Like your users, whenever they interact with your product, are doing lots of different actions. They're clicking on buttons, they're scrolling, they're hovering on things. And whatever provides you, as a builder, signal, you might have instrumented that already. So you're sending that data to Amplitude.

So Amplitude knows how users are interacting with your product behaviorally. And that's a lot of data. And so for a human to sift through that, like I said, it takes hours and hours, but for an agent to look through that, it can do that in minutes.

Jess: Right. We use Amplitude at Honeycomb. And we can follow-- People go to this page and where do they get to from there kind of thing. But that is how an agent can get to what was happening, definitely. But for the why, you need to know what experiments are being run, what marketing campaigns are happening, did your CEO just speak at an event?

Janaki: Yeah, that's really where the golden ticket lies. I think an agent that's really good at doing analytics will be able to synthesize all of that information. So in Amplitude, there is information about product releases that have taken place.

If you're going in and like documenting that, there's information that you might have encoded into your projects on Amplitude, about any time based information that's relevant into making sense of spikes and dips in your data. But then also there's kind of this like synthetic influx of data where it's like, you know, maybe there's a spike happening and there's no product release or like, there's no global event or anything that you can be aware of happening right now, but we're still observing a spike.

How do you then go in and figure that out? And a really common workflow that analysts do is they segment by different properties. Like, okay, maybe this spike could be attributed to users in Latin America rather than other users. Or maybe only users on our mobile platform are exhibiting this behavior.

And so that takes hours to manually go in and segment my data by all of these properties. And so the Amplitude Global Agent is able to run those for you in the background and then come back to you saying like, okay, I was able to isolate this particular factor as a contributor to a spike or a drop or any trend that you're observing.

And so automating that time consuming work is really the value proposition of Global Agent for some of these longer running tasks. That being said, it's also really good at doing the manual work of exploratory data analysis in collaboration with the user.

And so if I just want to explore and build charts and build analyses, I can use Global Agent to do that by just asking it like, hey, show me how my weekly actives have changed over the past year. Actually compare that against the year before. Or you can build complex analyses with Global Agent that understands your question in natural language and is able to map that into how Amplitude works behind the scenes.

Ken: So it sounds like there's a lot of tools that are being built as well as prompts and such that can parse out. I was reading some of the posts and I guess one of the harder challenges has been query to SQL kind of mappings and figuring out what you actually want to ask.

So it sounds like you've got this iterative process. Obviously everything is iterative in every world, realistically. But in these development loops, how do you determine the quality of what your various steps are doing?

Janaki: Yeah, that's also a really good question because quality can erode at various points in the process. well.

If you think about how accuracy loss compounds in just 10 steps, if each step is about 95% the way there, or 95% accurate, that leads to your response being, with napkin math, about 60% accurate. And so that's a huge jump. Even if you're doing really well each step of the way, ultimately your answer may be more or less subpar.

And so the way that we try to make sure that our system is doing well on all accounts is to make sure that we're doing our absolute best each step of the way. And this is where evals comes into play. This is where we have evals on specific tools. We have evals on like the overall response quality. We have evals on making sure that the responses are formatted in a way that like users perceive them to be useful.

There's a lot of aspects of what makes a response from an agent good.

Jess: Do you do any of these evals in real time? Is there a double check step anywhere?

Janaki: Yeah, so we do run certain checks in real time.

Jess: Okay, so they're checks in real time. That makes sense.

Janaki: Yeah, yeah, there's checks in real time and that lets us actually be able to go through some of these traces of chats that users are having in real time manually and see like, "oh, hey, this one was flagged with a response quality that is less than 50% or that's what our evaluator system deems." And so maybe it's worth looking into manually.

Now, that being said, a whole lot of users are using this product already and so there's a huge volume of these. How do we even figure out, out of all the issues that may be taking place, where to focus our attention? Where do we continue to invest our product resources into making our system better?

You know, certain users may be facing issue A and then others may be facing issue B. How do you measure which one's more important? And as a real world product development team, we have to be able to make those trade offs. And so if you want, we can talk about how to think about being efficient in that process too.

Jess: How do you think about being efficient in that process? Haha.

Janaki: Haha. Yeah great question. I think being efficient there actually starts with being really inefficient at first and not just being really bad at it, but being a little slow. And so this is what I mean: At Amplitude, we were so excited to roll this out to customers and see an incoming flux of real users using our product. We're getting feedback from them in a couple different ways.

We have a survey attached to this where they're able to give us 0 to 5 stars and then they can write in feedback. And so that's great manually written human feedback, super high signal for us to dive into. But then if you think about how many users even bother to write feedback, we need better ways to determine whether they left the conversation feeling happy or sad.

And so we have a thumbs up, thumbs down attached to it as well. So that's another lever of manual feedback that we can look into attached to chats. Now if you think about how many users even bother to give us that thumbs up, thumbs down, we still want to know how good those chat conversations are.

And so that's where we decided to come up with an auto tagging system where we can see what sorts of issues are happening in each conversation. What do I mean by an issue? Maybe the agent failed to find the context, the right data context for answering the user's query. And that can yield a pretty disappointing result.

Maybe it executed the wrong tool calls, maybe it did the wrong workflow, or maybe it got the right data but it just interpreted it wrong. Or maybe there was a UI issue. And so there are a lot of things that can go wrong and decrease the quality of the end user experience.

Jess: How do you detect any of those?

Janaki: That was the question we were asking ourselves, where we're able to manually look through these conversations, but how do you automate that process? And so the part that was really slow was actually going through manually and tagging each conversation with the issues that we perceived were happening.

And what we did was, based on a combination of user feedback, manual internal testing, and then running a bunch of responses through our internal AI system, or internal responses through it, we were able to come up with a taxonomy of error codes.

And so these error trace codes captured some of the different types of issues that I enumerated earlier, where we're now able to manually go through each conversation and tag them with like, oh, user was missing context, or user just asked a bad question, or the system failed to make the tool call, for example, or the chart was malformed and there was an error on the front end.

And so manually tagging was the first layer, and we're still actually doing a lot of that and investing a lot of time into doing manual tagging. Number one, it builds our user context, but number two, it helps us figure out what we need to encode into a system that can automatically detect when these issues are happening.

Ken: Okay, so you've gone through that. Now you're on to the next step. So where are you headed from there?

Janaki: Yes, where we're headed from here is to encode a lot of information about the volume of errors that we're seeing on many of these different counts and instruct an agentic layer to be the evaluator for determining, fairly accurately, when certain errors are taking place.

Jess: That's the auto tagging step?

Janaki: Yeah, yeah.

Jess: Okay, so you did a bunch of manual tagging, and also you said you had examples from internal responses. So you're like dogfooding your product where there you can really understand the data that it's working with and whether it got it right and really get the examples for the issues that you-- Do you then use those in prompts to get the agent to recognize that in other scenarios?

Janaki: Exactly. That's exactly the flow that we're dealing with. Now that we've identified a pretty tight set of error codes, we're able to instruct our evaluator system, "hey, keep an eye out for these sort of errors and whenever you think one of them is happening, one or multiple of them is happening in an incoming chat thread, flag that for us."

And now we're able to use our own product, actually, that we built, called Agent Analytics, to filter down by particular error codes or see that, "hey, this particular error code has like a huge volume of issues. We should invest our product development efforts here."

And so the Agent Analytics product that we also shipped recently is built for teams like us trying to look at traces of agentic conversations and figure out what to do and how to make them better.

Jess: So is Agent Analytics, is it an agent or is it analytics? Is it analytics about agents or is an agent about analytics?

Janaki: It's analytics about agents. Great question.

Jess: Haha. Okay, okay. So it's actually looking at the traces for you and at that point the traces are already tagged?

Janaki: Yes, they're auto tagged. And then also there is a huge like manual layer on top of it as well. Whereas humans looking at these traces with the auto tags from a system, you can also go in and manually update them, you can add notes, you can iterate on these auto tagged layers.

Jess: Okay, so your traces, which are describing the whole conversation that the agent had and how it performed and all those tool calls and stuff?

Janaki: Mhm.

Jess: All right, so those are something you can look at and then they get tagged by the auto flagger, or auto tagger. Also they can be looked at by humans and tagged by humans.

Janaki: Exactly. And that iterative process, obviously the auto tagging is like a great to have because it automates a lot of the hard work of going in and figuring out like, okay, what do we do next?

You don't always have customers writing into you like, "hey, I would really like this particular tool to be more accurate." They're not going to give you that feedback up front. You need to be able to figure it out from the overall response quality.

So that gives our product teams much needed direction into where to invest. Like, where could these systems be better? What is ultimately going to help users feel like the response quality of our agent is trending upwards and our users may not even know that what's happening behind the scenes.

And so we need to be able to figure out what the users want without them having to ask for it explicitly. And that's really similar to how pretty much all product development works. But it's a little bit amplified in this world of AI where there's a lot less transparency into how these agentic systems work.

Jess: True, true. Is the interface to Global Agent, is it like a chatty interface?

Janaki: It is, it is a chat based product. And so it's a chat interface where the chat shows up in context with where you're doing your analysis. And then also you can access some of these tools through our external MCP product. You can access it through Slack. Pretty much wherever you do your analytics work, this Global Agent and the MCP will show up alongside it.

Jess: That makes sense.

Ken: So I know we were talking earlier about like models and evaluations and so this is kind of like prompts, evaluation prompts, things like that. How do you control the iteration of those things and approval of changes of them and review of even the prompts themselves?

Like have you found workflows that work better for this as opposed to like, you know, I mean it is version control of files that have prompts in them after all. Right? I assume.

Janaki: Yeah.

Ken: Like what do you find there as being challenging or interesting?

Janaki: Yeah. Whenever we have significant changes happening to like either if it's prompts or tools or other parts of the agentic flow, we definitely run all of our evals and make sure that we're not seeing, first of all, any adverse effects or like we're not like suddenly like failing half of our evals or anything like that.

But then if there is an intended effect that we're observing that, that the changes that we're intending to make are taking place based on our set of evals. And so we do run our evals whenever trying to ship big or small changes.

Jess: Okay. We've been talking about evals on production.

Janaki: Mhm.

Jess: Is there a separate eval process pre-release for: We ask it the standard set of questions with this standard data set and see if it does a good job?

Janaki: Yeah, yeah, there definitely is an eval running process that we have offline as well and we try to run that very frequently as well, multiple times a week and also associated with whenever we're shipping changes. But these evals cover all sorts of things even--

Well, first of all, they cover how good is the response quality that we're seeing, like rubric based evaluations. Or how long are the responses taking? We shouldn't be shipping changes that are suddenly like making the response time skyrocket or anything like that.

Do the responses make sense semantically to an end customer? Are the responses violating any rules that we have set? Is the agent performing all the right tool calls? So we have evaluations for each of these different, I guess, correctness criteria for our agentic product.

Ken: And in your article, by the way, I just want to take this one quick diversion because this is an interesting one. So in the one you posted on X, you were saying that early on when you were doing evaluations you were doing like a pass fail. So there was no such thing as partial credit. Right?

So that's something you learned as you were doing these agents. Right? And these evaluations of the agents, you were learning that like maybe the way we evaluate something needs to be more fine grained than just a true false.

Janaki: Yeah, right, exactly, exactly. Like for example, if a user has asked the Amplitude Global Agent to make me a funnel analysis and measure how many users fully complete registration if they scanned a code at checkout, for example, if this is like a storefront, the agent should be able to create you that funnel chart.

So there's like one check there. Like did it make you a funnel chart? Did it do the task that you asked it to? Does the funnel chart now have the six steps that we're expecting from for this particular storefront? Did it use the right conversion window of an hour?

Are we showing you the right conversion rate and drop off? Is the response interpreting the chart? Does it have the right numbers that a user is expecting? Like the median or the delta between steps? Does it break down by platform? Does it have all of the information that the user has asked for?

And we encode all of that into like the expected insights that a user should be seeing for this particular case. So evals are a really great opportunity for you to predefine what you're expecting your agentic system to output.

Jess: And some of those are verifiable deterministically. Like did it make a funnel chart? Does it have the right number of steps? And some of that is interpretive, like is this response something that the user will be able to understand?

Janaki: Correct. Correct. And those things seem like they're a little bit more qualitative. But that's where our product building context comes in. We know that users find responses formatted in maybe a certain way to be more helpful. And so we can check whether responses are structured in that certain way and break down certain criteria of a response.

Jess: You could say, "oh, that's clearly too many ed dashes."

Janaki: Exactly. Haha.

Jess: Or "dang, it said, you're absolutely right again. What is the optimal level of emoji content?"

Ken: Haha!

Janaki: Haha. Yeah, we've had so many excited AI esque responses and teaching the system to respond in a matter of fact way that's still approachable and friendly is also part of the way that you build a system that's perceived to be correct or perceived to be usable. Cut all the fluff and just get straight to the facts.

Jess: Yeah. And there's like a level of confidence that I want it to express.

Janaki: Mhm.

Jess: Which is not, "you're absolutely right. Everything you said. Yes, yes."

Janaki: Yeah. And you know, in the world of analytics, that's even more important because you can see a case where maybe a customer has seen like a huge dip in their data and they're panicking already. They're like, why are my signups going down? And then like, if you have an AI coming to you and being like, "oh, your system is down," that can be very alarming. And that would affect users trust with this AI tool.

And so we have some pretty strong instructions to tell it to output hypotheses and not try to make determinations about facts. And that's important as an analytics assistant tool that it's only hypothesizing on potentially things that could be going wrong and handing control back to the user.

And like, "hey, these are five hypotheses that I have. And let me know if you want me to dig into any of these further. Let me know if you want me to take this action for you." And really using the human as a partner, rather than being like, I went off, did analysis and sorry, your product's down.

Jess: Right. Because if it gets too confident on one thing, it can take us in the wrong direction because it hasn't been reading the news and maybe it doesn't know that it's Super Bowl Sunday and all your users are just watching TV.

Janaki: Exactly.

Jess: There's always context that a human has that it doesn't.

Janaki: Yeah, yeah. And in analytics, the human having context is really key to like, not just related to world events, but also about your particular taxonomy. And what I mean by your taxonomy is how do you label different events that are coming in? And wouldn't you know it, a lot of customers have really messy taxonomies.

Things are labeled poorly. Things are labeled things that they used to be called 10 years ago and not necessarily what they're called today. How do you even interpret that as--

You know, this is something that would take you or I effort too, in terms of going in and figuring out how do I interpret this one customer's taxonomy when I'm not even that familiar with their product? And so that's also really difficult. And this is where we also allow customers to add in specific context related to their product.

And we let our agent read that context and be able to know like, oh, okay, this customer actually defines weekly actives in a way that's completely different from how I would have thought. And so let's use that definition or let's use these standard definitions for metrics. When they say clicked on the Checkout button, they're not talking about the event that's labeled checkout button, they're actually using this other event.

And so those sort of org specific contexts are able to be defined by users and the agent has access to those.

Jess: Yeah, that's so important.

Ken: So one of the things that I know I've struggled with is when you're dealing with building evaluations, when you just get started with them, there's there are libraries that have all sorts of like basic built in ones.

You're using an LLM as a judge. Right? So you're basically filling up a prompt with business specific things in it that are important to your customers. And then you have hopefully enough context with that, plus the data that's happening and being filled in along the way with all the analytics. What does some of that look like when you get it wrong?

Janaki: You mean like an evaluation where the agent is not finding the right answer yet?

Ken: Or more like you didn't really, as you're starting to build your evaluations, you didn't get the evaluator right. Like how do you debug those things?

Jess: How do you check your checker?

Ken: Yes.

Janaki: Yeah. Part of it is understanding exactly what you're trying to measure. I think when you get your checker wrong, it's either being too general, like trying to check like correctness in an under specified way, or it's being too specific in a direction where maybe like you know, this isn't exactly what you're trying to check, but you're getting the wrong signal or you're checking for the wrong signal.

And so really crystallizing what are the correctness criteria for your system that you care about is some of the pre-work that you have to do before you even start defining evals. You have to define the structure of your eval.

You have to be able to figure out, okay, I care a lot about whether the agent created the right type of analysis. And that's like one thing that I'm going to measure. That now is a well specified problem. If you step back and you're like, I want the agent to give me good responses that make a lot of sense. You know, you have to do a lot of work. What does it mean to make sense?

And you have to break that down into like, okay, it should tell the right story and be interpretable by an end user. But then it should also be interpreting data correctly. It should also be producing the right sort of artifacts.

Jess: It should be saying, "I don't know," when it doesn't know.

Janaki: Exactly. It should be refusing to do certain tasks that it can't do instead of making them up. It should not be hallucinating. And so these are all different types of correctness checks that you then have to break down and layer into different types of evaluators.

Jess: And then depending on those outputs, you're refining the prompts and the tool sets for each step of the Global Agents work.

Janaki: Exactly. Refining those steps, sometimes adding more steps and realize like, "oh, users really want to do this particular analytics task and we have no tools that can help them do this. Let's add one." And so that's where we use failures happening in real time to change where we're investing our product development efforts.

Ken: You had a whole article that you wrote, Making Stone Soup.

Janaki: Mhm.

Ken: Eval Driven Development for Analytics with AI, which we'll link to in the show notes. I think it's excellent. Really kind of talking about like start with some sort of framework of setting up your evals and put the stone in the water. Right?

Janaki: Correct, Correct.

Ken: Is that how you started getting people interested in jumping in and adding some business context in the eval process for these things?

Janaki: We absolutely did. And it was a team effort, let me tell you. It took a lot of hands on deck to get a really great set of internal evals because one thing we have a lot of is internal data. We've been using our own product for ages.

And then also we internally have so many experts on Amplitude because we have people in the day to day who are training others on how to use Amplitude. We're using Amplitude ourselves to see how people interact with Amplitude. It's very meta.

And so we decided to use our own product as a really great set of example cases where we're asking real analytics questions like, "why did our signup surge even though we had no top of funnel product improvements?"

And so real investigations that took our own PM's hours to do then turned into evals that we were able to test our Global Agent system on. Like, hey, is it able to get this thing right? That RPM spent like four to five hours over a week trying to investigate last year. And it's a really great signal when the agent's able to get something like that right, we know we're headed in the right direction.

Ken: That's gold really if you think about it. All those little findings that you're able to pull out of experience and years of work with your product and then suddenly being able to apply those to other customers on the fly.

Jess: Right. We do this at Honeycomb too, to take the expertise that we have internally and encode it into the agent to make it available to everyone without our customers having to develop that expertise on their own in order to use the product to maximum effect.

Janaki: Right, exactly. And the magic of it is we, we have users at Amplitude who span so many different personas. Like, we have product developers like PMs and engineers, and then we have our customer success managers who are also using Amplitude to answer their sort of questions. We have people on the sales and marketing side.

And so when we bring all those people together and say, like, hey, what are questions that you would like to ask Amplitude? Or like, what are questions that you try to answer in your day to day? Get those questions, get their expected outputs, and then put our system to the test, that's when we can build that internal confidence that we're building a robust system.

And it takes a lot of effort. You know, we're still working on it internally. People are manually adding in evals based on their real life experiences. But there's also a piece to it where we automated this process of generating evals as well from that very article that you mentioned, the Stone Soup one, where we realized we want more evals and we want them fast.

So we have an artifact in Amplitude called a notebook, which we've been using internally at Amplitude for a while too, which is essentially a handwritten document online where you are documenting your steps to analysis.

Like, maybe I shipped an experiment and it was very successful and I've documented that in an Amplitude notebook. Here's the analyses that I did to figure out what's going on. Here's the output result that I have.

And so we parsed a lot of our internal notebooks at Amplitude and generated nice tight evals from that that give us an input question like, oh, what was the success rate of this experiment? How well did this experiment do? And an expected output.

Jess: That's excellent. Speaking of notebooks, this is not really a notebook, but I wanted to take a few minutes and ask you about something really important, which is your art. I found your website. It's janakivivrekar.com, right?

Janaki: Yes, that's the one.

Jess: Yes. And it lists all your writing, your blogs, like the ones that we've been talking about today. Also it has art and I love the one from March 10th. About the calendars. It's like you've somehow taken a Google Calendar that like starts in January of this year, I think.

Janaki: Yes.

Jess: And it, I clicked on it and it displays like this week. And there's meetings on it for today that say, "pain is not something I want to feel yet it is something I have to carry." That's from 1 to 7am and then at 8am we have, "hot blooded." And at 12:30, "what should we name her?" And then, "and we named her Longing."

This is so beautiful. And then I can hit back and see previous weeks and I can even add the events to my calendar.

Ken: Like poetry and philosophy in a calendar.

Janaki: Yes, exactly. This is a live intervention project that I'm running right now which is transforming this, you know, very synthetic element that we use in our day to day lives, the calendar, to bring order to our lives, really. But instead I'm using it to add chaos.

And so it's a really subversive project because it's repurposing all these affordances that we have. You know, you have an event title, you have event duration, you have a location, you have a description. And we use all of these to tighten up and schedule our lives down to the minute. But our real lived experiences are so different.

There's things going on minute to minute that we don't choose to document. And so I've decided to bring some of those more latent narratives that we all have ongoing in our lives and bring them into this grid system of a calendrical frame and instead use it to write poetry or thoughts that otherwise you wouldn't even think of putting in a calendar.

Jess: Oh yeah, Hot Blooded takes place at a Spotify track.

Janaki: Exactly. There's embedded artifacts within these events if you choose to click into them. There's images, there's Google forms, there's Spotify links. And so really it's this landscape that now my audience can interact with and participate in.

But, you know, it's really trying to project these vague thoughts and impressions, you know, fleeting thoughts, into this tangible calendar space.

Jess: I have to ask, did you use AI anywhere in this project? And if so, how did you evaluate the results? Haha.

Janaki: Haha. That's so funny. If only, if only this was an AI project, but this is decidedly not an AI project. In fact, if you go to March 3rd on that calendar, you'll see a little bit of a nod to how this is very much not an AI project.

Jess: Whoa, there's a cool triangle. I'll just let our listeners go to March 3rd in Janaki's calendar poetry project and see how not AI it is.

Ken: That's pretty great.

Jess: It's beautiful.

Janaki: Thank you. Thanks for checking it out.

Ken: All right, so now we know where we can find your writings, your calendar poetry and calendar -free writing and fun. Where else do you write things and talk about things? You're on X with your work for Amplitude, right?

Janaki: Yes, you can find me on Twitter, you can find me on my own website. You can reach out to me.

Jess: Would you spell your URL for people?

Janaki: Yes, my website is J A N A K I V I V R E K A R dot com. That's my name. First name, last name, dot com.

Jess: Thanks. And I would like to ask you for a piece of advice or a random thought that also works to leave our listeners with.

Janaki: Yes. One piece of advice as we're all figuring out how to orient ourselves in the AI landscape is particularly about evals, since we've been talking about them so much, is try to demystify evals for yourself.

It can seem kind of daunting and scary if you've never written an eval before, if you've never built an evaluator before. But really think about it to start off as just an input and an output. It's just like a test that you want to run on your agentic system. What is something that you can think of asking it, and what should it respond?

And then once you think about what it should respond, break that down then into what are different angles of that? What are the success criteria? Should the output have a particular artifact like I mentioned, or should it be structured in a particular way? Should it contain a specific detail necessarily, and then take it from there.

So break it down for yourself. It's not as mysterious as it may sound where everyone's talking about evals today, but really all it comes down to is an input and an output. And you can get to evaluating your AI systems this way.

Jess: If you, in the audience, have ever done property based testing, it's a lot like that.

Janaki: Yeah, yeah.

Jess: Just thinking about the properties or correctness criteria that you want the responses to have because you can't hard code, "it should say exactly this."

Janaki: Yeah, absolutely. And once you have that up and running, the next step then is how do we turn this into product development? How do we learn from our evals and the outputs?

You know, whether you're running them online or offline or both, there's information to be learned from them and figure out where your system could be better and then invest your product development efforts there.

Ken: Because you're specifying these things in regular language, you have a conversation piece to use with everybody about this instead of code. So In a way, it kind of makes it easier for more people to participate than if it was software that was running deterministically with unit tests, for example.

Janaki: 100%. A lot of our evals are written by our own PMs, our own customer success managers.

Jess: Nice.

Janaki: Also other engineers. But really, anyone who uses your product can help you write an eval. We even have had customers being like, "hey, like, I really asked this particular question a lot. And the system could be better in this way." What did we do? We turned that into an eval for ourselves. So every failure that you witness can turn into an eval that you have running perpetually in the future.

Ken: A domain specific language called language.

Jess: Haha!

Janaki: Haha! Yeah.

Jess: Beautiful. Thank you so much.

Janaki: Thanks, Ken. Thanks, Jessitron.