In episode 53 of o11ycast, Charity and Jess speak with Jean Yang of Akita Software about legacy software, API-driven observability, tips for improving team efficiency, and insights on how observability can inform programming languages.
About the Guests
Jean Yang is Founder & CEO of Akita Software. Jean has over ten years of experience with programming systems research, including work that won the Best Paper Award at the Programming Language Design and Implementation (PLDI) conference in 2009 and that got her named one of the MIT Technology Review’s Top 35 Innovators Under 35 in 2016. Before starting Akita, Jean was a tenure-track Assistant Professor in the Computer Science Department at Carnegie Mellon University.
Jean Yang: It's been a great transition, lots of fun, but I think it was a surprising transition to a lot of people, because before this I was a professor in computer science, working on very academic programming languages.
So programming language design, program verification. So how do we prove programs correct? Program analysis. How do we analyze programs for properties? And observability from afar, to many people, seems like completely the different thing, because a lot of the topics and programming languages, you're playing God. You're saying, "Okay, here's some specifications. I'm going to prove down to every last atom that this is doing exactly what I said it would do. And-"
Jessica Kerr: Whereas, we're like, "No, no, no, no. We know that's not the case. Tell us what it's actually doing."
Jean: Yeah, exactly. And I feel like especially the brand of observability that Honeycomb champions, we test in production, we don't even test. The world I came from, everyone assumes you write code, you write types, you write specifications, you write tests. On the one hand, it seems like such an opposite world, but I had been very seduced by reality.
And so here I was in the ivory tower, I'm working on tools to make your application level code correct, down to the last line of assembly. I literally wrote a paper. The title is Safe to the Last Instruction. It's about verifying assembly code. And that's all fine and good if you're running a rocket ship, if you're Amazon and you're running everything down to the metal and the rest of the internet depends on you. But for the vast majority of web app companies out there, that's just not the case. And so-
Charity: I would argue that at every company, it's just not the case. Production has a way of messing with everyone's greatest hopes and dreams. This sounds like a good time for you to introduce yourself.
Jean: Sure. I'm Jean. I'm the founder and CEO of a company called Akita Software. We do developer tools. We're in the observability space. I would say we're very complimentary to what Honeycomb does. If you want very basic drop-in observability, just to start understanding what are your API endpoints, what's in your system, that's what we do.
Charity: And how do you pronounce your Twitter handle?
Jean: It's @jeanqasaur.
Jean: Yeah, J-E-A-N-Q-A-S-A-U-R. The backstory is it's a very AZN thing. I feel like when you're an Asian American teen, it's a very cutesy thing to use Qs instead of Gs. And I don't know, I just started using Qs everywhere, because I thought it was so funny people did this. And somehow this became part of my nickname that my friends started calling me.
Jessica Kerr: That's the best place to get Twitter handles.
Charity: Well, so what can programming languages learn from observability?
Jean: That's a great question. And in fact, ever since I ran away from academia and I started working on APIs and observability, I get asked to speak in classrooms and at grant meetings sometimes. And they're like, "Jean, we've been working on mostly this like very top-down, specify everything stuff. What should we be thinking about with APIs and things like that?"
And what I really like about the observability space, and especially Honeycomb's work in it, it's just very accepting of, again, reality. And so, you're just like, look, you just can't understand large parts of your system behavior until production.
And so I think for me, the big revelations there was, I was doing a ton of work on application-level tooling. So language design, program analysis, program verification, they're all assuming that everything happens in the language. But even for me, the cracks were starting to show, you have the run time, any memory-managed language, you have this giant honking run time.
What are you doing with that? Anytime you call it to the database, you have a bunch of stuff. And for me, the big thing that caused me to get into the observability space was network calls, because the bane of my existence, the cause of all of my existential crises, was every time you do a network call-
Charity: Anytime it hops the network, you're in mystery land, you have no more control over it.
Jean: Yeah, you subvert all of your guarantees. I think the acceptance that you're not going to prove all of that away, and you're not going to test all of that away, you can't, I think is one of the big things to learn.
Charity: I think it never calls us being, it really is like reality in operations forcing itself into your gated kingdom or whatever. And that's why like the shift from the monolist to microservices is so fascinating, because it used to be that, you could kind of believe in a world where there was developers over here and operations over here. But now that you're inserting network hops and web hops, you can't live under... Every developer has to know a lot more about operations now that it's hopping the network.
Jean: Yep, yep. Exactly. And I think the big thing is, one of the things I love about how you all talk about this, is it's a people problem. Because I think that when you're living within the application layer you can kind of trick yourself into thinking it's a technical only problem. "If only I had better specifications, if only I had better types, if only I had better tests, then no humans would have to talk to each other. You don't have to trust anybody."
But I think something that I picked up from the observability world was, look, it's always a people problem. You always have to trust people. And a lot of what the tooling is doing is helping you do archeology on what did people do in the past, what did they do recently, what caused these issues. It's all people.
Jessica: What was your phrase from Twitter today, Charity? "The smallest unit of software delivery is a team."
Jean: I love that.
Jessica: It's not a program. It's not an application. It's a team, which includes some code, but also people.
Charity: And it's more than just one person too. This is why you can't have a piece of software that is owned by a person, because you've got the bus factor. You can only-
Jessica: Well, it doesn't even take a bus. You're sick sometimes.
Charity: Yeah. Or you're going on vacation, or whatever.
Jessica: You might have a life.
Charity: Yeah. Good point, Jess.
Jessica: And that's the difference between a program and an application too. You pointed out that a single person can write software. We can write programs. A program can be correct, but a program that isn't hooked up to the network and talking to any person or other program isn't useful.
Jean: Yeah. Yep.
Charity: It's a difference between a program and a software system.
Jean: Yep, yep, yep. Yeah, that's a really good way of putting it. And an analogy I like to make, is that these software systems, they're like rainforest there. They're constantly evolving, there's things coming in and out, especially with microservices and the API economy. This has become very, very apparent.
Charity: Even just with different user patterns starting. It's like the difference between complicated and complex. I would say that a little piece of code is a complicated system, but once it's live, it's in reality, once it has users, and traffic patterns, and different infrastructure underneath it, it becomes complex. And that means you can't predict what's going to happen when you change something. You have to change it and watch and see what happens, under some controlled environment.
And I think that a lot of people think of legacy software as some kind of thing that doesn't apply to most companies. The minute you ship code, you have legacy software. There's stuff that binds you. You can't change it. I don't remember things I did last week, and I think that's true of every single person out there that creates code.
Charity: This is why, we're all trying to figure out how we can do more with less, how can we ship faster, do better, with the people that we have, because software as it exists in the wild, you're always accumulating commitments, you're accumulating user pattern, you're accumulating promises that you've made to people and teams and everything. Which means even if you write and ship no code, you're accumulating more to do with the same number of people, every day that you exist.
Jean: Yep, yep. I completely agree.
Jessica: And then, like you said, as soon as you ship code, that's a promise. As soon as someone else has hit that API, they expect that to still work.
Charity: Yeah. Which is why I've been on this rampage for the past year, about how if you're trying to be a more efficient team, the number one thing you can do is shrink that... And this is not a novel insight, we've talked about this for years, but it really sunk home for me just in the past couple years, just like you were saying, Jean, "I don't remember what I wrote last week or last month." This is why the cost of finding and fixing problems in our software goes up exponentially, the longer it's been since you wrote it.
And the more you can do to get that fast feedback loop of I'm writing it, shipped it and I'm looking at it in production through the lens of my instrumentation, ask myself, "Is it doing what I expected it to do? Does anything else look weird?" If you have that down to minutes, oh my God, your software is going to be so much better, and more tractable, and more comprehensible, than if you've got a lag time of weeks or months, and the person who's actually trying to debug it isn't even you.
Jean: Yeah, absolutely. That's the sort of thinking that converted me, because the assumptions that a lot of people in formal programming languages make, is that the language is the powerful thing. If you just design the language and you allowed people to say what they wanted. And for me, the big realization is, there's a huge gap between what you want and what actually happens.
For some decades, I think, Gerald Jay Sussman, the computer scientist once spoke in an undergrad class I was taking. And he's like, "A lot of people like programming because you can play god." And I think for many decades systems were small enough, systems weren't used by that many people, and they weren't hooked up to the internet and all that.
You could, for a short period of time, play God as a programmer. And now, I think that we have to recognize our humanity again. And the humility that we just have no clue. We have no control over anything. No clue what's going on.
Jessica: Monotheism is over, now you have to fight with all the other gods.
Charity: No, that's so true. And that world where we felt like we had this illusion of control, that's where the great divide between dev and ops came from too. We're like, "Oh, this is fine. By writing my code and making sure it passes tests I know it's good. Someone else can run it now." And what we've learned as stuff gets more complicated is that was never true, and it's getting less and less true.
You have no prayer of running the software if you weren't the person or the team that writes it, that gets inside of it, that looks at it from the... You can't just like treat it like a black-box to be operated. Part of development is operating it, and seeing how it behaves under different systems and constraints. You're not going to understand your code in the IDE ever.
Jean: Yep, yep. I completely agree.
Jessica: Yeah. And then there's that reality of, okay, but a lot of code, we are running it without the original developers, the capital L legacy stuff. And so Jean, how does Akita help with that?
Jean: We fell into the business of doing that. I am very transparent. We started out thinking we would do observability for API security. And so as you know, security teams hate invasive installs. They can't get developers to do very much. So we started developing a black-box approach for understanding where data was in systems.
So that's how we got to our EBP FBS, passive network, traffic-based approach. We had a lot of stuff that was completely black-boxed outside the system and watched what APIs were doing, and told people stuff about it.
And what happened was, I think security teams were like, "Okay, but can you do these 18 other OWASP things?" We were like, "No, no, no, we just do this one thing, but pretty well." And then developers were like, "Whoa. If you can actually just drop into my system and tell me about it, I'm actually struggling to log. I'm struggling to put metrics in the right places.
And I'm struggling to find people expert enough to understand how to properly do ops on my system." And so that's how we fell into, proper dub twelves observability, and who started showing up. Because our website started out very vague. We were just like, "Something, something, drop into your system. We tell you stuff about it." But who ended up showing up was people with legacy code. And so an example of one of our users is Flickr.
They're a Web2 company, they've been around 20 years. They have this legacy PHP monolith, and they've been breaking it out into services. And that modernization process is so tough. I had a series of conversations with our user there, and I was kind of like, "We're such a rough tool right now, why are you in our beta?" And he was just like, "Look, there's not a lot out there, and my team isn't big. Nobody on the original code base is still working on the code, and we're just trying to wrap our heads around it."
Charity: Sounds like you're using it almost more as a service map.
Jean: Yeah. Service map is one part of it. I will say that a lot of our users are actually monolith users. So it's initially a discovery tool. And then it's a, "Let me know if anything happens with my endpoint that I want to know about." So a lot of these teams, they're not doing the very fancy advanced Honeycomb level stuff.
They're not trying to optimize 99 percentile tail latency within every inch of their life. The reason their code has been around for 20 years, is it's been mostly okay. But how do you still move fast doing that? Let's say you made one endpoint slower. You got to trade that off somewhere else. And so there's a lot of very high-level trade-offs that people make on a day-to-day basis. People still need to make changes.
They don't know exactly what the existing system does, what even is talking to what. And so initially, I didn't realize that there was a whole population of developers like this actually. We just had people starting to show up to our system and we asked them, "What other tools do you use?" And it was often very, a version that told story quote.
I think every system that was built in the last year kind of has the same stuff, but every system that was built five, 10 years ago, they're all old in different ways, like 15, 20 years ago. Every legacy system I think is legacy in its own unique way. And so for a while our investors are saying, "Hey, what trends are you jumping onto? What star have you hit your wagon to?" And we're like, "None. Opposite. seems like a lot of our users are PHP users."
Charity: Because reality.
Jean: Yeah. Yeah. And I think people had said, "Look, it's really hard to grow a business with users who aren't growing. And I'm like, "More and more legacy users showing up. There's a lot of them and they're growing some, but it wasn't us strategically saying, "Hey, there's this neglected set of users out there." It was us starting to build under a model and a set of assumptions that we had to be as noninvasive as possible.
Us developing a set of technology that would allow us to do it. And then developers picking this up and being like, "Oh my gosh, this is what I've been looking for this slightly other purpose." And so then we just leaned in. We were like, "All right, it seems like this is what we're doing now."
Jessica: So you're using the operating system to spy on network calls that the software makes?
Jean: We're not. Okay. We actually are not even looking at the whole operating system. We don't do file system or anything like that. We just do the PCAP layer. So it's just ports. Basically, we look at the ports, we watch the traffic. So one way to think about it it's like wire shark, but with a whole inference layer on top of it.
Jessica: The inference layer, does that notice changes.
Jean: Yeah. So first we infer the structure of the APIs. And so initially we were just looking at all the traffic and showing it to our users. And they were like, "We were having trouble reading our logs before this is worse."
Jessica: Those are all bites.
Jean: Yeah. Well, we would just show them, here's a call, we would kind of reify the call a little bit. And so we put in after that a ton of work into inferring paths for URL endpoints types. We can actually provide a pretty concise API model on top of that we automatically infer. And then on top of that, we're able to start inferring changes so we can detect like very basic things like this was removed.
Charity: So do you have something that like a library for people to instrument their code with or you're just sniffing traffic?
Jean: There's no instrumentation. So the challenge that we did was how far can we get with no instrumentation. And Charity, you and I have gone back and forth. I completely agree that if you're trying to do anything that involves deep debugging, you need to instrument for us. We are trying to go everything up to what you don't have to instrument to do. So we can detect endpoint level changes, because we have some inference we can detect type-level changes. We can detect some performance changes that are externally visible, but for the kind of debugs that you all are working on, we can't do that because we don't instrument.
Jessica: Yeah. Instrumentation is more like, well to do it, you have to have control over the code. You have to be able to deploy the instrumentation and hopefully increment on it and make it better. So it's really good for code that you to some degree have your hands around.
Jessica: Whereas you're working with code that people definitely don't have their hands around.
Jean: Yeah, exactly. And I think that there could be an interesting Akita Honeycomb collaboration where people are trying to get their heads around something, that we make it real easy for them to drop us in, get a very basic understanding. But in your end of this, what we envision is there's a whole observability stack. What I've noticed is, most companies adopting observability aren't just doing one tool anymore. They have, "Here's my monitoring for this. Here's my..." Yeah.
Charity: Yeah. I don't think they want that to be the case though.
Jean: Yeah. But I think inevitably, unless someone comes in and unifies it all, there's different tools for different purposes. So I can imagine us playing well with Honeycomb so that if people are like, "Okay, now I want to go deeper. I can do some instrumentation," or something like that. And that's not what we've been working on at all. We're just like, "Look, you are in a hole. We'll get you out of the hole and give you some basic amenities in life. But once you get your head around the code, if you want to go deeper, there's a set of other things you can do.
Jessica: Yeah. Like full instrumentation. And Akita traces, or Akita... What do you call them events? What do you call the...
Jean: Yeah, we've been calling them traces, but it's confusing because they're not observability traces. They're just traffic. They're traffic traces. Yeah.
Jessica: So that could be linked to a trace that is instrumented around the code?
Jessica: And then you could be like, "Well, my code says it's doing this, but the network says..."
Jean: Yeah, exactly. And you can also imagine feeding Honeycomb data into our change inference, the more source of information we have for that. Because the persona we have in mind is someone who's like, "Look, I don't have a lot of Dev Ops expertise or bandwidth on my team. I just need someone to gut check or really flag big things." They might not have the expertise to set up detailed dashboards or anything like that. And so we're really just a very basic starter. This is like-
Charity: There was a piece that you wrote recently, something about APIs that I really enjoyed. Do you remember what that was about?
Jean: I wrote a few things. One, I wrote, "the case for developer experience" where I talked about why observability is the thing to work on. That was a few months ago.
Charity: Yeah. I liked that one. That wasn't the one I was thinking of.
Jean: On the Andrseen flag.
Charity: Yeah. I definitely read that one. It was a different one that I was thinking of. This was a really cool, we should definitely put this in the notes. The difference between abstraction and complexity, the one-- This is something that, Christina and I have talked a lot about in the past where there are tools that help you back away from complexity and treat things abstractly, and then there are tools that lean into it and help make complexity tractable.
And we definitely see Honeycomb as being, at some point at the end of the day, somebody somewhere's going to have to understand your code. And we are definitely in the business of making complexity tractable for people who have to come up and understand it later on.
Jean: Yeah. I love the point that you are in the space. And something that had been really frustrating to me, is I think all observability tools are by definition complexity of embracing tools. If people want an easier life, they would just pay someone to write and run all of their code. And then-
Jessica: Well, it would all be in one app and then you'd prove it correct.
Jean: Yeah, exactly. And sometimes I get so frustrated, because I feel like the sexiest demos and a lot of the dev tools will get a lot of hype, it's some TikTok video.
Charity: Yeah. It's so easy to fake a demo. It's so easy to make anything look fast or good. This is why with Honeycomb, we somewhat often read into people who just literally don't believe us. They don't believe that what we're saying about our tool is possible. They think we're lying. They think we're faking our demos.
And I get it, if you've lived in a world where high cardinality and high dimension and all these things are just impossible, then it sounds like we're just high on our own supply or something. So it's really satisfying when one of two things happens. Either, this happened to Christine the other week where she was sitting next to a CTO at some dinner, and he's just didn't believe it. And she's like, "Okay."
And she talked them through like, "Well, here's how the column store works. And here's how we can scan so many rows so quickly." And that's fun. It's a good way to shut people up, but it's more fun when they actually try Honeycomb on their servers, try it on their data. You can see it click in their eyes when they see something that was literally impossible before, and now happens like reliably in sub-second queries on Honeycomb. And you just get the mind blown, and that's really satisfying.
Jean: Yeah. No, I think that's amazing. And it's something we struggled with in the beginning because we fell into the trap of, "We have to make such a slick demo." And then we were like, "No, no, no. It's never a demo until you're running on their traffic. And actually, it isn't a really compelling demo until you're running on their production."
I would always look at other tools that they're just like, "We replaced your back end with a faster back end. Here's a demo." And then everyone's like, "Okay, I can totally see how that does it. I can imagine it." And I think that complexity embracing tools just have, there's not really a playbook. People forget about us. I feel like, complexity is just something that people so continually sweep under the rug that when people think of buying tools, when they think of recommending tools, it's just like-
Charity: People would be horrified to realize just how many outages problems, that these large complex systems have, are never understood. How often they happen. We all freeze. We start looking into it and it resolves itself. And we all sit there looking at each other going, "Did you see that?" And it's like, "Okay, we can spend the rest of our day trying to figure it out, or you just wait and see if it happens again."Like that's the norm. The norm is not that these things are well understood.
Jean: Yeah. What I like about the observability movement, is that look complexity is here. It's biting. You need to deal with it. This is reality. And here's a better way of dealing with it.
Charity: Yeah. And the thing is that there are so many problems happening in most people's systems, steady-state that they have no idea because they don't happen to, affect more than 2% of the traffic or they happen rarely. But once you have the ability to slice and dice and look beneath the covers, oh my God, there's so much going on. And we have our paging threshold set quite high, which we should, because otherwise being on-call would be impossible. But that just means that there's so many things going on right now that you just have no idea about in your systems.
Jean: Yeah. I feel like I'm system monitoring and ops is sort of like where medicine was in the middle ages. It's like, okay.
Charity: Armed with like a lance and a leach.
Jean: Yeah. The Leach is just waiting.
Charity: And then if it fixed it they take credit, and if not, well, that was just God's plan.
Jean: Yeah. And I think that what I hope to bring to this whole world is a perspective of, look, I came from a world where there is order, that world happened to be such a small fraction of the whole real world. But I think there is such a thing as order and more sanity and more structure.
Charity: There are systems that are better understood than others, and they're more pleasant to work on. You can do much, you can move much more quickly with much less suffering. I think we've all experienced these systems that are there just like a hair ball that some cat coughed up. They've never been understood. We ship code every day that we don't understand, onto the ship we don't never understood. And then we wonder why it's like a nightmare to run. But when you have a system that is fairly well understood, where you have developers who are checking themselves, the code, it's like night and day.
Jessica: Where it's last week Legacy instead of three hires ago legacy.
Jean: Yeah. My dream is, I used to live in these worlds where we pretended everything was clean and you had full control. And I think there's some kind of hybrid situation where you have more control over the mess. You don't have full control. And so going the other way and what I think observability can benefit from with programming languages, ideas is the spirit of all the work I did was we take messy complex systems and we bring order.
And what I thought was unfortunate was no one was bringing that much order to real systems and ops. To me, logs are like assembly. They're low level, you can do anything with them, but what can we extract from logs? What kind of insights? What kind of questions do people really want to be asking about them?
Charity: Logs are just spaghetti strings. They should be deprecated. Nobody should be logging unless you're in a development environment.
Jean: Yeah. Right. And so where can we raise the abstraction from logs metrics and traces? That's been the big challenge we've sat out for ourselves.
Charity: Two events. The sector of the user.
Jean: Yep. Exactly.
Jessica: Because we can't have complete order. We have to give up as soon as the network is involved, on having perfect guarantees. But that doesn't mean we can't have some. We can have more clues, more tools to detangle the hair ball.
Charity: It can be better.
Jean: Yeah. Never going to be perfect, but it can be a whole lot better.