FEB 25, 2021

32 MIN

Ep. #34, Diminishing Complexity with Jaana Dogan of AWS

GuestsJaana Dogan

light mode

about the episode

In episode 34 of o11ycast, Charity and Liz speak with Jaana Dogan of AWS. They discuss Jaana’s career journey, life before observability tools, and reducing system complexity within large organizations.

about the guests

Jaana Dogan is the Principal Software Development Engineer at Amazon Web Services. She was previously Tech Lead and Staff Engineer at Google.

show notes

about the episode

about the guests

show notes

transcript

Jaana Dogan: I realized that like one of the biggest blockers in my career, was the lack of observability data.

And a while ago actually, we were not calling this observability data.

It's such a kind of concept that you helped create in the last, you know, very recently.

As a person who is very new into the industry, one of my like onboarding experiences was always going to the monitoring tools or the performance tools.

You were trying to like understand actually what's going on from an outsider perspective.

And these tools were like not very sufficient but was my entry point.

So I kind of like started to see that there's a trend and I became more interested in performance plus monitoring all along the way.

Even though, you know, I was a software engineer, I think I was at least spending 20 to 30% of my time on, you know, kind of like establishing how we use these tools, how we are like collecting data, and stuff like that.

Charity Majors: When you say it was holding you back, you mean like, from shipping quickly or from understanding or like--

What was it holding you back from?

Jaana: It was both.

You know, if you're not coming to a small team, working on a small project, most of the time you're looking at this huge code base with so many different components and you don't necessarily have a great big picture understanding of what actually it does, plus how it behaves.

So I can give an example.

When I joined Google, on my second day, I actually fixed a bug because my mentor was suggesting me to, "Hey, like we have this like distributed tracing tool. Do you want me to give you a demo?"

And I said, "Yes, like, what is that?"

Right? Like, and as soon as he demonstrated, I was able to see all the RPCs and there were like 200 services in the chain, which I would have been spending maybe five years just learning the names, right? So.

Liz Fong-Jones: So it's kind of like a shortcut to understanding for you, right?

For understanding what goes where that you otherwise would have had to stumble across kind of tripping over the roots.

Jaana: Exactly.

Charity: We spend our time when we are introduced to a new system, trying to build up this intricate mental map, which is always out of date, and it's always like subtly off.

And like, just getting that out of our heads with a tool where you can see it and where it's democratically available to everyone.

Everyone sees the same, you know, vision of the system, which is more or less up to date.

It is amazing how much that transforms your ability to locate yourself in the problem and spend more of your time on the actual solving of the problem instead of just like where in the system is the code that I need to find in order to even start debugging the problem.

And then you're talking about very large systems but I think the part of our thesis is that it's not just large systems anymore. It's like--

Jaana: That is completely true. Like even, you know, when we're talking about large system we're always thinking about like the systems that we own.

But if you think about the whole space I rely on tons of like external dependency layers of layers of stack, like, which I don't necessarily have a good understanding of.

Charity: A lot of your system isn't your system.

Liz: It's our system. It's collectively our system.

Jaana: That's true. Like you only own like a small piece on the, you know, top of everything and, you know, the complexity is there.

And it's just not getting any better, it's getting worse over time, right?

Charity: It's getting way worse.

Jaana: Yeah.

Liz: So now would be a good time for you to introduce yourself.

Jaana: Hi, I'm Jaana. I'm a principal engineer at AWS.

I'm leading our instrumentation and telemetry collection work.

And as well as I'm leading some of the newer observability services that we are working on.

I've been working in this field for a while and, you know, I'm a big fan of your work. So thanks a lot for inviting me.

Charity: I'm going to guess that if you're listening to o11ycast, you know who Jaana is. Honestly.

Jaana: That's such a big compliment. Thanks.

Liz: Yeah, I think that's super, super interesting to see how the field has evolved.

You mentioned earlier that you started off as a kind of pure software engineer and you became more ops minded and kind of eventually grew to focus on developer tooling.

Charity: Or you sound like you've got reality embedded.

Jaana: Yeah, It's actually unbelievable to have those boundaries, right?

Like it's software engineer plus like ops like it's just this continuous cycle.

So, I think historically we created these boundaries, but you know, those boundaries are just more fluid and just going away.

Charity: You can't separate creating and maintaining.

Jaana: Exactly.

Charity: Like it's the same system.

Jaana: And it's just really good for the health of the system.

Like you should be doing the operational work as well to understand what's going on.

Otherwise it's kind of like creating this like unbalanced situation where you as a developer don't necessarily care about the operational side of the things.

Charity: It's like if you have a surgeon who's just doing surgery and then never checks on their patient. They're just like, "Well."

Jaana: Exactly.

Charity: "Nurse's job now."

Jaana: That's a great analogy, yeah.

Liz: So I think earlier we talked about how observability is kind of a newer term and that it did exist at kind of some of these larger companies who weren't necessarily calling it observability.

When did this kind of start to coalesce for you?

Like what was the commonality that led towards it being like its own discipline?

Jaana: I think that, you know, it is still a difficult topic.

I sometimes ask, you know, five people what their definition of observability is and I hear like 10 different, you know, definitions, right? From five people.

But I think the general theme is around, you know, being able to ask questions that you didn't know that you should ask and being able to correlate stuff, being able to, you know, dynamically enable things and just making the most out of what you collected with, you know, enriching things with context and so on is how I see observability.

You have probably a better answer to this question than I.

Charity: No, have answer number 11 to add to your

Liz: Our listeners have heard, you know, the first five things that we define observability as so it's kind of, you know, we'd like to hear from our guests about not just what their definition is, right?

But the road, how did they arrive there?

So kind of along those lines, like, you know, what are kind of some of the blockers, right?

Like what are some of the things that make people come to observability too late?

Jaana: I think one of the reasons is, historically speaking there's always been a barrier.

You know, we talk a lot about like there's been different initiatives going on in terms of telemetry data or context or what the telemetry data we collect means and so on.

So, you know, as a developer, you always think like this is not my mandatory job.

We just need to, you know, check a box so we can have like some basic monitoring and stuff like that, so--

Charity: It's not what getting grudged on or evaluated on, right? It's like--

Jaana: Exactly.

Charity: Yeah.

Jaana: It's just like a late time, you know, I need to check this box so we can push things to production, but you know it's more involved than just the operational side of things.

I think what observability enabled recently was it became a tool to learn about your systems and now, you know, people see the value and then they care about this very early on.

Charity: And the users. It literally links you to the impact of your work.

How much more motivating can you get than that?

Liz: Except for where like I think, people felt maybe it was too hard, right?

Like there was too much of a barrier because they didn't know what to do or how to get there.

Jaana: Or you know, like they had to be very prepared.

When we were thinking about like typical monitoring, we are talking about having some canonical signals that you need to agree on before you start monitoring them, right?

Like it just requires too much work and when we were developing systems, we are doing everything very gradually.

You learn more about the system, you care.

Like you start to learn more what to collect and how to like interpret them in the longterm.

So I think that like gradual, you know, growth just didn't really match with the monitoring style because monitoring was like, "I need to have like this canonical things to take a look at. I need to set my SLLs, maybe."

Maybe observability is helping people because it's just kind of like works well with that like gradual growth and learning.

Charity: I think that there's kind of a thing here where like you said, you know, people think of like everything that happens after deploy as being like an afterthought or a nice to have or extra.

I think there's kind of a parallel there. There's so much work has gone into this already.

You know, there's like layers and layers of like teams gone by, and graphs, and dashboards.

They are just accumulating everything.

And I think that the initial reaction is often just one of weariness.

Liz: People don't want another tool. They just don't want another tool.

Charity: You don't want another tool if you have system that is working-ish and you've been raised to see computers as like a source of like much fear and suffering. Like, "Don't change it."

It's a "nobody touch this" thing. Right?

Then the last thing you want to do is disrupt that fragile ecosystem and the grand promises of a vendor or somebody on Twitter. Right?

And I get that. And I also feel like it sounds hard because it's-- but like I think that the thing that I keep trying to make people understand is that it's easier this way.

It's actually simpler this way.

You don't have to be a great engineer if you can see what the fuck you're doing, right?

Like you would said at your very first answer, like, "I could see it. The 200 services. How am I going to know which one is there?" Right?

When you see it, you don't have to be a genius.

And this is great because we shouldn't have to be geniuses in our everyday life, you know.

We should just be like code monkeys just doing our work and like focusing on you delivering value to our customers and not just having to like save today, you know.

It's easier this way, and that's what I keep, you know.

I think it's on us who are trying to like tell a story to then follow it up with, and here's, step-by-step, you know, guide to getting there, you know.

Because it's not actually their job to innovate in observability, it's their job to do something else, right?

That's what we get paid to do.

So I do think that, you know, it really sets the bar for us to follow up.

Liz: So in that vein of kind of getting things out of the way for people and kind of paving that golden path, three or four years ago, there was an OpenCensus and an OpenTracing.

And therefore there were multiple paths and people had to kind of be experts to figure out which one they wanted.

Kind of what led to the combination of those two projects together.

Jaana: Yeah, that's a interesting thing, right?

Like I think, so for the listeners who doesn't have much context, I was working on the Census team and which was an internal instrumentation library at Google that was linked to every like production service and it was collecting metrics and traces.

So at some point we realized that like we just needed something very similar externally for the cloud users.

We looked at like what's around, you know, OpenTracing was doing the work for tracing but it lacked some of the things that we like such as like, you know, being able to correlate things.

It just was more of like, you know, API rather than the implementation.

We just wanted to have a couple of features that we can maybe talk about them with.

We ended up diverging and maybe contributed to the problem of having too many tools because people are looking for practices and the golden paths.

And we were giving them like this yet another tool that you need to learn, and like, you need to be an expert in this area to be able to understand the nuances and such.

And we quickly realized, and in the beginning since the beginning, I was very interested in merging these projects to be honest.

But because of the goals were diverse, they went in different directions, and I think we didn't necessarily see the OpenCensus becoming a big thing so soon.

And we just didn't expect like it's going to be adopted that widely.

So we didn't think that we are going to disrupt like and create this level of confusion.

But as soon as it happened, it just became clear to us that like, you know, it makes no sense to have like two very similar things.

Liz: All right, like neither project was going to succeed if they were competing with each other.

Jaana: Exactly. You know, everybody was coming to us, like all the customers, "Why, you know, do I have to know two, you know, tools for very similar things?"

And you know, "Which one I should pick? Like what small nuances are you talking about?"

So this became a huge, you know, issue. You know, there's this concept of when you're trying to introduce a new standard or something, you actually like add one more. So, we were--

Charity: Now you had three problems.

Jaana: Exactly, yeah.

So we were contributing to the problem rather than any solving problems.

It became very clear that like, you know, merging needs to happen and it happened and it's a positive thing.

Liz: And now you're, you know, sitting in a slightly different role, but still working in the open telemetry ecosystem because you and I, Jaana and I, used to be colleagues at Google.

And now, you know, I work at Honeycomb, Jaana works at AWS but we still work on observability.

Kind of I've spoken a little bit about what that shift was like for me, going from Google to Honeycomb.

What's it like shift going for you, going from Google where you were doing internal facing tools plus cloud tools to AWS?

Like, what was that shift like?

Jaana: Yeah, like almost a year ago, AWS asked me, they want to adopt OpenTelemetry more widely all across because they believe in this mission that it is important for us to be able to collect the right data and be able to push it to the tools that our customers want to use, right?

Like they're not necessarily interested in pushing data to only to AWS services, but, you know want to to be able to also push the Honeycomb as well, or the, you know, the other tools.

Liz: Right, it's that fragmentation and that kind of data wall issue, that's kind of almost similar to what we just talked about with OpenCensus and open tracing where people just wanted one thing to just work.

Jaana: Exactly. They just don't want to understand that any of these building blocks, they just want all of these pipelines to just work.

So they can choose whatever tool they want or they can write their custom tools to, you know, if they need like something very specific for their use case.

So they came up to me suggesting like, "Hey, would you like to lead some of these efforts?"

Right? Like, "It's a huge, you know, cross-company effort. We don't have a lead for that."

And it just, I felt, you know, immediately that, if I really--

Charity: That's a big deal.

Jaana: Yeah, it is a big deal but at the same time, like, hey, I can help OpenTelemetry a lot by doing this work because you know, it'll be, it's a huge provider, has a lot of like services.

It's a huge problem there and as well as like the customers.

If I can get this right, it could be also a good reference point.

That's how I got excited and, you know, ended up joining AWS.

And now, I'm trying to, you know, drive some of these integrations and want to make sure that, you know we're trying to adopt it more widely.

It's not just like we're adopting the instrumentation libraries or the collectors for the customers.

We want to use the same thing ourselves.

Charity: Something that I learned just this last year, I've been preaching this like, "Arbitrarily-wide structured data blobs are the way to understand your data for for years and years."

And I found out on Twitter just this last year that that's what AWS has done since the beginning of time.

Like it's just like flat files and every host and everything but then they do like distribute it.

I was just like, "God damn! That's--" I mean it's even infuriating and awesome and validating and cool.

Liz: Really we should have been sharing these practices and ideas and like trying to figure out how to get these things to inter-operate.

Charity: And now that you're there, I'm like , now maybe some of these things will get broadcasted a little bit more widely.

And everybody now, Jaana no longer has me blocked on Twitter, so we can talk about this stuff.

Jaana: Yeah. You know, this is also a really big company.

I mean, think about like Amazon as a whole, it's just a huge company and there's so much flexibility.

I think there's a huge, sharp difference between, you know, some of the companies that I work here in Amazon.

Charity: Amazon is hungry.

Jaana: It's so big and at the same time, that's why like, you know, each team needs to have some flexibility.

And that's why, like you see just like more like, you know, less structured stuff going around.

Charity: I like that though, because it's like, there is this mandate.

You have to sell it, right? You have to sell your vision.

The teams have to opt you in affirmatively. They have to be convinced.

And I believe that that ultimately, it leads to a stronger product and a stronger like technical discipline.

Liz: And also translates better to customer needs, right?

Like if you have customers who need to buy in, right? Like, and your customers happen to also be internal, right?

Like that's a lot more similar than being able to say, "We have a new mandate."

Jaana: Yeah.

Charity: Yeah. Like it basically like, Scuba, which Honeycomb is based off of, is flat ugly.

It's like aggressively hostile to use or it's in its, it looks like it was designed in 2010 because it was.

And that's because they don't have customers. They have, "Here's what you get to use."

"Take it or leave it."

Liz: So when someone is not a captive audience, right?

Like what are the things that you are trying to nudge them towards Jaana, in terms of like best observability practices?

Like what are you trying to get people to do?

Jaana: I think, you know, like it starts from how much, you know, you're invested in this like from the beginning.

Or is there a later time? I mean, I think a lot of things like are starting from like canonical things that we need to do and everything else is case by case and ad hoc.

When I mean, by like canonical things, we need to like, you know, sufficient elementary.

This is most of the time, you know, in the typical terms is, getting like metrics, basic metrics, plus, you know, some traces and logs in a correlated fashion.

And then, you know, being able to provide some like primitive.

So, if a team wants to, you know, provide some additional data, they should be able to, you know, correlate that data.

But most of that in interesting cases are enabled by case-by-case what we need to do in terms of instrumentation.

Since I'm working on instrumentation, I think, my answer to this is going to be very like scope to what we do in terms of like, you know, telemetry collection.

And I think that I should answer this question from that perspective because you are, you know, inviting other guests who can give some other broader answers.

But since this is my expertise, I think it makes more sense for me to, you know, talk about my scope.

So most of the time, the bigger challenge I think in this area is like representing the end-to-end needs when you're working in telemetry collection or the best practice or what we need to actually capture.

It's just more about like users', you know, journey in a critical path, right?

Like most of the teams where I was explaining to you this is a large company and everybody has a lot of flexibility, but at the same time, my customers are touching so many, you know, services along the way.

So somebody needs to be owning that entire story and like try to shift the teams to do the right things in order to, you know, bring more like, telemetry data out of the box.

Liz: So it sounds like what you're describing is offering a consistent user experience, right?

To make sure that you are exploring the same golden signals of metrics.

That you are exploring the same fields, that you would expect to see on kind of traces and wide events.

Jaana: Exactly.

And the difficulty is, I think a lot of teams at large companies just look at their like small scope and they really understand what they can expose but they don't necessarily mean what it means and you know, the larger.

And so I'm filling that gap or trying to fill that gap.

Because there's so much work to do there and I think specifically in a company where teams have a lot of like autonomy, it's sometimes like much of a harder problem.

But you know, that's why I'm here.

Liz: I think another question that I had was kind of relating to language proliferation, right?

Like how do you support all the varying languages that Amazon team's developing and make sure that it is all kind of adhering to the same kind of templates and practices?

Jaana: Yeah, this is a huge challenge and, you know, in the industry in general this is a huge challenge.

I never want to talk about like just the internal stuff because all the, you know, industry problems are also our problems because we have similar like scale and like fragmentation between teams.

If you think about, like, for example, we want to, you know, have like high cardinality labeling plus being able to, you know, propagate those labels between different services because everybody uses microservices and you want to, you know, just kind of put that context on the wire and pass it along.

So when you're like collecting data, you just want to you know, put the right context.

All these things require a lot of like, you know, agreements in terms of both on wire and in languages.

Charity: So the schema that you have to like come up with the definition and everything, and you don't really want there to be a rigid schema because you need people to be able to toss in ad hoc data at any point.

Jaana: Exactly. Exactly.

And you know, if you take a look at what's going on in the industry right now, like the trace work with trace context and OpenTelemetry is also like introducing some of these primitives, but at the language levels, like not every language has a canonical way.

So like in Java, for example, there are five different ways to, you know, propagate the context in--

Charity: One of the ways that I've described observability to some people 'cause different definitions, like, click for different people.

One of the ways is just by talking about how, you know, we used to have the app, right?

The monolithic app. And almost all the complexity was really bound up in the app.

And if you had, you could attach a debugger and just step right through it and see, you know.

But all that context, right?

It's bound up in the process, it's executing.

And then when you introduce multiple services, suddenly your process is just hopping from service to service and discarding all of that context every single time.

And the role of instrumentation for observability is to pack up as much of that context that might ever be useful as possible and just ship it along with the request at every hop.

Jaana: And you know, this is why observability became as a concept, right?

Like the fact that like distributed systems is just an everyday fact now, is the fact that observability as a concept at the arrive and, you know, we had a very context specific, you know, definition, yeah.

Charity: It's what made it necessary, not a nice to have.

Because suddenly our old tools just don't work anymore. Right?

It pushed so much of software engineering out of the data structures and algorithms realm and into the realm of networks.

Lossy networks, packets, discs, hop, hop, hop, hop, hop.

And suddenly you're an ops person and you need different tools.

Jaana: Yeah, the funny thing is, you know, a couple of years ago, like back in, I think 2006, seven, eight, I was working on the performance and profiling tools and we wanted to just expand on the Go programming language.

And we wanted to expand us to like observability to see what else we can do.

And as soon as like, we are doing that, like I was leading some of the work over there.

As soon as, you know, I started to step into observability, I realized that, hey, I can't solve this in the scope of a language.

Like, it's just, you need agreements, you know?

And if you really build something really good for a language, like then you're fragmenting the community so badly, right?

Like you will have to like fix that problem in the next whatever years.

So I didn't want to do it.

And that's how I pivoted into the census team, because they are trying to, you know, solve some of these problems.

And that's how I left the Go team.

Charity: And here you are!

Jaana: Exactly, yeah.

Liz: Yeah, it's super, super interesting, right?

Like from the automatic instrumentation perspective you know, working with the various auto instrument , right?

Some of them you can do by code hooking or you can just override modules, right?

Like, so we see that in Python, Ruby, Java, right?

Where you can just reach in and do things whereas for us for Go, you kind of have to have hooks.

You have to wrappers. And if the language doesn't support it, right?

Like it's library isn't supported, you're really out of luck.

And that's kind of been a really painful and sore spot.

Jaana: That's true. And like, there's been several actually ideas, you know, how to like hook into runtime events and such.

Liz: Like, thank goodness for HTTP Trace, right?

What did we do before HTTP Trace?

Jaana: Exactly, yeah.

Like these things just came at a later time and there's so much space to improve, but you know, like the proposals, all the proposals that I've seen was kind of at the earliest stage.

There's so much, you know, to do.

People coming from JVM, for example, they are like, "Oh, you know, I can interpret every call. I can instrument things."

You know, at a later time we unfortunately don't have that type of capabilities in Go, at least not officially.

Some people have been trying to hack into things, but, you know, I can't recommend it for the production services.

Liz: Speaking of kind of innovative things, specific languages, you know, I do--

Honeycomb uses Pprof, like not often but like when we do use it, it's a lifesaver.

But that feels like it's even more at the fringes of observability is, right?

Like, you know, you don't have to be an expert to use distributed tracing data, but you kind of still feels like you have to be an expert to use Pprof.

So kind of when our tools like eBPF and Pprof, like, going to reach the mainstream adoption?

Jaana: I think it's because of the fact that like you need them when you hit a really, you know, micro OptiVol.

I mean, most of the time you don't necessarily have a CPU optimization problem, right?

But everybody wants to have like a bird's eye view of their services so they care about distributed tracing more because it has that everyday use.

Charity: Yeah, I really feel like the low level system stuff and it pains me to say this 'cause I love them.

Like I love performance tooling.

But like for the average engineer, needing to fall back and use one of them is a sign that, you know, it's a leaky obstruction, right?

Because when is the last time I went to the colo and like, had to figure out if there was bad RAM in a machine.

You know, it's just, it's something that's like, you know, we're moving up the stack and for the average software engineer, you shouldn't need to use them often.

It should be an exception, you know? I think that will be more and more the case.

Jaana: I completely agree. Like fundamental building blocks, maybe you actually need to take care.

Like one example is the telemetry instrumentation libraries, right?

The area that we are working on.

It's important to maybe optimize a couple of things because you are adding some, like, you know, overhead to every request because you create some, you know, trace spans--

Liz: Who traces the tracer? Yeah, that's totally reasonable.

Jaana: Exactly. This isn't everything so you're just like, space, like eating like another 10 to 20% overhead.

So you care about those things.

And you know, one of the reasons I actually work liked continuous profile and I spent some time working on it is it kind of helps you to recognize what are some of these super hot parts in the entire production.

Like in the continuous profiling world, you're uploading your profiling data.

It aggregates everything and tells you, "Oh, these are like the functions that, you know, you call all across the company or all across the organization."

And then if there's any, you know. thing that you can improve, at least, you know, it gives you like the priority list right.

Liz: Hmm. So it's kind of a difference between what are the tools that a platform team uses and what are the tools that kind of product dev team should use daily.

And I think that that kind of distinction in audiences is something that kind of hasn't really been super well articulated, right?

Like you definitely see tools being put out there.

You know, saying, "Hey, you know, use continuous profiling."

Right? But, but not actually saying, "This is what it's good for. And this is when you should use it instead of tracing."

Charity: I think that those tools are mostly kind of a luxury for the teams that already have their shit in order.

Whereas you have "you must be this tall to ride" you know, much like SLLs.

It's like, it's a luxury that everyone should have.

It's like, "What should I do this week? Oh, let's see if I can lower my AWS bill."

You know, I mean, thumbs up. But so many teams are just not in that space.

They've got a ways to go before they get there.

Jaana: Yeah, and you know, everything is becoming cheaper also, right?

Like, you know, memory and like CPU is just becoming cheaper so it becomes more of a luxury.

Charity: Just throw more hardware at it. It's like, "Well, what if we just add more of them?"

Jaana: Yeah. Yeah.

Liz: It's kind of addressing that hierarchy of needs for sure.

Jaana: Yeah. And like, you know, whatever you can improve is just really this like top of the stack like your actual application.

But there's layers of layers of, you know, complexity and, you know, unoptimized stuff underneath it.

So you know that like, you know, you're blocked in a way so, you know, yes, of course you can improve things, but in the larger picture, you have very few chances to actually improve your build.

Charity: Yeah. What is the best thing to spend your energy on as an engineer?

Like for most companies, for most teams of most companies it's like spend it on your core business differentiators.

That is what most people at most companies should spend most of their time on.

And it's only when you get very large where you have real specialists or when you're really in trouble.

Or you're just being kind of wasteful with your most precious resource, which many teams are.

And so many, for so many engineers, because this is fun to us, it's just, it's a real temptation sometimes to just be like, "Oh, I really need to do this this week."

And stop you from, you know.

Jaana: Yeah, I've seen this a lot.

Like I have this like, you know, Friday, you know, on my happy projects or something like that.

And like, let me profile and like, let's see, you know what I can prematurely optimize, right? Like it's such a common behavior.

Charity: And once in a while, you pick up that rug and there's giant fucking cockroach and you need to go, "See, I've told you so. You're so lucky that I looked into that right today."

Jaana: Yeah.

Liz: So one final topic before we wrap up.

Kind of what is that intersection of kind of application logic, cloud customers, and kind of when the leaky obstruction becomes like leaky, right?

Like how do we better diagnose these things?

What is our path forward when we start, you know, seeing things where, you know, there's sharding that's going on in the provider site, that's invisible, right?

Like what happens when the magic is no longer so magical?

Jaana: Yeah, like going back to the fact that everything is becoming, you know, more complex or in terms of like, you know what we depends.

You know, in the end of the day, my app is just a really tiny part of the entire system, right?

It just relies on all these different systems.

And one thing that I realized, like especially like among customers, that they have to learn the behavior.

They optimizing for that learned behavior over time.

But in the end of the day, they never actually understand what actually went wrong.

One of the typical problems was like, "Hey, there's a audit."

They can't always real tell like, is it my problem?

Or is it like cloud provider's problem?

They were at like that level.

Once we are able to answer that question, the next thing that we can do is exactly telling them like what actually went wrong, you know, in that particular path.

So, you know, if it's a misuse or something, they actually like learn how the system--

Charity: So you are being their observability.

Jaana: That is true, yeah.

Like, and I feel like this is one of the reasons I took this job because, you know, all the things that we provide is just a huge, you know, unknown box.

Like they don't know what the behavior is, what's going on inside, but it's just really important to the, you know, the availability and the entire like health of the, you know, the critical path.

We are sort of like in this position to provide the best.

So, you know, you can truly understand what actually happened because, you know, you want to rely on it and that's the only way that we can communicate what's happening.

And that's, you know, the valid proposition pretty much.

Subscribe to Heavybit Updates

You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.

Content from the Library

Visit library

Jun 4, 2026

Podcast

O11ycast Ep. #91, Every Failure Becomes an Eval with Janaki Vivrekar

On episode 91 of o11ycast, Ken Rimple and Jess Kerr sit down with Janaki Vivrekar. Janaki shares how Amplitude is building...

May 7, 2026

Podcast

O11ycast Ep. #90, Outcome Engineering in the AI Era with Cory Ondrejka

On episode 90 of o11ycast, Ken Rimple and Jessica “Jess” Kerr speak with Cory Ondrejka. Together, they unpack the rise of agentic...

Apr 22, 2026

Podcast

Third Loop Ep. #3, Give It a Name: Why Software Needs a Third Loop

In this episode, the hosts unpack the thinking behind the name Third Loop and what it represents. Building on ideas from their...