JUL 10, 2020

34 MIN

Ep. #22, Designing for Observability with Jimmy Bogard of Headspring

GuestsJimmy Bogard

light mode

about the episode

In episode 22 of O11ycast, Liz and Charity speak with Jimmy Bogard of Headspring. They discuss maintaining balance for on-call engineers, what’s missing in the average engineer’s toolkit, and moving from monoliths to microservices.

about the guests

Jimmy Bogard is an independent software consultant and the Chief Architect at Headspring. He was previously a software developer at Dell.

show notes

Headspring’s Blog

about the episode

about the guests

show notes

transcript

Jimmy Bogard: I try to stick around my clients until the projects actually go to production and hopefully, after that as well.

So I learned about it pretty quickly because things tend to go wrong in production, especially for, for new systems, new applications.

And when I have to be tasked to figure out what went wrong, then I start to care very deeply about observability.

Liz Fong-Jones: You don't throw things over the wall, you definitely encourage your teams to practice kind of full cycle ownership of their code.

Charity Majors: Is there even a wall anymore? Like it feels like there never should have been a wall?

And we've been busy tearing the wall down. And then there are remnants of the wall, but like is there still a wall out there for most people.

Jimmy: There is I do it get to interact with a pretty wide variety of clients.

And when the whole devops movement came about, as people got further down the adoption curve, larger organizations assumed that meant I need a devops department, which is hilarious because

Charity: Yeah, I know oh my God.

Jimmy: You just missed the point. So I still see that sometimes, less and less.

So like my current, my current client, the team our operations person is on our team sitting next to us.

And so we're working directly with them to figure out what we need to do on the development side to help make their job better, and what things they need to make our job better and kind of work hand in hand.

Charity: Who's on call?

Jimmy: So, it depends on how severe the issue is. But right now it's still the like there's still tiers of support for my clients.

Charity: Mhm.

Jimmy: But eventually it does come back to developer if something is a high enough severity, then we're going to be called to figure out what's going wrong, especially if it's a high kind of high critical issue.

Charity: Yeah, how's the process been for engineers? Like, I feel like it's so often this last opportunity, because it's pitched to them is like, you got to eat your vegetables, right?

Now you too must suffer. Like we in ops have suffered for so long. Right?

But like, granted years, like we want to write good code, we want to do things well and we want to see the fruits of our effort, right?

Like, don't we enjoy figuring things out?

Jimmy: Yeah, but not necessarily 2 a.m. on a Friday or Sunday.

That's not as much fun. And I've definitely been on those kinds of clients and projects.

So it does take a little bit of feeling of the pain to understand, you know, what does need to improve? In order to not have those 2 a.m. calls?

Charity: How many times do you think is reasonable for an engineer to get woken up at 2 a.m.? And say, how many times per year?

Jimmy: Well, I would say, once is enough, before they realize I don't want that to ever happen again.

Charity: So obviously, this is a leading question. Perhaps I have thoughts and feelings on this topic.

I don't know, you know and there's no right answer.

You know, it depends on where you're at, in terms of stability and willingness, blah, blah, blah, all the caveats. Sure, right?

But I do feel like you know, the process of putting engineers on call has to be met with an equal commitment from management to giving them enough time to solve the problems so they're not woken up, right?

Liz: The only thing worse than being woken up at 2 a.m. is being woken up at 2 a.m. night after night after night.

Jimmy: Yeah.

Charity: And knowing how to fix it, but not being given time away from the product roadmap to do it.

Jimmy: So it usually takes one time for that to happen for suddenly those kinds of features to be prioritized. Like we need these kinds of operations tools.

Charity: Yeah.

Jimmy: But yeah, if that doesn't get prioritized properly and given the right amount of budget and time, then you have a sing around problem, people will leave.

Charity: Totally exactly.

And I feel like it is reasonable to ask an engineer to own a service and get woken up two or three times a year, right?

Like I think I would ask that of anyone except for people with new babies.

Like people with new babies or people who have seriously problems.

Like, I would totally exempt them for that, right?

But I do feel like you know, it's a serious responsibility and ownership is real, and I feel like that's like a good balance.

Liz: So now's a good time for you to introduce yourself.

Jimmy: So yes, I'm Jimmy Bogart, a consultant out of Austin, Texas.

I've been in software consulting for, oh gosh, 14 years or so now.

And before that worked in product, worked in IT, worked in startup and consulting, like the best because I don't like the client.

I can just wait six months sand I have another one.

Charity: There are not many of you.

Jimmy: Are there not?

Charity: They aren't no.

Jimmy: I guess not.

Liz: They're not many independent consultants, right. Like they're the McKinsey's of the world.

Jimmy: Yeah.

Liz: Right but definitely, in terms of people who are paying out their own shingle and getting results, that is a very small set.

Jimmy: Yeah.

Charity: Do you find your clients listen via word of mouth? Or?

Jimmy: Yeah, for the most part, I do a lot of, you know, blogging and speaking and things like that, more or less accidentally have built up some of the personal brand of the .net world.

Charity: Um mhm.

Jimmy: So I don't find it terribly difficult, but I think I'm very fortunate in that case.

Charity: Yeah, that's interesting, .net.

So if I was, like name a software engineering community that is stereotypically as far away from operation as you can get, that would probably be one of the ones I'd come up with. Is that fair or not?

Jimmy: I know, I know. It's, it's especially frustrating as I'm trying to introduce these kinds of concepts as organizations.

Charity: Yeah.

Jimmy: Then only I realize like, Okay, why don't you use this tool?

I go to the SDKs and done that's at the very bottom of the list for just about everything, right?

Charity: Yeah.

Jimmy: Okay, I want to introduce you what it is. Well, there's the tooling is years behind even enterprise Java.

Charity: Yeah.

Jimmy: Which is I consider more enterprising, but somehow they made it much further along and the observability space, so.

Charity: Is this just a hangover from Microsoft's reputation or what? You have your series?

Jimmy: Oh, there's a couple things like most of my clients have a lot of on prem deployments.

And so they really look towards Microsoft to fill a lot of those gaps for them.

And Microsoft hasn't filled those gaps, so they just haven't bothered.

And when you still run giant monolithic systems, the need for things like distributed tracing just aren't really there.

But as they started to make this move, they realize okay, the company's been running, you know Jaeger for years. We don't have that at all.

We just kind of build that from the ground up for ourselves.

Liz: Mm hmm, but I imagine that because .net is a, you know, virtual machine based language like there must be reasonably good single host APM tools that at least provide some kind of learning path.

Has that been your experience? Or like, what does that learning path look like when people go "Okay, I'm moving from monolith to distributed" and what are those analogies that I can kind of grasp for?

Jimmy: Sure so the story is definitely getting a lot better.

About 10 years ago, I was part of a big monolith. It wasn't even called micro service, a term didn't exist at the time, but services, service or architecture.

And then yeah, they went to production. When we flipped the switch, something went wrong. We had zero insight into what went wrong.

Charity: You're literally like trying to like craft like nine castles in the sky, just like trying to read about where your packet could be and what it could look like.

Jimmy: Yeah.

Charity: That sounds familiar.

Jimmy: So we haven't we went on having to build a lot of other tools ourselves, like having to discover even what distributed tracing needs in order to function that you need these like identifiers and parent identifiers you have to propagation and identify this request versus that request.

Charity: You know, it makes me sad, scared to just think of how many of us reinvented tracing out there in the wild on our own.

Liz: And also how many people tried the off the shelf things from five or 10 years ago, and were thoroughly disappointed because it was too complex.

Charity: Yeah, they're just like, aah this is not worth it.

Jimmy: Yeah, we have, we have some clients that now that they've moved towards this more distributed model, try this sort of easy mode, which is doing a lot of like auto instrumentation, and then finding that that's sorely lacking because some component doesn't have it working for them for whatever reason.

Charity: Right.

Jimmy: And now they have to roll a whole bunch of custom vendor- specific code to try to plug those holes.

Liz: So I guess that's kind of how you came to the open telemetry community then is by starting from this position of, you know, what resources are out there for a vendor independent instrumentation.

Jimmy: Yeah, that's exactly. I think is, being a consultant I get to see a lot of different kinds of observability tools and practice, especially hosting providers.

So you can't just count on a single kind of tool to be in place.

And because especially in the .net world, the SDKs are typically the last ones to be developed, it was usually me having to develop those plugins for whatever next observability tool someone was using in that client.

And I was getting tired of developing, "okay, I have to develop this zipkin plugin for rabbit mq, but that'll go away because the next client is going to use something else. They're going to use App Insights."

And I just got really old.

Charity: I'm so curious when you first showed up at a client just like blown away by how awesome their telemetry was?

Jimmy: No, but I'm a consultant though, so I only get called when things are wrong.

Charity: I guess you're kind of like a trauma surgeon, right? Like, they don't call me for the easy cases.

Jimmy: I know, exactly.

Charity: But like it's so mystifying because, you know, we all crave impact, right?

Like we all we want to like, you know, it's kind of still it's thrilling to like, put in a small amount of effort just like watch, like just a mushroom cloud of impact.

Like, that's awesome. As long as not a literal mushroom, but you get what I'm saying, you know, and yet, like if I try to think about where, where people can apply effort that reliably pays off just in like, orders of magnitude of benefits, it's on instrumentation and observability.

And like, shedding light on things, and just like --

I feel like so much of the technical debt, it doesn't just come down to lack of being able to see things but it starts there, and then it grows.

Liz: I guess the interesting question that I asked though, is like, you know, is it because people have been not investing in trying to telemetry?

Or is it that their previous investments haven't worked out well?

Right, like we've all been in the shop where they log everything using printfdebug statements, right? Like where they invest a lot of effort into it, right.

Like because of the kind of Full Text indexing.

Charity: And I think like, I've definitely been at places where we invested a shit ton of effort into like our monitoring tools.

And they were useful, they paid off inside the ops team but not outside of it because it's almost like so much translation was required to translate the language of low level systems, you know, counters and statistics into the stuff that would make sense to software engineers that they kind of needed us to stand next to them and explain it all the time.

And I feel like maybe that's one of the leaps that I feel like we're starting to make is making tools that are way more powerful available to people that is kind of--

It's in the language that you write in everyday like, endpoints and variables and like it should feel familiar to a software engineer who sits down and, and looks at it. So maybe that will help.

Jimmy: I doubt if you see an education gap because people are seeing symptoms but don't understand the disease at all.

They don't know understood what questions they need to be asking they say this is the thing that we actually need to know what tools can help them solve those problems. They see that they're spending sometimes a week trying to diagnose a problem, but don't understand that, well, if they had these specific tools in place, then that would have taken a second to understand or to ask that question has been answered by a tool for you.

Charity: Right.

Liz: That is like lack of imagination, right?

Like, people can't imagine that there's something that solves it or right, like they just assume this is the way that it has to be or like this is for people who are like, you know, way more sophisticated than I am not realizing that there are actually solutions appropriate to where they are.

Charity: I think it's also maybe that we're used to tools where we can craft specific answers, like how many times you remember like okay, I've got this problem, I struggled I wrestle with it, and the days later I figured it out.

Cool, I figured out how to ask this question. So then I create a dashboard, so that I always have the answer to that question up, and I leave it there and this is why our past is like littered with the trail, engine dashboards--

But like the idea of making it so that I can asked this kind of question easily and repeatedly in any circumstance is kind of a level of abstraction above that, that we aren't used to maybe.

Jimmy: I don't know about YouTube, but one of the things I also run into is just the complexities of the budgets of who pays for this tool and companies.

Charity: Oh, my God, that is so frustrating for us right now.

Like, because it ops teams hold the budget.

But like the people who need to use honeycomb are people who are writing code, and in the trenches with it every day, who are most often, sadly, because in my world that's not the case.

But right, but in the medium cases are not up to people that are software engineers.

Jimmy: So who's going to fit that bill?

Charity: So who, yeah, so it's kind of like you know, yeah you have to sell the one. Well, another was paying and that's not great.

Liz: The other situation I've often seen is kind of be centralized Omnibus and budget for logging vendors, right?

Where on team which is, you know, probably the Ops team just takes on the leadership out there, you know, six figure Splunk bill or whatever it be, right?

And then they don't wind up doing chargeback, right?

They wind up just having teams just send all their logs to that kind of fire-hosing all their logs to that one team.

And it's not even a venerable, necessarily people who are growing their own elk clusters, right?

And it winds up, right, like the teams that are sending all this garbage, don't wind up paying the cost for it. And then when those teams instead try to spend money on something better, right, then they're told why are you spending money, right?

Charity: Right.

Liz: So how have you found kind of navigating and helping sell that to your clients Jimmy.

Jimmy: I kind of go the guerrilla route, which is I'm going to get something deployed in a container that is just visible to my dev environment.

And like, just let's start getting the developers using it as the kind of a local development tool or just for the dev environments.

And then we can start to then sell and show the value to say, "hey remember that that thing we're trying to fix?"

Just to the dev environment, because, you know, trying to get acceptance to something that took five minutes answer as opposed to it used to take a week.

So that was X amount of dollars that we saved on that time, and we could still spend it on something more important.

Charity: We are so bad at quantifying the cost of our time as engineers.

And part of it is because we love what we do, but it's hard to see it as work sometimes.

And part of it is just because I don't know, it's another conversion step.

Like I don't know, like the number of hours that people spend wrestling with this stuff.

It far outstrips any vendor bill that they're paying. But it's really hard to see it in those terms.

Or like the people who are like, I'm going to staff a team to build an observability tool, internally like, and it's not just the cost of the team there.

But like the opportunity cost, the loss of focus on the business, like this is one thing that we are seeing now with economic downturn is a lot of people who were planning on hiring their own teams to do these things are suddenly now looking at the fact that their headcount is frozen for the next, you know, who knows how long and I feel like that's actually helping people think more concretely about, okay my Engineering cycles are precious and rare.

Should I be spending them on what's core to my business? And as you know, is this it, which is actually maybe a good side effect?

I don't know, TBD I guess

Liz: The other thing that I thought was super interesting about what Jimmy said about kind of starting with a dev environment is that usually, like, if you're using a telemetry provider, right, and sending your telemetry somewhere, for the volumes encountered in dev environment, there's actually not that much cost associated with it, right?

Like, you can often get it for free and--

Charity: Yeah, and the problems aren't that hard.

Jimmy: Yeah, we, right now we're just strictly using it just as that kind of distributed debugging tool, knowing that we'll be having to answer more complicated questions in the future, but just trying to clear hurdles for the developers really focused on adding value as opposed to just spelunking, and trying to, you know, go through logs and jump like that.

Charity: What does that mean for you adding value?

Jimmy: No, I used to do a whole bunch of like lean software stuff before I got jaded with the whole agile.

And so I would often go through these exercises with customers of doing the whole value stream mapping of, you know, mapping out value add activities versus value detracting and saying, you know, there's value in the answer you have for the customer, but not necessarily the amount of time it takes you to get the answer the values in the answer, not in the work.

Charity: Interesting.

Liz: So how did you first become involved in open telemetry kind of what was your introduction to it?

Jimmy: It was, I guess, about six to nine months ago, I was seeing the problem of this current client who was needing distributed tracing.

And they did not have a target yet of what tool they wanted to use to achieve that.

We saw tool that they're using today, but it was using audience rotation.

And even worse was using log, strictly log parsing to kind of infer spans and traces and clicks like I don't want to build on top of vendor-specific headers.

I don't want to build on top of this proprietary API. And then that's not portable.

I can't think that's the next client nor if they decide to go something else. That's all wasted work just all just chunked out the door.

Charity: Yeah.

Jimmy: So you just start, you start looking well, what is out there, just, you know, start googling "open source tracing."

That's why I first got introduced to the, I guess, Open Tracing Project and Open Census before that all got merged together.

Charity: I got to say, I was a little dubious, you know, it is like, littered with the past corpses of projects that have tried to do this and failed, and it's down yet another one.

Liz: Right, the old XKCD comic about, you know, there are 20 standards, let's introduce one theory to solve all them.

Now, there are 21 standards, right. But I think kind of commitment to sun-setting the old things is, is what really drove kind of adoption.

Jimmy: That's what we're seeing actually happening in the Microsoft world.

They're not quite the new Microsoft. There's still lots of old Microsoft, you know, anti OSS stuff, but at least with the open telemetry stuff, we're seeing that they're modifying their published API's to better support open telemetry.

So the folks that are on the open telemetry .net side include someone that actually works at Microsoft.

So they're able to go back and make changes to the API is to ensure that open telemetry works better with it.

And then new components can work better with it as well.

So that's what really encouraged me is like, it's not something that has no support from Microsoft, because like it in the Microsoft world, our dark teens expect to get spoon fed from them.

So it was good to see that there is first class support from them on this project.

Liz: Yeah, it makes a lot of sense in our world we develop in Golang and part of what makes Golang work particularly well for us and distributed tracing is the fact that everything in going expects to be passed a context object, right?

Like it's kind of this basic thing that results in it being easier to plumb things through all of your methods.

And then you have the data in place to make those distributed calls traced afterwards.

Jimmy: So that is one huge challenge that they this Microsoft has in their .net world versus really any of the other platforms being supported.

As they don't have that easy ability just to, just to wrap or pass things through anything.

So Java, you can do bytecode, weaving JavaScript, you can do whatever the hell you want, because it's JavaScript.

And so they actually need a bake the support in and release it as part of the libraries they publish. And so that's why I was really excited to see them directly support it, because they would have to do that, in order for open telemetry to be able to hook into the right places to be able to publish out the things that leads.

Liz: But at a certain point right like in their self interest rate, it's a competitive thing, right?

Like if the language that they are promoting heavily, does not include support for distributed tracing out of the box, then it will be a inferior language for developing micro services in distributed systems.

They kind of have to do this almost.

Jimmy: Yes, I think so especially with Azure becoming as big as it is money wise.

And they're saying, at least me personally, there's a huge gap that they're built in tooling for observability.

I mean, there have been sites, but it's just not the same as other observability tools I've used.

Liz: Yeah, definitely it's a common pattern, that kind of cloud provider instrumentation is okay, right?

Like if you're running on the Cloud, but it's definitely not a substitute for kind of full application dedicated observability t hat's not coupled specifically to your cloud provider.

Jimmy: And I think that's where they got to, because they did that two or three years ago, where they just try to bake in their tool to everything.

And very quickly, they realize that's just not going to, that's just not going to work.

We can't release our entire stack just because our specific observability tool, changed an API, just that can't work.

Liz: So you played around with open telemetry, you realize that it solves some problems, and then you started writing about it, which is why you know how we first heard of you and invited you onto the show.

So kind of what were the kind of key things you wanted people to learn from your blog post, which we'll also link in the show notes?

Jimmy: Oh sure, so probably was just education, just understand that this project exists, and then we should you know, if you're doing any kind of distributed systems, you need to be keeping an eye on that to understand how it's going to affect you.

And the other big thing was just understanding how the bill and observability tools inside of Microsoft, how they're changing those tools to be able to better support projects like this.

So I work with a pretty wide variety of systems and components in production.

And Microsoft only has two or three of those supporting any kind of tracing out of the box.

So if I'm coming somewhere new, and I need to extend some library to include support for observability then I wanted to have some breadcrumbs from other folks to at least learn how I you know, how I did it, and hopefully how they can help do that in the future.

Liz: That makes a lot of sense, right?

Because you have to persuade people that you know, there was a need to add observability hooks into their libraries or better yet for them to just add observability into libraries to begin with, and that requires people to appreciate the value.

That makes a lot of sense.

Charity: Yeah, what do you see be the tipping point for people who you know if you've seen any converts who have gone from being you know, dubious to being rabid partisans, or like, even just like being step you know, yes, this is worth it.

You must have seen like this journey, a lot of customers now like what?

What are the common themes or what tends to like wide people's eyes and get them on board?

Jimmy: Honestly, the first time someone sees any kind of distributed tracing UI, they usually just blown away, because they've wanted that so long.

Charity: Yeah.

Jimmy: And they're probably so tired of just digging through Splunk or something and try to retrace the steps and figure out what the heck happened.

Charity: 'Cause the problem with logs is you have to know what you're looking for before you can find it.

Jimmy: They're fine for what they're the purposes but they're not an end all be all.

Charity: Yeah.

Jimmy: So that's usually I tell them like, Look, here's x, y, z distributed tracing UI. Isn't this great?

Charity: Yeah, if it was that easy, we would all be happily using traces and all this stuff like what? Why isn't it that easy?

Jimmy: Well, in the .net world, it's not that easy. That's what open telemetry's really helping to solve is.

Charity: Is it just friction in tooling? Is that the only reason?

Jimmy: For me in Microsoft it is.

Charity: Yeah.

Jimmy: So in my experience is not simply the lack of tooling.

So if I, for example, if I wanted to just pick Zipkin.

Zipkin itself in the dinette world only might have extensions for two of the three components I'm using.

But if I'm using RabbitMQ, there is no support for that. So now I get to write that I'm using MongoDB. That doesn't exist, I got to do that.

I picked literally any of the other platforms that those facing tools support, it would be super easy.

But it's like okay, I have to write code for my clients, and they have to spend my hours to do so. And are they going to recoup that?

Liz: Yeah, exactly like, this is kind of the argument that we had was like, you know, "hey, why should we as an industry be employing an army of you know, hundreds of solutions engineers to write integrations for everything when we could just do it once and be done."

Charity: When we started talking to people about tracing. There are two things we heard over and over and over that were problems.

Number one, that you had to instrument everything before you even got partial benefit.

And that was really like frustrating for a lot of people. Because like they have to take it on faith. It is going to be that valuable and worth their time.

And then number two, that like over and over again, like it would turn into a thing where like, yes, they would have deployed tracing, but one person knows how to use it.

And everybody goes to that person when they need to see something traced.

And it's never really like become something that is just like part of the average IRS toolkit, even in places that do have a very extensive, you know, deployment.

Have you experienced that, or is it not been that way?

Jimmy: Yes, unfortunately. So I can show them the picture of the trace, and that's great.

But there's the second part of it is okay, let's actually practice how to use it, so.

Charity: Yeah.

Jimmy: Sometimes what I'll do is I'll, I'll make some API endpoint--

I just throw an exception to throw an error like okay, well let's deep down somewhere so let's this diagnosis and figure out and kind of give them the shortcuts to understand when something goes wrong.

Here the you know, here's the quick way you can figure out what the heck happened and go back from there.

Liz: Yeah, the thing that I characterize that problem as it's not just gathering the traces. It's being able to find the right trace, right?

Like you can't just have a, you know, giant collection traces that no one ever looks at, because no one knows how to find the right trace.

You kind of have to incorporate it into people's existing flows around you. Being able to go all the way from "here's the high level graph, here's the error rate, oh, no"--

Charity: It's a deprecation problem, but it points to weaknesses in our tooling.

And this is why I like Honeycomb from the very beginning was so focused on, we don't build for individuals we build for teams, because we realize that like the act of collaborating with other people, like publishing snippets of your brain, and like making your history available to yourself and other people--

Like even people who don't think of collaboration as being that important, is always because when you're debugging, you're always collaborating with like your past self.

And when you're writing, interpretation, you're collaborating with your future self.

And if you could make those really solid and really grounded and things I get, yes, it helps.

It helps the entire team but it also helps you. I don't know about you, but like when I'm working on a part of the system like I know everything about that, like intensely. But after I've stopped after I've moved on to something else, like it decays my knowledge. When you're working on a large distributed system, like, part of it lives in your head, part of it lives in, every other member of your team said, but each of you is responsible for the whole thing, right? You have to be able to know your part intimately. But then you have to be able to debug and trace it through the whole thing.

So I feel like this has been a real gap in products that have been built so far, with the exception of Honeycomb.

But like, inevitably, like other people have got to start thinking this way about like, you know, wearing grooves in the system as you're as you're using it so that the way that the expert user debugs and understand and, and learn so that that knowledge isn't lost them doesn't locked up in our heads, but it's in essentially available source of truth that we all have equal access to so we could democratize it.

Liz: And so that if that one expert leaves your team, they don't take all their knowledge with you.

People are interested in learning more about this. They should definitely listen to the last episode that we published or two episodes ago, the episode with Jessica Kerr, because that was kind of her central theme, right?

It's not the idea of joint work between multiple humans and multiple systems.

Charity: Yeah.

Jimmy: Jessitron, she's great. She's really wonderful.

Charity: She really is.

Jimmy: So one of the things I try to emphasize with my teams here is that designing for observability is not something that's just done.

It's part of a process and that process needs feedback. So I need to understand what my developers are spending their time on to understand.

Well, what things do I need to put in on the observability side to make their lives easier, but that's never it's not something to just okay, we've ticked the box now, it's done.

Something that still requires feedback to continually approve?

Charity: Yeah.

Liz: Yeah, for completing this process, right?

Like it's my attorney likes to talk about observability driven development, the same way we would do testing driven development, right?

Like you kind of have to bake it into a continuous process with that feedback loop.

Charity: Yeah and this goes back to the lean stuff that you were talking about too.

Because like, I see like, you know software engineers, they kind of have just the system in their heads, like they write the code, they, you know, they test it.

And then once it's tested, their job is done, right?

Like tests pass, I can go home, where like, that's just like the pre-test, right?

That's just like the, alright, let's just make sure there are no regressions, no obvious things, but then you're not testing that shit till you're rolling it out in prod, you're not testing it until people are interacting with it.

And you're not testing it 'cause like you know, start with the fact that like nobody's staging environment in any way resembles production. It just does it and it never will.

And it's better for us to just admit that and accept it and start embracing fraud as part of our you know, test loop and keep chasing this like fucking dragon that doesn't exist.

Even if it was possible, you know, no one's going to pay for it for one thing and, and even if they were willing to pay for even if we didn't like the gold standards, capture and replay, right?

I've written this piece of software myself, I know for fucking three databases Dell, capture 24 hours worth of traffic and replay them, you know, with various knobs, even if you did all that, even if you pay for it, you know, users are chaos engines.

You know, it's like the Michael Jackson death problem like it had never happened before.

Tomorrow is not today, and it never will be. And the harder we fight that, the more problems we're going to have, we should just embrace it like we embrace failure happens all the time, it's fine.

The point is not to make things not fail. The point is to make it so that users can't tell that it doesn't impact them that it's graceful, that it's not fatal, right.

Jimmy: I wish my business could take that too.

Charity: Well, you know, I feel like people need to feel safe. If you're asking them to accept this rescued the failure and stuff.

You need to help them feel safe and oriented which starts with putting the fucking glasses on so you can see where you are, right?

Like, it's your hurtling down the highway, and you're as blind as I am, you don't have your glasses on, everything's going to be scary, right?

And it's not that it doesn't take skill, and like effort and focus and prioritization and all these things, but like, first you have to be able to see where you're at and get rapid feedback about your actions.

And I really feel like people who've grown up with, you know, the last couple decades of, of technology haven't really adjusted to just how detailed, how specific, how exact how much visibility they should expect from their systems.

How much is reasonable. And like, the bar just needs to be raised in people's heads, and they need to realize it's for them too. It's not just for Google and Facebook.

If you're not in the Bay Area, which I love, I really like tacking, we're not inside our bubble, right? Like, do you encounter this perception?

Or this kind of sense of preemptive defeat very often out there in the field of just like, yeah, we're not Google or Facebook.

We don't get nice things. Just as good as we can do. This as good as we should except for ourselves.

Jimmy: Yeah, pretty frequently, but I guess that's one advantage to being a consultant is I'm trying to be there to help them solve the problems that they know about, but also kind of find their, their unspoken needs as well.

And things like observability are definitely one of those areas.

Charity: Yeah.

Jimmy: Maybe teams that say, you know, we've got a quarterly release process and I, you know, my stomach turns but you know, better.

Charity: You see the DORA report, right, where they've got their, you know, year over year they published their, you know, your metrics, if you're an elite team, or if you're a high performing team, you know, and year over year for the last couple years, the bottom 50% is actually like last round.

Jimmy: Oh, no.

Charity: Well, you know, it top 50% is accelerating, like they're getting better faster.

Liz: Yeah, the good performers are becoming great performers in the middle performers actually regressing back to low performers and we have to fix that.

Charity: Well just because like if you're standing still in tech, you're losing ground because chaos is always increasing.

You know, entropy is out to get you like things are getting harder.

And if you're just standing still, if you're just doing the same thing as you were doing six months ago, a year ago, two years ago, things are not staying in place.

They're getting worse. And it's not your imagination. Like it's a it's a thing that we have to actively fight by actively seeking out.

Better tooling, better strategies, you know, we have to like, we have to see our tech is like a thing that we cycle through, not we deploy and achieve.

We don't achieve high ground and stay there. It's more like we're on an escalator.

And we're trying to climb faster than it could, you know, go down?

Liz: Yeah, we have to definitely invest in reducing complexity and otherwise kind of paying down our technical debt or else we're never going to be able to keep up.

Charity: Yeah.

Liz: All right well, we're nearing the end of our hour together. Jimmy, any kind of closing thoughts?

Jimmy: Well, nothing really, thank so much for having me on. Appreciate it.

Liz: You're very welcome.

Charity: Yeah, it was really great to talk to you.

Subscribe to Heavybit Updates

You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.

Content from the Library

Visit library

Oct 29, 2025

Podcast

O11ycast Ep. #86, 12 Years and 100 Million Customers with Amarilis Campos of Nubank

In episode 86 of o11ycast, Ken Rimple and Jessica Kerr sit down with Amarilis Campos from Nubank. They explore how Brazil’s...

Jul 14, 2025

Podcast

O11ycast Ep. #84, Maddy Montaquila on .NET Aspire

In episode 84 of o11ycast, Ken Rimple and Martin Thwaites welcome Maddy Montaquila, lead PM for .NET Aspire at Microsoft. This...

Apr 21, 2021

Podcast

The Kubelist Podcast Ep. #13, Curiefense with Tzury Bar Yochay and Justin Dorfman of Reblaze

In episode 13 of The Kubelist Podcast, Marc is joined by Tzury Bar Yochay and Justin Dorfman of Reblaze. They discuss the latest...