SEP 1, 2022

32 MIN

Ep. #55, The Dev Side of Observability with Martin Thwaites of Honeycomb

GuestsMartin Thwaites

light mode

about the episode

In episode 55 of o11ycast, Charity and Liz speak with Martin Thwaites, a developer advocate at Honeycomb. They discuss the dev side of observability, exploring topics like logging pipelines, the pain of context switching for developers, the significance of the DevOps movement, best practices for scaling engineering teams, and outside-in test-driven development.

about the guests

Martin Thwaites is a developer advocate at Honeycomb, a public speaker and activist for observability, and a committee member at NDC Conferences.

show notes

about the episode

about the guests

show notes

transcript

Martin Thwaites: So I'd say that my favorite client was more to do with the constraints that they'd provided, which was using Azure PaaS functions rather than just plain compute, which brings in some really interesting challenges around how do you scale this? How do you work with those constraints? What is it useful for? What is it not useful for?

And trying to do things at a massive scale with pure Azure Pass, which was a really nice angle to throw at a very complex problem that they had anyway, and trying to run that in a constrained environment where you don't have access to, say, the run time that you would in a normal application.

So using things like Azure functions, using things like Azure App Service, and just piecing all of that together and trying to do that in a secure and scalable way was really, really interesting.

Liz Fong-Jones: When we're talking about a pass, what are the key elements that are important there? You mentioned not having control over the run time so that means no virtual machines. Is this just pure functions or is this also other managed things?

Martin: Yeah. So obviously Azure functions is the functions as a service type thing where you don't get access to a lot of that runtime. It's very, very constrained, to the point where you don't even control the end request and response so you can't stop the response from coming back, for instance, which is really interesting.

But it's also basically any service where you're not going to be able to really access that raw machine, that raw compute, where it does everything for you. So App Service, for instance, you tell it to scale, you say, "I want five of these," and you just give it your code and it runs it. Very similar to the Heroku type environments, as opposed to running it in the traditional VMs, but also very different to running it in scaled containers like ECS and Fargate type things.

Charity Majors: Everybody is us nowadays. This feels like a good time for you to introduce yourself.

Martin: Yeah. So my name is Martin Thwaites, I go by @MartinDotNet on Twitter. I'm a developer advocate at Honeycomb, focusing on the developer side of observability and predominantly .Net and Azure, and all of those kind of things.

Charity: Yeah. Welcome to Honeycomb, Martin.

Martin: Thank you very much-

Charity: We hope to have you as a host moving forward on this occasionally.

Martin: As a long time listener, first time guest, it would be a pleasure to do that, and obviously amazing to be at Honeycomb after spending so many years on the outside trying to bring .Net to Honeycomb, and obviously maintaining that library for so many years.

Charity: That's true. We first heard from Martin... We started the company in first of 2016, pretty sure we heard from Martin early 2017, it was early. No one was paying attention to us and Martin was like, "Hey, guys. You need some .Net?"

Martin: Recently I was going back through one of my talks I gave back in 2017, and right at the end of my talk where I was talking about logging and metrics and just getting into observability. Right at the end of one of my talks I threw a slide in at the end just saying, "I've found this company called Honeycomb, and I think they're the future of observability. They're doing this thing called events and it's high performance and all of this kind of stuff." To see that five years ago to being here now is a massive journey.

Charity: That's amazing.

Liz: I think that weaves in these two threads together, right? That we were talking earlier about the idea of having situations that might be challenging to traditionally monitor, and then there you were, you found Honeycomb and you've realized maybe that you needed to come bring those two things together, right?

Martin: Yeah. At the time I was working for Manchester Airport and we were working on systems where we had a very, very high load at the time. Well, high load to us. Obviously some people may not say high load. But it was high load to us, and trying to understand how do we throw things in?

At the time we were using Elastic Search and doing structured logging, but struggling to see the performance stuff, bringing that into Honeycomb was that really interesting, "Oh, we can do some really interesting things with graphing this out at scale and looking at these things like high cardinality," which at the time was just a revelation to being able to see really in depth what was going on with the different requests.

I think the first thing I actually ended up doing was ingesting application load balancer logs via S3 and using triggers in S3 buckets for blogs and just pushing that into Honeycomb to see all of this stuff come down. It was just amazing to see the difference between that and just dropping structured logic.

Charity: What had you been using before using events? What kind of instrumentation?

Martin: I would say before that it was Elastic Search and Serilog which is the de facto standard in .Net for logging, and using that to product structured logs, throwing them into Elastic Search, trying to provide some sort of faceting and stuff through Cabana at the time. Which it was useful, it was really good, it was a step above nothing. It was a step above taking the logs off the server and trying to work out what was going on.

It was that sort of leveling up a little bit. That distance though, between leveling up from nothing to Elastic Search and showing structured logs, to taking those events and doing an Event per request at the time, and throwing all of that data on and leveling up into Honeycomb, and then being able to do all of those metrics in blazing speed at the time.

This was a difference of second to milliseconds of running these queries, it was just night and day in what you could see.

The other thing was the fact that we were using logging pipelines. We were taking logs, we weren't sending them straight to our backends to run, we weren't sending them straight to Elastic Search, it was going through Log Stash. It was going through that pipeline which meant that it took minutes for you to be able to see that data, so that to me, was that mind blow moment.

Jessica Kerr: That's so frustrating, especially when you're trying to debug something in production and you make a change and then you wait. You make a change, and then you wait like three minutes, five minutes. It's just like, "Ah, no wonder people SSH into boxes and tail logs because it's just so infuriating to have to wait for your telemetry to catch up."

Martin: And that's a scary time, you're there going, "Did I do it right? Did I just take the server offline? Because I've not got the logs through." It's really nerve wracking, and when I speak to people, when I start showing them the things in Honeycomb and they start comparing it to other tools and they look at it and go, "Oh, it's there now." And they don't believe it's there.

Charity: We forget about the speed because you just get used to it and it just stops being something that you think about, "Oh yeah, it's going to be available instantly. Why would it not be?" And it's always funny to watch other people go, "What? Oh yeah, I guess that's kind of cool."

Martin: Yeah. You start to take that into, say, the Azure PaaS world where people are just ingrained into them that, "Well, yeah, you do the request and then you wait a few minutes. You do the request and then you wait a few minutes." Trying to take them into that world where that is abnormal and something is wrong if that happens, and you show them it and they don't believe it.

That development experience where you run the code and the response comes back too quickly that you don't believe that it did the thing that you had expected it to do, and you spend so much time trying to work out why it came back so quickly. Actually, it was because you wrote the code so well.

Charity: You spend so much time talking and thinking about operable software and production software, and it's great because I think you and I think about a lot of the same things but I come from much more the operational side, you come from the development side. But I feel like it is underappreciated, just how much getting that cycle as short and as tight. It's partly the responsiveness of the tool.

It's partly making it so that your software is being auto deployed within a few minutes. We don't have the glorious future of, say, a Dark live where you're literally writing code and saving the file on production as you're being it. But if you can get a nice loop in there where you're instrumenting your code as you go and you merge your code up to main, and then within a few minutes you're looking at it in production.

You've still got all that context in your head, everything you're trying to do is fresh and you can poke through the instrumentation that you just wrote and it's an entirely different occupation than the prospect of writing software and then just shipping it off into the Great Unknown and then at some point in the next day, week, month or whatever your code goes live. But you're completely detached from it, right? Something breaks, goes into a ticket, you're not really... Okay, you're a software engineer, but it's-

Martin: It's that context switch.

Charity: Yeah. That context switch is brutal.

Martin: And it's not just the context switch of tasks, it's not just the context switch of developer Feature A then develop Feature B. It's the context switch of when you're using the debugger in your local development flow, you hit the big play button and as .Net developers, and I've been one for many years, we love the big green play button inside of Visual Studio.

You hit play, it throws you a debug line that you can just step through all of your code. Then all of a sudden you go to production and you don't have that, and you say, "Well, how do I know what happened? How do I find out that this particular function was called? How do I find out that this database was updated?"

That's what I'm trying to push with people at the moment around the idea of operable software, because if we take the tools that you use in production and make them as useful in your development pipeline, then what we end up with is a much smoother flow and lack of that context switch.

Because, well, if I want to whether the database is hit, I use the observability data that I've just got out of the backend. If I want to see it on production, I'm doing exactly the same thing.

Liz: I really, really love this. It's this thing about how we can make our systems more humane for people to run, if they have the right context and they already know how to go about it.

Martin: Yeah, and that's the thing. If it's ingrained in developers, that, "This is how I work," all of a sudden production doesn't become scary. It's just another one of my environments that I work on and I can do what, in production, that I do locally. I don't care about being on call, I don't care about being given that production issue because it's just the same as a bug. I would use the same tools, I'll look through my trace logs, I'll look through all the metrics and say, "Oh yeah, there was a big spike."

Charity: It's just amazing. So much yes.

Martin: But it's the context switch that all developers hate, and if we can stop those context switches and make it faster. Like you said, if we can get it so that you deploy fast as well as using the same tools-

Charity: The thing is that people are always like, "Yeah, but that's only for fools and really good engineers." You got to be so good to be able to write code this way, and that's exactly the opposite... It is so much easier to write and ship code this way. It's so much easier, you have to keep less context in your head, you get to... The idea of just if you ran your code and you had to wait three days to see the output, whether it failed or not-

Martin: Please don't let us go back to that.

Charity: That would be so much harder, right? It's so much easier, and the idea that it's hard to get it so that you can deploy your code quickly, these are just symptoms. Getting your tests down so that they run in minutes instead of hours, getting your deploy stuff done, adding guardrails, these are not hard things to do.

If teams can write unit tests, if teams can write tests, they can do this right. It's mostly people lacking the confidence in themselves, they're lacking the... They just don't believe they can do it, and so they don't do it, so they're right. They can't do it. But if they just did it, then they would know that they did it.

Martin: I believe, therefore I am Not.

Charity: Right, exactly. I don't believe, therefore I do Not. But if they just did it... This is what people are always asking me, "How do you affect this change? How do you take a team that isn't used to working this way and get it so that they work this way?" And honestly, most of the time that I see this happen, it only happens because someone joins the team who has worked this way before, and they know it's not hard, they know it can be done and they're unwilling to work any other way, quite frankly.

Martin: It's the stubborn people, the stubborn people are the ones that change the world.

Charity: Well, it's the people who have experienced it this way and they know how much... they don't want to go back to wasting so much time and having to do it the hard way, and so they start pulling their team into the future because they know how much easier it is.

Martin: And I think there's a certain element of those people who have worked that way, generally end up being more elevated through their career, therefore they get more. When they go into those new organizations, they have a bit more power, they have a bit more influence.

Charity: Yeah. You learn more faster, you're more productive, you get to do more the fun stuff of engineering, which means that you become a far better engineer must faster than if you were doing it the old way. It's just because of this incredible feedback loop, or as they call it on the DTM side, the Flywheel, where everything accelerates everything else and it's just a whole different universe.

Martin: Yeah. I've equated this recently to the movement around CDD, the whole of idea of ghost loads go fast, the idea of if I write my test I can actually go faster because I have more confidence in my code. I think we're in the same mode with pulling observability back into that development flow because if I do all those things and I do them in the same way that I used to do TDD, for instance, I do them as part of my development flow. I use them to help architect my platform by using that observability.

Actually, when you go into production you've got less issues and maybe all those issues that you do get, they're easier to diagnose so actually you get less support.

Like you say, it's that flywheel then of things just become easier and easier and easier, and I think if we start to think about those problems that we had trying to introduce TDD to organizations, where we had to hide the fact that we were doing TDD because people were saying, "No, no, you shouldn't be doing unit tests. Don't spend time doing unit tests, you need to do features. I need features. The testing is for the QA team to do, you don't need to write tests.

Charity: That does sound familiar.

Martin: Yeah, it's the nightmare I still have occasionally.

Charity: I wasn't around for those conversations. That's fascinating.

Martin: Yeah. This is where we used to be, the, "Well, we have a QA team that does all these things. Well, okay, we've got an SRE team that does all of the production monitoring stuff." Well, no, no. That shouldn't be the way it is. You should care about that production.

The whole DevOps movement was around this idea of you don't throw it over the wall, you don't wait for somebody else to do it because if you do that what you end up with is no internal visibility because I don't care about it as a developer, I've got my debugger. I can hit play and I can see what's going on. But the ops team don't.

Charity: "I can prove my code is correct."

Martin: "Yes, because I can step through this code and it passed this to that function. Therefore, that's good." Then you've got the ops team are then going, "Well, I need to know this information because I need to know when I need to scale this service, I need to know where the bottlenecks are. Is it the database? Is it the caching? Is it this, is it that?"

And you'd never get it, and obviously bringing those two teams together in the DevOps movement allowed us to be able to say, "Actually no, the developers are adding more information which makes the ops team happier because there's less issues, there's less tension between those two teams."

What if we can do that with observability? What if we can bring that observability inside that development pipeline so that actually the developers value that observability more than they value their debugger?

Charity: Because they see it makes their jobs easier too, right? The DevOps movement I think of as the solution to the original sin, which was when we split up Dev and Ops in the first place. That should never have happened. It's like saying, "Okay, this is the person who is going to cook the food and that's the person who's going to eat the food,"and you can't taste what you're eating. We've got the taster and the cooker, but they've got to be two separate people.

Martin: I love that, I love that, because then the chef is putting way too much salt in it because they like salt.

Charity: Yeah, and like three days later someone is tasting it and going, "What the fuck?" So they file a Jira task and they're like, "Too much salt in the soup," and the chef is like, "What soup? I'm onto my shortbread now or something."Yeah. But it's ridiculous, right? Why would you ever split those things up?

There can be specialization, there are QA teams that specialize in this stuff and there are organizations where you need that. There are always going to be people who specialize in operability and in operations because these are really fucking complex systems and they require... We're not saying that developers have to be an expert in everything.

You don't have to be the world's best QA person, the world's best SRE and the world's best developer. But you have to know enough in order to do your job well, and knowing these skills make it easier to do a better job.

Martin: Absolutely. And I think there's a lot in the recruitment space at the moment where they advertise for somebody who has all of those skills, that is the entire department of things, where what you really want is somebody who's pretty good at most of those things, who has an understanding of those things. But then people who specialize in those things then work as consultants, so it's not as a... QA is a prime example.

When I go into organizations to talk about how to change, how to embrace things like Agile and start to do things like continuous delivery and all of those good things that are on this flywheel that we were just talking about, when I go in it's all about saying, "Let's not put a gate in. Let's not put a QA team in place that says, 'I'm going to be the gate, I'm going to make sure that you don't do things wrong.'

Don't put an ops team as a gate in there that says, 'I'm going to make sure that the deployment is correct.'" What you need is those people to advise the developers on the ways to do things better so that these people who are actually good at what they do can get that expert advice. Testing is a completely different mindset, having that mindset of, "I want to be pedantic, I want to be able to break things," I want to be able to say, "What about on a Tuesday when it's raining outside?"

The different mindsets, the engineering mindset to do infrastructure is different to the engineering mindset of building applications, so use those people for what they're good at and pass the expertise back into the teams. That's a much, much better way to scale your teams.

Charity: It's like saying, "I think that every engineer, certainly every engineer who calls themselves the senior engineer, should be able to answer the question after you merge your code, where does it go? How does it get deployed? How does it get out there? And if it broke, how would you go about debugging it right?"

Everybody should know how to do that, but that doesn't mean that you have to write the entire deploy system or be an expert in it. You just have to know how to navigate, you just have to know how to find the answers that you need in order to do your job.

Martin: Yeah, if you're fixing the bug and you don't know that your code has been deployed into a Kubernetes cluster and it's got a container that you didn't actually put in the commands yourself, you don't know what it's got on that container. You just go, "Here's my code, it ran on my machine, you go and make it work."

If you don't know the steps that it's going to take, the compile steps, the way it's going to be deployed into production from a container or an app service, and the PaaS stuff that we're talking about, they're very different paradigms.

Charity: That is part of your production code, you can't separate the software from what it runs on.

Martin: Yeah, and I've seen people who especially... Obviously I come from an Azure and .net world so Azure Pass is a really big thing in my community, where people will run it on their local machine and it'll work and then they'll run it on PaaS which has different restrictions around how it works, and they don't understand that. They just say, "Well, it runs on a machine, doesn't it?" Well, no, actually there's a load balancer in the front of that.

Charity: What's the difference between AWS versus the Azure platform as a service?

Martin: So there's a huge amount of commonality so they do both provide some services. In my experience, AWS is very, very, very good at infrastructure. They're very good at the raw compute type things where they'll provide an EKS, they'll provide you with that Kubernetes stuff to be able to run all your containers.

They're really good at the virtual machines, their network is incredibly fast, all of that kind of stuff. But what they don't provide as much of is that layer above that abstraction above, which is where Azure places a lot of their time because they obviously came out of pandering towards the .Net developers and, for our sins, .Net developers are very, very behind the times with devops and all of that kind of stuff, owning production.

We're very, very behind so they pandered to those people by saying, "You can just right click, deploy. You can right click in your IDE and you can then say, "Here's my app service. You go and work out what run time needs to be on there. You go and patch my service for me, I don't want to do anything like that. You go and do all of those things for me."

That, to me, is the big difference between Azure and AWS, because we've got this layer on top that abstracts all of those things away. I think it's the right level of abstraction, so we're not going to a point where you literally provide the uncompiled code and somebody else does the compilation and all of those things that go with it. It's a medium level abstraction, if you like, that is very, very good.

Charity: Yeah. What are some of the lessons that you've brought back from the very large, mature Azure community?

Martin: I'll have to have a think on that one.

Charity: This is a question that came from Rynn. Rynn, do you want to give us more detail?

Rynn Mancuso: Yeah. I was just curious.

Charity: This is Rynn, by the way.

Rynn: Hi, I'm Rynn. I'm Honeycomb's community manager and I'm on a team with Martin and Charity. The reason that I wrote in with that question is that I'm really curious about what your experiences have been as you've seen the Azure community grow and expand into a wider range of timezones and what you think we can learn from that in the Honeycomb community and as an observability and OpenTelemetry community?

Martin: I think one of the interesting things that's happening in the Azure community, and I think is wider to .Net as well really, is people moving towards the owning production thing. They're starting to understand that they can't just right click deploy anymore, they can't do things that they used to be able to where you would do it straight from Visual Studio.

You would connect your Visual Studio to your Azure CLI and just deploy it. They're realizing that that isn't okay anymore, they're realizing that we need to put in CICD pipelines. We need to play a lot more with the deployments and testing and removing gates and not relying on somebody else to test our software, for somebody else to deploy our software. We really need to understand, really deep down, how it runs.

Charity: So you're saying that they're learning devops about 10 years later?

Martin: Yes. I am absolutely saying that. I am one of those people-

Charity: Gotcha, gotcha. What lessons do they have to teach us? Because I'm sure they do things that way for a reason, right?

Martin: Yeah. I think there's an element of handoff that is good, so there's an element of saying, this is in one of my favorite management books, it's the One Minute Manager's Guide To Monkey Management where you've got so many monkeys on your back, which is an amazing book and it's really short because I don't like reading long things.

But it's the idea that if I've got so many things that I'm trying to worry about then I struggle to do anything, and I think the .net development community went the opposite way with having no monkeys. There's a lot more on people's plates now in the environments that we're talking about today that they have too many things.

So maybe there are more things that we can hand off, that we don't need to care about, more things that we can abstract away and use schools like the Azure community is very, very good at saying, "I want Azure to provide me everything. I want one place to go and get everything. I want Azure to manage everything for me."

Whereas the AWS community is a lot more, "No, I want to manage that. I want to understand the internals of it." So I think that's really interesting, that the Azure community is very good at saying, "I want to hand things off." But not very good at taking things from other companies, which is another side to it.

Charity: Can you define outside in testing for us?

Martin: Yeah. So we've been very good at doing unit testing, or the mockest world of unit testing where we take an individual class-

Charity: Well, very good is perhaps overstating it. But we've been all right."

Martin: We've given it a good ole' college try, as they say. But we've spent a lot of time on building out that really low level testing where we take things down to a very, very small component and we do a lot of testing of things that really don't need testing, but tests that when you call a method it sets a property on a class. Awesome. That really didn't need to be there.

Outside in testing is about taking this to the application or service level that you're trying to run. The idea of, "Well, if I pass in some data and I get out the right data, do I care what happened in the middle of that?"

Not really, because if all the time I'm passing in the right data and getting out the right data, everything in the middle, yeah, I can refactor the whole thing away. I can do it in one line of code, I can do it in 400 lines of code, I don't really care. Talking to a lot of people outside the .Net community, they're alarmed that people don't do it this way.

They're like, "Well, that is what testing is, isn't it?" So they believe that that is what everybody should be doing, and that was a bit of a revelation, which is why I started really investing in trying to do the education piece for people about how easy this is, especially in the .net world where we have the tools that are already built for us to do it.

The pushback I used to get was, "Well, I need to know things like the retries, so I sent the data in and it got the right data out, but I need to know whether it retried properly." That, to me, is where we can start to couple things like observability because if you wanted to know that in your test, you're sure as hell going to want to know about it when it's in live that that retry happened, so why aren't we doing observability side within that outside in testing?

Charity: Right, right. If I had to choose between tests and observability, which I don't have to and I'm glad for that, you should have both. But if I had to pick one above the other, I would take observability every-fucking-day of the week.

Martin: Yeah. The unit testing, even the outside in testing will only tell you in the isolated environment that that thing worked, and obviously testing in production, your customers are testing your product all day, every day. So I'd prefer to know whether their tests are failing, versus my ones that are failing in the pipeline. But if we can start to use the same tests in both-

Charity: Reality will beat staging any day of the week. Well, I think we're about out of time. But Martin, you now have office hours, right? How can people find you?

Martin: I do. So on the Honeycomb website we have office hours for our developer advocates, you can go on there, you can book some time with myself or my colleagues to talk about anything. I will wax lyrical to you about Azure and .net and how we can make things better in development.

Charity: If you couldn't tell, this is literally true. You will wax lyrical until you stop it.

Martin: Yes, and I'm sure you will hit the stop button and I will still keep talking, so yes, you'll have to put the phone down.

Charity: Thank you so much, Martin. And thanks, Rynn, for the cameo. We will talk to you all next time.

Subscribe to Heavybit Updates

Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.

Content from the Library

Visit library

Mar 19, 2024

Podcast

O11ycast Ep. #68, Observa-What? with Michele Mancioppi of Dash0

In episode 68 of o11ycast, Jess and Martin speak with Michele Mancioppi of Dash0. This talk examines what it takes to make...

Feb 27, 2024

Video

Building Apps with LLM Observability

In this talk, Honeycomb Principal Product Manager Phillip Carter discusses what LLM observability means, how Honeycomb approached...

Jan 17, 2024

Podcast

O11ycast Ep. #66, Building Observability Platforms with Iris Dyrmishi of Miro

In episode 66 of o11ycast, Jess and Martin speak with Iris Dyrmishi of Miro. They dive deep on what it takes to build an...