Ep. #67, Managing Infrastructure Costs with Performance Engineering
In episode 67 of o11ycast, Martin, Jess, and Liz dive deep on performance engineering. Pulling heavily from Liz’s extensive experience in the field, they share insights on utilizing data to balance resources like developers’ time and infrastructure costs. Listeners can expect to learn helpful lessons on infrastructure optimization, performance data accessibility, and why Honeycomb embraced Kubernetes.
In episode 67 of o11ycast, Martin, Jess, and Liz dive deep on performance engineering. Pulling heavily from Liz’s extensive experience in the field, they share insights on utilizing data to balance resources like developers’ time and infrastructure costs. Listeners can expect to learn helpful lessons on infrastructure optimization, performance data accessibility, and why Honeycomb embraced Kubernetes.
transcript
Liz Fong-Jones: And it turns out that there is a certain satisfaction in making a line go up or making a line go down. So this is one of those things where you're always curious, how hard can I push it? It's similar to how people are like, "How hard can I floor my performance car?" You're not necessarily a race car driver, but it still is interesting to figure out how fast does my car go if I really push the limits? What can I do to tweak it and improve it?
And, yeah, sometimes I've been told off for, "Oh, we don't need to make that optimization right now. Why are you worry about this?" But, yeah, basically this innate curiosity about how do I explore all the nooks and crannies? How do I squeeze every drop of performance I can get? And sometimes it can seem a little bit pointless and tedious, but it turns out every 0.5 or 1 or 2% really, really adds up if you keep on doing it for a long period of time. So that's how I got into this.
Martin Thwaites: It kind of gets a bit addictive, I think. Doesn't it? You can get a little bit intoxicated by those little small additions on that graph, just a little bit more.
Liz: Yeah, it's the feedback loop. Right? It's absolutely the feedback loop. We talk a lot on Ollycast about feedback loops and how engineers get dopamine hits from seeing their stuff go out live. In my case it's not necessarily seeing users use the future, so much as the immediate dopamine hit of, "I've just paid off my salary for this year."That's really cool, to be able to justify the cost of your headcount.
Jessica Kerr: Strictly by saving us money on AWS?
Liz: Yeah.
Martin: I think there's sometimes that I've heard you talk about the nanoseconds that we've saved in a lambda somewhere, and then you go, "Oh, that doesn't seem like much." And then you tell us the cost of what that is and you go, "Okay. Okay."
Liz: Turns out that shaving 100 nanoseconds off of a call that gets invoked billions of times per day, yeah, that kind of adds up. The trick is figuring out what's the tight loop and what is something that is not the reformed critical section.
Jessica: Oh, so if you pick the right line and you make that line go down, you're accomplishing something?
Liz: Yeah, exactly. But if you're picking a different line and it's something that isn't actually all that significant or important, then it's an interesting brain snipe, it's a nerd snipe. But it doesn't necessarily make a dent in the bottom line.
Jessica: Great.
Liz: There's actually this cool challenge that's going around as we're recording this called the One Billion Line Challenge, where you're supposed to aggregate a bunch of weather station temperature measurements and do it as fast as possible. That's the kind of cool stuff who are interested in this general field like to do.
Martin: Yeah, it started in Java. Some of the Java people were putting together the One Billion Row Challenge to see how fast you can parse one billion rows and aggregate data over one billion rows. Which is a really interesting challenge that they did after the festive period. Maybe before the festive period would've been a better time to do that and everybody could've been stacked over Christmas.
I love those kind of challenges where I've been looking at some of the solutions and they're doing little things like they're moving bits and going really into moving the actual bit value instead of doing addition to make it quicker. I feel like there's a point where you get to actually that is way too addictive, that you can get way too low, you can be weak compiling kernels to try and make it go faster.
Jessica: But it's only too much if you're only going to write it once. When you're a SaaS company and you're supplying this billion times a second, or billions times over some time period, lambda to hundreds of thousands of other companies, then a little bit of performance improvement really matters.
Liz: Yeah. And it definitely used to be the case that you had developer time being relatively expensive. It used to be that you could say, "This will slow our engineers down by an hour every month that we have to think about it. Let's just let those servers burn." That doesn't necessarily fly anymore. Both developer time and your AWS bill have now become very, very expensive resources that you want to carefully optimize as companies are thinking a lot more about profitability, about their rate of cash burn and so forth.
Martin: Yeah. The idea of autoscaling and let's get autoscaling really in so we can scale down to one server, rather than 10. Then you work out that actually it's only going to cost us $2,000 a month to run at 10 servers all the time. That becomes a, well, actually no. I'll just get the developers to do other things because it's way cheaper than them spending three months trying to make autoscaling work and save $9,000. Wow. That's not anything compared to what we're going to be paying them in salaries for that amount of time.
Jessica: Okay. So the trick here is knowing what that amount is and knowing whether this is a line you should optimize. I really want to get into how do you find that line? How do you get it, and how do you know which one? But first, I know a lot of people who listen to o11ycast already know who Liz is, but that is not everybody. Liz, would you introduce yourself?
Liz: Hi. I'm Liz Fong-Jones. I'm the Field CTO at Honeycomb, working in the office of the CTO with Charity Majors and I am a former member of the DevRel team at Honeycomb.
Jessica: And before that?
Liz: In past lives, yeah, I was a Site Reliability Engineer at Google and I worked with some of the most awesome brands that were using GCP to help them adopt SRE practices. As I was alluding to earlier, I've always had an interest in performance engineering and figuring out how to tweak my capacity to be as efficient and lean as possible.
Jessica: So at Honeycomb, in addition to many other things, Liz does a lot of optimization of our infrastructure on Kubernetes.
Liz: Yeah. It's definitely been a fun little challenge to keep our cost of goods sold to a minimum, so that we can enable spending that money on delivering innovative, new features, on paying our developers to continue to develop new features. It's just this ongoing exercise where we are always striving to help our customers get quick access to data. I think that's one of the other aspects of optimization that's overlooked often, is that something it's about the dollars, and sometimes it's actually about the legacy that you serve to your customers.
Jessica: Yeah. Faster and cheaper come together.
Liz: Yeah, exactly. If you're billed by the lambda second, every second that you save saves a customer waiting for the results of the lambda and it also results in a lower bill.
Jessica: So y'all were talking about a Billion Row Challenge, and at Honeycomb, that's like every second when someone runs a query, our thing is querying an unreasonable amount of data live, when you ask it and getting the results back in an unreasonably fast amount of time. That takes a lot of compute, and then that compute is expensive and so a lot of our special sauce is making that not too expensive.
Martin: Do you think that the lambda equation has changed how we think about it? Because before when we were thinking about servers, we think about, well, when we've rented a server off AWS and it's got four CPUs in it, we should actually use all four CPUs. We don't care if we only end up using one of those CPUs. Well, fine. I don't care. I want all the CPUs to be used. But when you're charged by the millisecond, do you think that's changed how we think about performance engineering?
Liz: Yes, it absolutely does influence that, and in ways that you don't necessarily anticipate. For instance, if you own a server that is on and running all the time, you can keep your TLS connections warm. Right? You can keep a web socket or you can keep a GRPC connection open to whatever servers you need to connect to. If you have a lambda that AWS is freezing and unfreezing and it could go minutes between requests or it could wake up five milliseconds later and just run another request, that lambda has to restart its connections to all of the other backend servers like AWS S3 that it needs to talk to.
That can add up to a surprising amount. One of the funny things that we had to do was we had to cache the certificate, the SSL certificate for S3 that Amazon AWS stack on, and we had to just say, "Hash the certificate if I've seen it before. I don't need to verify that it's still a valid certificate because you wind up running those set up and tear down processes so often." So it's not just about trimming seconds out of your core compute, it's also about thinking, "What do I need to set up and tear down and how do I make that as efficient as possible?" Versus if you're running something steady state, so there are a lot of different trade offs.
But I think, Martin, the thing you are alluding to is actually one of the reasons that we switched to Kubernetes. Honeycomb, for a very long time, was a EC2 only shop with the use of lambda for the use of querying backend. But we discovered that we were leaving a lot of performance and capacity on the table because even with autoscaling and scaling each individual service separately, we were not BIN packing things onto machines.
There was a period of time where we could interleave workloads on that same machine and we weren't availing ourselves of it because each pool of workers was its own dedicated set of EC2 hosts. So I think definitely that desire to push capacity higher was an important side effect of our transition to Kubernetes. The actual real motivation behind the transition to Kubernetes, though, was the ability to handle deployments more easily, that it turns out our GNT system that was rolling out of the Chron Jobs of our EC2 hosts really needed to be replaced with more stateful deployment sets and so forth with Kubernetes.
We just decided, hey, we're not going to segregate these workloads anymore. Kubernetes lets us not have to worry about it, it's just all generically CPU.
Martin: I think that alludes to that idea that we waited at Honeycomb to the point where it became a problem, that it was cost effective for us to skill people up on Kubernetes, understand what the Kubernetes paradigm was in terms of scaling and performance and all of that kind of stuff.
Liz: It also got cheaper over time because the number of people with Kubernetes experience coming in from the outside was much, much, much higher the longer we waited. That in 2016, 2017 it would not have been reasonable to expect anyone that you interviewed for a SRE or a core services job, to understand Kubernetes and now it is this thing that most people in our are of the world have some exposure to.
Jessica: That's a great point.
Liz: Not all, and I think that's that tipping point of at what point does something have traction? And what point does it become the default? I think that it brings us nicely to the subject of cognitive load, that is a cost as well. Managing technology choices, managing code complexity. I'm not going to write in hand Assembly something that is frequently modified, no matter how performance critical it is because you might value the ability to iterate on that code a lot more, rather than spending an extra 1,000, or 2,000 or even $5,000 a month on.
Jessica: On the other hand, you tend to come back to code that's very stable and runs a zillion, trillion times a second, and then you performance optimize it.
Liz: Yeah, exactly. So I think that goes back to how do you pick what to work on?
Jessica: Yeah. How do choose to write line to min/max?
Liz: And I think that's not even picking the write line. I think it starts with these global perspectives, you have to zoom out and look at my entire AWS bill. Then you have to look at, okay, what elements of your AWS bill are the most expensive? If it's you're RDS, there's no need to hand optimize your application code. That's a case for, okay, we should maybe look at Aurora or maybe we should look at the Graviton instance types for RDS.
There are different control knobs for each AWS service, only some of which relate to your application code. But supposing that you look at it and it's, indeed, your EC2 instances or your lambdas. At that point you then have to figure out, okay, what are these instances majority being used for? In some cases it is idle capacity, in some cases it's one or two applications, or in some cases it may just be a mixture of different applications that are all consuming that quota of that particular service.
Jessica: You mentioned it might be idle capacity and Kubernetes helps with that.
Liz: It can help with that. The challenge is you have to have a mixture of different workload sizes, which is possible to do at a large company that has hundreds of different microservices, each of which have their own different capacity scalings. But if you're a company like Honeycomb that has macroservices where we have maybe a dozen different kinds of microservices, each of which are very, very horizontally scaled, you suddenly don't have this heterogeneity of workloads.
Just think about things like can I BIN pack onto this machine? What's the most common combination of my top two or three workloads that I expect to see, and to size things accordingly? For instance, if you have a machine that has 32 cores available, you want one core allocated for various maintenance tasks, daemon sets, the OTel collector, right? So that leaves you 31 CPUs to work with. If you have jobs that are 17 CPU each, you're never going to be able to squeeze two of them onto the same machine.
Jessica: Whoa, that's a lot of CPUs on one job.
Liz: Right. So that might waste a significant fraction of your capacity, if you all you have is 17 CPU jobs. Then you're going to be wasting, what's that? 14 CPU? Unless you have something that can plugin at 14 CPU whole. So, in general, we find it's best to make sure that every job is no more than half minus one CPU of your smallest machine size. Also, it's important to have, for the much, much smaller jobs, to be able to fit two or three of them in there.
Let's suppose that you had a 15 CPU job and then you had some 5 CPU jobs, so you can squeeze a 15 and then a 5 and a 5 and a 5 to make up 30, which is close enough to fully utilizing the machine. So Kubernetes makes it possible to squeeze those workloads together, but it doesn't guarantee that there'll be zero wastage. It's on you to size your jobs appropriately. Jess, I 100% agree with you that 15 CPUs, 17 CPUs, that's a lot of CPU, but if you have shared caches that are maintained in memory that are used by all 15 of those cores, you don't want to be populating three separate caches across 5 CPU jobs.
You'd rather have those all in one thing that's talking to the database one time to refresh its memory.
So it's this trade off of how much work do I wind up doing populating things that are shared, versus how much work is feasible to multiplex and scale out vertically within a job? So there are all these interesting challenges that you have to keep an eye on and see where am I under utilizing? Where am I over utilizing? How am I sizing things? And are the Kubernetes processes working correctly?
Jessica: So this is an observability podcast, therefore I'm going to ask you how do you find out how many CPUs each job is using and requires? How much memory is being shared in these shared caches that are valuable enough that we want to keep all the CPUs busy sharing them? How do you find out about these things?
Liz: So it's important if you have any kind of shared in memory data structures where you're evicting things from cache, you want to have telemetry about the size of the cache, how often you're evicting things. You also need to keep pod level, node level statistics, and that's where the OpenTelemetry collector running the node metrics models can really help with gathering the information on what pods are running on the current host.
That's where having the OTel collector as part of your daemon set that's running on every single node can really populate and fill in the details about how much are you reserving and how much are you actually using? But it doesn't give you a breakdown by which specific area of memory. It just says, "This one pot is using this much memory."
Jessica: Right. So if you want to know the size of your cache, you're going to have to admit that yourself because your application knows how much memory, or can know, how much it's using for this particular purpose. I want to clarify a couple of terms. When you said a computer with 32 CPUs on it, in Kubernetes that means a node, right?
Liz: Yes. One individual Kubelet.
Jessica: Kubelet is the process that runs on the node computer, in order to make it into a Kubernetes node. Right?
Liz: That's correct. The Kubelet is the scheduler and resource manager of everything that's running on that individual VM or physical server that turns it into a Kubernetes node.
Jessica: Yes. And on a Kubernetes node you can run a daemon set. Well, specifically, if you run a daemon set in Kubernetes then you get one job as a pod, one pod per node and it can do the tracky things per node which means per computer. So you can get the information about that physical infrastructure.
Liz: Yes, exactly. So people use that for monitoring, people use that for telemetry forwarding and processing, people use that for emitting logging, and people use it for security purposes if some people have some kind of security monitoring job that they want to attach to every single Kubernetes node.
Jessica: And, Martin, you advise people to send their application telemetry to the collector running as a daemon set on the node, right?
Martin: Correct, yeah. So if you've got things running on there, you use the node IP which means that it doesn't go through the software networking and there's no chance of it transferring software networking between multiple node and all that kind of stuff.
Liz: And then you are crossing availability zones, maybe using a load balancer. I think this goes to the question of what are you optimizing for? Sure, it might be in some world more efficient to have a centralized fleet of OTel collectors to be a load balance, but now you're trading off the efficiency of having a CPU dedicated to your OTel collector on each node, for now I'm guaranteed going to be crossing network and crossing a load balancer and paying for all of that network traffic.
So it's kind of a situation where now working often, we have found can be more expensive than the cost of the compute itself for a highly optimized compute job that is handling lots of data.
Jessica: Oh, okay. So if you optimize your compute in a way that hides a bunch of networking costs, you could lose?
Liz: Yes, exactly. Imagine that, "Oh my god, I can run only one OTel collector and maybe it scales to three and then it scales back into one at night, I've just saved a bunch of money on my compute." But now I have this problem, and I have this problem that everything needs to cross to a different machine, maybe I cross to different availability zones so I'm getting billed a different way.
Jessica: So in Kubernetes you typically run multiple nodes, usually at least three, and for high availability you want those nodes in different availability zones. Now in AWS there are multiple availability zones in each region, so you're in the same AWS region but this cross availability zone traffic costs you money.
Liz: A lot of it.
Jessica: Even within the same region?
Liz: Yes. And that's true across all cloud providers. It's not just an AWS thing. There is a real cost to digging fibers to go between different buildings on the same campus, or even different campuses in the same city. That does cost money, I understand why they charge for it.
Jessica: And it takes time, so again faster is also cheaper.
Martin: One of the interesting things I was talking to with a customer last year about they'd made a decision that their entire ecosystem was cross region, so an order service in Availability Zone A in Region One could be communicated with the email service in the Availability Zone B in Region Two. Everything was 100% cross region, and from a telemetry point of view that's a nightmare if you think about things like tracing because trying to get all of that into one place so you can do things like sampling and all that kind of stuff.
They'd optimized their compute so that they had full reliability, everything could go between all the different regions, they could lose an entire region or part of a region or a service in a region or an availability zone in a region.
Liz: Right, because they were running active-active so it didn't matter. They didn't care which availability zone they were talking to.
Martin: Exactly.
Jessica: Because they had two active availability zones at all times.
Martin: Well, they had six.
Jessica: Oh, six. I was close.
Martin: Yeah. In my opinion, a sane person would have three availability zones running.
Jessica: In the same region?
Martin: In the same region, and they would scope it to that particular region. A request would live inside of that particular region. But they were doing it across multiple, and they had reasons for it. The problem was that the costs then went up because, yes, we talk about cross availability zone costs being high, cross region costs are even higher.
Liz: Yeah. And it's fine to run active-active as long as you default to your own region and you default to your own availability zone, and you only communicate across if you have no other option. But again, this is a thing where if you don't have the data to understand what your architecture is costing you. You have no idea. You're just saying, "I want to maximize reliability. Let's just Round Robin it and talk to any available host."
Jessica: Yeah. And as a developer, what I love about Kubernetes is the abstraction, it's that I don't have to understand servers. I can just understand deployments and I can describe how many of my servers at least should be running somewhere and I don't have to worry about where.
Liz: And you shouldn't have to worry about it. I think that's why you have people like myself or Lex from our SRE team that go and look at our infrastructure afterwards and we say, "Okay, this is starting to tick up. We have to look at this now, we have to revisit it." You don't have to prematurely optimize things as long as you have sensible defaults for people and you're catching any issues that are coming up and increasing the cost that are not expected.
Jessica: Nice, okay. So your strategy with capacity and performance optimizations is to let developers do what they do and then come back and clean up after them?
Liz: Not so much clean up, as if something winds up being more expensive than we expect, then that's an experiment that we then go and say, "Okay, let's do that but differently."
Jessica: So it's react to problems instead of preventing them?
Liz: Yeah. Because if you try to prevent everyone from spending any AWS money, that's how you get really restrictive fin controls, that's how you get shadow IT. So it's much better to, within bounds, let people spend up to $5 or $10,000 a month, and if people wind up costing more then we'll see what we can do to help them achieve the same results but for cheaper.
Martin: So do you have any advice for people on how to then forward educate people so that the same problem doesn't happen? Because obviously these are recurrent problems, what can people do to try and make it better? Is it just observability?
Liz: Yeah, it's the feedback loop. It's looking at the performance of the thing that you're deploying and the cost of the thing that you're deploying, so tagging which service is responsible for what is crucial. That's why, let's say that we've solved the BIN packing problem, we don't have a bunch of unallocated wasted space, and let's suppose that we've solved the problem of growing and shrinking our pool of nodes. At that point we start looking at, okay, out of these CPU that you're consuming, let's go ahead and take a profiler to it or let's go ahead and figure out what we can do to balance the load better so that you don't have one node that's being 10% utilized and one node that's sitting at 120% utilized.
So yeah, it's kind of challenge because there are so many different ways that you can have issues. You teach people to recognize some of the patterns and which tools to apply that you're still relying upon them to look at the signal, you're still relying on them to pattern match and say, "Okay, this looks like a thing that I've seen before. I know what to do because Lex or Dean showed me how to deal with this situation."
Martin: Yeah. I think this actually comes down to people actually caring about it a little bit. They have lots and lots and lots and lots of things to care about, this is just one extra thing.
Jessica: And if they have easy access to that information, it's a lot easier to care.
Liz: Yeah. That's why democratizing access to data, at least read-only, is so important. This is why we so firmly believe in what that DevOps and other movements, that when you ship things buff, when you make it easier for people to see themselves what's going on, rather than handing off to another team, they tend to take responsibility and ownership of it.
Jessica: It's interesting that the shifting left here is shifting left of visibility and awareness, and the accountability happens naturally then.
Liz: Yeah, exactly. People used to in the past, when they were hiring people, when you're hiring 10 people a week it was like whatever, the cost of machines doesn't really matter. We need to onboard these people and we need to get them effective. But when your company is potentially facing having to furlough people or freeze hiring and you can point to something and be like, "Hey, maybe let's save this $5,000 a month?" If you do three or four of those improvement, that pays for a developer, right? So I think that's the thing of people care about it when they know that it matters to the business, where it matters to the users.
Martin: Yeah. Very similar to that we had Iris from Far Fetch on talking about exactly the same sort of things, that she'd gone through the same ideas of getting people that information visible so that people can care about it, making it easy access, that democratization. It feels like that is becoming that superpower, if you can make that available.
Liz: Yeah, and also making the things that used to take wizards to make them accessible to ordinary engineers. This has been a fun, little journey. On a past o11ycast we had, I believe, the folks from Polish Signals on, talking about profiling and profiling is one of those things that I think is severely underrated because for a long time it's only been accessible to the likes of Brendan Gregg. Brendan Gregg is awesome and there's only one of him in the world.
How do we ensure that people who are not performance engineers with a capital P and a capital E, how do we make sure that developers have access to profiling so they can see in their ID what lines of code are the most expensive? I think that kind of making accessible of the performance data makes people make the right decisions because people aren't going and seeking out to do a bad job or waste CPU. They just don't know what they don't know.
Jessica: They know that spending a ton of time optimizing every line is wrong.
Liz: Yes, exactly. This is why when I'm looking at the full stack, I start from top to bottom, I look at the biggest wasters and I figure out how can I optimize the biggest waste lines of code in the biggest wasters. But if you're owning your individual service, it's just a lot easier if you tidy up as you go because it means that you owe a lot more context as to what is it doing, is this supposed to be expensive, than I do.
Jessica: Okay. So if a team can look at the CPU usage that's tagged as their services, and see where it's highest, they have context and they're able to improve that a lot faster without zooming all the way out to the entire company.
Liz: Yeah, exactly. It's not necessarily you should spend five hours optimizing this, so much as if you see something pop up and it's something that's quick for you to fix, you may as well do it. But the challenge with profiling I think has been that, number one, it's been hard to use, number two, it's been hard to figure out where to apply it. Then number three I think it's the difference between, hey, we found a bunch of all these technical debt things where we're wasting 30% of the CPU on this job.
Once you get rid of those obviously glaring things, you're suddenly into trudging through .5%, 1% improvements, and having to do a whole bunch of them to make up that 10 or 20 or 30% performance change. There are diminishing returns because a lot of things people just don't fix until they have visibility. As soon as they have visibility, they fix them. So the classic example that I use for this is inside of Honeycomb's services, we have ingest service, our beloved Shepherd.
Shepherd ingested service for a long time, for many, many years because, again, this didn't matter originally. It was a $1,000 a month service, whatever. This $5,000 a month service was wasting 30% of its time rebuilding the ETC routing model. It had a Mux that was saying, "For this path, for this RegEx of paths, route it to this particular function." And instead of storing a copy of that on startup, for every request it was initializing a new multiplexer, and it spent 30% of its CPU time doing that and no one realized it was doing that.
It was an oversight and it didn't matter much initially, until it did. So we fixed that, of course, as soon as we saw it and then we stopped having those improvements, we ran out of 30% improvements to make. I think depending upon who's on your team, it may really, really vary what threshold the kinds of severities of things that you're fixing are because it may take someone more time to dig into that 1% or 2% improvement, and it's only 1% or 2% improvement.
Jessica: Right. But 1% or 2% of what? And how does your salary compare to that?
Liz: How does your salary compare, and how does it compare to the other things that you could be working on?
Martin: The other thing I think is missed in these debates is the idea that it might look like a real improvement when you look at it locally, so that idea of, "Oh, I'll look through this code. This function here is really inefficient, I'm going to spend a load of time fixing this particular function."
But unless you know that that's something that's used in production at scale, that particular function is a real problem, then you can spend a load of time saying, "Yeah, I've saved 17 seconds. This function was taking 17 seconds, but it's only used once a day." So it's not actually a problem, so knowing something is a problem in production versus it looks like it's a problem because I've seen the code. I think people tend to not see those two as the same thing. You need to actually look at both of them and see is this a real problem?
Liz: Yeah. And people sometimes still complain to me about, "Oh, OpenTelemetry tracers add 1% overhead or whatever to my process," and I'm like, "What are you not discovering because you've failed to invest 1%, at most, 1% of your CPU time on OTel?"
And it may be that you are optimizing the wrong things, that you are failing to find that 5% or 10% regression, or finding that one function that's very expensive, that's captured by a span, that you really should consider investing in visibility so that you have the data to work with. Even if it costs a little bit of performance.
Jessica: Right. Because it's that 1%, if that overhead of telemetry to give you the observability to find the 30% improvements.
Liz: The 30% improvements, then the 5% and the 2% improvements after that, and there will be many. They will continue to crop up continuously.
Jessica: Right, because maybe you get down to 1.5% improvements and then you stop but keep an eye on it for the next 5% or 10% or 30% or whatever to crop up.
Martin: Because we will all introduce those problems, the code is constantly changing.
Jessica: That is our job.
Martin: We will change something. I always equate it to if you've got a dog and you tickle it behind its ear, and all of a sudden its leg starts waggling. Those are the systems that we have these days, where we'll make a change in the frontend and all of a sudden the backend will... "Oh, I've made this change in the frontend, it now makes three more calls and it makes it so much more efficient." Then all of a sudden the backend, the graphs are going like a sawtooth everywhere and you're like, "Um, I've broken something elsewhere. I didn't know that that was going to... Oh, you didn't want me to call all four of them? Oh, sorry."
Liz: Right, exactly. That is another area where when you're performance tweaking, you try not to bring production down, you can try to still maintain your service level objectives and also have better utilization. Availability is job one.
Martin: Yeah, don't break production. I mean, rule one is don't die. Rule two, don't break production. I'm happy with that priority scale there.
Liz: Or at least don't break production for too long, right? It's okay to break production, just roll it back.
Martin: Whereas rule one, don't die ever. Don't just die for a little bit, just don't die.
Jessica: Unless you're a Kubernetes pod, in which case, whatever. You'll get restarted.
Liz: That's suspended animation, right? You get brought back.
Jessica: Because lambdas are really good at it. Hey Liz, got any stories from your recent work about this?
Liz: Yeah. The fun one that I've done recently has been wiring up the OTel collectors to emit tend data on every pod on every node type running in our fleet.
Jessica: Wait, what kind of data?
Liz: The pod utilization data. How much CPU is it using? How much memory is it using? And how much is it wasting? How much of the reserved capacity is not in use because you said you needed it and then you weren't actually using it?
Jessica: Oh, for like pods that say they need 2 CPUs and-
Liz: And they're actually using 0.5 CPUs or whatever. So I've started up Honeycomb derived columns to basically emit on a per-pod basis how much wasted CPU there is and that's made it a lot faster to drill down into which services should I be investigating next, which service should I make autoscale. It's uncovered things like, hey, this service that we thought a while ago had a bug that meant that we couldn't run more than one petition worth of workload on one Kubernetes pod, we should maybe look at revisiting that assumption because now this is suddenly costing a lot of money.
This is preventing us scaling down at night, it's resulting in very abysmally low utilization because we are running the same amount of CPU one pod for each partition, all day, all night, regardless of day of week, regardless of time of day. So that looking at the dollars and cents cost of it was like, okay, this has now become expensive enough, we've added more partitions. We really need to look at autoscaling this, we need to look at the possibility of running multiple partitions that work on the same pod so that we can scale in and out a number of pods rather than fix the number of pods equals the number of partitions.
So it was really, really gratifying to make that change to validate the bug that we were seeing before no longer manifested, and that we could scale this in and out, and then suddenly you could see that the amount of waste CPU, the amount of wasted RAM dropped to like a tenth of what it was before. So that was one really, really fun... I like charts, I like graphs, I like seeing the numbers go down and that gave me the ammunition to go to the team and say, "This is something that's worth investing in because here's how much it's costing us, and this is how long it was since we last revisited this."
Jessica: So at some point we save developer effort, it was too hard to fix, maybe it was an underlying library bug.
Liz: It was an underlying library bug, yeah. It was an underlying library bug, and at the time, yeah, the developer time needed to fix it was not worth the amount of extra CPU we were spending on it. But library got fixed, and we added more partitions, which meant we had to scale this up and keep it scaled up all the time.
Jessica: And this is a thing where we had to revisit a trade off where the consequences had changed?
Liz: Right. That was like two and a half years ago, yeah. I went and dug up the pull request that was, "Fix this to a fixed size."
Jessica: Nice. Yeah, that's yet another example of software is never done.
Liz: It gets both new bugs and new features.
Martin: I heard somebody recently that software is only done when it's done its job in production, you've taken it out, deleted the code. That's when it's done.
Jessica: Yeah. And even then you're left with the data.
Martin: The carcass of the code.
Liz: The other really fun thing in languages like Java and Go and anything that's garbage collected, is there is a resource that you are using whether you realize it or not. Every time you allocate a new object on the heap, you are creating future work for the garbage collector. Developers so often are not aware of this because it just turns up as generic time spent doing garbage collection, so you have to separately collect, you have to do separate measurements of heat, both in terms of total amount of heat and also in terms of number of allocations to heat. That winds up being, again, one of those surprising things.
You're looking at the service and it's like, "Wait a second, why is this spending 50% of its time doing garbage collection?" So I think this is one of those things that it's a pattern, right? As soon as you recognize it, you know what to do about it. But if you've never encountered it before, never had to think about garbage collections before, you're just like, "Why shouldn't I just allocate this on the heap?" And in most cases it's fine to allocate on the heap.
It's only when something is being done a lot that it starts to matter and you don't know in advance whether something is going to be done a lot. You kind of might be able to detect it, but you might not. It's one of the things where you have to test in production to find out what the performance characteristics are.
Jessica: Yeah. Like a lot of our frontend code is littered with UseMemo which is a caching functionality in a React Render function, which might be called 30 times a second or 30 frames a second or something, I hope not that many. But it's a lot and so it avoids allocating a function object on the head every render, and it also clutters up the code and it's annoying. So there's that you can save the cognitive load of even knowing what the heap is when you're young, until you have some consequences. And if you can measure those consequences then you know how long you may retain your innocence. I mean ignorance.
Martin: I love the idea that you either die young or live old enough to understand heap allocations.
Liz: I think the thing there that's especially scary for frontend observability for FEO is that those are not your machines. Those are machines that you do not control instrumentation on, right? You can act on instrumentation but you will not necessarily see-
Jessica: Which is really expensive because it's definitely a different region.
Liz: It's definitely a different region, it's your customer, it's your customer's computers and cellphones. The most you might get a complaint about is, "Oh, this is really slow to load or Chrome said this tab was running slowly and to kill it." But until it gets to that point, you can hide a lot of ugliness that way so it's incumbent on you to be economical with your customer's phone's battery life.
But if you don't know what's going on, you have no idea that this is happening, you have no idea if you introduced a regression, so how can you get better visibility into that so you know... I talked earlier about saving dollars, I talked earlier about saving developer time and enabling developers to keep their jobs. But also let's maybe save some planning a little bit. That's been my side quest for a while with talking about Graviton chips, talking about performance engineering. Every milliwatt of power that you draw, that's something that needs to be generated. We have to start thinking about that.
Jessica: Yeah. So cheaper, faster, better for the planet.
Liz: Yeah. It's just been hard because people haven't had the right signals, people haven't had access to the signals, and now we're starting to crack that, I think. I think the incentives are much more in place for people to start looking at that more seriously.
Jessica: And observability has some overhead both in developer time and in CPU and other costs. But it gives you a hope of improving the right thing now. Great, I think that's a good place to wrap up. Liz, where can people find out more?
Liz: You can find more on the Honeycomb blog in general. That's where you'll find a lot of the research and writing that I do.
Jessica: Honecomb.io/blog.
Content from the Library
O11ycast Ep. #51, Performance Engineering with Henrik Rexed of Dynatrace
In episode 51 of o11ycast, Charity Majors and Jessica Kerr are joined by Henrik Rexed of Dynatrace. This conversation covers a...
O11ycast Ep. #18, Real User Monitoring with Michael Hood of Optimizely
In episode 18 of O11ycast, Charity and Liz speak with Michael Hood, a senior staff performance engineer at Optimizely. They...
O11ycast Ep. #68, Observa-What? with Michele Mancioppi of Dash0
In episode 68 of o11ycast, Jess and Martin speak with Michele Mancioppi of Dash0. This talk examines what it takes to make...