APR 7, 2020

30 MIN

Ep. #18, Real User Monitoring with Michael Hood of Optimizely

GuestsMichael Hood

light mode

about the episode

In episode 18 of O11ycast, Charity and Liz speak with Michael Hood, a senior staff performance engineer at Optimizely. They discuss real user monitoring (RUM), the shortcomings of traditional metrics, and ramping up observability.

about the guests

Michael Hood is a systems engineer-architect passionate about developing, tuning, and scaling high-performance infrastructure/applications on distributed systems. He is the Senior Staff Performance Engineer at Optimizely, where he is responsible for low-latency delivery of 9 Gbps of JavaScript.

show notes

Web Development is Distributed Systems Programming

about the episode

about the guests

show notes

transcript

Michael Hood: My background is in infrastructure engineering, so I was the "First principles" type of thing.

We had no external monitoring, everyone starts with your outside black box monitoring where you have synthetic stuff hitting your API endpoints, and then inevitably as your systems grow that is only telling you what the weather is right now but not telling you how you got there.

I started instrumenting individual systems, and this was back before tracing was really popularized or made accessible to smaller startups, especially.

I started just building fundamental building blocks and sending stuff into graphite.

Charity Majors: What are the fundamental building blocks?

Michael: Timers, counters, that kind of stuff. Wrapping them anywhere and everywhere and later on it becomes anywhere you can afford to keep them, because when you're running your own infrastructure that stuff--

Charity: You're from the infrastructure side, how much do you identify as a software engineer?

Michael: That's an interesting question. Nowadays as I've gotten more senior I tend to look at myself as a force multiplier, so I try to lend my expertise wherever it can help move the business forward and wherever I can help grow less experienced engineers or engineers who have less experience in a certain area that maybe I do.

Liz Fong-Jones: I think it was super interesting, when you came in you were talking about the weather report and you were talking about how the weather report from the entire San Francisco Bay is very different than your experience in each of San Francisco's microclimates.

You got rained on as a result when you came here.

Michael: Right. That's a good analogy.

I think there's two different lenses to look at things, it's like knowing the individual health of your different systems and there's a lot of ways to do that, but more importantly maybe it's knowing the individual quality of an individual experience that goes through the system for things that are user - facing.

Charity: It always looks different from the inside than from the outside.

Liz: Now would be a good time for you to introduce yourself.

Michael: Sure. My name is Michael Hood, and I run performance engineering at Optimizely.

Liz: What does Optimizely do?

Michael: Optimizely is a company that provides an experimentation and feature flagging platform, so companies who want to experiment whether on their website or deepen their stack with our fullstop product, can implement our SDKs and run a/b or multivariant tests through feature flags and deliver different experiences to different users and find out what the optimum experience is.

Charity: You guys were doing feature flags before feature flags were cool.

Michael: Yeah, that's true.

Liz: Your customers require observability to understand how their experiments are running, and you use observability yourselves in order to understand how your platform is doing.

Michael: Absolutely. Our primary product historically have been our web product, and that's a big piece of JavaScript that companies add as a third party JavaScript to their site.

Obviously nowadays that's a hot button issue with regards to security, performance, etc.

But for a long time, and still we have a small real user monitoring beacon embedded in our third party JavaScript that evaluates the performance of it on the website, we take about 80 different measurements and send those back to a backend that we have sampled to some degree.

We can do very deep analysis on what the interactions between our JavaScript and other popular third party JavaScripts are .

Liz: A lot of companies would kill to have that, to have real user monitoring embedded in every endpoint. That's really amazing.

Charity: It makes sense to me now why Optimizely was some of our earliest customers, because you all are very familiar with this.

Michael: Yeah, absolutely. I brought that background to Optimizely when I joined, of thinking about JavaScript and the browser as a distributed system has a lot of the same hallmarks.

There's noisy neighbor type issues, you're in a multi-tenant environment that you don't fully control. It's like running in the cloud.

Liz: Because I know a lot of infrastructure engineers falsely put down front end engineering.

Charity: "What? The web? That's not distributed systems."

Liz: It's like, "No. It's super complex. It's super interesting."

Michael: It's extremely complex, and it's unfortunate that there's that tension or animosity between the two sides.

There's a lot more in common than they would appreciate, I think.

Charity: I watched that talk by meat computer, called Web is the Original Distributed System a couple of years ago and it really stuck with me.

Michael: They already had a more simplified existing version of what we call RUM now internally, and that was just recording how long it took to download the JavaScript from the browser.

But they weren't measuring how long it took to execute or anything about that execution, and so now we've built our own little small tracing by measuring different spans inside there.

We hope to actually at some point be able to use those to represent spans inside the tracing functionality and Honeycomb even , and because we already have that coming back and we're aggregating it in our ba ck end system and to avoid survivorship bias, which is a really common problem with these types of sensors where you have a thing that only sends once this thing is fully done executing.

So you can get all those measurements, but that doesn't tell you about the ones that went bad, so we actually sent two. We send a lightweight one as soon as it's done downloading, when it starts executing, and then another one at the end of the page once we have about 70-something more measurements available. We stitch those together in the back end and we actually send those to Honeycomb.

Liz: We had one thing that was very similar to that in that we had an outage recently at Honeycomb where we failed to get the telemetry from processes that were crashing due to running out of memory, and therefore it looked like network traffic was just dropping when in fact it was that our telemetry was failing to send at the end of execution.

Charity: It's an interesting and hard problem, and it's not one where there's a correct answer because you don't want to call home after every single function executes.

That would just be excessive. That would saturate your network traffic.

You don't want to call home only after it's successfully exited or errored, because that's not going to catch the kill-9's.

This is something where every place that I've worked has arrived at a different compromise, which suggests to me that you still do have to understand your code.

Liz: You have to understand your code, and you have to have a way to look for those incomplete executions that are a sign that something has gone wrong with your own instrumentation, which is a sign that something is deeply amiss.

Charity: You answered the question that I was going to get to though, which is that you instrumented the lifecycle of this stuff but obviously you just did it with metrics that are completely disconnected from each other, because that would make sense.

I'm being sarcastic, but this is the tool that's been available to us for so long.

It's just metrics, and this is something I think that it clicks for everybody eventually.

But for a lot of people it still hasn't clicked yet, maybe you can explain to us what is the shortcomings of traditional metrics?

Michael: Metrics requires foresight into what's going to matter later on , so if you were to say "Stick with a browser example," although there's probably more interesting ones when we're down this path, you need to know what matters about it.

That's especially that--it's really hard to know. What can you measure? Also, the things that you would be able to reflect out of a traditional environment.

Like memory usage, and so forth. This is available to you so you have to infer those things from sizes of different objects you have, so "Should we start tracking how big this array got?" Things like that.

I think the eventual holy grail that everyone wants to get to is being able to ambiently become aware of those things.

Measure everything and figure out what matters later, and the tradeoffs about how much data you're transmitting how much you're storing, in jest at your provider etc.

They all come into play there.

Charity: I have a sticker that says, "Graph everything. Fuck you."

Liz: I think the other interesting avenue to this is also talking about not just how much data are you collecting, but are you pre-aggregating?

Are you deciding in advance how you're aggregating, or are you leaving it up to the analysis later as you say?

Michael: That's an interesting question.

What I say before, that we are aggregating, and I actually mean just that we're sessionizing those multiple beacons that come in.

We stitch them together in a little memory window that's 30 seconds long.

Liz: OK, so that's creating individual events out of multiple different measurements.

Michael: Yes. Let's say there is two Beacons and I'm literally just appending them on the same row.

Liz: That makes a lot of sense.

That goes with what Charity and I have been saying for a while, which is "Wide events. Single, wide events rather than emitting multiple log lines or multiple beacons per user load."

Michael: We send a single event, both the Honeycomb and other syncs, as one execution of our platform in the browser and that's about a hundred columns wide.

Liz: That's not aggregation, that's just correlation as it were.

Michael: Yes, exactly.

Charity: I guess you could say it's aggregating around the event itself.

Liz: I don't really like overloading the word "Aggregation."

Charity: I don't either, but I've had a bunch of nitpicky people point out to me "But actually--" So sure, they're correct.

Michael: At some level, everything is an aggregation if we're going to take it to that extreme.

Charity: Any time you stick two things together, you are--

Michael: I think that without some level of aggregation, if we're going to overload it to mean that WHERE clauses as it were are not useful anymore.

Charity: That's a very good point.

Liz: Let's talk more about that, WHERE clauses are not useful anymore.

Michael: Sure. I just mean that you're aggregating something, and you need something to identify all these things happening together.

If you just had a big event waterfall and no way to say, "OK. These things happened to the same person or the same connection or whatever, and these would be fairly meaningless in most cases."

Liz: That's what most metrics platforms are doing. They're saying, "The average of all of these lows is, or--"

If you have a more sophisticated provider at least it's a heat map or histogram. It still is-- You have no connection between these things.

Michael: I tend to think that most ways that metrics are interpreted in legacy senses are meaningless, frankly.

People have a thermometer when they need something more like a stethoscope.

Charity: I'm always so interested in thinking about how the technology that we have is giving rise to the practices that we have, and the social interactions that we have.

The legacy metrics have given rise to this generation of people who thought that debugging was done by intuition.

We'd have these graphs and past experience would tell us that sometimes it correlates with an outage or correlates with this system being wrong or something.

We don't actually have the information to debug it systematically , so people have just grown up thinking this is normal and this is called "Debugging," just having battle scars and applying them to dashboards.

Liz: Or using your analogy of a thermometer versus stethoscope, if someone says "This person has an elevated temperature, I bet 99% of the time that it's a cold."

Then they miss the infection, or they miss the--

Charity: "The average temperature of all people in San Francisco is--" I t's just not, "If it goes up a little what does it tell you?"

If you're a doctor and you've seen plagues before, you're like "This must be the plague."

It's like, "No. You have you have a very blunt signal and you have your experiences."

Michael: That brings out an interesting thing, and ties back to the question about aggregations before the original meaning of the word.

A reason that I think it's forced upon people to realize that they have to do away with the legacy notion of a stats-d style, just roll up everything in this window because now we have the ability to query things along so many dimensions that those sorts of roll-ups no longer make sense.

Liz: Exactly.

One of the challenges that I encountered when we had even a very sophisticated metric system at Google was you had to know which metrics to look at, and knowing which metrics to look at were either intuition or you leaf through 20 pages of 20 different dashboards, of 20 different graphs in each dashboard.

What wiggled at the same time?

Charity: Everything wiggles at the same time.

Michael: Just the stochastic nature of it, you're going to make a lot of inferences that seem interesting at the time, and half of you are investigating a current incident or trying to look at it after fact, and it tends to be leading you down the wrong path in my experience.

Liz: So, AI Ops. What do you think about AI Ops?

Michael: I think it's interesting as a as a concept, I'm super skeptical of anything that uses the "AI" label and increasingly so of things that tend to fundraise on the ML label right now.

Because a lot of what I've seen that's called "ML" is best probably expressed as a case in a switch or some nested "if" statements.

Charity: But it's a machine and it's a language, Michael.

Michael: I'm sure those words mean something, but I don't know what.

Frankly, I lack the classical education to fully understand the more sophisticated stuff that's out there.

Charity: That's very gracious of you.

Michael: But the vast majority of what I see isn't the more sophisticated stuff, it's like "I could do that without machine learning."

Liz: I think the other angle is what we just talked about, if you roll 100 different dice some combination of three of them will all come up sixes.

You wind up inevitably having these experiments where one experiment happens to return a sufficiently low P value, because you've ran enough experiments that one of them will do that by accident.

Charity: I think the saddest thing that I've ever heard of people using AI Ops for is to manage their alerts. Like, come on.

Michael: Creating a spam filter for your alerts, basically?

Charity: Yes, basically. But if you have so many alerts that you think that you need to use AI or ML to manage them, I've got a suggestion.

Delete them all. Just delete all of them, they're not helping you.

Liz: It's a lot cheaper, for sure, than paying someone to--

Charity: It's a lot cheaper, it'll drive you less crazy.

Honestly we see people delete all their alerts when they go to SLO-based monitoring and it drops by an order of magnitude.

80-90% of them are gone .

Michael: As far as I'm concerned, if you have a way to measure customer experience accurately and your user experience accurately, threshold-based alerting doesn't really have a place in that world . It's that simple.

Charity: But the reason it's stuck around for so long is because people have gotten used to thinking that it's debugging.

They get paged about a symptom and this is how they know to go fix something, because they don't have the sophisticated tooling to ask iterative debugging questions of their systems based on science but they know that they they'll get these signals from somewhere in their system and then somebody has to go fix something.

That's the only way that they know that these things are happening, and so they don't want to delete it because they're like "Chaos will reign."

Liz: It's the "If a tree falls in the forest and it doesn't fall on anyone, does it really matter?"

Michael: It's been widely talked about, alerting fatigue is very real as well and people get desensitized to those things.

So using alerts for symptoms like, "A thing is about to happen, everyone get on the edge of their seats" is a extremely bad for culture, I think.

Liz: Sorry, terminology nitpick. That's a potential cause of a problem rather than a symptom.

When we talk about symptoms, we usually are talking about symptoms of users who are actually in pain, because if users are actually in pain and they are seeing symptoms of pain like high latency, high error rates, we want to do something.

We don't necessarily want to alert on potential causes .

Charity: We're not precogs.

Liz: Exactly. We're not precogs. So, how does your team handle on-call? What's your alert load like? How do people debug?

Michael: That's interesting.

I'll say specifically for my team, I'm responsible for Optimizely's delivery infrastructure of the JavaScript server that we've talked about, as well as the JSON data files that provide to our SDKs what feature flags exists and a manifest of what should happen.

That gets obviously many requests per second 24/7. It's a fairly flat load, and you would think that's something that requires a serious on-call structure in triage and so forth.

I don't want to downplay the support that my team and my organization has provided, and certainly other teams that manage more lively systems have good support for this, but I have been the primary on call on this rotation for four years, 24/7.

Including vacations, I've gotten paged I think three times. The S3 outage, a dying DNS outage and one other one that was a false positive.

Charity: That's great. Congratulations.

Michael: So the escalation path is basically straight to a VP of engineering if I were to miss the page, that's how confident I am that if we get paged it's serious and it should absolutely go to the top if for some reason I fail to answer that.

Charity: I feel like we should be congratulating ourselves more than we do when things are good.

Liz: Yeah, exactly. We shouldn't be congratulating ourselves on, "You handled this firefighting last week. Oh my god, what a martyr. Charity, you're such a martyr."

Charity: Yes.

Liz: How do you get the confidence in your systems to be able to ship all the time and not get paged? What are your safety measures?

Michael: Feature flags. Everything is behind feature flags, and everything is done through canary releases.

If I introduce a change in functionality to our CDN or the way that we deliver stuff, I do it to a small percentage of traffic first.

Liz: So you drink your own champagne.

Michael: Absolutely.

Charity: The term of art, I believe, is "Progressive deployments."

Michael: I like that.

Charity: That's the one that James has been trying to popularize, and I really it.

Michael: I'm happy to help with that, we use progressive deployments. Yeah, absolutely everything.

We have OKRs around this, everything will be shipped behind a feature flag , and we're increasingly moving.

That's actually a surprisingly interesting area of difficulty for us to do canary releases though, in an enterprise application some of our customers are the largest software companies in the world.

Having an experience where even if you were to think, "Different teams could see different experiences," that can be really confusing if you're like "My experience of using the application has changed as compared to another team."

Also, they expect very stable release cycles.

There's a legacy expectation in the SaaS world that you have these really long release cycles and, and it's like "Now you shipped V.24 and you have a new add button here," or something.

That doesn't really jive with modern product development.

Liz: Exactly. I was actually just about to ask this, which is how do you and your customers deal with multiple different feature flags intersecting?

There's no longer a linear number of versions. Instead, it's a combination of different feature flags that can be on or off at any time.

Michael: It's a "Combinatorial explosion," is the term that we use internally.

If you have 30 different feature flags that someone could be exposed to, even if they're only boolean true or false type things, that's an enormous number of permutations that you have out there.

Liz: And you can no longer have a metric for every single combination.

Michael: Exactly. That goes back to I was saying before, why observability becomes important.

You can't have a stats-d metric with 80 underscores that describes this thing.

If you did, it would have two data points and it wouldn't be meaningful.

We're actually working on some really interesting stuff this coming year in "Feature monitoring," we call it, which it looks to explore the interactions between the different states of feature flags and the impact that they have from an experimentation lens.

The impact that they have on business metrics, for example.

Perhaps two people who have this combination of feature flag values have an outsized impact to some other business metric you wouldn't have considered.

Liz: That's something we definitely think a lot about at Honeycomb, is things like the statistical correlations on any given combination of fields makes it easier to spot, "This spate of errors is correlated with these two particular flags being on at the same time."

Michael: Optimizely has invested really heavily in the stats field, and in ways that are far beyond my understanding.

There's a lot of academic papers we released, one called Stats Engine that powers Optimizely's ability to determine what variations of an experiment are statistically significant and better or worse.

Like, the lift for the fall from that.

A lot of it, like I said, is beyond my realm of understanding but a lot of our customers come to us specifically because we have that statistical model .

Charity: So, experiments.

This is actually one of our company values, that everything is an experiment .

Liz: It's painted on our walls.

Charity: It's painted on our walls, yeah.

Michael: That's really cool. I'd love to see a picture of that.

Charity: I'll show you one. But Christine and I say this to each other all the time, but it was recently pointed out to us that we do that not very well because everything's an experiment.

For us, it lowers the barrier to trying things.

We don't have to get attached to outcomes because we can just try things, but we haven't been very good at measuring results.

How do you feel about that?

Michael: It's an interesting problem, "What is experimentation without results?"

That's just doing random things and-- I think when you say "Without results," you obviously have some results but they're just not granular enough.

You have the intuitive ones, like things feel better, but obviously that's not good enough.

If you look at it from that lens, it's like "How good of results do you require for this?"

And for the same reason about the cardinality of the different types of experiences that we talked about before, that makes stats-d style metrics untenable, the need for better results is the same reason here.

If you only have exactly one type of experience, you provided exactly one type of customer, all of them did the exact same thing, your need for results would be pretty basic.

You could just count "How many people seemed happy? " That's not good enough anymore.

You need to be able to do what we call attribution to be able to associate "OK, this business metric was higher. But why? What was the result of that? What experiences were they exposed to, whether they were behind flags or as a result of experiments we were running, and what were the contributors to that? How much should you attribute that lift to these different things?" With enough statistical power, because you have enough data going through, you can discover those things.

Liz: That's super interesting to think about, because in my particular world I think a lot about the correlation between system performance and user experience, rather than feature flags and user experience.

It's two sides of the same coin, we're trying to measure user experience but for slightly different purposes.

I'm trying to measure how good does this system have to perform in order to make users happy, and you're measuring what changes can we make in order to make users happier?

Michael: I think the performance thing is really interesting, something that some of our customers have expressed a desire to do and we've actually been working with the Web Standards Group on a potential standard for oblation testing of performance through chromium and hopefully other browsers as well, which would allow you to synthetically slow things down.

Liz: Yes, so bringing chaos engineering to everyone rather than just requiring you to buy a vendor's product.

Michael: Because something I see that happens a lot is customers have a correlation plot between page load time and revenue, or some proxy thereof.

For a lot of reasons that's OK as a starting point, but that's not strictly correct to do .

The users have had those slow reload experiences have them for different reasons.

Maybe they have slower devices, may they have slower connectivity, and those traits themselves introduce a type of bias that you can't infer.

Like, "OK. If this was 100 milliseconds faster, they would have spent X amount more because other people who had that 100 milliseconds faster experience did."

Liz: Exactly. People who can afford higher internet connection speeds and lower latency might have higher incomes.

Michael: Exactly.

That's the immediate one that comes to mind, but there's a long tail of reasons that we're not ever going to think of here, and so rather than have to try to figure that out something that I'm interested in is being able to establish a plot that shows your visitor's tolerance for slowdowns.

Because it doesn't make sense necessarily, we all know about the diminishing returns of trying to invest in faster and faster performance.

But wouldn't it be cool if you could have a plot that showed where the shelves are in that, where it's like "If you get to here, that makes a big difference, but beyond this--"

And identified a few little notches where you should shoot for.

Liz: Exactly. That helps people set better service level objectives if they know what their service level indicator target should be for latency.

Michael: Yeah, exactly.

Liz: Onto our final topic, how do you spread knowledge of good practices within your team? How do you bring everyone up to the level of your best debugger?

Michael: Our best debugger, I think that's probably something I'm not great at conveying the knowledge on because I am able to have the privilege of leaning heavily on past experience.

I mentioned before how often I've gotten pages, that was definitely not always the case.

I've worked places where I was paged so frequently and so around the clock that at one point I would sleep with headphones in so I wouldn't wake up my wife when I would get paged.

Charity: That makes me hurt.

Michael: I still have the habit of sleeping with headphones in because of the way that place has ruined my psyche for sleeping at night, apparently.

But now it's just like I don't listen to alerts now, I listen to music or something because I've gotten in the habit of listening to something.

But I think that I'm able to draw on all that experience of-- I've seen things break in every conceivable way, so I have this big pull up.

But that's not scalable, you can't somehow impute that knowledge to other folks.

Liz: It used to be the case that you would become an excellent systems engineer by learning the hard way, by having 5-10 years of running your system into outages.

Michael: I mentioned the impact that it's had on my quality of life, not just for the funny anecdote, but because that's not something we want to do to people.

If we could teach people in a trial by fire, that's not right, so now we have to find better ways and give them better tools to do so.

As far as spreading the knowledge, written documents, especially as teams become more remote we rely heavily on Google Docs.

Charity: Yes, but also I feel like the days of the playbook are coming to an end. It used to be that your systems could break in some pretty predictable ways.

Liz: Yeah. Playbooks are for the known unknowns.

Charity: Exactly, and increasingly I think that the goal in the future should be every time you get paged it should be something new . Because--

Liz: It might be they the mitigation is in the playbook, but the ability to debug it should be something new.

Charity: Because our systems get to a certain level of bigness and complexity and everything, and you have to fix those things, the common failures and the ones that page over and over.

Because if you don't fix them, you're fucked.

We assume you do fix them, and I think that the still useful place is having a starting point, almost like a dating profile for each service.

Like, "This is how you reach me. This is who's responsible for--" All right, this veered off a little bit.

But here's a good starting point, here are some jumping off points, here are some common patterns to look for.

Liz: Like, "Here is the feature flag to turn off this feature."

Oh, my goodness. If you give me a quick red button, I will press that quick red button and keep my service from being on fire so I can go back to bed and debug it tomorrow morning.

Charity: Resiliency is not making your systems not have problems , it's about your system being able to have lots and lots and lots of problems without waking you up.

I think that having a couple of different channels for alerts, it should only be if your users are in pain that you get woken up.

This is not something we were good at when I was growing up, it was like we just abused ourselves.

We just fired off alerts at ourselves all the time.

Liz: But in this case, Michael's saying that he's gotten paged three times in four years.

How do you convey knowledge to your peers when you get paged three times in four years and you're the only one on call?

Charity: I assume there's still stuff that you need to do to fix things.

Michael: To be clear, when I say that I've gotten paged I'm referring to outside business hours. PagerDuty actually sent me the notification.

We still have, and it's something that I'm not a big fan of, but we have what I'm sure we're all familiar with.

It's the half alert, where you have some system that pushes a Slack notification somewhere and says "A thing is awry."

Those are typically threshold based monitoring, deadman's switches, which I'm actually a big fan of under the circumstances.

Things like that, so we use those and we look at those things that are coming up and we use those.

In the course of regular business hours we have an on-call rotation. The person who is on-call that week focuses on triaging and understanding those sorts of issues, researching them, putting items in the backlog for enhancements, responding to any escalations that come to our team. That person is responsible for taking a look at those understanding and then pairing up, if necessary, with someone else who may understand it better.

Liz: It relieves me so much to hear every time someone says, "There is a dedicated person for on duty or on call."

Rather than spraying it across the entire team.

Michael: Absolutely. I think that's a huge innovation, and to be fair I wasn't the one who drove that.

We have good, great people on our team who think about this stuff and I think that it is a big boost in quality of life for people to know that in the absence of an utmost emergency, an unexpected thing is not coming their way for the next three weeks and they can focus on their job.

Charity: I feel like this is the exchange we have to make, an exchange for everyone being on call.

I do think every engineer should be on call for their systems in some form or another. We have to make it not suck. We have to make it not something you have to plan your life around. We have to make it something that isn't going to predictably interrupt your sleep.

Liz: Indeed, a past guest from Intercom said they run an all volunteer on-call rotation.

Charity: I love that.

Michael: That's really interesting.

We more or less do in the sense that we auto generate the rotation through PagerDuty, but people trade and swap so frequently and regularly it more or less becomes that.

I'd be really interested to hear about the details of how they--

Charity: They've written some great blog posts about it, and they're super.

We should put those in the show notes too, but I like the idea of making a world where being on-call or being in the pager rotation is seen as a badge of honor. Something that you--

Liz: A badge of honor, but not a heroic badge of honor.

Charity: Not a heroic badge of honor, but if you're new grad maybe you haven't yet gotten cool enough that you could be a--

But once you are a responsible member, certainly once you're any senior engineer, you have ascended to the ranks of the people who support this service.

Liz: Exactly. It's not a measure of how much heroism you do, it's a measure of how much expertise and skills that you have to be able to handle the system.

Charity: I also think that we should make it a world where people look forward to their on-call weeks, because think of all the crappy--

You're so busy with your project work, you're billing your stuff and there's all these things that annoy you that you never get to get to.

On-call week is when you have permission to do nothing but that.

Liz: Nothing but look at the weird and interesting things happening inside of your system.

Charity: Yes. Explore, fix things have been bugging you, improve the deployment process, iterate on these things that are outside your normal lane.

I've always looked forward to being on-call when it's that what I get to look forward to.

Michael: I think that's really interesting, and it's certainly putting that lens on it. For an on-call rotation, rather than something that people dread--

Charity: It's agency, a week of agency.

Michael: Absolutely. I think that expectation setting and it's a culture thing of on-call is the time.

It's almost like a hack week, in the even that you've taken care of your stuff and nothing breaks all the time, it's your hack week where you can improve the things that bug you about your area of the product.

Charity: Once you've earned the right, yeah.

Michael: Absolutely. You earn the right by making sure you don't have these nagging alerts that you should have already taken care of, fixing the issues underlying them.

Charity: Totally.

Liz: Cool. Thank you for joining us, Michael.

Michael: Thank you.

Subscribe to Heavybit Updates

You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.

Content from the Library

Visit library

Feb 27, 2024

Podcast

O11ycast Ep. #67, Managing Infrastructure Costs with Performance Engineering

In episode 67 of o11ycast, Martin, Jess, and Liz dive deep on performance engineering. Pulling heavily from Liz’s extensive...

Apr 13, 2022

Podcast

O11ycast Ep. #51, Performance Engineering with Henrik Rexed of Dynatrace

In episode 51 of o11ycast, Charity Majors and Jessica Kerr are joined by Henrik Rexed of Dynatrace. This conversation covers a...

May 14, 2025

Podcast

O11ycast Ep. #81, Observability 3.0-vNext-final-DRAFT with Hazel Weakly and Matt Klein

In episode 81 of o11ycast, Charity Majors and Martin Thwaites dive into a lively discussion with Hazel Weakly and Matt Klein on...