July 10, 2020
Ep. #22, Designing for Observability with Jimmy Bogard of Headspring
In episode 22 of O11ycast, Liz and Charity speak with Jimmy Bogard of Headspring. They discuss maintaining balance for on-call engineers, wh...
In episode one of O11ycast, Merian Ventures‘ Rachel Chalmers and Honeycomb.io‘s Charity Majors dive into a few ways observability can drive fundamental changes in the way we approach software development.
Rachel Chalmers: What are your goals for this podcast?
Charity Majors: I think that the idea was born when I realized how many people were saying, "Yeah, I get the tech. I get the changes that are happening, but I don't know how to introduce this to my org. I don't know the physics of rolling this out across my organization, I don't know how to get people on board, I don't know what cultural changes need to happen."
Because the more you think about it, it's such a sea change in the way you approach computing and developing, that it kind of gets bigger the more you look at it. And so I really just wanted to hear people's stories.
I want to talk to people who have been through this transformation. I want to invite people who are struggling with it, who have failed at it.
I personally only learn through other people's stories and pain, so I thought it would be a good idea to listen to them. I also really just want us to get inspired.
Now that I have a company, I find that people don't trust my word as much as they used to. There are some topics, now, if I talk about databases, they're like, "Yeah, sure, fine." But if I'm talking about observability, they're like, "Well, you have something to sell." And they're not wrong, they're not wrong.
So I want to bring some more credible voices because some of the claims that we make are kind of fantastic.
Charity: We're saying things that have never been possible before are now possible, and more than that, they're trivial, they're easy. Which is always something that makes you kind of cock your head and go, "what are you trying to sell me?"
But I believe that this transformation really is that large and that worth investing in. I feel like,
as an engineer, if you're asking me to learn an entirely new system, a new language, a new community, a new tool, it has to be at least 10x better than my old one.
Rachel: At least 10x
Charity: At least. But it's also never trivial when it's that large of a change, so I just think that it's time to start collecting those stories and talking about them.
Rachel: Great. Feels like a good time for us to introduce ourselves.
Charity: It does.
Rachel: I'm Rachel Chalmers, I'm with Merian Ventures. I'm an investor in Charity's company Honeycomb.
Charity: I'm Charity Majors. I am the CEO and co-founder of Honeycomb. My history is kind of polyglot but very operationally focused.
I've been an early or first engineer on the infrastructure side repeatedly. I was at Linden Lab back in the crazy day of Second Life. I went to a couple of small startups that are very different from each other after that. And then I landed at Parse where I was the first infra engineer. I was there, I built the systems.
I then switched to management at some point and built the teams. I was there through the Facebook acquisition and
I'm someone who's always been drawn to where beautiful theory of computing meets the awkward messy reality of actually trying to do things.
And that's what brings me to Honeycomb.
Rachel: My career path has been oddly parallel but also extremely different. I'm a failed English professor. I got more interested in the computer science department than the English literature department at the school where I did graduate studies. It was downhill from there, ending up in a small apartment in San Francisco.
Charity: So that's why the Oscar Wilde references.
Rachel: I can't be stopped, I ended up joining together with a group of friends and becoming employee number 10 at an industry analyst company called 451 where I built the infrastructure practice.
Charity: Oh wow.
Rachel: So I basically told stories about infrastructure. I was the first analyst to cover some companies you might've heard of like VMware and Cloudera.
Rachel: And Splunk. We sold a lot of our research to investors. Eventually, an investor took a punt on me, decided to give me a crack at actually putting my money where my mouth was.
I spent four years at one investment firm where I invested in Wit.ai, which got sold to Facebook, and Docker. And moved on from that gig to a new gig where I'm investing in infrastructure companies majority owned by women, which is a whole other story.
Rachel: But one thing that really became apparent to me in the course of talking to probably 12 or 13 hundred startups by now is that infrastructure is highly, highly leveraged.
The decisions that people make about infrastructure software reverberate for many years, much longer than decisions about applications.
And if I wanted to have an outsized impact in the work that I did, the infrastructure structure layer was the place to do it.
Charity: I've always said the lower you go in the stack, the more powerful you become, which is it's somewhat tongue in cheek. But there's a great feeling to be powering the thing that everything else rests on.
Charity: It's an invisible dependency when you're doing your job well, nobody knows you exist.
Rachel: And it comes to what you said about the hairiness of computing. I think one thing that frustrates me about the tech industry, as a completely reconstructed 1990s tech idealist, is how far we fall short of the promises that we've made.
There's no real technical reason why we can't keep those promises, the limitations are all human factor. The limitations are all cultural. And so one of the things that I've become increasingly passionate about over the years is trying to, not even to change the culture of technology, but trying to get people to actually walk the walk. Everyone talks a good talk here.
Charity: I look at it like it's so much easier to, instead of shutting people down, saying, "Stop doing this, stop doing that, stop doing this." It's like telling somebody: "Stop eating candy, stop eating candy." It doesn't work. What does work is giving somebody something exciting that's better, that they can do.
Rachel: Because I'm a massive nerd, it ties into my huge interest in policy, public policy, and game theory where you can incent the right kinds of behavior without coercing people.
Charity: Not winner-take-all, but you actually make it a collaborative outcome.
Charity: These things don't have to be, "You were a winner before and now you're a loser." It can be, "We all win together, we have the power."
Rachel: Let's talk specifically about observability. We've talked a lot about infrastructure. What's the narrative arc of a life that ends in hosting a podcast about observability specifically?
Charity: Yeah, well, it comes from a lot of failure. What was it that Winston Churchill said about the United States? "They always do the right thing eventually when they've exhausted all the alternatives." I feel like this is how we've arrived at everything we know about distributed systems.
Rachel: He also said that democracy was the worst of all possible systems of government except for--
Charity: Everything else, yes, exactly. Exactly, and that's very much how I feel about observability too. It's kind of like an admission of defeat, in a way, when it comes to predicting what problems you're going to have.
I've been doing ops for years and the pattern has been, you look at a system, you size it up, it got built, it's beautiful, it's about to start serving traffic.You're like, "How should I monitor this thing? What problems am I going to have?"
Then you write a bunch of checks, you write a bunch of graphs and then the site gets turned on. If you're lucky, you could predict most of them, but inevitably, you can't predict a lot of them. And the balance between known unknowns and unknown unknowns is swinging hard.
Charity: When you were building a LAMP stack to serve an e-commerce website, cool. You can download pre-generated Postgres or MySQL.
There was a lot of baked stuff because all these systems were very much alike. Now there's spaghetti. There's just fucking spaghetti all over the place.
Components, you may have half a dozen different data stores, half of which are cutting edge, by which I mean completely broken. And you're gluing this together with routers that you read about in Hacker News because it seemed like a good idea at the time. Or you didn't, somebody did, you just have to come in and make it work.
Observability is basically just acknowledging, "I can't predict it. I can't, and it's stupid for me to even try because the failures that manifest in a distributed system are like this infinitely long tale of things that almost never happen, except that once they did."
Rachel: Black swans galore.
Charity: Yes. Or it has to be five different rare conditions that all collide and you can't stage these.
You can't find these in artificial environment because it relies on real users, real scale, real data to even find these problems.
So how does a life end up doing observability? I'll tell the story at Parse that really crystallized in this. We were getting acquired by Facebook in, what was it, 2013?
Rachel: Yeah, and this was about the time that I'm coming to the dawning horror of a realization that we built a system that was effectively undebuggable by some of the best engineers in the world doing "all the right things."
And yet, every day someone, customers would write it, they'd be like, "Parse is down." They're very upset, I'm like, "Parse is not down, motherfuckers. Look at my wall of dashboards. They're all green. They're all green. Check your WiFi."
Rachel: Computer says no.
Charity: Computer says no, right? And I'd be arguing with them, which is a dumb thing to do with your users. Because you're not convincing them that they're not having pain, you're just losing credibility the longer you argue with them about their experience.
Eventually I dispatched engineer, go figure out why things were down. Maybe it's Disney, maybe they're doing eight requests per second, and we're doing 100,000 requests per second. Never going to show up in any of my time series aggregates ever. It's never going to trigger a monitoring alarm.
But we can be completely down for them, so an engineer would go off, investigate and come back in hours, if not days, because of the sheer range of possible root causes.
Rachel: What could possibly--
Charity: What could possibly go wrong? It's like the infrastructure mantra right there. So it could be anything, it could be some edge case that they were hitting on our systems. It could be a bug in the SDK, it could be anything, hours or days, and sometimes we would have to give up.
We'd just be like, "Well, these guys are paying us 200 bucks a month. We're going to choose not to answer their question because 70, 80% of our time, both backend teams, 70, 80% of our on-call time just tracking these down. And we were not shipping anything and we were still losing ground.
I cannot even tell you all the things that I tried. If there's a piece of software out there, I tried it. If there's a technique, a team thing, I tried it. Tried hiring more people, I tried it. The only thing that finally dug us out of this hole was getting some of our data into a butt-ugly Facebook tool called Scuba that's aggressively hostile to users.
It was developed, developed is a kind word for it, it was thrown together back when they were trying to figure out their own demons in MySQL 10-plus years ago. And it was just so useful that it's just kind of hung around even though nobody's invested in it. That, plus a bunch of plumbing and stuff that we built ourselves to whatever.
It took us about six to nine months to roll this all out to all our entire systems. But even after a month or two, we had cut down the time that we spent understanding these issues from hours or days or impossible, to seconds or minutes, and predictably, reliably, and even sales and support could do it.
They could tweak, ask this question about a new user, they could usually find it pretty quickly. I'm in ops, so as soon as we had a handle on that problem, I was on to the next one. I didn't really stop to think about what had had happened or what, all I knew I wasn't hurting anymore so I looked at something else, as we do.
It wasn't until I was looking at leaving and I was planning to go be an engineering manager somewhere else and I started thinking about going back to the old tools I'd had to use, and I suddenly realized I no longer know how to engineer without this tool. I can't.
It's like trying to imagine writing software without my development environment and without servers. It's so fundamental to the way I understand what I'm doing that I knew that I would be half the engineer. And so that's when I decided this tool needs to exist.
Let me also say that every single tool's marketing site will tell you that they do all of the things that we do, every single one.
There was a long period where Christine and I were just trying them all and talking to their users. And that's when we realized just how much pent up, barely pent up anger and frustration there is out there with everything, with all these solutions that they've been trying for so long and the ways that they fall short.
It wasn't until we started, I was like, "Well, let's see if we can build this thing. So then if nothing else, I can open source it and I'll always have it, I'll never have to engineer without this thing." We started building it and that's, honestly, when I started to understand what underlying characteristics has made this experience so transformative for me and that was when
I started to understand that this wasn't just a platform problem like I had thought, but this is a problem of complexity of systems.
Charity: And the range of possible outcomes and everybody is hurtling towards this cliff. There's a tipping point, and it can be from a lot of different things, sometimes that's from the complexity of the product itself, the platform where you invite users to do creative things on your systems.
It can be when you adopt microservices. Sometimes it comes from Kubernetes, but there comes a point when you can just no longer predict most of problems that you're going to have.
Charity: Emergent complexity.
Rachel: Emergent complexity. And you know it when you hit that wall. Everyone knows it, because they're helpless. They start homebrewing stuff. They're desperate because they cannot understand their problems.
Rachel: The reason I'm grinning like an idiot over here is because we haven't actually had this conversation before. And the moment at which you left Facebook and started to build Honeycomb is when you and I met.
Rachel: To hear you describe observability as a failure of preexisting tools cracks me up because I was looking for a Honeycomb. Because history doesn't repeat itself, but it rhymes.
Rachel: All of this has happened before and will happen again. I mentioned that I was among the first analysts to cover VMware and then immediately afterwards, Splunk and New Relic appeared.
I invested in Docker, I was looking around for a tool that would let people manage the infinitely increased complexity of systems based on microservices. So I walked into that meeting knowing that you were onto something and for you to perceive that as a failure of all of the other avenues--
Charity: Not until it bashed me in the head.
Rachel: Right, I had been waiting for an engineer to figure out that the complexity had spiraled out of control.
Most engineers are just way to arrogant ever to admit that they can't comprehend everything that's going on in their systems.
Charity: Yeah, there is a real urge for control, for the illusion of control that we all have. You look at the sales pitch of every tool, it's like, "You just buy this, you will always know it's happening, you never have to think about it. You never have to figure it out. The tool will tell you what to look at."
Rachel: Honestly, I'm clearly a humanities bigot, but I do think this is one of those serious monocultural risks to Silicon Valley, is that so many people came up to STEM careers and business careers and they're really uncomfortable with uncertainty.
Rachel: And hi, the real world is full of uncertainty.
Rachel: The black swan idiom just cracks me up because I was 23 before I saw a swan that wasn't black. Where I come from, all swans are black. I mean, seriously.
Charity: Well, and when you start talking to engineers, especially if you get them a little drunk, you pick up this rock and suddenly, all of the bugs squirrel out and you realize just how many people feel a lot shame.
Charity: Because they know how many of their postmortems they never know the root cause for, they don't know. And nobody wants to admit that, because it makes it sound like you're a bad engineer. Everybody always gets to the root of it immediately because we're all incredibly good at what we do.
Rachel: The shame is horrible because it means that a lot of we're building we're building on delusion. And that is a really strong way to create very weak and flawed software, which is what we're seeing happening.
Charity: Yes, yes, and shame is a real. We talk about blameless postmortems, but we don't really talk much about shameless processes, not shaming people for not knowing things for not always being the expert, for not always knowing what to do, and how do we meet them where they're at.
Rachel: The other thing that I think this culture of rationality and empirical solutions to problems squishes is real curiosity, genuine inquisitiveness.
Rachel: Exactly, these systems are becoming organic in their complexity and that's super interesting. Nobody really wants to talk about it except in the context of, "Oh, Roko's Basilisk AIs are going to eat us."
Rachel: What if they don't, what if AIs are something completely unimaginable? I bet they will be and we're not having real conversations about that because of the fear and the shame.
Charity: We're not. I want to pivot off of what you said about curiosity and exploration to talk a little bit about the philosophy that I think that we all share at Honeycomb.
There's a certain type of person that's attracted to work at Honeycomb. They're a very interesting breed, but it has to do with exploration and curiosity and being comfortable with the unknown, like you said.
You have to give people that dopamine hit of, "Oh, I found it. Oh, there's this thing that I didn't know that customers were going to find and I found it before they did." And in order to do that, I think that the key is social interaction.
Charity: Is the social graph, as Facebook would say, because it is incredibly time-consuming and difficult and expensive to debug something, to learn something, to learn the full stack all the way down to that one little bug in OpenSSL that was written 14 years ago and that's why your thing is crashing.
It takes all this time and then it decays rapidly. It's so cognitively expensive to understand these things, and it's so cheap to share.
Rachel: The one that got away from me as an investor was Slack. I looked at Slack really early on and I tried to get that firm interested and I could not. Because, again, the power of Slack is in its social engagement, in the fact that it magnifies everyone's intelligence.
Charity: Compounds your impact, really. We talk a lot about how can we bring everyone up to the level of your best expert in every single area. Because of one person knows it, everyone, should have access to that information. It should be, the iPhone is like your outsource brain. You don't have to page through it all the time, you don't have to remember it. It lives somewhere and you can find it when you need to.
Rachel: Only '80s kids will understand having to remember your friends' phone numbers.
Charity: Oh my god, yes. But if I get paged and it's about something that I don't know well, it's like an outage with Cassandra, say, and I don't know anything about Cassandra.
But I do know that we had an outage two or three months ago and I think Christine was on call and that's all I need to know to find what she did to solve it, like literally, the questions that she asked while she was trying to understand it.
What did she think was interesting enough to leave a note or add to a postmortem or post to Slack or share with a friend or someone else on the team? And that getting access to that information, or it could be me, maybe I debugged it and I've completely forgotten it because it was a year ago. If we can just help people forget less.
Charity: If we can just help them be slightly better versions of themselves and embed themselves more, draw on the wisdom of their team. Because when an engineer walks out the door and takes all of their information with them, all of their data with them, it's such a loss to the team.
If you're leaning on that person and calling them every time there's a problem with the thing that they know, well, that sucks too. That leads to the burnout, that leads to attrition. That leads to, we would never accept that in a distributed system, why do we expect that in our teams?
Rachel: Because we think that humans are disposable or fungible and they really aren't.
Rachel: What is the biggest problem you're trying to solve right now and why is it difficult?
Charity: The problem we're trying to solve right now will unsurprisingly not be technical. We're very privileged to have a team where I can literally just hand wave away questions, like, "Okay, we need this new storage engine, we'll just write one."
The tech is table stakes, it's always the product questions of, "How do we make this the thing that people need that they don't know that they need?" One of our biggest problems is always the people tell us that they want to faster horse, not a car. If we listen to our users, we would be building better metric systems.
Rachel: Listen, I genuinely want a faster horse and not a car.
Charity: I know you do, honey, I know you do. I wish I could help you with that. But it's difficult because of path dependency and the fact that the industry has 20 years now of developing monitoring systems where you predict what you want to ask and you ask the question again and again.
You check the expected state to the actual state and with metrics. And so metrics, you can use the term in two ways. One is just a synonym for data, and the other is like StatsD type metrics where it's a number and you have tags that you can append to the numbers, and these are very, very fast performant, but they have characteristics, you've stripped away all the context of that number.
You can't link it to anything else in that event, which means that you have no ability to use it for debugging whatsoever and you can't have high cardinality because of the right amplification of the tags that you've appended to the metric.
Every time you write that metric instead of being fast and cheap, you write the metric in all the tags which is very slow, exponentially slower the more you add. So you're limited in cardinality to the number of tags you can have, which is typically a couple hundred.
When I say high cardinality, definition time, all I mean is say you have a set of 100 million users, your highest cardinality information is always in a unique ID, unique social security number, very high cardinality, very high cardinality but not the highest, would be first name and last name. Low cardinality would be things like gender.
Rachel: Less lower than it used to be.
Charity: Less low than it used to be, sure. But lower than last name and the lowest of all would be something like species equals human, right?
Rachel: Yes, yeah.
Charity: So when you think about this intuitively, you know that almost all the interesting identifying information is very high cardinality.
Rachel: It's in the long tail.
Charity: It's in the long tail and you cannot have high cardinality when you are using metrics. People have been told for years at almost every conference that this is impossible, you can't do it. It can't be done. And they're right. You can't do it with metrics because of the way you're storing bits on disk.
You can absolutely do, this is not a hard data problem at all. They had nice things in BI for years. If BI worked the way systems did, they would start with a few dashboards that represented the possible end-stage answers, and they just start flipping through to see which one best matches the question that they were trying to ask, you know, number of users that converted to "blah."
That's insane, we would never tolerate that. Instead, they start by asking a small question. Then they look at the answer, based on the answer, they ask another question and they follow the breadcrumbs of data where they're leading you. It seems so easy when you think of it that way, and yet we haven't had this in systems land.
All the mental muscles that we would use to follow these breadcrumbs and debug things intuitively don't exist.
It is so much easier for us to take a new grad from college and just plop them in front of Honeycomb, and they get it immediately. It's harder for someone who's been using metrics and monitoring for 20 years and is baked in all of these assumptions about how they have to ask the question in these weird ways that get around it
If you google high cardinality and metrics, all you find is a bunch of people trying to tell you how not to have these problems and you can't. Context is also very important and the wider your event is, the more context you have, the more weird edge cases you can tease out, right?
Charity: So you want to an extremely wide event. This is why we had to write our own storage engine. Doesn't exist, it doesn't exist. So that was a long rambling answer about one of hard problems, which is just getting people to change their frame.
Charity: And understand that they can just ask the question directly and it's simple and it's easy. This gets back to why we want to have customers on this and users and people who can talk about having gone through this journey. Because it sounds like crazy vendor speak when you're just like, "Well, you can do this and it's magic and it just works and it's easy, so I'm not credible on this topic."
Charity: There's also trying to convince teams that owning your shit and being on call does not have to be a life sentence, doesn't have to be miserable. Hat life becomes better.
Rachel: Well, this again back to Silicon Valley's devaluing of human labor.
C- Absolutely. When you talk to software engineers about being on call, the first thing that comes to mind is all of the suffering that they've see ops teams undergo and impose on themselves. We have a problem with masochism. I'm over 30 now. I don't want to get woken up anymore, either.
Rachel: You still have a problem with workahol, I just got to say.
Charity: I do. That's lifelong, it's fine. But I don't want to get woken up and I value my time. I value everyone's time and the fact is that "on call" does not have to be miserable. The outcomes are measurably better.
Putting people on call who know how to develop the software, we call this "software ownership," where the same person has access, builds something, debugs, triages, talks to customers occasionally, has the ability to deploy roll back and is on call sometimes.
It doesn't say you have to do those things all of the time. It says that you can do these things. You understand how and the value of them. Making that feedback loop short and without a bunch of extra hops means you need an order of magnitude fewer people, honestly.
It means that their time is better used, it means that you're not dropping as many packets. It's like a game of telephone. The ops team gets paged and escalates to this other team who isn't the owner, so they escalate to someone else and by the time it eventually gets to the person who wrote it, who understands it, or the person who takes the time to debug it, a lot of the context and immediacy is lost, maybe isn't reported correctly.
It's a mess, it's a giant mess. This is why nobody likes these rotations. But it can be so much easier and it can be so much more empowering to have that ownership and control.
Rachel: My hunch is that this is where the really long-term dividend of observability will come from this combination of a much more collaborative process and just giving people more agency and autonomy and problem-solving.
Charity: Yes, exactly. And there are safety and security issues.
Rachel: For sure.
Charity: A lot of people will go, "I can't get my software engineers root." Well, this is so you don't have to give them root.
Charity: They should not have to log into a machine to debug their code. They should be instrumenting it at the right level of abstraction so that they can ask and answer new questions. Have we even defined observability yet?
Rachel: We have not.
Charity: Let's do that.
Rachel: Let's define observability.
Charity: What does observability mean? Well, the term is taking from control theory, as everybody's seen the Wikipedia definition. But applied to software, all it really means is you can understand the inner workings of a system, the software in the system, by asking questions from the outside.
You can understand the inside by asking from the outside without, and this is key, without having to ship new code every time.
It's easy to ship new code to answer a specific question that you found that you need to ask. But instrumenting so that you can ask any question and understand any answer is both an art and a science. And your system is observable when you can ask any question of your system and understand the results without having SSH-ing into a machine.
Rachel: One of the analogies that I've been using to convey my enthusiasm to my non-technical friends is that
metrics and monitoring, up until now, is just being a burglar alarm. Observability is closed-circuit TV.
Charity: Nice. This speaks to one of the reasons I think it's important that we have a different term. A lot of people have gotten really up in arms about "Monitoring covers all this, blah, blah, blah. It's been long understood because black box, whatever."
We have 20 years of best practices that we've built up for monitoring systems, and I'm not saying throw these out the door, it's very important to monitor your systems. The best practices are often the exact opposite of observability best practices and I don't think it does us any favors to dilute the waters, to confuse the two.
For example, a classic best practice for monitoring is you should not have to stare at graphs all day. You should not have to look at them. You should trust your alerting system to let you know when there's a problem and you should go look at it.
That's great, that's a great best monitoring practice. But with observability, with these very complex systems, it's a much better approach to the best practice of, "When I ship some code, I'm going to go look at it. I'm going to see what happened. Did what I expected to happen actually happen? Does anything else look weird around it?"
Just spend some time being curious and exploring. This should be muscle memory for anyone who's working in distributed systems because the alternative is one of two extremes: either you never look at it and therefore there are problems that exist for a long, long time and they're completely baffling when someone does notice them. This is the way most of your systems look right now, by the way.
The other extreme is you add a bunch of alerts to everything and you drive yourselves and everyone else nuts and your ops team quits as they should and you burn yourselves out or you silence them all you never see them anyway. So you've just wasted all that effort. Those are your options.
You really need a best practice of just going and looking at it when you've shipped a change to your system.
I feel like observability is, everybody talks about wanting to ship faster and with more confidence. Nobody has any confidence because they shouldn't.
Because they can't actually look at what, I think it's insane that we ship code and wait to get paged. That is bat shit.
Rachel: Yeah, yeah, wait for it to fail, wait for somebody to hurt.
Charity: To fail big enough.
Charity: That it rises to the threshold of one of out alarms, that's crazy. It's so much better to just get in the habit of looking at your systems, especially when they're normal.
If you're not used to looking your systems when they're normal, you don't actually know what abnormal looks or feels like.
And a lot of this is so subtle and it takes intuition and it takes the familiarity of frequently looking at your stuff to see how it behaves under various conditions.
Rachel: That's where the curiosity comes in.
Rachel: What is the thing intelligent bystanders most often misunderstand about your work?
Charity: There are a couple of paired things. They misunderstand that we are not just monitoring and that we do not hate monitoring. Neither of those is true.
A lot of people still try to think about Honeycomb like it's a time series metrics thing, they to try to apply all of the same intuitions and they ask us the questions in the same way. And I think it helps to just learn to visualize what an event looks like.
Rachel: You know that not everyone can visualize in multiple dimensions as easily as you can, right?
Charity: I know. Well, no one's spent as much time on it. I think that the main thing that people misunderstand, is that the paradigm shift is real. They have been told that this is impossible and they can't do it. They've been told that it's hard.
It's not harder, it's easier because it's actually addressing the problem instead of spackling over it with all these other tools. Empowering engineers does lead to better services, and it leads to better engineers.
The feeling that I had was, when I left Facebook, when I was like, "I can't live without this," it's because it made me a better engineer and I didn't want to give that up.
Rachel: People want to do good work.
Charity: People want to do good work, people want to be asked to do things that matter.
Charity: And if you're doing something that you believe matters, then you want to do a good job and you want to have the ability to do your work, the power and the trust to fix what needs to be fixed.
There's nothing more frustrating as an engineer than being given something to do, like a task, not being given enough time to do it well and then asked to support it. But not given enough time to fix it either, just to see the users who want to use it.
But it's kind of shitty, so they're complaining about it. They care, but you're not allowed to do the job for them or to not know how to do a good job. And a lot of this comes down to not being able to directly observe it.
There's a giant black hole that people don't notice is there because they've never seen into it, you know? Just being able to look at what you're doing, and we talked about, what is a high cardinality field, users. Being able to break down by one in 100 million users and then any combination of everything else is how you answer questions from help desk tickets.
Users are like, "I'm experiencing this," and we have all these complicated ways of looking at logs and looking at dashboards and try to correlate what happened in this system, what happened in that system.
People will have half a dozen tools up trying to investigate instead of simply pulling up the service, breaking down by that user ID and then looking at errors and latency and where they came from. And it's so simple.
Rachel: The amazing thing about that particular feature of observability is that it's instantiated user empathy.
Rachel: You can literally see things from the user's point of view.
Charity: You can be in their shoes.
Charity: Back to the Disney example where I'm like, "Check your WiFi." Well, after we got just our edge dataset in, I could look and see if they were ever hitting our edge. I could tell them confidently in about two seconds, "Ah, you're not hitting our edge, check your WiFi."
That's not a dick thing to say you're wrong, check your WiFi. "Well, you're not hitting our edge, we're not seeing your traffic, or we are seeing it and here's the error rate. Here's the latency, hmm, I'm going to investigate. Click, click, click, oh, it's coming from this."
One of the hardest problems in distributed systems is everything's slowing down a little bit, why? Especially for a system that loops back into itself, where any single node or service or database node can slow down the entire thing.
It's circular so you can't tell where, unless you have observability. And then it's like two clicks and every time you can go, that one. Once you know where to look, then you can apply all your traditional tools.
Rachel: Right now the number of systems that really see these kinds of errors is probably in the minority, but all systems are trending that way.
Charity: That's the other thing that smart people miss, is that this is coming for them too.
Charity: This is a cliff that they too are likely to, unless they are planning on working something very small forever, but most of us don't aspire to that. We like problems of scale. We like problems, lots of users.
And it can be hard to understand just how awful it is until you find yourself there and realize how lost you are and how much time, how many bodies you're throwing at this.
People will try to solve this by hiring outsourcing teams in India to just be cannon fodder, buy all this training, spending incredible sums of money to capture all of their information for logs. When, in fact, they're just using the wrong tools and it doesn't have to be that hard.
There's also just the fact that everybody needs to get better at instrumentation. This is no longer optional for anyone.
No pull request should ever be accepted without being able to answer, how will I know if this doesn't work? Which is not something that we're used to.
This is why I feel like observability-driven development is the next natural extension behind TDD. TDD was huge for the industry, right? Oh, we develop to match the predictive output of these tests, but TDD stops at the border of your laptop.
Rachel: It does.
Charity: Stops at the network and that means it stops before you hit anything real.
Rachel: Throw it into production and you don't really know what's going to happen.
Charity: You don't have any idea. The only way to do this safely is to test in production, and
people are scared of that sentence. But they shouldn't be because they're already doing it, whether they admit it or not.
Every deploy is a unique test of that deploy fact, that deploy script and those sets of deploy targets.
Rachel: These are live human tests that would never pass an ethical review board.
Charity: Exactly. And once we've admitted this to ourselves, then we can talk about how to make a safer and better with canaries and feature flags and blah, blah, blah. That's a topic for a different podcast.
But observability-driven development means you ship it safely, a small amount to production and you watch it and you see what happens. And you gain confidence.
This is actually the only way to ship faster and more confidently, is to get better at observing it and to make that part of your development process. You release the code to production as soon as possible and you watch it and you develop based on what you see in that feedback.
Rachel: This cliff is coming for you, all of you dear listeners. And this is the beginning of a very long conversation that Charity and I are going to have with a bunch of our friends. Thank you so much for coming today.
Charity: Thank you so much for being here with me, Rachel. This has been delightful.