November 7, 2019
Navigating Cloud Infrastructure Security Challenges with Al Ghous
We chatted with Al Ghous, ServiceMax CSO & Head of Security, about how startups can navigate the complexities of cloud infrastructure securi...
In episode 15 of O11ycast, Charity and Liz speak with Rachel Myers of Google and Emily Nakashima of Honeycomb. They discuss serverless development, customer-vendor relations, using logs correctly, and better understanding your data model.
About the Guests
Liz Fong-Jones: Serverless is not actually new, that's a hypothesis that I have, that serverless has been around since Google App Engine. But what is serverless then?
Rachel Myers: I would define serverless as using a managed service for every part of your stack, which is not exactly--
Charity Majors: Every part? What if you just use it for offloading index jobs, or something?
Rachel: Then you're partly serverless. Yeah, you can be somewhat serverless or you can be totally serverless.
Liz: "Serverless" is a spectrum, like being --
Charity: Like gender.
Rachel: Yeah, I would agree. I think that serverless has a bad name for people that are all-in on serverless, and they become ideological about that.
I don't think you have to do that. You can ease into serverless.
Charity: Also, it's the worst name in the world ever. Because it's not serverless. There are servers there.
Rachel: There are servers somewhere, but you can't see them though.
Emily Nakashima: Just put in an ops person.
Liz: I would rather cope with serverless than opsless or no ops. No ops pisses me off.
Charity: As long as we can have no devs, I'm cool with it.
Emily: I'm down for it-- I can't say that.
Charity: This seems like a great time to introduce yourselves.
Rachel: I'm Rachel. I work at Google, and before that I started a company called Opsolutely. Before that I worked at GitHub, and before that I worked at ModCloth.
Emily: I'm Emily, I manage engineering and design teams at Honeycomb. But more important than that for this conversation, Rachel and I used to co-speak together about web perf back in the day, so that is the thing that formed this bond between us.
Liz: How did you two wind up speaking about web perf together?
Rachel: Because we were doing it wrong, and then we started trying to do it better, and then we started telling people about that transition.
Charity: From a pure sense of anger?
Rachel: For me, it was more like a confession. Like, "Let me tell you about this ridiculous thing I did."
We have different approaches to speaking, so I think that the best talks are disasters and she thinks they should always end with something positive or something uplifting, or like a good takeaway.
I am very content to just leave people--
Charity: You think that's just false confidence, or false hope?
Rachel: Like, we'll tell a disaster and then she's like "Here's a
tool that solves this problem if you want to pay that much money," or something like
I'm happy to just like leave them in a pit of despair.
Emily: That explains so well how we got to where we are now.
Liz: If you have no servers or if you're a serverless, does that mean that you don't need to monitor or observe?
Charity: Can you make it somebody else's problem?
Rachel: It would be really great if the people who are managing
those servers for you were able
to see what you wanted to see, if they were looking for the things
you want to look for.
That's a hard problem, though. It's hard as a manager, as you're trained to solve the general problem.
As a specific application, you're trying to look for things that only you know about your application.
That's why I think it's hard to have the generalized solution too. Like, observability for--
Charity: That's a trick question. Nobody can own your service for you,
only you know what you're looking for.
Every app is specific.
You can offload components to other people, to people who are specializing those components, but you cannot offload ownership.
Liz: The other thing that I think is interesting about that is everyone has their own view of the world.
By necessity, no matter how much you outsource, you have to keep the interactions between all different components and the glue between all the different components in your head and understand how things flow.
Rachel: That makes a lot of sense.
Liz: So how do we increase the amount of observability across that gulf between your vendor that you've outsourced your thing to and your own code, and you?
Rachel: Right now it's a really hard problem. Right now, I feel like this is a point where we could wax philosophical about how software tends to develop.
Or we have many different solutions and then we find out what's right, and we hone in on that one.
Charity: Then it eventually becomes a commodity , and then you move on to the next one.
Rachel: Exactly. Then we have to-- Like, something else blooms. Serverless right now is in the "Many different varieties" phase.
There's not one way to plug into things, every serverless component is operating a little bit differently and has different quirks and different places where you could plug in.
So, how you do this is like a very one-off answer. It depends what you're doing.
Charity: It's a lot like the clouds were 10 years ago.
Charity: Emily, you're so quiet.
Emily: The troll question that I've been sitting on is "What do you tell people when they're like, 'How do I observe firebase?' As a vendor, what do you say?"
Rachel: I tell them it's a great question, that they're asking the right questions and they're worried about the right things.
There's not a "Hit this button to observe." Like, that's not a button.
What you can do is your functions are allowed to call out to things,
so you can start putting things in there.
It depends on what function provider you are using. I'm biased towards the one I work for, but everyone can choose their own functions provider.
They all have a way for you to either do retries on your own or retries are built-in. So, thinking about how you want it to retry.
If you want it to report your retries vs.
"Just keep trying for a while."
It's nice to know when the retries start in case they're
all retrying eventually, because you could have a cascading
Like, "We started these retries and then the servers came up, and it was brought right back down because all of the retries had queued up."
Liz: The giant thundering herd.
Charity: This is giving me so many flashbacks to Parse.
In the beginning we told everybody "Parse is the best way to build mobile apps. You don't have to worry about the backend. You don't have to bother your pretty little heads about it, just use our SDKs."
Rachel: How did they like being condescended to like that?
Charity: Fabulous. They love it, absolutely. I giggle when I condescend and so nobody notices, and for a long time that was true. It's for hobbyists, it's for people who are bootstrapping.
For a long time you don't have to understand your data model, or the underlying storage or the query planner, or all these other things. But then you get featured on the front page of the iTunes store and suddenly you do, and it's the worst possible time.
We give you no insight into it and you're like, "Parse.
What the fuck is happening under the covers of my fucking app?"
We're like, "Funelly you're doing a full table scan times three over here, which you couldn't have known because you're using--"
Liz: It was fast enough earlier.
Charity: You're using an API, right? How are you going to know how that translates into our query planner, or much less how MongoDB is going to do under the covers?
Much less all of the people who are sharing a lock with you on that single database?
Liz: So, it's kind of the multi-tenancy problem?
Where you as a single tenant don't have the right to poke around in what's
going on in your peers, and yet you need to at least have an
It brings us to the question of how to
measure what's going on in a service you don't necessarily have telemetry
Can you just start a trace span before you call out to someone else?
Charity: I've been telling people, for developers who are asking "How do we
Because I feel like instrumentation is something that as an
industry, we have no idea how to do it.
We have no standards.
There's no consistency, there are a few different schools of
thought if you happen to be lucky enough to work with a good senior dev.
When you're early in your career you might know how, otherwise you're fucked.
But I feel like the best advice you can give people is "Imagine you're a
serverless developer, now instrument your code that
Because you should not have to-- Ideal world, whatever. You shouldn't have to care about CPU, load average, read memory that is buffered in vs. page to disk.
I was like, "You should only have to care about 'Can your code execute from end to end successfully or not?' And if it can't, you should have alternatives that you can do."
Rachel: You should have some way to call out and say, "There's a problem here. Let me dump what I have so you can debug this later," or something.
Charity: You should have a plan for what happens when your request doesn't succeed.
Liz: I think the other thing that I saw when I used to work at the same company
as you, Rachel, is the notion of
support needs to have the same view as the customer
Even if the customer can't see
into the vendor's solution, you still need to be able to
say "OK. That request from that customer, they're passing me a trace
idea. What does that trace idea look like inside of my system?"
So you're not scrambling around trying to reproduce behavior.
Charity: IDs are the little magical things that knit all systems together, so if you expose them to your customers maybe they can get something sensible out of it.
Emily: Liz, I love that world that you live in where this is a support problem and not an engineering problem. Tell me how to live in that world.
Liz: I think that it inevitably happens at a certain scale and size, where you cannot have every customer request being routed to a product development engineer.
Charity: I often talk and think about this as a "Observability
has a way of making so many support problems not
engineering problems," because instead of having to use your creative
brain like "I don't know what's going on, I'm going to go spelunking
and open it," because anytime it's engineering it's an
Because you don't know yet, it's a new problem, and the definition of
that vs. a support problem is it's bounded and it's
You can give someone a playbook or a recipe and they can do
I feel like with observability, when
we were debugging Parse it could take hours or days to debug some of these problems.
With observability, it was seconds or minutes. It was repeatable and it felt like a support problem.
Liz: Even if it was a new problem every time, it's the
standardizing "How do we answer new
It's not that there's fewer kinds of questions coming in, it's the time to resolve and the complexity of those questions.
Charity: It's the practice, the approach to the mental model of
bisecting and taking one small step, and looking at the answer.
You can't tell people which steps they're going to need to take, but you can give them the process and they can always debug.
Liz: What about the process of looking at your log files?
Charity: That's a great question. What do we think about logs?
Rachel: I think that logs are a great fallback, and I think we should log things.
Emily: "Logs are a great fallback" was such a diplomatic way of saying something rude about logs. I'm going to remember that.
Charity: You don't have to be polite here. What do you want to say about logs?
Liz: The reason I ask that rhetorical question is that I'm with you, Rachel. I think that logs are great if you use them only for what they're intended for. Logs are great--
Charity: What do you think about structured vs. unstructured?
Liz: I think that sometimes you really just need to throw a [inaudible] debug in there.
Charity: Sure, if you're a developer who's actively debugging something and you're looking at it on your screen.
Liz: Yeah. But my point is that you can still do
that in production, as long as you're not trying to index all of
If you're just putting it somewhere and if you need it, it's there, and if you don't need it then it rots and gets expired in a day.
Charity: Spoken like someone who's used to having a very large budget.
Rachel: That's true, it is. I'm pro structured logs. I think they're great. It lets you search everything that you need to find really quickly.
Charity: You can always derive unstructured from structured, and the reverse is not true.
The reason, when I say "I hate logs,"
I think there are a
lot of assumptions that are bundled with the term "Log."
You hear "Logs," you think strings and you think unstructured, you
think emitted at random intervals throughout the code
Whenever somebody was like, "I'm going to print something out here," and none of those are necessarily true.
You can log and weigh the structure that's predictable, that's
the way we do it.
You initialize a Honeycomb event at the beginning of the request in the service and you pre-populate it with a bunch of stuff, you stuff more stuff in it as it's executing, and then at the end before it's erroring or exiting you ship it off in a single wide block.
Liz: You're describing a structured event, not a structured log.
Rachel: Yeah. It reminds me a lot of event-driven programming.
Charity: That is what I call "Events."
But I get people yelling at me, "That's just structured logging."
I'm like, "Yes. But the assumptions that you have for logs are the opposite of everything that we do, which is why we called them 'Events.'"
Rachel: What do you mean, "It's the opposite?" What do you mean?
Charity: Logging traditionally is strings, unstructured, emitted at various points and not wide. Not many nouns.
Rachel: Like, what we're putting in production is like that?
Rachel: Like, what we're putting out--? Like, logging from our production apps is like that?
Charity: I'm just talking about in general, yes, people log that way. That's the historical, like most logs look like that.
Liz: Right, but when most people think of when you tell them to structure their logs, what they're going to do is they're going to put a JSON shim around it but they're going to still emit multiple per request.
Charity: Yes, which is terrible.
Liz: Here is the compelling argument for why sometimes even if it's just "OK, fine. I threw a little JSON on it, but I'm still emitting it bits here and bits there." Here's why I think it's valuable.
If you're trying to understand conflicts between different things happening in different threads on the same machine or same process, you have to look at not just the events when they finish, but you have to look at what else was going on immediately.
Like, above and below, even if it is in a different request.
Charity: I would argue that's a job for a debugger.
The job of structured logs or events and observability is helping you find where in your system is the code that you need to debug, and giving you enough of the context that you can feed that into your local debugging instance and recreate it.
Liz: But you can't necessarily recreate everything.
Charity: If it's a scale, then you can't log enough to give you that context. Almost certainly.
Liz: The advantage of scale, though, is that if you have something running on a machine you can probably keep a one gigabyte log that just is an open buffer. It's a ring buffer.
Charity: That's true, and it probably gets written over every two or three minutes.
Rachel: I have a question for both of you, because you -- Charity, you don't think that no one should ever log. That's not the position you're taking.
Charity: No, of course not.
Rachel: I want to hear, "What are the necessary and sufficient conditions for when to log and how to log," or whatever?
What is a good log? What does a good log look like in both of your heads?
Maybe Emily, you have opinions about this too, but it's just that you're not out there with your daggers so we don't know.
Emily: I'm Switzerland in the logging wars, I'm sending this one to Liz and Charity.
Charity: I feel like another assumption that people usually mean when they mean "Logs" is "Logging to local disk."
Which again, there is lots of problems with that.
You don't generally want to log to the same disk as you're running your
application on because it'll fill up, it'll have contention, if you're
running at scale.
Again, these are big problems and it tends to impact you at the absolute worst times, when you're already at peak capacity or there's some other problem.
That's when your log is also going to overflow and compound your problems, so I think that the best telemetry sent from an application doesn't get buffered to disk locally.
Liz: I'm with you there.
I think that the magic of telemetry and the magic
of observability is figuring out where you need to poke
I think maybe what we're differing on, Charity,
is what our definition of "Poking around more"is.
Because for me, in my past life as a n SRE, I spent a lot of time where my poking around was not necessarily spinning up a copy of the same binary and trying to recreate the conditions on my own disk.
But instead, my service was on fire and I was trying to figure out "What do I need to drain, what I need to whack?"
Charity: I have exactly the same background and past, and I just feel like it's deprecated. I feel like there are generally better ways.
Emily: She giggled. That means she is being serious.
Charity: Being serious, yes. No, I hear you. Look, I would never take away anyone's s-trace.
I would never let anyone take away my s-trace, but I feel a little ashamed and dirty inside every time I drag my s-trace out.
Liz: Yes. You should absolutely feel ashamed and dirty, but at the same time--
Charity: That's how I feel about logging too.
Liz: Yeah, it's a useful tool though.
Charity: I think more violently agreeing.
Liz: Yeah. We're actually violently agreeing.
Charity: It's very useful, but it's not something I want to lead out and say "Hey kids, do this. Lean on this."
Because it should be your last line of defense, because it is unreliable. It is often harmful and if you lean on it you will send people astray.
You should have people leaning on getting the telemetry out into a place where it's shared state, where everyone one can poke around and debug it, and look for patterns across many instances.
But worst case scenario, you've got to have these in your back pocket.
Liz: Yeah, worst case scenario.
I think that's the generalization, that people make the assumption that because
it's useful to look at one machine, that they must be able to correlate it
across multiple machines.
I think that's where that falls apart.
Liz: You use your structured events. You use your metrics, you use your traces to figure out how to-- "Where do I look?"
Charity: The entire notion of a process running and doing something is deprecated. Conceptually, you don't want people thinking about logging into a machine or restarting that binary.
Yes, you sometimes have to do this, but you want people to be thinking about
"How do you manage a service as a group, as a
Because there are so many patterns that you will only see when you're zooming out and you're looking at - - You're looking at it from a much wider lens.
Liz: Which is the appeal of serverless, in a way.
Rachel: Nice. Well done. Brought it back a round. Wait, before we leave logging I have one question.
Charity: You are determined to start a fight here.
Rachel: I'm not. I want to get the conclusion, I want to get the takeaway.
Say, I feel like I should be
What do you want me to know? Like, how should I be logging well? What kind of things should I be logging, and what kind of content should I put in there?
Charity: Whatever you have to debug, whatever terrible thing you can't get any other way, and so you're just temporarily putting it in there.
Example, you can't get at it through a debugger because for whatever reason, you don't
know why, but you can't seem to trip it in a debugger.
Add some logging, put it back in prod and watch it like a hawk, see if you can find it around whatever it is that you suspect might be the problem.
Rinse and repeat, but that is a legit last ditch.
Rachel: Liz, do you agree?
Liz: I disagree. I think that the place where local logs are
helpful is for a logical analysis
of co-tenancy issues.
That was my bread and butter for 10 years at Google, was co-tenancy issues where you have the one request that is slow is not actually the request that's tripping it.
Charity: For me too at Parse, but logs were not useful to me.
Liz: The logs are very useful, I think, because we were very thoughtful about
putting "I'm spending time working on
this request. Wait, it's been there for 10-20-30
So you can see, "Here's the request that got me
here to begin with."
Like, "I found that through my centralized telemetry.
But now I'm looking at what else was happening on that machine around the same time."
Charity: I think the reason that you had that experience and I didn't is because almost all of my
co-tenancy problems took place in MongoDB.
So at the application layer, I wouldn't have been able to detect that at all because they were all just waiting on the database and you couldn't see what actually was happening.
Liz: Yeah. Whereas I was operating the database, I was operating the table.
Charity: I was operating the database too, but I didn't have the ability to add the right kind of loglines to this that would have shown those.
Liz: Yeah. Whereas I had those log lines, and therefore I'm seeing from my experience of writing those log lines "OK. It turns out that we are in fact in violent agreement, and that in Charity's case she wasn't able to get useful logs and therefore she--"
Charity: We might or might not have a couple of times forked the MongoDB source code and inserted some log lines, but we were absolutely not allowed to do that as a licensing matter so we probably didn't ever do that.
Liz: What do you think about that?
Emily: This is so fun to me, because what I see here is that there's a missing
If you still have to go SSH somewhere and tail a log file to figure something out, there is such a gap in tooling right here. I'm already thinking about product features that could address this.
Charity: This is what I want people to take away from this. Anytime you have to SSH into a box or something, it's not that you were bad or you did bad things.
You did what you had to do, but it means that there was something missing and you should try and solve that so the next person doesn't have to repeat that.
Rachel: Do you mean something in your automation, something in your tooling?
Charity: Something in your tooling that you're using to understand what's happening.
Rachel: SSH is like the "Break in case of emergency."
Charity: Exactly, yeah.
Rachel: Cool. We solved it.
Charity: We fixed it.
Emily: No, it just became a ticket in my backlog.
Charity: It's about to become a feature.
Liz: I'm going to plug these lovely folks on the podcast. If you found this conversation interesting, go look at TailScale.io.
What they do is they are your last line of defense in a way that's slightly less shitty than writing to a local disk.
Charity: Is this DistributedTail-f?
Liz: Yeah, effectively.
Charity: Cool. But this is a great question that I think Rachel had, "What are the interesting new infrastructure and tooling companies and products? What are they building and why is it cool?"
Rachel: Yeah, that is a question I have. I like collecting these companies, but I figure y'all live and breathe this, so tell me.
Charity: One of my favorites, of course, is LaunchDarkly.
I just was tweeting the other day about that awesome little poor man's observability that you can get with just feature flags and logging, to tie all the conversation threads together, basically you instrument your code in such a way that you can flip on and off very detailed logging around snippets of code using feature flags.
Without having to redeploy or do anything except for flip the flags, and I was just so mind-blown.
Rachel: That's such a cool use of LaunchDarkly. It's not customer facing at all, it's just for you. That's great.
Charity: It's just for you.
Rachel: I love it.
Charity: See, the first time-- People have often gotten on my case about "You're just defining observability to mean what Honeycomb does,"which is backwards.
We first thought about observability and what needed to happen, and then we built Honeycomb to spec.
But I've been waiting for someone else come up with an open source way of doing the same thing, and this is the first thing that I've seen where I am like "Yes. This lets you ask and answer any question about the internal state of your system without having to ship new code."
Liz: And without paying a million dollars.
Charity: And without paying a million dollars.
Rachel: At Firebase we have a similar-- We have a/b testing as a feature in Firebase. I'll just throw that out there.
Charity: Is that the same though?
Rachel: Not exactly. It's not a one to one.
Charity: It does sound cool.
Rachel: It's cool.
Liz: I think the important thing is specification at runtime of which users or which servers you need in that set.
It's not just that you're running a randomized a/b controlled trial, it's that you want control over where you are turning these things on.
Or similarly, like right now, we're working on some dynamic sampling-related things in Honeycomb.
One of the things that is always on my mind is "How do we make it easy for someone to say, 'I think that user might be in pain. Let's go sample their stuff one for one temporarily and then turn it back down.'"
Charity: What are your thoughts and feelings on sampling?
Rachel: I think that sampling is a great way to avoid having to ingest a ton of data.
Charity: Having to make hard choices.
Rachel: I don't know, it depends how you sample. If you're sampling well, you can minimize the downside. If you're sampling poorly, then you are missing things.
Charity: There's a lot of people in the industry who tend to hear "Sampling," and the only thing
that they hear is "You're a very blunt instrument."
Like, "You're taking one of 20 requests and just dropping the rest on the floor."
Liz: It's the difference between clubbing someone with a rusty knife and actually using a scalpel.
Charity: Yeah. To dig out their toothache or something .
Rachel: This is so disturbing. Like, what is this?
If you think about it, the majority of your traffic you do not care about except in aggregate. Errors, you always care about.
So why not just sample heavily on the things you know to be mostly boring, you care about their shape and the trajectory of the curve, and some representative samples of them.
But then take the things that you know are usually interesting, like any hits to the payments end point, or anything to do with money or errors or a user that you can't seem to capture, or weird behavior.
If you could just dynamically turn that dial up and be like, "Can you capture everything for this user for the next five minutes?"
Then why would you ever want to keep everything all the time?
Liz: Yeah, it's interesting in that the world that I used to live in and Rachel
lives in now, is the concept of how it
turns out that histograms and heat maps
align really closely with a sampled event-driven
In that at the end of the day a
histogram bucket is a count that represents some
combination of events in a time window, and some range of values,
and you can choose to attach high-cardinality values.
Then suddenly that starts looking an awful lot like an event containing those high-cardinality values with a sample rate on it, and those are synonyms for each other.
It's just a different way of getting there.
Rachel: I'm really glad that Emily works here now, because when we were doing front end
performance talks one of the lines that she would end with was "You
always want a histogram when you're looking for a tool, just make sure they can draw you
I feel like you found your people. This seems good for you.
Emily: That's upsetting, because now I'm the person who goes "We don't have time to build that." But I do want them, I want more of them in Honeycomb for sure.
Charity: What would you like to see us build for browsers, in terms of appealing for browsers? What is most needed that nobody seems to have yet?
Emily: The sampling stuff for browsers is actually super interesting.
Charity: More often you tell me that I don't need to worry about this because open telemetry the more dubious I get. Which is not to say that you're wrong.
Rachel: But if you're collecting open telemetry, you still need to wait to do something with it.
That's a standard for collecting the data, and then what do you do once you have it?
Liz: Then it winds up being shipped off, anything from printing to standard out or sending it to stack driver or sending it to Honeycomb.
The idea is the API for collecting the information is standardized, and the collection is similar to a b-line which will automatically be done.
So it just automatically says, "Here are the headers, here are the useful attributes. Make sure to propagate the context."
Charity: It's too good to be true. I would love to believe, I'd love to think that it could be true. I hate doing this collection.
Rachel: I want to hear Emily's wish list. What was the wish list going to be? Sampling for browsers?
Emily: That was just the idea of open telemetry in browsers,
just there's so much to dig into there.
There's a whole 30 minutes right there, because the best practices around
collecting traces in the browser are so unclear right now.
Like, "Is a trace an entire page view, is it a
There's so many interesting tooling problems about how to best represent what the user is actually experiencing in that case, so it makes me so happy to know that other people will be thinking about it.
Because I feel like these are problems that keep me up at night in my lonely cell, and now there will be other people in there with me.
Charity: I really hope that this project as a side effect can
have the effect of pulling together more of the
People than doing this in the backend for so long, and
it's not that you haven't had stats on the front end side, but it's been very
How would you characterize that?
Liz: You've got projects like Crashlytics, you've got projects like Firebase. You've got projects like the one that-- Fabric, right?
Charity: How would you characterize the difference in approach between those and what we're doing?
Emily: I look at the back end side of that and it is definitely more mature than the front end side, but I just see people reinventing the wheel over and over again.
To me, the really exciting thing about something like open telemetry is that we can help get framework others to start standardizing.
If we can say "We all want to collect this data this way.
Can you send it to us this way?"
Then every firm and author out there can go "This is the shape that I can make my errors, this is the shape that I can make my hooks for adding context," that kind of thing.
That's the world I want to live in, for sure.
Rachel: If it's in the framework, then every developer just starts doing this by default? It's so easy for them.
Charity: They don't even have to think about it. They're just doing errors like normal.
Because we do jump through so many hoops to try and auto instrument all these different frameworks, and the way that context works in each one is different and the way that errors are propagated in each one is different.
Even just people thinking about the problem of "How do I get events out of this that mean something about what my app is doing?"
If everyone just thought about that and then had the same example code to look at for how someone had done it before them, we would be in such a better place.
Rachel: Then you would be supporting everyone by default, right?
Charity: I do feel like we're in a time of convergence, or about
to be finally.
There's been letting a thousand flowers
blossom and a new category
springs up every night.
It feels like there's APM, there's tracing, there's monitoring and
metrics, and there is logging, and there's all the front end
I feel like we've reached a point where every one of these vendors, by the way, wants the same large amount of money from you.
I believe that everyone should pay practically 10-30% of their infra bill should go to observability, but that's in total.
All these vendors are like, "We should get 10% of your bill."
You add that up and it's like 300% of your bill.
I feel like the fragmentation is so costly for teams when you have to use a human to copy paste an ID from this system to that system to get a trace, or if you're trying to go from your dashboards but all you care about is one user.
Then you have to copy over to the log into a search, and it's just the
hopping from system to system and then being asked to pay to
store this data.
It's expensive all around, and I think
you're starting to see a lot of people pushing back on this from a cost
I think that we're seeing the ops teams that are no longer willing to
sit in the middle.
They're no longer willing to sit in the middle and translate for everybody what graphs means for their code, and my hope is that in a few years we'll just have an observability category and everyone will be reading from the same view of reality.
Rachel: Do you have an idea of how to converge, like how these many different silos will come to be un-siloed?
Charity: It starts with the people who are writing code, who are actively in conversation with it while they're developing, because as you're as you're shipping it one thing that I think three years ago was not widely agreed upon that software engineers should all be on call.
It was not. I got so many people angry at me every time we talked about this, and now I feel like that--
Rachel: Because the argument was, "I'm not an ops person. You should be on call, I'm a developer and I just write the code. You operate it."
Charity: Because they're like, "I don't know how to operate it. I'm just going to get woken up all the time, and my brain is too special.
I should not be woken up ever to support my--"
Rachel: I'm sure that's exactly what they said, Charity.
Charity: Kind of, it kind of is. A lot of it is.
Liz: It's the difference between the nice thing that people say and the subtext.
Charity: Like a lot of people were just like, "Keep it away from me." But this is the only way that we can build systems for users that are highly available.
It just is. You have to have the original intent in your head and you have to carry it all the way out to watch the users interact with it.
Liz: It's so weird seeing the metrics that
people falsely developed in the world of NOCs.
People were saying, "It's a bad
thing," but this team has a 80-90%
escalation rate where they can't solve the issue 80-90% of the time.
It's like, on one hand maybe you shouldn't have that layer of indirection. On the other hand, do you mean the status quo before?
Where they were solving the 90% of tickets by doing them wrote by hand? Do you think that was a good thing? They know, right?
Rachel: They've gotten it down to just the real problems.
Liz: Exactly. If you've gotten yourself to where it's
just the real problems, then you're right.
That's a step towards "You should just own it yourself."
If it's not the case, and you haven't gotten it down to just
unique problems and you still are having these repetitive
things that require manual glue, that's not a success criterion for
That's a failure condition where you're pushing this work on to other people.
Charity: The services should not be flappy, they should not be down all the time, they should
not be expensive to maintain.
The only way that you get to that promised land is if the people who are writing the systems who have it in their head are empowered to fix the systems, and they actually look at the systems.
You can't break this up, you can't break up writing and fixing.
Rachel: I feel like I should push back. Like, y'all are team ops. Good for you. I'm really happy for you.
I'm team dev, and here's what I would say. It's not "My brain is too special," It's that "I'm not allowed to--" Like, I am now, but "I wasn't allowed to play with the tools."
Charity: Yes, I get it.
Rachel: So saying, "You write code, you give it to me, I do things with it and then you're not allowed to touch it at that point."
Charity: Yeah, and that's terrible.
Rachel: In that case, if it's like "Rachel. Now you're on call, but you can't use any of these tools that I built," then I definitely would push back.
That is not the case anymore, I definitely live in a better world where I get to play with all the tools, but that's where it's coming from.
Liz: That's about the divides. Some of some of it is devs saying "I'm too good to do this," but others of this is the ops people who perceive themselves as the keeper of the knowledge.
Rachel: They are gatekeepers, and they see-- There's a strong sense of "This is what makes me special."
Sorry, if you're making fun of devs by saying "My brain is too special," I will make fun of ops people by saying "My tools are too special, you can't touch them."
Charity: Absolutely deserved.
Liz: I think that we have to move to this world where it is safe to experiment with production and it is that safe to touch production, so that you don't have to live in fear.
Charity: Yeah. It's OK to fail.
We have to we have to have less fear of production.
We've built a glass castle and we need a playground.
We need to be not focusing on "Make it so it never breaks," We need to be focusing on "How many things can we let break before users ever notice, or before anyone has to get paged in the middle of the night? How many things can we let break and it will be fine until somebody wakes up in the morning and resets it to a good state?"
Rachel: I feel like serverless is great for that.
Charity: It super is.
Rachel: I feel like in the case of serverless, a lot of it is like it will go down and it will come right back up. You'll have no idea.
Rachel: It's managed. Someone might get woken up, but it won't be you.
Charity: Yes. I feel like this is the second coming of DevOps, and the first coming of DevOps was all "Ops people must learn to write code," and I'm like "Message received."
Now for the past couple of years, I feel like the pendulum has
Now it's like "OK software engineers, it's time to learn to operate your services and to build operable services, because those go hand in hand."
Liz: And we've made it easier, so it's not--
Charity: We've made it easier, and instead of being the ops professionals and
instead of being the gatekeepers, we are specialists who are
We are here to help support you as you learn to do this
better, and a lot of devs didn't want to be on call for
very good reasons.
Which were that ops people have a long history of
masochism, just abusing themselves to
What I want to say is we're not trying to invite you to this terrible land of abuse, we're trying to genuinely make it better for everyone so that nobody has to be abused.
That means that the people who have original intent, we have to support them in getting all the way to the end.
Because that's how we get that virtuous feedback cycle up and running, so that problems are actually getting fixed and they don't reoccur, instead of just slapping a Band-Aid on top of a Band-Aid.
Liz: So Rachel, you mentioned that you're in a better place now. What changed? What was different?
Was it that the culture was different when you came in, or did you change the culture?
Rachel: When I started working at GitHub was the first time that I was required to be on call for the services that I was writing, for the features that I was writing.
That was a great experience. The tooling was at a place where I could use it, I was automatically given all the credentials to use it.
You sign in and you get the credentials, that was a weird
thing at the place where I worked before.
Where "Here's the tool," so I would click through to the tool,
"I can't sign in."
"You're still responsible for
that." That's fun, so I don't think that I was the
I feel like I changed jobs and the tooling was at a place where-- And also the culture was at a place.
Liz: How can we learn which companies have good cultures of supporting their devs, and which ones don't?
Rachel: What a question. I don't know.
Emily: No one is better in the world than asking upsetting questions during the interview process and making it look good, so Rachel you should just tell us your tricks and techniques for finding out all the dirty secrets before you get in the door.
Rachel: Like asking awkward questions in the interview?
Emily: I feel like you'll just go for it, that's the thing.
Where you are like, "Are your developers on call? How awful is that for them?" Like, "How often do they get paged? When was the last time a developer cried in front of you?"
I feel like you have just a wonderful way of pulling that off that other people can't.
Rachel: It's because I have long hair and they think I'm innocent, and I'm not. Not at all. Doesn't that work for you at all?
Charity: Yes, absolutely.
Emily: No, not for me.
Charity: You don't have long hair.
Rachel: You giggle, and that's how you get away with a lot of things.
Charity: I just said that. Yes, it's true.
Rachel: It's so true.
People see you as more innocent.
Let's see, how do you find out before you get into a place what it's
I think that you can ask questions, that's one thing. I think it's fair to ask them about their tooling.
Like, ask them about if you say "How often are you woken up?" They will lie to you. I don't feel like that's a reliable question that you could ask.
Charity: "What happened the last time you were paged. How long was it to resolve it? Did you have to ask for help? Is it okay to ask for help?"
Rachel: "What happened the last time you had to ask for help?" Because that's a
hard one to lie about.
I think asking about what the tooling is and then maybe asking to see it, if they can go right to it and they can navigate to it, if they can click around and show you things then that means that they're using it on a regular basis.
Charity: "Are you trusted? Do you get the credentials that you need to do your job? Are there different levels of access for ops and dev, and why?".
Liz: I love that question, like "Take me on a tour of your tooling." I love that, that's amazing.
Rachel: I imagine that there's probably some places where it's their own, and there would be another NDA situation there.
Charity: I would sign that.
Rachel: If they will show you, I would ask to see.
Charity: For sure.
Liz: Even if they can't show you the live tool, it's like diagram for me on the whiteboard "What does your investigative process look like?"
Charity: "Where does it hurt when--?"
Rachel: In the case of Google, all the tools are homegrown.
We made all the tools and one of the things that's great about that is that people can describe, "Here's this tool and here's what it does. Then here's this other tool, and here's how they relate to each other."
So if someone can outline for you what all of the tools are and what they're all doing, then maybe they can't show it to me but they could describe for me what it's like.
Liz: The thing that I certainly found during my time was once you learned the tools once, sure it took you six months, but after that you could switch teams every 10-18 months and be perfectly fine.
You'd spin up within a month and you'd be good.
Charity: Another question that I really like is "Do the managers track how often people get
paged or woken up?"
Because if nobody's tracking, then they don't care.
Rachel: Yeah, that's a great question.
I would also like--
The thing I've heard about Etsy is they
had a graph of how often people were getting woken up, and if
you got woken up they would automatically--
It would readjust the on call schedule so you could sleep a little bit more.
Charity: That's really nice. What I would always do at Parse was I took myself out of the on call
rotation as a manager so that anytime someone got woken up,
I'd take it the next night so that they could sleep.
Because that helped me stay in practice and be hands on too.
Liz: I think that 24 hour on call is one of the worst things that
we inflict upon our engineers.
Like, break it up.
Seriously break it up so you can't be woken up at 3AM and then at 8AM.
Charity: It's doable for a while. Not at Google's scale, for sure.
Rachel: Just to be clear, no one is responsible for all of Google. There's no one person that's responsible for all that.
Charity: You're breaking my brain right now.
Liz: There's an interesting group of about two dozen people who are the defenders of last resort.
Rachel: Wait. Can we--? Is this something you're allowed to talk about? Can you tell us about the defenders of last resort?
Liz: I can say a little bit about it. I served on an on-call rotation that was if multiple things at Google are all broken, you page this team.
They don't fix it hands on, but they coordinate the response.
Charity: They know enough of the pieces of the system that they can fit it into their heads and reason about it.
Liz: Yeah. Like, fit all of the --
Not all every detail, but enough of the moving pieces that you can say, "Wake up this person, wake up that person. Get them talking to each other." This higher level incident--
Charity: It's as much a human problem as it is--
Liz: Yeah, it is.
Charity: That last question you asked though, was about "How will this
convergence take place?"
That's when I went into "Software engineers have to be on call," because I think that's where it starts.
But I think that the next step of it is if you've got your software engineers on call and you've stabilized to a place where it's not terrible, I think that there's a lot of power to be unlocked by having everyone speak a common language.
Tools create silos, and the border of your tool is the border of your silo.
Right now you've got all these engineering teams who have radically different views of the universe, of reality, and often when you're trying to solve a problem together you spend more time arguing about the nature of reality than you do solving the problem.
I feel like there's a lot of value to consolidation, because if you can have
all these teams share a single view of reality, then you
can start creating entry points for engineering, adjacent teams,
your support teams, your product managers, or people who don't write code but
they are elbow deep
in your production systems every
For support teams, instead of just being like
"Escalate and triage dumbly," or whatever.
Be like, "OK. I'm going to create a view for you with five questions so that you can put in the user ID, verify that it's actually happening, see if it falls into one of half a dozen problems that we've already triaged and we've got fixes coming for them.
Maybe the advanced ones, futz with that a little bit more and see if you can figure out something before you escalate it."
Liz: Right, support automation. It's absolutely support automation.
Charity: But this is so much about enhancing the humans, it's not
about AI and it's not about ML, it's not about taking the humans
out of the way.
It's about letting the machines do what they do best and crunch a lot of numbers to support the human who's using their creative human brain to actually solve problems.
Liz: That was a super great conversation. Thank you both for being on the show.
Rachel: I had a great time. Thanks for having me.
Emily: This was part of my job and I loved it.
Charity: Thanks for coming, you guys.