In episode 20 of O11ycast, Charity and Shelby speak with Marco Rogers of Mode. They discuss the wall between dev and ops, application analysis, and wrangling vast amounts of data.
About the Guests
Shelby Spees: So, Marco, tell us about your observability journey at Mode.
Marco Rogers: So I've been at Mode for about a year and a half.
But my observability journey started even before that. I've actually been tracking the conversation around it and Charity and I have been friends for a little while.
So I was aware of Honeycomb and have been wanting to use it actually for several years.
And when I landed at Mode, it was available in a way that I could start digging into, but we hadn't really invested heavily in using it.
And so I kind of became the advocate internally for this thing can do way more than what we've been using it for and we really need to invest in making it valuable.
Charity Majors: It's funny because y'all are a data company and so like Matt over there Mode was one of the first people who-- He really got it.
I'm curious what like when you showed up, what was the delta between what you saw and what you saw the potential?
Marco: So Matt heads up our infrastructure and DevOps department. So he's always gotten it.
And that team had definitely started to invest heavily in it. But I came on to the product engineering team.
I've always been an application engineer, product engineer, whatever you want to call it.
But the investment there was a lot more nascent, and what I found is that we hadn't really done enough work to figure out how to make it valuable in the application space, which is why I kind of reached out and why I wanted to talk more about it.
So Matt and I were able to talk really easily about the value that we saw the potential that we saw, but I was on the on the product application side and able to start to help people figure out what that looked like, because they had it instrumented into the infrastructure side.
But translating that into what we care about on the application side is I think, not as obvious to people
Charity: Not obvious, no, please tell us about that. I'm so curious.
Marco: Yeah so I think I would say that Mode, what has become kind of a more traditional stack and I think I'm hedging there because I think there's a much wider industry.
But if you look at kind of tech startup stacks, it should look really familiar.
There's kind of this central hub built around Ruby on Rails, and then a bunch of services in our back end that breakout different parts of it to scale them up in various ways.
And then a really sophisticated front end build system, because the product itself has a lot of front end interactions.
And so there's a whole build chain and tooling around the front end as well.
And I think what we get out of the box with honeycomb at kind of the infrastructure level was really great. And what we get out of the box from a product perspective, was helpful as well, on the rail side is what I would say.
So you can you can immediately just to see all the traffic to all of your endpoints, how long they're taking, and things like that.
And that was that was super helpful. And then the outcome was that it just immediately raised more questions.
Like why is this happening? You know, and I think that that's a great place to be.
But then you kind of have to accept that it's time to dig in and figure out how you can get the answers to those second order questions.
Shelby: Yeah, as soon as you sort of open the curtain on that first level of answers and opening up all these new questions that you never had thought about before,
Charity: It's like picking up the rock and seeing all of the bugs just go blah. And you're like "that was there?" Well this seems like a good time to introduce yourself.
Marco: Absolutely So I'm Marco Rogers. I'm a senior software engineer at Mode on the enterprise product team.
Charity: This might be the first time we've actually had somebody on this observability podcast, tha t is a big Honeycomb user. Well we had not on but we didn't really talk about honeycomb stuff at all. You know I think that we've tried so hard to be very agnostic and very not pitching Honeycomb we're doing that thing that engineers do when get so afraid of being salesy that they forget to mention their own stuff? This is this is awesome.
Like yeah that that feeling of like, I answered a question. Great now I have 10 more. This to me, I feel like we were all kind of fed a lie when it comes to debugging our systems, cause vendors are always like you know, we'll just show you the answer.
And like we'll tell you what to look at. And like you'll see it instantly. And you never ever do.
And like the process of like, that I like grew up debugging was like, you look at a system you see errors you think real hard.
And you formulate a hypothesis, and then you go look for the data to support the hypothesis.
But with observability that's not the workflow at all. It's more like you start at the edge or super high level and you ask a small question Like where are the errors? like which endpoints? and then you ask another question and one foot in front of the other and always take you to the answer.
Shelby: I love the term Liz uses of first principles debugging, where you don't need to know what's going on underneath to be able to find the outliers and to be able to ask novel questions.
You can just take what's in front of you and start asking about that.
And I feel like that's really important, especially across product teams, we're not everybody knows the ins and outs of every part of the system, and it gives people access to the different parts of the system. So that's always been exciting for me.
Marco: I agree and I think maybe my background kind of comes into play here b ecause I feel like debugging has always been one of my strong suits.
I think different engineers can have different qualities and skills, but I've always kind of felt pretty comfortable with debugging.
And I do think of it as investigation. you have to start with, what do I know and what questions do I have and then be able to investigate.
And you kind of mentioned other tools.
I'll just be I'll be that person to put the stake in the ground. I've just never met an analytics tool that I liked because and this my perspective, they all kind of take the stance of like you're producing tons of data. We know what you want to see, we'll show it to you. And I'm like you don't know what to see. And you're not letting me dig in to figure out what I want from my data.
Shelby: And I think that's, I mean, you even brought it up. I think that's the difference between an analytics tool and an observability tool is analytics, gives you the answers.
Observability helps you ask the questions. And I think it's really important to make that distinction and continually hold that line there.
Honey I think for observability tooling in general, like the goal shouldn't be about trying to outsmart the engineer--
Charity: It centers the human like we want to help them do better.
It's been really interesting for me, something I never realized as an engineer, is that like big companies, like CEOs, CTOs they trust their vendors more than their people.
Like to them like employees come and go and a vendor relationship is forever.
So like the pitch is actually, it kind of horrified me when somebody told me that and I realized that it was true.
Like, they don't want to hear that their people are necessary in a weird way, they want their people to be fungible.
They want the tool to be cause it was more reliable to them. And that just they're doomed, it's not going to work.
Because at the end of the day, somebody has got to understand your system. And I feel like, you can feel the difference.
Marco: I think you and I are aligned on this Charity.
One of the things that may not be obvious from my introduction is I've been doing this for a really long time.
I've been building websites and applications for 15 plus years. And I also spent quite a bit of time in management.
Charity has these great blog posts about kind of the pendulum swinging between engineering and management, which I think is off topic but I also love them and you should check them out.
But I've been on that management side and specifically been in the position of evaluating vendors and whether to say "yes."
So if I was kind of putting my leadership hat on here, what I would say is that it's really important for leadership to provide tools to the team.
But that's insufficient, you have to go the next step of empowering the team and making sure that the tools are paying dividends.
And that's what I see being missing a lot.
People think just like buying New Relic and giving it to the team produces value, it does not be able just pay them tons of money and not see the value.
So taking that extra step I think is critical. But figuring out how to do it is also tough.
Charity: Figuring out how to do it is very tough.
I remember when Danielle and I started talking about the stuff that eventually led to BubbleUp.
It was the fact that for-- I was frustrated, because I'm like, people keep asking me, "Can you just show me the answer?"
And I keep launching into these long winded blah blah blah. "Here's why you don't want that."
And I'm like what if I could just say, "Sure," and round up to Yes.
Because this is VCs, they don't actually want to understand this, they just want to hear that it's not magic, or that there is magic or whatever.
And Danielle, our brilliant like data scientists went off and arrived at BubbleUp where it's brilliant because we're not taking away any of the data.
But what we are doing is we're computing like for all values inside the bubble that you drew and outside the bubble and then dipping them and sorting them.
So the detail gets sifted to the top, and your eye's just naturally drawn to it. All the data is there. We didn't make any decisions.
But we laid it out in a way so that you can pick out the patterns for yourself.
Marco: I love the term bubble up.
It resonated with me immediately, because that's what I do want.
I don't think we are kind of suggesting that there's no value in taking some of the really common things and surfacing them. Right?
Putting them in front of people in a way that might make them take action. But I do know the limitations of that.
Shelby: Sure and what I like about bubble up specifically is like Charity said, like it's not trying to be smarter than you.
It's about-- It's I mean much more complex on the data analysis side but it's about the level of like sorting, or order by price versus order by most recent or something like that and so
Charity: The key insight of bubble up, I think is that any machine can detect a spike, only a human can assign value to it.
Like, think about how many times you can look at your dashboards, you can see random spikes and you're like, "This is a good spike," I expected that to happen. "I wanted that, that's fine."
It doesn't have meaning until you come along and assign it to it.
And the cool thing about bubble up is, you know what's important, so you just like point to it.
And then like it lets the humans do what humans are good at and machines do what machines are good at, which is crunching lots of numbers.
Marco: Humans are really good at looking through things and pattern matching and pulling out things that look like anomalies.
Which is what I love about bubble up. It's not like here's the answer. Here's like "here's a bunch of stuff that might be going on, l ike figure out which one like matters to you."
Charity: The other thing I like about it is that like, the Big Bang style of debugging where you're like, "it's probably this I'm going to look for evidence of that."
Well it might be that and five other things, or that might also be like another symptom of the true root cause, like how many times you just been like, "I know what this is," you run over and do the thing. "Oh, it wasn't that."
I feel like the model of assuming you don't know the answer, once you--
This is a real leap for people though who have spent their like 20 years whatever, conforming figuring out how to express themselves in the very limited language of like metrics and time series tools.
Like there's so much that you have to unlearn b efore you can relearn it again is a very simple thing, because it's doable was like the tools that we use ganglia and geometrics.
Like most of these things are doable i f you know what you're trying to do in advance and you're gathering the data in the right way.
It falls apart when you can't predict the ways you're going to need to gather the data to ask the questions.
But it's not that it was impossible. It was just that-- I don't know where I was going with that.
Shelby: But it's a different paradigm. It's a different direction.
Marco: So if I was bringing it back to the topic of kind of product analysis, or like application analysis, I kind of follow what's happening in the infrastructure space.
And I really love the conversation, but it's not my area.
And instead I think that in the infrastructure space, we've seen kind of an explosion of innovation, but also kind of an explosion of complexity and an explosion of data.
Like we generate many orders of magnitude more data than we had in kind of previous decades a nd all of the innovation has been about trying to wrangle that.
And, you know, this is kind of my perspective as well. I think we're starting to see that happen more on the application side.
Because it was simpler in the past, like I just needed to tail logs. And I just got really good at that, a nd you can do a lot with that.
But that just feels insufficient these days with the complexity of our applications.
And so I think and I hope that on the product application side, we're going to start seeing more of that innovation.
Because I think product analysis is also an observability problem. It's part of the reason why I feel like I'm here.
Charity: That's so interesting, cause I feel like infra is starting to catch up with product when it comes to this.
Cause like things like a user ID has always been a high cardinality dimension. And it's always been critically important to be able to, explore by, group by, and breakdown by.
L ike the entire key to like us getting over our year of terribleness at Parse, is just the ability to break down by like one in a million app IDs, and then break down by endpoint and then break down by raw query, just like just chaining them up along the like pearls, on a string.
But those are all very producty things. And like it feels to me like y'all have been ahead of us there honestly,
Marco: Maybe the grass always looks greener,
Charity: True, but like users are the original agents of chaos. You can never predict what users are going to do to your systems, and you shouldn't even try.
And you guys have embraced that from the start.
Marco: I think that's true. I think as a kind of a counterpoint, maybe it helps to extend into the front end.
Because I believe that the tooling there is nowhere near where it needs to be.
We're all building like huge complex client side applications.
And I'm like that feels like such a failure to me in 2020.
Shelby: I mean front end is the ultimate distributed running of your application, there's nothing you can predict about what--
Charity: Environment's going to be. What else is going to be happening?
Marco: Yeah in many real ways your server cluster has extended to everybody's laptop.
Shelby: So I want to hear more about like the kinds of questions you want to be able to answer with observability a nd maybe we can talk more about what that might look like.
Marco: Yeah absolutely. If I'm thinking about what Charity was getting at, I do think that we have some good tools for what I call product usage analysis.
It started out with things like Google Analytics, which has gotten way too complicated and not useful.
But if you look at things like you know Hip, there's some really great products out there that will tell you what your users are doing, like what they're clicking around to what views they're looking at, and it's really great feedback from a product perspective.
But I think what I'm really getting at is that space in the middle where the engineering team is still trying to maintain this system and deal with issues and bugs and errors.
And the kind of data that we need visibility into is a little bit different. I understand that they clicked on something, I want to know why it didn't work.
Shelby: Or you know where people drop off in the user experience, things like A/B testing or feature flagging, where you could end up with like an explosion of cardinality, comparing different changes.
And especially on the front end, it's very hard to track and compare and diff people's experiences.
My vision for this is for a single user ID, you can follow the trace or you can follow the exact path that they took from clicking a button to like how that calls back to the back end, and then the database and then all the way back to the front end where it returns.
Charity: Yes but like that is only like, until like hardware is so just incredibly cheap that we can't really fathom it like--
You're going to have to do that in a heavily sampled way be cause it's just nobody's ever going to pay.
I don't think for observability is 10 to 100 times the size of their infrastructure cost, which is what you'd be in for there.
Shelby: And I think at the level of like capturing every single users, every single experience that might not be--
Charity: Like maybe if there's like a user that you're working with on a bug, you could ask for permission like add, you might fly capture this in high fidelity for a period of time. And then you could generate that volume of data.
Marco: We need it to be configurable. I think you're right about that.
You know there's a whole conversation I work at Mode, we live in breath data as well.
And I think it's really important to accept that we moved from a world of kind of not really knowing what was going on and systems being really opaque to the systems generate tons of data, we're drowning in data and the challenge has become how do we make sense of it?
Right? How do we get the signal out of the noise.
And that's the challenge of a lot of the tooling. So it has to be configurable so that, again you need to bubble up, you want all the data, cause you'd never know what is going to be useful, but you want it to bubble up in a way that a human can manage.
And that kind of becomes the challenge. So I don't want to throw any data away. I just don't want to have to drown in it. Trying to figure out how to maintain my system
Charity: You just want something magical. That's all because there's so much to ask for.
You just want to keep everything forever and have it like magically the surface the right little nugget at the right time. Like come on.
Marco: And I want a pony
Charity: And you want it free. Don't forget the free part.
Shelby: Until we can grant that, something I've talked about with people is even like feature flagging your instrumentation, so that if you start to hear complaints about a very specific subset of like a user experience you can turn on, the full stack trace below a certain level of granularity, which you might not want all the time.
It starts to get a little bit into like the over optimizing or optimizing for costs versus using sampling and dynamic sampling to be like smart about your ingest.
Charity: I feel like we have to help engineers usher themselves into like, a different way of interacting with production.
Like they think of it as this one way, they push the code out, and then they back away, and maybe they're on call, they're kind of ahead of the curve.
But like production has to be a constant, like sort of conversation between the owners of the code and the code as it's being used by the users.
And what I love about feature flags and that that model in general, is it just like it shortens that loop a lot.
And it makes it so much more reactive and interactive and like instant and fine grained and I think it's the right--
There are so many people who have gone, "Honeycomb didn't really make sense to me until I also tried you tracing and feature flags and it all kind of goes together doesn't it?"
And like, yeah it does. And it's really about, like that wall between Dev and Ops just has to like die in a fire because it can't exist.
Like we have to be unified, integrated b uilders and maintainers in our craft. And that means getting our hands, getting ourselves in up to our elbows in the clay sometimes constantly.
Marco: Absolutely, I think feature flags are a really interesting example to dig into in the product application space, because we tend to think about our product does this kind of one experience but it's actually breaks down into multiple experiences, because you're rolling out different feature flags and different users or different customers are potentially having very different experiences.
So when you're looking to maintain those, you need that level of visibility as well.
If I'm getting this problem reported, can I see where it's happening?
Can I see a breakdown by which customer it's happening to?
Can I see if it's only in this feature flag that we're trying to roll out right now.
And it's that level of investigation that we end up doing. But the tooling needs to catch up. It's hard to pull those things out today.
Charity: You know I feel like the engineers who spend other time in like test clusters and staging and stuff. I feel like they never grow up.
I feel like they never, they have instincts that are honed and like forged in environments that aren't real. Like their intuitions are off.
And they just like the definition of a senior engineer in engineering terms to me is really someone whose instincts I trust.
Like you have to be exclusive to the real world, right in order to train your own little neural nets.
Shelby: Yes, and I think a lot of that has to do with enabling people and empowering people to feel ownership and to take ownership--
Be proactive about taking ownership of their services in production.
I definitely agree that there's a certain amount of maturity as an engineer that you just don't have access to, unless you've lived in production and seen you know what your code actually does?
Marco: Yeah I think that's really interesting. And it takes me back to what we were saying before, which is kind of how there are so many vendor tools.
And in many ways the vendor tools are like your window into the application, rather than building your own, like mental model and level of comfort and familiarity.
And there's all these tropes, like never SSH into production, like I'm that person.
I was like, "Really, I can't anymore?" Like, and I think there's kind of a generational gap where I don't know how you would build a level of familiarity and comfort without just really getting in there in the system.
So I don't know what that experience is like for engineers who are walking into these complex systems without them.
Charity: That said I will point out that like for me, I think of serverless developers as being --
When I'm trying to explain to someone like the next generation of like paradigms for how you interact with your systems, how you instrument how you like--
Just think of how serverless does do it because it is doable, It's just we all learned interactive systems through the shell.
But the lens of your own instrumentation is arguably the better lens for a developer to learn through as long as the toolset ecosystem is rich enough to actually peer under the hood.
Marco: Yeah absolutely. I think it brings me to a question that we might want to explore.
I'd love to hear your perspective on which is, where are we in that journey of the tools being good enough to give you the right insight.
Because I think what we end up talking about a lot is how the tools aren't adequate.
The tools aren't helping me the tools aren't giving me the level of confidence or the level of visibility or discoverability that I need.
Charity: God yeah we're early on.
But I will also say that I think that part of the fault and responsibility lives in engineers who like we're like 10 or 20 years behind where we ought to be when it comes to instrumentation best practices, because we've been relying on this magic from the vendors.
And so it's almost like there are these oral traditions out there. Like there's the Google oral tradition.
There's like some other like tribes that we all learn these from each other, but like people don't know how to gather data from their system that they're going to need in the future.
They don't know how to structure it they don't know how to capture it they don't know how to project themselves into the future by a couple of weeks, be like what is future me really going to want to have at 2 a.m. you know, like that that instinct in that and that like, that internal loop just is kind of withered away.
Shelby: Yeah and some of it you've mentioned before, like some of it is on the vendors.
Like basically discouraging people from adding certain kinds of instrumentation, because it's just going to get just like prohibitively expensive.
Charity: Or just like, "we can infer it all from what you did." No you can't you can never infer original intent. Yes to come from the dove.
Marco: I think that that's super interesting. So kind of going back to being an observability advocate
What I've run into, I think, you start with kind of introducing people to observability.
Like what is it? Like it's a buzzword, but what do people actually mean? How is it different from what I've been doing?
But I think once you start to kind of dig in, you start to talk about instrumentation.
There are some things we get out of the box. And when I when I do demos, when I show people what we're getting from a system like Honeycomb, just kind of out of the box.
They didn't even do anything, but look what you can see, like people are kind of astounded and it gives them you know, generates a lot more interest.
But then there comes to work, And I'm like now I need you to start instrumenting things, and they're just like, "I don't know what does that mean? Like how do I do that?" And like,
Charity: And the thing is it is easier if you figure out how to do it. And it becomes just like intuition.
It is easier to write code like instrumentation first than, than without it.
Like it gets to a point where you just feel blind without it. And you start to rely on that, because--
It angers me w hen I see people trying to understand their code by reading it because fuck you, you can't.
You know that is not the source of truth for your code.
The source of truth for your software is in production. And the sooner you make that mental shift, and start seeing your interpretation, not as an afterthought or later or a no eventually or just for monitoring as how you can see if your code actually works or not, and it doesn't work until you go and verify like that.
That's just like, how do we get people across that river?
Marco: I agree with you. But I think I want to challenge that a little bit because I think that there's a paradigm shift and I want to be really clear about what I'm saying.
So at least trying to represent what I understand in application development, we do instrument code.
But the thing that most engineers I think get comfortable with is just logging.
You just pepper your code with log lines. And there was there's a thing that shifted, I don't know when a few years ago, something like that, but you can't get logs anymore.
I'm like, "Where my logs?" Like, I put logs in here. How do I see them?
And the answer you get is like, "Well I don't know we used to dump them."
And I don't know, like Sumo Logic or whatever.
I guess I'm kind of asking the question because I feel like in some ways, the tools that we used to have that we at least felt comfortable with kind of went away, and they haven't been replaced.
Shelby: This is kind of what excites me about the term observability- driven development because you can, you can draw a lot of parallels between ODD, I guess is how we abbreviate it now.
And test-driven development, where the idea is to write the test before you write the code, but it ends up being this really tedious process, and people try to right code generators. It's a huge pain.
And what I found exciting about instrumenting my code is it's like, I don't have to come up with every like possible edge case, which is another thing that blows up in cardinality.
I just think like, what's the important thing I care about here? Cool let's instrument that let's add that field.
And so I really want to like help people feel that about instrumenting their code where it's like, it's part of the writing process like--
Charity: We somehow won the battle to get people to comment their code. And I think of that is like JV instrumentation.
Shelby: Yes and comments you know, everyone knows like comments get stale. Comments become wrong people update them. Sometimes you don't even read them as you're editing the color and verification.
They're just whenever somebody crapped out it, whenever there's no there's a lot to be said about people like littering and polluting the docs.
Charity: But we culturally made that happen. We've made it unacceptable to not write down a few sentences about what we were trying to do. and I feel like I hear what you're saying Marco like about, yes we knew how to log.
But I feel like correct me if I'm wrong, because I'm not an app developer, I feel like the purpose of those logs and statements was somehow subtly different.
Marco: I'm interested in, in hearing your perspective on that, how different In what way?
Charity: Well I've seen a lot of application logs that were like, ones we would archive for like billing logs, that like the capture the fact of who was billed and all this stuff.
And there was a lot of optimization for shortcomings of the logging thing, like the paradigm where, you'd be like well, it's too expensive to log this here.
So whereas I feel like, I feel like it was less about how the code was running, what was going on or what the state was, and more about I hear the fact, this fact I might want to grep for this someday, or this is important to some other system that I need to send across to them and logs are what I have I don't know maybe
Marco: it's interesting. I do think that there's a different perspective there.
Like we talked about kind of debugging and investigation. My mental model from an application development is that's what I plan to do with logs, like logs essentially become your trace. Right?
I plan to go and look at this backlog of things that happened, and put together this picture of how the code executed. That's the mental model that I started out with, that I think a lot of application developers start out with, and that's the model that's kind of been taken away.
And I'm kind of I'm trying to replace that with the observability model.
Like if you just do a little bit more planning, you still have your trace. It's going to go over here, and it's actually going to be much more accessible and visible.
And it's going to be awesome. All right but I feel like we're in that valley between the old thing and the new thing.
Charity: I think you're right. And I think I think what I'm getting at is, because of the verbosity of logs you'd be like, well, for every request what's the maximum number of log lines that I can really reasonably output.
Like, maybe five or I'm just going to get rate limited and throttled. I mean did you do any stuff like hashes or sampling or like rate limiting in the code itself?
Marco: No we didn't. And I think that's why that's another reason why I've been kind of really paying attention to the observability conversation as it's happening in infrastructure and ops because I was introduced to this idea of sampling and how to still glean insights and things.
Because from again, I may be dating myself, but before you would just put logs in and you just thought logs were free. You just put all the logs in, tail them grep them it's fine.
But like I said, like the amount of data has increased just by orders and orders of magnitude, and it's no longer free.
So I know that we're on kind of a paradigm shift. But like I said, I think the tooling hasn't caught up, at least from my perspective, and we're still in this valley, where it's like, it's not free anymore. But I still need to do the thing that I needed to do. I still need my traces. I still need to know what's happening.
Charity: Super fascinating. Oh my god I could talk with you all day about this stuff.
Is there anything else that you want to say Marco, it's delightful having you. Do you want to make any grand statements and predictions about where application observability will be this time next year?
Marco: I don't know if I have any grand statements, but I do have a vision.
I think it's possible for us to have really strong observability a nd to have that actually be like, I'm hoping it kind of takes over as the primary paradigm for engineers.
Like when I give talks and things internally at Mode, about observability, I try to come up with a definition of observability.
And the one that I kind of settled on that I like, is that "an observable system is one where you can diagnose problems by asking questions of the system in near real time."
And as I kind of develop that I recognized that that was a paradigm shift.
People don't expect to be able to ask their system questions they expect to just look at logs scroll by or look at a dashboard and feel silly b ecause they don't understand what the dashboard is trying to tell them.
And I do think that we want a paradigm shift where, like you said, it puts the engineer back into the driver's seat like I know what I'm trying to do. My tool should be serving me.
Charity: Amen, brother.
Shelby: I'm looking forward to that for all of us.
Charity: Thank you so much for being here.
Marco: Thank you for having me.