Ep. #23, Beyond Ops with Erwin van der Koogh of Linc
about the episode
about the guests
Charity Majors: Erwin, I met you when I visited Australia for the first time.
You took me around and introduced me to, I think, every engineer in Australia, I think all of you know each other and you went to the same school, you go drinking together.
It was amazing. I was exhausted, but you were amazing.
I had so much fun and I was so impressed with Australians.
I spent a lot of time in Europe and compared to the European visits that I've made, Australia just seems very open to things that are new.
Erwin van der Koogh: There is this really interesting thing, where I think if you go back to the early 2000s there are a lot of pretty big, mature scale-ups around.
You've got the real estate sites, you've got the second hand car sales sites, and you've got the job board.
Charity: Yeah. In Australia, some big-ish companies.
Erwin: They're really big now, the extent of it internationally.
Of course, there's Atlassian, this sort of big, massive story.
Charity: Yeah, but big does not necessarily usually correlate with appetite for risk and novelty.
I think that was what was interesting to me.
Erwin: They're big, but they're not massive. I think that's the really interesting phenomena, but also there's a lot of--
Charity: I thought you were going to be like, "It comes from our roots as cowboys in the Wild West."
Erwin: I don't know, I'm Dutch. I got here ten years ago.
Charity: This seems like a great time for you to introduce yourself.
Erwin: I'm originally Dutch, currently living in Australia, running a startup around bringing front end applications to production.
Charity: That's the worst thing in the world to do. As your friend, I counsel you to you never start a company.
Erwin: It's funny, I do this a lot with aspiring founders. They go, "You're a founder? I want to start a startup." And I go, "Don't be dumb."
Charity: Just don't. The cult of the founder is something that makes me just grind my teeth. I hated it before, and I hate it even more now.
Erwin: Like, "You're dumb. Go join someone else's startup that's somewhat successful." You've done well, Shelby.
Shelby Spees: I remember when I was interviewing for a job, I was at my previous job and I was explaining how I'm not an ideas person.
I don't have some big idea that I want to go out and impose on the world.
Like, show me what you're working on and I will find 50 ways to make it better.
I am a nitpicker, I'm a critic, I'm a Virgo. I will sit down and just pick on every little thing and make it better.
Charity: "I will take your shitty idea and I will make it right."
Erwin: You've done exactly the right thing. Join someone else's startup that's been somewhat successful and make it better.
Charity: Let them suffer, and make it better.
Erwin: So much better. Being a founder is just dumb.
There's an incredible amount of stress, an incredible amount of work, and the expected payoff is zero.
Charity: You can never put it down.
Shelby: It's like having a kid or something, like you don't s top-- The job doesn't go away.
Erwin: I have a kid, and the startup is worse. The startup is way worse. The kids occasionally hug you before they go to bed.
Charity: The kids occasionally hug you, that's a very good point.
What brought you to observability?
Because you were very early to zero in it and go "Yes, I want this."
Erwin: It was funny, because you did a serverless talk at some conference, and that's how I found out about you.
Then I was like, "What's this Honeycomb she's working on?"
And then as soon as I saw the homepage of Honeycomb, I'm going, "Surely this can't be true."
Charity: I feel like you were one of the fi--, back when I was in charge of marketing and I was like "High cardinality is going to be the term. It's going to bring all the-- My milkshake is high cardinality and it's going to bring all the engineers to the yard, yo."
It took me six months of Christine just going "Nobody knows what this is."
And I was like, "I will tell them what it is."
And there were like five people in the world who were like "Yes. We are all in."
And it was you and Intercom, and it was just so funny because you all were like "Yes. Say no more. Clearly, this is the next big thing."
Erwin: Yes. It was exactly that.
Charity: But I think it was the high cardinality stuff that you were able to map that to what you are doing and go, "Obviously."
Because it's so obvious once you've seen it, you're like "Obviously these are the questions I would want to ask. Obviously."
Shelby: What are those questions? What were you working on that just made it click so fast?
Erwin: The fun thing is, and I think it's interesting for later as well, I was running a startup at the time.
I had no traffic, I had no nothing. It wasn't actually that it was a solution for a problem I had at the time.
Charity: Production wasn't even an issue.
Erwin: Yeah, production was down. I'm like, "Whatever." That was the stage we were at the time.
But having worked on these big massive systems where you get a bug report and you just go, "I don't know."
This long series of very specific things that you just wanted to chain together like a little pearl necklace, and the composability.
Because it's always obvious after that, you're like-- And this is where people are just like, "Cool. I'll just generate a dashboard for this question and show this forever."
Erwin: But even then it was the high cardinality, but for me the thing that drew me in was the fact that as soon as you log just log everything.
Log things once, because the thing is that the problem with the dashboard is that it's great, but it will never tell you the past. You can never go back.
Charity: You're always fighting the last battle and you're always looking at the dashboard for the last outages.
Erwin: Or specific log things, and so what you have to do, and that's how a lot of the gnarly debugging was always "Figure out where things could go wrong, then put in the logs, then run it in production for a while."
Charity: "Predict where your shit is going to break and then go instrument it really well so that you can see it.
Because You have to be very skimpy with it, with the traditional logging in metric stuff.
You're constantly balancing "How much can I pay for?"
"What's my right amplification going to look like? How much am I already logging?"
"Have I already logged this in some other place?"
Because if you log it in one place, you'll get to see it in that place, it doesn't persist in the entire context.
Erwin: It's all of that.
Because I spent a lot of time in banks, so money wasn't even the issue.
It wasn't even to pay for it, but it was that you're now looking at logs and then you'd have to be out a couple of million users at the same time.
Now you've got to figure out "This log line here, is that the same request?"
Charity: Being able to break down by-- If you have two million users, being able to break down by that user is what I feel like--
You can describe high cardinality all day long, and until someone has experienced the power of "I'm literally just generating any possible one of my dashboards for this user on the fly whenever I want?"
Erwin: "In the past."
Charity: Yeah, in the past.
Shelby: Over time you can just look at the entire window of everything you care about.
Charity: That's so key and it's so core, and it's so hard to explain to people just how revolutionary it is to everything about understanding your systems.
Shelby: It's so funny, because I remember even when I started following you and learning from you, high cardinality meant nothing to me.
But hearing about stories or hearing examples and being able to connect that to what my team was struggling with at the time, where it's like "OK . We have issues with this one customer, with this one placement, with this one platform."
Our engineers are just going in and adding StatsD metrics for that, and they didn't even have feature flags so they would have to deploy for every time they wanted to ask questions about that, and then the error wouldn't ever come back.
The issue would never surface because it was some emergent failure mode, so it was just like, I remember--
Erwin: The good old Heisenbug. Heisenbugs are good.
Shelby: Yeah. It wasn't until I made that connection, it was like "Wait. Of course you should be able to query for this one user, this one customer or whatever."
And then just seeing what those events looked like, where I was just like "OK. This is just object oriented logging. It's just not even that crazy."
Charity: I like that, "Object oriented logging."
It's so simple but it's hard to explain to people how revolutionary it is, because it seems so obvious.
This is where I feel like what we're seeing is that the effects of tooling and tools just took a very specific path.
StatsD was built on top of the metric that was MRTG and SNMPD, a simple network-- A "Simple network," I want to throw things every time I hear that.
I've still got this knee-jerk, but it was built on those data stores so it inherited all of the qualities of those very primitive systems that were limited in these really intense ways, down to the very nature of the data structure that was underpinning it.
Because it was built on the trunk of that tree, it's like we just never saw outside of our own courtyard.
Makes you wonder how many more things like that are out there in tech right now.
Shelby: Where are our blindspots? What don't we realize we're doing?
Charity: What could we be doing radically better if we hadn't been just so tempted by our assumptions from the past?
What does your startup down, Erwin. Why don't you tell us about it?
Erwin: We take front end development, like React or Angular or Vue, things that run on a browser but talk like APIs, and we're reinventing CI and CD.
It's really interesting because we have the same thing where every commit should be deployed and available forever , but once you start to do that you break a couple of these constraints that we had previously, where you had very linear CICD pipelines.
First you'd build and after the build you run your unit tests, and after the unit test you run your acceptance tests and then you deploy to there and then you run the other tests.
Charity: The entire area of release engineering has been such a underappreciated under-invested in area for as long as I've been in tech.
Charity: Which means that there are so many huge, amazing leaps you could do with very little effort in most places.
It could be very rewarding. One of our challenges is that we go, "No."
Because I know setting up pipelines is a bitch, but ours is literally 15 minutes. Five if it's somewhat standard."
There's all these mental models that you have around these things that are so hard to break. They really, really are.
Erwin: The fun thing was that even we have to break through our own mental models, and we write the damn thing.
Charity: This goes back to something we've been talking and thinking about recently a lot, which is the fact that these are very complex sociotechnical systems that you cannot really design or plan so much as they emerge and evolve, which is both the frustrating part and the strange joy of any--
Not even management role, but any crossing the event horizon to being a senior engineer just basically means you are now equipped to reason about the system instead of your own tiny little component.
I think it's sobering to realize just how much of your ability to ship code quickly and sanely is not actually under your control at all, it doesn't come from the algorithms or data structures in your head, it doesn't come from your experience.
It comes from the system surrounding you, and the system you basically rise or fall to the level of the system that you join within a couple of months.
I've seen amazingly effective engineers at one company go and join a company that did not ship very quickly, and they stroll in with the arrogance of a thousand suns but it doesn't last very long because--
Shelby: They're blocked.
Charity: Everybody is blocked by the same goddamn logs in the river, and conversely this isn't just a depressing tale either.
Conversely, I think a lot of people have an unfairly low opinion of themselves as engineers just because high performing teams are so incredibly rare that most people have never had the opportunity to work on one. So they don't know that they're just as good as all those other engineers over there who they look up to, they just haven't gotten to work on the same teams.
This is why I literally have stickers that I've made that say, "Quit your job" with a bunch of rainbows.
Because I feel like people should have such higher expectations for themselves, and they should crave this experience.
Especially fresh out of college, what you should you be looking for, I don't give a shit about the product, or the industry.
How good is the team, how good is their system that you're going to get plopped into?
Because that is where you're going to learn all of your expectations, all of your initial set of understandings about what it means to work and build in one of these systems.
You should want to get on the best system that you can.
Shelby: Totally, and especially those first couple of years when you've never lived in production before, you don't even know what production was.
I remember the first time I learned what CI was.
Charity: The more you can learn good stuff, the less bad habits you have to unlearn before you can learn better ones.
Shelby: At my first job, I remember having to fight that we should have a test suite.
I was like, "We should have a test suite. We should have automated tests. "They're like, "No. We're on a deadline."
Of course, we pass the deadline and they're still working on it years later.
Erwin: I've heard this story so many times.
Charity: If there's one thing that Jez and Nicole's book Accelerate has taught us all in these arguments, it's that the way to be reliable is to move fast.
Pick up your speed and your reliability will improve. It's like riding a bicycle.
If you slow it-- It goes so contrary to our instincts, when we're feeling out of control we want to slow down and seize up. Like just, "Nobody move."
But we have to train that out of ourselves because it's so counterproductive.
Erwin: I had a colleague who did an internal presentation 15 years ago maybe, it was at least 10 years ago, and she did this incredible presentation internally.
This was a five minute presentation, and she starts out by talking about race cars.
She talks about race cars, like innovations in racing for five minutes straight.
We're all going, "This is really--?" There's all of us in the audience, like 60-70 people in the audience.
We're also going, "This is really interesting. But, why?"
Shelby: Is she even at the right event?
Erwin: She was a colleague, we knew that, and she was very good at what she did. But why?
Then the last thing, in the last seven seconds she goes, "Just in case you haven't noticed, all of these innovations were safety innovations, and it's that safety that allowed us to go faster."
Now, I've been to a couple of hundred talks since, and I still vividly remember that one because that's when it hit home for me.
It was like, "If you can make things safe, safety allows you to go faster because you know that the brakes work."
Shelby: I'm jumping up down over here.
I love that sentence, because that's exactly how it works.
We put guardrails on ourselves, we shrink our build time so that we get smaller feedback loops and just all of this stuff.
That's such a beautiful thing because it rings true for every example I can think of for developer tooling and process improvements and all of that stuff.
I'm thrilled right now. I'm so stoked.
Erwin: That's what that counterintuitive is, that you have to invest in this safety and in these tests and the builds and all of that.
For me, Honeycomb is in that same space. Observability is in that same space.
Charity: You can see where you're putting your feet. It's like going out to hike up a mountain in the dead of night and not bringing a flashlight.
No, you're going to go faster if you have a little headlamp on so you can see exactly where you're about to put your feet.
Erwin: For me, we put something in production.
I'm pretty darn sure that it's good, but it messes with this whole workflow that we may or may not even be talking enough about.
This post-production phase, after I put something alive, I just go click around in Honeycomb for a while, just to see whether the change that I have made is actually working as I intended it.
Charity: This is something where, as I'm sure you're aware, there are a lot of companies out there who are selling their "Observability tools" that do not provide the kind of observability that will let you do this.
This is one of the things, I have many posts out there on the internet talking about our definition of observability, which is about "Unknown unknowns."
But specifically, you cannot do that if you don't have the ability-- If you don't support arbitrarily wide structured data blocks, and if you don't allow people to break down by high cardinality.
Build ID and high cardinality dimensions.
Infinitely incrementing a high cardinality dimension, you need to be able to string together as many of these so that it has to be--
And I'm not just making this shit up because Honeycomb has these and so therefore everyone must have them, but it's like, "No. You can't actually do the things we're talking about if all you have is the time series aggregates in the metrics tools. You can't do those things."
You have to be able to basically point at any point at the spike and go "What's different about it? What is different about these requests than all those other requests?"
It's not going to be one thing, it's usually going to be three or four or five or six or seven things.
What chance do you have of guessing those? It's pretty low.
Erwin: It was hilarious. I vividly remember doing a demo for a friend of mine over in Perth, so I was just clicking around in my own links dataset to show them.
Because we host other people's-- I didn't learn from your mistake, Charity , from Parse.
Charity: Platform problems.
Charity: Terrible mistake. To our listeners, never do it.
Erwin: It's a little bit too late for that now, but it means that things like performance don't necessarily make any sense because I can look at an aggregate because--
Charity: It could be something that they did, something that you did, or some intersection of the two, or something that any one of the other customers sharing any pool of services or any database. It's just meaningless.
Erwin: So, I was looking for this demo for them, and there's this big massive spike in latency.
I go to the heat map and then just did a bubble up for that spike. It's the images for this particular customer.
It was only the gifts of this one particular customer, and you just go "Why? Why is that the case?"
You've got this threat, and now you can go down and go "Why? Why are only the gifts?"
Charity: Or it's like, "It's just this customer and it's just his customer's exports, and it's just because they're all so much bigger than all the others."
At a glance you can tell that this outlier is special in all of these ways, and it's just the years of my life worth of ours that debugging that this one thing alone would have saved me is just-- It almost hurts to think about.
Erwin: I had this really interesting bug a couple of weeks ago where one of our customers pinged me and he goes, "It's really weird."
Everything was fine for him, but when he sent one of these preview links for one of those commits to someone else in the organization, it didn't work for them.
The guy happened to be the CEO of that customer, so it didn't work for the CEO.
It worked for the dev, but not for the CEO.
Shelby: So you say "It works on my machine." And you go home and--?
Erwin: "It works on my machine, so this is what you do." But then I checked out the link and it didn't work for me either.
Then I gave it to my co-founder and it worked for him. Now you sort of go, "What?"
Shelby: You start tearing your hair out.
Erwin: This is one of the things I vividly remember, you were talking about a lot when I was dragging you around Australia.
Error percentages just don't mean anything.
Charity: Percentages cover over so many sins, they just erase all details.
Erwin: There's a big difference between one in a thousand requests going wrong for everyone, which is fine, whatever.
Just refresh. But it's very different if it's 100% of one customer.
Charity: Or everyone whose last name starts with this, or everyone who is in this region running this thing at this time.
The possibilities are infinite and they are getting harder and harder to track down.
Erwin: When I finally figured out what was going on, it turns out a cookie wasn't set that should have been set.
If you had the cookie on your machine already, then it worked because the cookie was there.
But then the next question became, "Why is the cookie not set?"
Charity: I compare it to following a trail of breadcrumbs.
Debugging isn't hard when you are just putting one foot in front of the other and you can't see where you're going.
You have no idea where you're going to, but you always see where to put your foot based on the answer you just got.
This is so qualitatively and in every way different from the way we used to "Debug," which was throwing out random guesses that we tried to match to the dashboards on our wall and going, "Would this scenario support this set of errors?"
Shelby: I call it "Armchair engineering."
Charity: I love it.
Then you go look for evidence that your guess was correct, which is another problem, because if you're just looking for evidence that your guess is correct it's exhausting.
It is exhausting, it is stressful and it is hard to debug that way.
You're always going to be the person who's been there the longest is always going to be the best at it, and it gets depressing and demoralizing.
But it's not scientific, it's not debugging. You're not actually following the data.
Erwin: But here's the thing I realized after that thing, because it goes back to my thing earlier, having that historical data.
Once I'd figured out what the problem was I could figure out that, yes, these other two customers were affected as well by the same bug.
But more importantly, I had full confidence after fixing that bug that I had fixed all the instances of this bug because I could go and--
Like, "It wasn't just this one thing here, this fix that I have now shipped would fix all of the causes of this bug."
Charity: This confidence that you're describing, this confidence is not the same as cockiness. It's confidence. It comes from--
This is what is so hard to describe to people who have never worked in a system that was well understood, or that was even capable of being understood by people.
And these are the people who are always on Twitter freaking the fuck out over Friday deploys.
They're just like, "What the fuck are you thinking? Doing that on a Friday?"
It's like they're accusing me of shipping bombs. That is the level of trust and confidence that they have, not just in their changes but in their ability to even know or understand the ramifications or the likely danger of what they're doing or whether they fix something or not when they think they have.
They're just waiting for monitoring checks to go green, and then it's like "Phew."
They have no confidence that anything is fixed, they're just like "I've gotten it to be quiet," so they can go to bed.
Erwin: Because it goes back to that piece of workflow that we may not be talking about enough, is that post-deploy step of verification of--
Charity: Do you know why they don't look at it?
Because almost nobody has actually automated everything that happens between when you merge your branch and when it's 100% live and you can look at it.
Almost nobody has automated all of that, which means that it is God knows how long.
It could be half an hour, it could be three days. You can't make muscle memory out of that and you can't hold people accountable for looking at that.
As soon as you inserted human gain, and they probably aren't even batching up multiple merges for deploy, which let us not even speak of such horrors.
Shelby: I was lucky to be on the DevOps team and having that access where we would patch up our Chef deploys or put "Wait until Monday," or "Wait until next week."
And then when we finally deployed to prod, it was my job to go in and SSH into the server and tail the Chef logs and watch the processes and all of that stuff.
I'm glad for the experience because now I can talk about that stuff, but at the same time you don't--
Erwin: Tell your kids.
Charity: Those are valuable hours of your life.
Shelby: Yeah. Or oh my gosh, terraformed changes.
When we manually scale up and scale back down in order to rotate our AMIs and things like that, and it's just not a good use of our time.
And the same thing for deploying code, if you don't know how to validate that your code is behaving as expected in production it's a feedback loop you don't get, and you can't learn, and it actually stagnates people.
We have an entire generation of engineers who could be so productive that tooling and their sociotechnical systems are holding them back. It's a thing.
Charity: I made the slide for exactly that, that was showing a picture of two people.
One person, they left Cloud at the same time, same courses and everything.
One of them joined a te am where anytime someone merges they automatically get shipped, which means they ship 12-15 times a day on average without really thinking about it.
The other joins a team with equally good engineers, just as much money and everything that ships twice a month.
Fast forward two years, which of those engineers is going to be super-powered? It's not even funny.
Shelby: It's the thing that pisses me off about the L33T code and the whiteboard interviews, and all of that--
Erwin: Don't get me started on those.
Shelby: Right? Because they're such bad signals for what makes a good engineer and what makes a good team.
It's like we have this false advertising around "What kind of team do you want to join?" Or "What kind of engineer do you want to hire?"
And the other thing that I love about the term sociotechnical and stuff is that it's a new instance of the system.
As soon as you add or remove a person it's no longer the same team that it was.
Erwin: I talk about that a lot.
Shelby: Before personnel changes or tooling changes , so not only is it constantly evolving--
Erwin: It's essential.
Shelby: But it's not a single entity.
Charity: It's never the same team.
Shelby: It's such a shame because I've been really fortunate to learn from really amazing people, and I've been already learning so much from just my short time at Honeycomb so far.
I'm thinking about, "Gosh. If I could have done this stuff my first couple of years, how much better would we be at solving the problems that we're setting out to solve as an industry if we weren't held back by our tools and our process and our practices?"
Erwin: I'm also really optimistic about the future in this regard from the technical side of things, not so much on some of the other dimensions of the future.
If you look at how we're now very much in this construction of the undoing of the fragmentation.
A large part of that was driven by things like Docker, and containers and Docker.
Where before everything had to be bespoke, because we had a Java application and it needed this other thing.
There's this massive reduction in fragmentation going on right now, that we can start to build up this tooling around, hosting around, like getting stuff into production.
Then there's these-- A couple of the companies that are doing some really cool stuff around this space as well, like Honeycomb when it comes to observability or LaunchDarkly when you were talking about feature flagging.
The stuff that we're trying to do is the front end of things, but there's actually a lot of--
I'm sure there's many more of these companies, because these contractions are happening of that fragmentation, I think it will get so much better over the next couple of years, where it's becoming easier and easier to build these systems that do this from the outset.
Which actually brings me to the point. One of the points of observability that I make a lot is "Start with it."
People have this feeling that "We don't need observability now because we're a startup and we're not a real company."
Charity: Yeah, "Until we get to be this big or this complicated." But it's never not easier.
Erwin: One of the things that is really powerful, because we've done it almost from the beginning, it's certainly not in everything that we do but we are now in a position where quite literally every alert goes into a Slack channel with the three of us. Look, and we've got notifications on.
Charity: The thing is, it's really hard to dig yourself out of a pit when your team has gotten into a pit.
When you're losing ground, it is really hard and painful and you usually sacrifice all forward momentum as a business while you sort your shit out.
If you can avoid that, if you could just avoid sinking into that hole, oh my God are you going to bless your past self.
Shelby: I loved Jessica Kerr's story about just starting to send from dev.
Erwin: I have such a crush on Jessica. She is so good.
Charity: Don't we all? She's amazing.
Shelby: She's so cool. That's been my thing. It's just like, if you can send from dev you can send from prod.
Getting people comfortable with just instrumenting code and thinking about how that code is going to behave in production before they even start writing it, and that's--
Charity's awesome post on Observability Driven Development, that's the definition.
If you're starting a new greenfield project or something where you don't know when you're going to ship, you don't know when you're going to have general availability yet, you can still gain so much about starting with observability and building an observable system from the ground up.
I've learned, "OK . That's how Rails thinks about that," or whatever, just really basic stuff that it's so much quicker.
Charity: Just being able to see what you're doing, you just learn little random things along the way about what's happening under the hood.
It's what helps to build up that rich, vibrant, intricate, real, mental representation that you have where most of us believe the craziest shit about our systems.
Taking it out of our heads and putting it in the source of truth that we all have access to is-- Like, each of us are working on these distributed systems.
I'm responsible for a very small little corner of the world, and we know our corner intimately while we're working on it.
The way that you interact with that corner while you're working it, that is how an expert interacts with it.
An expert who knows what's meaningful, knows what's important, knows what tends to break, what trends to look for.
This is why we've always put history into Honeycomb, because you're working on your corner but you're responsible for the whole thing.
You're on call for the whole thing.
You have to be able to hop around and look at other people's corners of the world and just have access to their history, seeing the grooves that they wore in the system while they were interacting with it is incalculably priceless.
Shelby: I would love to just keep talking all day about this . We could go on forever.
Charity: We could keep ranting all night, couldn't we?
Shelby: But this has been so wonderful? Thank you, Erwin.
Erwin: No worries. It's always good fun chatting with the two of you.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
O11ycast Ep. #58, Game Development with Brenna Moore of Second Dinner
In episode 58 of o11ycast, Jess and Liz speak with Brenna Moore of Second Dinner. This conversation explores game development and...
Jamstack Radio Ep. #105, Real-time Data with DeVaris Brown and Ali Hamidi of Meroxa
In episode 105 of JAMstack Radio, Brian Douglas speaks with DeVaris Brown and Ali Hamidi of Meroxa. They discuss the future of...
Getting There Ep. #3, The October 2021 Roblox Outage
In episode 3 of Getting There, Nora Jones and Niall Murphy unpack the Roblox outage of October 2021. Together they review the...