March 28, 2018
Heavybit Welcomes New Member: Replicated
We’re thrilled to welcome our newest member Replicated - a company that gives SaaS and software vendors a container-based platform to easi...
In episode 13 of O11ycast, Charity Majors and Liz Fong-Jones talk with Natalie Bennett, Software Engineering Manager at Pivotal. They discuss the difference between collaborative projects and teams, continuous verification, and diagnosing failed deployments.
About the Guests
Liz Fong-Jones: So, you've been at one company for a while. How many teams have you been on?
Natalie Bennett: I've been at Pivotal for about four years now, and I've been on nine teams.
Charity Majors: Nine teams?
Nat: Nine teams. Yeah, I probably hold the record for most teams--
Nat: In four years. But it's not unusual, I think most people who've been here four years have been on three or four teams.
Charity: So you have a culture of team switching internally.
Nat: Yeah, we try to de-silo.
Charity: That's so great.
Nat: We try to have high bus numbers.
Liz: It's not that foreign to me, because I served on something like 10 teams in 11 years at Google.
It really depends upon how your team is set up and how your organization is set up.
Charity: What do you think you two have in common?
Liz: I think that when you seek out growth opportunities, when you start saying "I'm feeling comfortable," make yourself not comfortable again.
Nat: Yeah, I hate that. I get bored and then I start looking for ways to entertain myself.
Charity: That sounds like trouble.
Nat: Nobody likes that. Nobody.
Charity: Nat, this sounds like a great time for you to introduce yourself.
Nat: I'm Nat Bennett and I'm a software engineer at Pivotal.
Liz: What do you work on at Pivotal?
Nat: I work on Cloud Foundry. Those nine teams have all been different Cloud Foundry component teams or release engineering teams.
My number is a little bit inflated because the first team that I was on was a one pair team testing the high availability features of Cloud Foundry during the 1-4 to 1-5 upgrade.
We just spent like six weeks messing around with Gatling and looking at which parts of it go down when you're upgrading.
Liz: What's the difference to you between a project and a team?
Nat: In the Pivotal context, a team is a group of engineers who are all pairing with each other and are rotating on a daily or weekly basis.
We've tried to do project teams, but they always turn into long term things.
Charity: The career trajectory of a build engineer is not really well understood. What does one do?
Nat: The team right now is split into two pieces. We take input from all the other teams, so they're pushing--
Nat: Artifacts, yeah, into our system. Then running it through a pretty massive series of build pipelines that are running a bunch of different tests.
Fresh deploy, upgrade deploy, deploys across a bunch of different-- .
Charity: You own the code all the way from the time that it's committed until it's deployed to users?
Nat: What do you mean by "Own the code?"
Charity: Are you the one who says whether it goes or gets bounced back?
Nat: Kind of. Usually if it gets bounced back what we'll do is open up a cross team pair and try to figure out what's going on and what's wrong.
Charity: How many developers would you say that your team of release engineers supports?
Nat: 100-200? I know that the Cloud Foundry Development Organization is about 400 engineers.
Liz: How many of you are there?
Nat: There are 15 total, if you count both the open source and the closed source release engineering.
Liz: You have a different problem than a lot of people have, in that you're not just releasing something that you run in your own production environment.
You have to build things that other people can just take and run, right?
Nat: Yeah. It takes code sometimes up to a year to hit production, real production. Like, customers running into production.
Because we have a quarterly release cycle, so that's going to take often at least three months between somebody writing code and it getting shipped.
Then our customers are mostly large enterprises and big banks, often.
Charity: Conservative types.
Nat: Yeah, there's a lot of risk.
They're handling millions of people's money.
Charity: I would like them to be conservative with my money.
Liz: Yeah. So how do you build confidence, then? We talk about the idea of shipping as lifeblood, of being able to test things early.
How do you test things at Pivotal?
Charity: And how do you maintain that? That sense of the lifeblood of your-- Like, urgency?
Shipping things if you only ever release once a quarter, do you have internal releases or things that create a more regular cadence for people to ship regularly?
Nat: Yeah. One thing is that we have a couple of sources of internal feedback.
If you've used Pivotal Tracker, you're using an application that's running on a--
Charity: You dog food.
Nat: PCF, yeah. That's the EU operations team. Hi, James.
He runs that and gives a lot of feedback, like they do things and provide a lot of feedback about things like how long upgrades take, for instance. Which is something that is hard to get.
We also are just now firing up and starting to get feedback from this thing called telemetry and insights, where we're getting stats back from customer sites.
Charity: From out in the wild.
Nat: Yeah. So now we're able to see things like, "What versions are people actually running in dev versus production?"
Charity: Oh my God, you haven't had that before?
Liz: Like sources of truth from production, it's amazing when you get them.
Nat: We found out some of the settings that people always change and some of the settings that people never change.
Liz: I think that it's really fascinating to think about. So how do we observe our production systems?
For you, to an extent your production systems are the end product, but also you are a build engineer.
You have production systems too, so what are your production systems like then?
Nat: Concourse. Yeah.
Charity: I keep hearing about this.
Liz: Funnily enough, I was pairing with a bunch of people over at meet-up a while ago and they use Concourse because they're staffed by a whole bunch of ex-Pivotal people.
Nat: Yeah. We go places and do things to absorb people into the hive mind. So, Concourse is a continuous integration system.
It's really, I just picked this up yesterday, "Continuous Verification."
It's a continuous thing do-er, and it lets us build these gigantic pipelines and control what versions of things are moving through it and what tests versions of things have to have passed before they move into the next stage.
They let us orchestrate these deploy across a bunch different scenarios, run tests, tear downs.
Liz: How long is a typical Concourse pipeline for you?
You're probably one of the more sophisticated users, but yeah.
Nat: For release engineering if nothing goes wrong, it takes about two days to run an artifact through our pipelines.
The amazing thing about build engineering that I've always been fascinated by is the way that you touch everything and everyone. Maybe this feeds your itch to jump teams a little bit, I don't know. But the flip side of that is your mistakes affect everyone.
How do you keep that from being paralyzing?
Charity: What's that?
Nat: "Semantic versioning." That's a little bit of a flip answer, but for instance--
Charity: Has this been an evolution across your career too? I would love it if you could speak to the ways that senior Nat is like, "Junior Nat, here's where you're going."
Nat: Senior Nat is a lot more chill than Junior Nat. That's a lot.
Liz: What do you mean by "More chill?" Is it in terms of perfection is not an aim, or is it in terms of being battle hardened?
Charity: "Everything breaks, everything will get fixed."
Nat: Junior Nat was like, "We're doing these things wrong and I don't understand why, and why are you doing it this way? What's wrong with you?"
Charity: That's adorable.
Nat: Somewhat Senior Nat is more like, "Interesting. Tell me more about what this does for you. Have you considered trying this instead?"
Liz: I remember making that mistake when I switched onto the team that was the former ITA software team in Boston, and I was like "You are doing SRE wrong.
Here are the five things you're doing wrong." I literally said that the first two weeks I was on that team and they never listened to me again.
Charity: How to win friends and influence people.
Nat: You've got to spend the first two weeks on a new team just asking people, "Why do you do it this way? What does this do for you?"
I've seen in my--"I like doing it this way. I have had this experience," or "This has worked out badly for me in the past."
But like, you're doing it. So why? What is that doing for you?
Liz: Then figuring out how do we pair with people, whether you're an SRE like me or whether you're a build engineer like you, Nat.
We have to figure out how we partner with people because we can't do everything by ourselves.
Charity: SREs and build engineers are tightly, very coupled.
It's before prod and after prod, but they're the same skill sets and often the same processes and tools, and mindsets.
Nat: At one point I rolled from another one of our dog fooding teams, one of our production teams onto the open source release integration team, and spent a lot of time going "This is what operators think of this," or "This is what operators think of that."
Charity: Which is a really valuable thing to learn.
Liz: It's so valuable to rotate around. Besides the "I get bored" itch, it's also the "I want to better understand someone else's situation" itch.
Charity: The thing that I've noticed in you, Liz, is that you have this incredible ability to jump into any situation and just size it up immediately.
This is clearly a skill set in and of itself, I take a while to acclimate to new environments because I have not made a practice of that.
It's been interesting to watch you just sail in to customers, "All right. Here's the top three things--" But very nicely. It's a skill.
Liz: Learning to do it nicely is important. What was that expression you used the other week involving a velociraptor?
Charity: Yes, "Liz is a velociraptor. She just enters the room, sizes up the flows of information and how to position herself to be in the route as much as possible without annoying anyone."
Something like that.
Nat: Ramping up is definitely a skill. Can we swear on this?
Charity: Yes. Abso-fucking-lutely.
Liz: It's Charity fucking Majors, of course you can swear.
Nat: On the last couple of rotations I kept a "What the fuck" notebook, which is a big part of how I stay chill.
I just have a running log of everything that makes me go "What the fuck? OK. It's going into the notebook."
And then later I'll either figure out why, or I'll fix it. I have this list of stuff that I'm going to fix.
Liz: What are the most common WTFs you encounter, and what are the most common things you tend to do when you first join a team?
We're talking about that on boarding, how do you make that smooth or yourself, Nat?
Nat: Often it's stuff that's like flaky builds, or workstation set up that's not quite right, or--
Charity: The thing that I will often notice is the amount of pain that people will subject themselves to.
Nat: They tolerate it.
Charity: Without noticing, like over and over.
Liz: Yeah. Because it creeps up on you over time.
Charity: You're just like, "Yeah. That thing that I manually do every morning before I can do anything? It's fine."
Liz: Yeah. I want to plug some of my friends at a startup called Windmill developing a tool called Tilt, and the goal of Tilt is to make it one command to stand up your entire Docker-ified or Kubernetes-ified development environment.
Have all your logs piped to one place, you can see what's going on and what's failing. It shouldn't have to be this pipeline of manual steps that you run to provision your dev environments.
Nat: We're obsessive at Pivotal about having standardized ways to do that setup.
Charity: So, question. If you were popped into some hypothetical place, new job, and you have access to none of the internal Pivotal tools.
Nat: No Bosch?
Charity: No Bosch.
Nat: No Concourse?
Charity: We're talking to the rest of the world here. No Bosch, no Concourse.
Liz: You could probably setup Concourse from scratch, but.
Charity: If you have to go do it from scratch, you can, but most people don't know what those things are.
Imagine you're flying in to help some team, Obama for America needs your help to fix their build. What kinds of things do you look for?
You've got to see common problems and have a repertoire of fixes, but what I'm getting at is there are so many people out there who have a lot of developers, no dedicated build engineers, and it's so bad they don't know where to start.
What advice would you give them?
Nat: I walked into a system that was mostly running. The history of build engineering at Pivotal is fascinating.
Charity: Let me start with the thing. It takes an hour and a half for them to play code.
Nat: Sure, yeah.
Charity: Is that acceptable? [Inaudible] people.
Nat: It depends on what kind of code it is, but probably not.
Liz: The answer "It depends" is a common one, it's the same thing with service level objectives.
How do you know whether 3,000 milliseconds is OK or not?
Charity: Which is why I'm asking, what questions do you start asking and what answers do you start looking for to steer you?
If you get 10 questions to determine your next 2 weeks of work?
Nat: The first thing I'm going to start with is just checking whether or not they're having retros, and then are they actually identifying real problems in the retro and ways to change it.
Charity: Talk about a healthy retro.
Nat: Yeah, a healthy retro is you're going to talk about some of the successes, some of the fun things that have happened that week, you're going to talk about some of the things that bothered people.
I'm a big fan of talking about feelings at work.
Liz: I love how the first thing that you jumped to is not tools, but instead culture. I love that.
Nat: People organizations are the most interesting distributed system that I have access to.
Charity: Junior engineers are always like "Stupid feelings," and senior engineers are just like "Always go straight to the people."
Nat: It's all feelings.
Liz: So you set up people with retros, they start talking about issues. What do you do next?
Nat: Probably look at "Can you set up the real software, an actual distributed system, not some single node baby version of it, but the real software."
Can you do that with a single command or a couple of commands? Can you actually bring the software up, play with it, test it?
I've been on teams that for instance, I rolled on a team once and I was having a heck of a time replicating a customer problem where the customer problem was every week like clockwork.
We turn this thing off and then we turn it back on again and it doesn't work, and the team is like, "This is impossible. We can't replicate this. Why are they doing this?"
And I rolled on, I was like, "I have some bad news for you about the cloud and what it does."
Liz: You have to get things to a repeatable state.
Nat: Yeah, they couldn't replicate the problem even though it was a pretty simple problem because they couldn't deploy the software.
They did all of their testing on a single node, like Docker-ized version of it.
You have to be able to deploy the real software and get feedback, and then looking at "What is your deploy cycle? What's your minimum time to real feedback from production?"
And then, "What other sources of feedback are you getting?" Like pairing, we obsess about pairing at Pivotal because that's the fastest possible way that you can get immediate feedback.
Charity: It's interesting that you go so quickly to "You have to be able to stand up a system that looks like production."
I assume you mean hardware instances as production as possible.
Charity: Because I often-- And I am taking a somewhat maximal stance that I don't 100% agree with myself here, but I will often be like "What you need is prod, well instrumented that you can understand and explain really complex questions, and your laptop."
That's mostly what you need, because most problems that I've seen you can spend a lot of time trying to find them in staging. You may or may not find them, because the conditions may or may not exist, but they will always exist in prod.
Often you'll start looking for problems and you'll find different ones in staging than exist in prod.
I think that you run the basics on your laptop which you know is not prod, so you don't have to blur the lines in your head.
A lot of developers start thinking that staging is prod, and it's not.
Liz: Exactly. It's this interesting thing. I remember the old XKCD comic, someone's like "My code's compiling."
And then the manager is like, "OK." And it's like, "No. You should make your own local build blazing fast so you can write a line of code and have it running in five or 10 seconds."
But then you have to also make sure that you can deploy that code to production with a fast cycle.
Charity: You have to stand it and look at it.
Nat: So that's actually maybe the actual first place that I would start, is "Can you run your tests locally?"
Charity: That's a really great starting point.
Nat: You're seeing somewhat bias there of like I've been working with packaged software that you ship.
Charity: No, for sure.
Nat: But also I think you should actually be able to understand how your software behaves.
Charity: Developers should be able to tear down their environment and get started again from scratch, and very quickly.
Liz: Also that in the new maturity model that Charity and I developed, we think it's important to use the same tools to observe your production environment and your local development environment.
If you're not testing with the same telemetry, if you're leaning on looking at local verbose logs on your local laptop you're going to have a hell of a time figuring out what's going on when it hits production. Because you can't look at all those logs, it's inordinately expensive.
Nat: We actually tore down the staging environment while I was on the cloud ops team.
It was named Jeff because it was actually a sandbox. It was a playing environment for the ops team.
Charity: It wasn't a reference to serverless?
Nat: No, I don't think so. But we'd gotten to a point where we updated the staging environment most of the time after we updated production.
Charity: I've seen this too. I think that staging honestly-- It's mostly, there are some problems where, yes, you can go prod.
It's going to be destructive. It deals with deep data things that I can't mess with.
There are cases where you absolutely need a state environment.
Nat: Like configuration problems.
We're actually probably going to set a staging environment back up, because most of our outages over the past year have been configuration issues that would have been caught by a really simple smoke test deploy.
Charity: Rolling to config, that's a really good process. Smoke tests, exactly.
Liz: I think also for smoke tests, it's important that they be relatively high signal to noise.
I think high signal to noise, but also not so critical that you're going to be screaming if it breaks.
This is the beauty of what we wind up doing at Honeycomb with our dog food environment ingesting the telemetry from our production environment.
Charity: Dog food, we use it constantly because it's how we understand the customers' Honeycombs .
Liz: But if it breaks, no end customers are affected, it just impairs our own ability to see.
Charity: I'm a huge fan of every form and flavor of dog fooding or trying things on yourself first.
Nat: Dog fooding can also lead you astray if you're really different from the customers.
For instance, Cloud Foundry engineers like developers were working on the thing, and we install and deploy it way more than any customer ever does.
The vast majority, probably 99% of the times that Cloud Foundry gets stood up, it's at Pivotal or another Cloud Foundation member developing it.
We have gotten newer engineers and they will get frustrated with the various rough edges on that experience, which is legitimate.
Liz: It's the missing stair. Everyone knows "Don't step there" except for the new engineer, so you really need that user experience viewpoint.
Nat: But it's also not-- That's something that a customer is going to do once.
The thing that's actually really important for customers is upgrades, which we never do.
Liz: Or you do upgrades, they're much more incremental. They're not big bangs.
Charity: Yeah. You're never going from V-1 to V-10.
Nat: We also have things like the operations team in San Francisco sits right next to the Bosch team and sits right next to the API team.
If something goes wrong, they can just go tap somebody on the shoulder and have the hot patch the Bosch director if they need to. Customers can't do that, so it causes us to not notice some things that are frustrating about operating our product.
Charity: It just goes to show that there is no one answer to anything.
The one thing about staging that we use it for is for the UI stuff, so that you can deploy if you're working on the UI and the UX and stuff, putting it on your laptop is not going to give you the right experience.
Whereas deploying it to staging--
Liz: It's not even staging, for us it is a window into real production data.
It's just running a different binary so it's not in the critical customer path, but it's--
Charity: We do technically have a staging that is separate from dogfood.
Liz: Not anymore.
Charity: We don't?
Liz: Not anymore. I actually have a change list out to delete all the remnants of it.
Charity: Shows how much I know. We are staging-free.
Nat: How much do you love deleting things, Liz?
Liz: Oh my God, it's the best. I wanted to talk a little bit more about the--
You mentioned it takes two days to run your build pipeline from start to end. How do you--?
You said that's if it goes well, so how do you diagnose when things go badly?
And not only go badly, "Go badly" doesn't necessarily mean 100 % errors. Like, how do you debug flakiness?
How do you debug slowness? How do you not have these things sneak up?
Charity: How do you know if it's gone badly or not? What does "Sad" mean?
Nat: I could talk about that for a couple of hours, probably. But I'll try to give you the short version.
Usually something going wrong is a failed deploy or a failed test run, sometimes it is a Concourse failure or an underlying Concourse failure.
Because we run probably one of the biggest Concourses that exists, so we're driving it pretty hard and we get all kinds of nice scaling feedback for them.
Liz: Sometimes the worker fails, sometimes it's a flaky test, sometimes it actually is bad code?
Nat: Yeah, there may be bad configuration. There may be a problem, there may be an actual integration problem, like-- .
Charity: Are they pretty clear?
Nat: No. There will be a failure, and we're working on making this more clear.
We actually have a process, we have playbooks, you can roll a new engineer on to PAS release engineering and have them running what we call a release train.
Charity: The thing about this category of problem, we've seen this with a lot of our customers at Circle and whatnot.
It's like Tolstoy said, "Every success looks the same and every failure is different in its own special way."
Nat: We do try to, during a train we'll have one person leading the whole process, and when there are failures we will pull those and record them and try to group them.
We're actually, we've got one of our engineers, Carlos, is working on a tool that pulls information from the Concourse API, Pipe stat.
Charity: Do you do a single run per merge or per commit?
Nat: No. We do tend to batch things, which causes--
Because there is that certain minimum fixed cost of running the pipelines, and I'm always the person going like "What if we put fewer things into this batch?"
Charity: Is it not possible to make it faster?
Nat: We had one go-- We have made it faster. It used to take longer, basically.
We have made it faster by reducing the amount of testing that we were doing.
Charity: Is that the only way?
Liz: I think that's really cool, actually, to think about. Why do we want to write a test?
Does it have to stick around forever? The ROI of testing is very interesting.
Nat: Yeah. That's one of my perpetual conversations that I'm having with people, "Yeah. I understand this is valuable, but is it worth it? Like, is it valuable enough to be worth the cost?"
That's a case by case.
Liz: How do you measure the cost? How do you figure out which ones you're going to go after and say, "Is this really worth it or can I optimize it?"
Like, what do you do for that?
Nat: For tests specifically, I'm glad you asked that, I was really interested in this question.
We use Gingko for most of our testing, go testing, and Gingko has reporters.
We wrote a Honeycomb reporter that every time a test runs it'll report pass or fail, what line did it fail when it failed, how long did it take?
For instance, when we're dealing with super flaky tests we now can use any Gingko suite.
Basically, we can hook it up and get a rank ordered list of the tests that have failed the most in the last 30 days, or whatever it is.
Liz: Nice. It's the way to identify the low hanging fruit. "What is failing the most often, what's taking the most time?" Interesting.
Nat: That's the main way that we've been able to use honeycombs so far, but that's something that in general I would like to--
I've been exploring a couple of different ways to instrument more of our Concourse pipeline with Honeycomb.
Charity: I'm curious what you guys use for observability and build pipelines, because this is not something I've really thought about.
I always think of observability for production.
Our customers keep dragging us over towards being like, "No. But we could visualize or build pipeline as a waterfall using tracing."
Intercom was the first to instrument their build pipeline and drop their entire deploy from commit until it's in prod to four and a half minutes using Ruby on Rails.
Charity: I know. I am still blown away by that.
Liz: It's just those little things, like figuring out what is the distribution of latencies for this particular spam type?
Charity: I was so blown away by that. All of the advanced teams, almost all of them, tend to come back to us eventually and go "We're using Honeycomb for our build pipeline," which is not something I ever would've predicted.
Nat: Yeah. You have to start with measuring it because otherwise you're just throwing darts in the dark.
Charity: It's a high-cardinality problem.
Liz: It's not even-- It is a high cardinality problem, but it is not that expensive of a problem.
The amount of data that's produced relative to the cost to run your tests in a VM, that's tiny.
Nat: Yeah, it totally is. Several times last couple of months I've had a team come to me, they have flaky tests and they heard that I have something for it. I'm like, "Here. Get set up with this. It's very exciting."
And then they come back and they're like, "How much does this cost though? Is this going to be OK?"
I'm like, "It's $70 dollars a month. You're not going to use even all of that." They're like, "How do we make this more efficient?"
I'm like, "Have you actually--? Just check how much you're paying for it right now, check how much data that you're using."
They're like, "It's like less than a gigabyte." I'm like, "Yeah. It's less than-- It's very small."
Actually checking first before you start optimizing makes a huge difference.
Charity: The thing about--
Liz: Premature optimization is the worst.
Charity: That's true. I find that a lot of companies also, the thing is that they instrument to look for the answer to one question and then once they pick up the rock, they're like "Look what lives under there."
It's just thing after thing after thing. Sometimes they blame us. They're like, "Before we had Honeycomb we did not know."
And I'm like "It was-- Your customers knew. Your users knew."
Nat: Yeah. I came out of-- I started as a software tester, as an exploratory tester.
Testers will talk about breaking the system. Testers don't break the system, testers reveal-- Software is just always broken, none of it works.
Liz: But at the same time, it's our obligation as software engineers to think about which groups of people are worst affected by the breakages in our software.
Is it concentrated on one user, or is it evenly spread out? It's almost never evenly spread out.
Charity: A spike is a spike, and then you don't know until you start disaggregating it, "Is it everyone who's impacted more or less evenly?"
Almost certainly not. "Is it 10 % of users who are like completely locked out? Is it--?" You don't know. This is what bugs me about dashboards.
This is why I hate dashboards, there's a spike and people just assume that they know what it is because they've seen something before that looks like it, and it's too hard and expensive to actually go figure out for sure.
Because you have to jump into logs and all this detailed shit. Or, you could use Honeycomb.
Nat: I have this problem where once I understand that a problem is solvable, not even solved, just solvable, I get bored immediately and I'm like, "Somebody else can do that."
Charity: "I've figured it out."
Nat: Which is why I gravitate towards people problems, basically. But it's also why I've been really attracted--
Why I have been attracted to Honeycomb and similar tools that let you collect information for problems you didn't even know that you had, necessarily.
Charity: And you started out as a tester.
The workflow, when I was trying to figure out how the fuck to describe what we were doing, before I landed on "Observability," one of the things that I was playing with was BI for systems.
Because in BI, in business intelligence they never would've been satisfied with "Here are a dozen dashboards. Now whenever you have something happen in your business, just fit them to one of these. Or worst case scenario, we'll make a new dashboard to describe that."
Because every scenario is so specific and so unique and so new, and you need to take one small step, look at the answer, and then based on the answer take another small step.
Follow the breadcrumbs to this specific answer every time.
Liz: The answer to "Can our system answer this question?" shouldn't either be "Yes" or "No."
It should be "It might take us a minute to figure out, it might take us 10 minutes to figure out, it might even, heaven forbid, take half an hour to figure out."
But never should the answer be "We can't do that."
Charity: Or revert to SSHing and s-tracing your binary, which is a thing that I used to do all the time at Parse.
Liz: So, how often do you have to go and look at individual Concourse workers?
Nat: I have a confession to make. I've been spending the last couple of weeks deep in manager leveling, so I haven't been hands on with the software in a little bit, but none of it-- I don't know, once a week at most.
Charity: It's riding a bicycle. See, the thing is that anytime you have to do that you know in your heart that you failed in some way.
Any time you have to look, if you've got to trace it all the way down to the end, this is how I used to feel.
If I had to SSH into a machine and look at some log or some state in the machine, it meant--
It's not a catastrophic failure, but it just means that my tools have failed to answer my questions at a higher level.
Liz: Lower your instrumentation. Sometimes the onus is on you to instrument, and the way you get that signal is "Am I having to do manual work too often?"
Nat: That's what's gray, and that's one of the things that I love about working on developer tooling now.
It makes it your job to fail, so then your customers don't have to fail, and every time you do some debugging that's like "This was really hard."
And you're like, "How can we make this easier for the operator?"
Charity: And you're never just solving for one, you're solving a category problem. I enjoy that.
Liz: As we reach the end of our time here, are there any closing thoughts that you want to talk about?
Charity: I wanted to-- You mentioned briefly just now being a manager, would you like to talk about the career arc of someone who identifies as a build engineer or release engineer?
Because I don't think that people really understand that as a career.
Nat: Yeah. I think in my context, it's something that often people will come into for a year, year and a half, two years and then go back to another team. But there are--
Charity: What do you think of that?
Nat: I think that's good. I think that's healthy. We need people who understand the system on the release team.
Charity: It's a great way to do that.
Liz: It's the same way that I love to talk to people about, "You should do a site reliability engineering rotation."
Nat: But I also think that this is a--
There's a handful of us, you'd think this is an emerging speciality, an emerging area, and we are interested in "How do we take this and make it a career? What does that look like?"
Charity: Is there a community or a group or someplace that people who are interested in this can join and find out more, or anything?
Nat: There's not yet, but there probably should be. There is one internal at Pivotal, but we should start a reliance guild.
Charity: I think we should make-- I think we should take this-- .
Liz: Having that idea of a common community of practice. Or even things sometimes get started with O'Reilly books.
Where is the O'Reilly book for modern release engineering?
Nat: Our anchor is thinking about writing it. Can I rant for a second about spreadsheets?
Charity: Yes, please.
Nat: So we touched a little bit on people trying to fix things without measuring at all, and you just end up fixing a little bit here and there and you never make traction on the real problems.
Charity: Cut your own toe off by accident.
Nat: We talked a little bit about making it easy for people to measure, like it's good to measure.
The real reason that I love Honeycomb and the real reason that I wrote this Gingko reporter was to rescue people from dumping data into spreadsheets that they never look at.
Because often when people have flaky tests, they start collecting data by hand in a spreadsheet.
Charity: Their test data? I'm sad now.
Nat: I don't even care if you never look at the automatically collected data. Just don't be spending hours of engineer time a week--
Charity: On a spreadsheet.
Liz: It's like, would you rather invest in the right tools or would you rather waste a bunch of your engineers' time on inadequate solutions?
Nat: And it's so much better. I was taught by my first engineering manager, basically--
It's better to do things in a way that's a little bit slower to start with but lets you have fun and lets you write code her than it is to do something painful and boring and not learn anything.
Charity: That's true. I've seen people, now that you've mentioned it, I've seen people on their laptops that have a copy of-- Not MySQL but the Microsoft version of that.
Charity: MSSQL, and dump in telemetry data and start sifting through it. It's equally sad.
Nat: And you can't take that too far. There's definitely premature optimization and all of that.
Liz: But is it prototyping? Or are you actually making this load bearing? If it's load bearing you need to consider it appropriately.
Nat: Yeah, but you won't always be learning, and you want to make it really easy to collect the data that you need.
Because otherwise you're not going to change what data you're collecting.
Charity: If you needed it once you're going to need it again.
Liz: I love that idea of having your instrumentation never be a fixed thing, or having your testing covers never be a fixed thing. I love that idea. A continuous process.
Nat: Yeah, I can go in-- We started collecting information on the acceptance tests, and then I was like "We want to have the version of the CLI that generated this."
And that was 30 minutes, 15 minutes and a line of Bash. Most of the Bash was--
Charity: Bash for the win.
Nat: Most of the 15 minutes was learning how to write the Bash, and then I learned that forever.
Charity: No, you didn't.
Nat: That one little piece.
Liz: How easy it is to add a new column to your data?
Nat: Yeah, and once you get it started, if you just build things up over time.
Charity: Thank you for being here, Nat.
Liz: It was a wonderful conversation.