Ep. #38, Infrastructure Changes with Andy Davies of Reaktor
about the episode
about the guests
Andy Davies: It basically came down to wanting to provide tools to many other teams.
We have a lot of teams, and they often want to spin up some piece of infrastructure on a fairly, maybe not short term basis, but they want something out now.
And not all of the teams have all of the infra capabilities.
So as a first step, you get a web UI and inside that you can say, "I want a VBC and I need a database and I need a Kubernetes cluster" or a Nomad cluster or whatever you want from our set of options.
And click a button, we create all of your infrastructure, but we also do things push the credentials straight into the secret provider, so they don't have to be visible on a screen anywhere, ever.
Liz Fong-Jones: So it's basically safe, out-of-the-box defaults for all of your applications desks.
Liz: How many are there?
Andy: In this project we've got 50 in the current team I'm in and another 20 or 30 teams I think.
Liz: Wow, that's pretty high leverage.
Andy: Yeah, varies based on the client.
I can't actually mention the client names, but yeah it depends on the client.
So we ended up re implementing this per client as well because it's not our own stuff and it's always really specific to their infrastructure.
The other thing we've been trying to do is give them a small CLI that can do these things for them as well.
This is all written in Go so we can do that.
But we also want to provide an escape hatch. Click a button and get all your infrastructure code. So it's just a block of Pelumi code in this case. And now you can tweak it to your heart's content and know that it works to start with.
And providing this we started realizing that we don't actually know what happens when they click the create button.
We get, we understand that, you know they've requested these things and we've logged everything about the request.
And then there's this 10 minute block of infrastructure creating.
And hopefully at the end of the 10 minute block, everything works, but not always.
I mean, sometimes things are broken or we've not got our validation correct.
So what we wanted to do is have far more insight into what was actually happening.
Now, if you create a VBC, sure.
That all creates, and then you create a database and a Kubernetes cluster and then something breaks in the Kubernetes cluster.
I want to know exactly what broke so that I can try and prevent it from happening again.
Or debug it when someone's panicking at the other end of a Slack message.
Liz: How important is it that it runs in 10 minutes or five minutes or 20 minutes? what's that feedback loop importance to you.
Andy: When I'm developing new modules the feedback loop is much, much more important.
If we decide to suddenly support a new database type or something that's not just a build-in Amazon service then the feedback loop to me is very important.
The feedback loop on time is less important to develop a process for saying "I need a cluster." Click.
But while they don't need a feedback loop a progress bar telling them, you've got 38 pieces of infrastructure creating, 17 of them are done.
We're waiting on these five, that's useful. They don't specifically care that it's taken 10 minutes or 15 minutes.
It's, is this actually working or has it stopped? How long roughly do I have left?
Liz: Right. It's the teenagers in a car phenomenon. Are we there yet? Are we there yet?
Andy: Yeah. Getting those first Slack messages, "Hi, I c licked the button and it's broken and you go and look at it and go, no, it's just that this particular thing takes a long time to create."
Charity Majors: Yeah. There's nothing worse than having something just appear to be working or not.
But just you're just standing there watching, going, "At what point do I bail out of this?"
At what point do I, ruin everything by just Ctrl C-ing or something?
Should I, how long has it been? It's just so frustrating.
Andy: Yes. And then we get into other parts of it.
If we're creating EC2 machines, checking that they have the performance characteristics that they should have.
Charity: Oh, god. I know.
Andy: And they do usually
Charity: And pulling all the bits down off S3 to warm them.
Charity: Mm, so fun.
Andy: Yeah. So that kind of thing when you just have a progress bar and it just says checking EC2, and then what does this mean?
It means we're just checking that you got the number of CPU's that you asked for or rather not the number, but their performance.
Charity: Tell us a little bit at a high level what Pulumi is and does.
Andy: So Pulumi I think is the next step on from Terraform.
Terraform is infrastructure, is configuration code in a sense.
But I mean, my view on configuration is it's just code without tests.
Andy: Yeah, So it's the next step on because you can now write your infrastructure in an actual programming language.
So you don't have to worry about that slightly weird FOR loop syntax, in HL which, you know, works well enough. Usually.
Charity: It works well enough if you're doing it once a week or once a month or something.
But if you're doing it many times a day you have higher level demands for it.
Andy: Yeah. And there's two levels to that. If you're doing it every day, it's fine.
Cause you remember all the syntax and everything.
Andy: And if you do it every few months you now have to go and look it up again.
Whereas no one forgets how to write a for loop.
Andy: So, use a FOR loop
Liz: Hmm. One of the interesting things I have also found is that with HCL, it's very, very difficult to test. It has its own bespoke testing syntax, right?
You might work with HCL every day but Sentinel is this giant ball of wax that's hard to unpack.
Whereas you can write a unit test for your regular Go code.
Andy: Yeah, exactly. So Pulumi lets us write the code. It's imperative code, new S3 bucket and then maybe a new S3 bucket appears.
Or it might be the one that was already created but you can treat it as a new resource.
Liz: Now would be good time for you to introduce yourself. Who are you? What do you do?
Andy: Right. So my name's Andy Davis I'm a senior software developer at Reactor.
We're a consultancy which helps our clients around the world with technology, design and strategy.
I specialize in making other developers lives easier.
That's what my main passion in software development is.
And I try and do this by making everything easy around what they're trying to do. I don't want them to have to think about stuff.
They should just be, "we need to solve this problem and this problem and this problem."
Charity: You are a curator and a improver of socio-technical systems.
Andy: I guess. Yeah. Yeah.
My usual first step when I arrive is speed up the build process.
Charity: I love you so much already.
Liz: So tell us about that. What do people mean when they ask to speed up the build process? What do you typically find?
Charity: Or do they not ask? And you just tell them, that's what they need.
Andy: I had an ex boss described me once as, "Andy doesn't do what I tell him to but he does what's needed."
And that's followed me around.
Charity: Yeah. People don't get how absolutely fundamental this.
It's like this building block where if you don't get that build process down to a small enough, it will haunt you forever.
But you might not be aware of it. It's one of those things that you just, you got to get it right before you can afford to do anything.
Andy: Yeah. And I can't remember who said it and they were like, "Optimize for developer happiness."
Listening to developers, rage about waiting for a build pipeline that takes too long.
And then by the time it's finished someone else's merged to master and you now have to update your branch and then run again.
That's frustrating. And obviously they should be doing trunk based development but one battle at a time.
Charity: And yet people get this sort of learned helplessness.
Cause they've probably never worked any place that had a fast deploy system.
Andy: Yeah. Or did they did, but they just assume it has to be slow because of how big the thing is they're building.
Charity: It turns out that the speed of build times is the result of many little optimizations.
It's what are you prioritizing? Day after day after day after day after day. There is no reason it has to be slow except that we made it that way.
Liz: A lot of time people don't prioritize decomposing the system. Right?
They assume that if the build takes 20 minutes it's always going to take 20 minute.
Andy: Yes. And the first solution is, they seem to be, "We need microservices."
No, just need to stop building stuff again. This hasn't changed in six months. Why do you recompile?
Charity: Have you tried parallelizing your tests or tried pruning old tests or tried visualizing your build pipeline over time or tried any of these number of things before you took to microservices?
Andy: So one of my clients, when I joined, they were like, "We need to speed up the build system."
Which I was like, "Okay, great. That's what I want to do anyway."
And they said, "So building this container is too slow."
I think it was NGINX or something with a bunch of configuration.
And I was like, "Okay, three minutes does seem excessive for an NGINX contract file."
So we wrote a tool to get a sort of the observability to type.
Like spans per step and executable. Like build events, but not.
And we looked at the graph, and I'm like, "Right. We could spend some time speeding up NGINX, but it happens in parallel to this other container, which takes 15 minutes."
It doesn't matter if we make NGINX take two milliseconds. The build is not going to get shorter than 15 minutes.
Andy: Because it's happening in parallel.
Liz: So often people misdirect their effort because it's on things that don't matter. Right?
You have to figure out where the critical path is and go from there.
Andy: Yeah. And it seemed so obvious that three minutes is excessive for NGINX but it just doesn't matter, yet.
Andy: It now actually matters in this build pipeline but that's because of progress.
So a lot of the time it's actually, instrument the thing, figuring out what is actually slow. Can I parallelize?
Charity: Don't assume, you know, instrument first.
I feel this is one of the things that divides, I want to say the men from the boys, but the junior from the senior developers it's don't assume, you know, instrument first. Right?
Go put your fucking glasses on and then take action.
But don't just jump into action before you started instrumenting, before you validated that what you think is the problem, what you think is slow is actually slow.
Liz: I find it so interesting that there were these folks who claim that, you know, "Oh, instrumentation is a waste of time, right?"
Instrumentation is the most valuable thing you can do.
Charity: Instrumentation is a waste of time. Instrumentation is not valuable work.
It's not mission critical. It's not, you know, you should be spending your time on things that move the business forward.
Not things that are optional. And I'm just like, my head is just exploding at this point because how do you know, how do you know if you're working on things that are mission critical or not?
If you can't actually see it.
Andy: Yeah. I have a four step process to how I try and develop software which I try to indoctrinate as many people into as possible.
Which is have a hypothesis, measure, implement a change, verify.
Charity: That seems terribly scientific of you.
Andy: Number two, measure current. The first step is usually add instrumentation.
Actually, you have to have another set of four steps in there.
Add some instrumentation because I don't know where your build is.
My guess when I see a build pipeline these days is it's probably not caching Docker layers properly because we're spending six minutes installing node modules every single build.
Charity: Every single time. Or are you starting and stopping your database from scratch every single time.
Andy: Yeah. Use a library that lets you roll back.
Liz: But it's not just your build pipeline, right? There are all kinds of areas where you might want to measure first.
Andy: Well, yeah. I mean, providing any business feature value, is measure first.
We think that this button should be green instead of blue.
That's a fairly minor one, but what do you expect to happen? Moving a button on the screen.
Liz: It's that tying back to business value that's so, so critical, right?
And so often people just forget to do that stuff and then they forget to measure afterwards to measure it had impact they wanted.
Andy: Yeah. I can't count the number of projects I've been in where they're like, "Yes, we've spent six months developing a thing and it's out the door and it's a success. Cause it's in production and doesn't have any errors" and that's, that is good.
Or does it not have any errors because no one's using it.
Charity: Mm-hmm, right?
Andy: It often requires so much organizational buy-in to get that level.
I like starting with the build pipeline and then expanding it to Cron jobs and things. That's where I add observability.
Liz: Hmm. That's kind of interesting because normally the chicken and egg phenomenon that we see is people tend to start with having a crisis, understanding their application and having it break in prod.
And then they come back and add observability to their build process later.
But it sounds you start with the build process and then work observability into the rest of the op.
Andy: Yeah. Essentially, it's what is the easiest, lightest touch that gives a positive impact?
I guess in all things it's trying to grab all the low hanging fruit first.
Liz: Mm. Or time to first value. Yeah.
Andy: But if I can demonstrate that I can chop 20 minutes off your build time by spending a week instrumenting and then a week making changes based on that data.
That's 15 minutes per developer, per day for forever based on how many pull requests they submit.
Andy: Which is a lot. And that pays for itself very quickly.
Andy: So yeah, I try and do that. And then people are like, "Oh, this is interesting. Could we add it to our...?"
Whatever they're building. You know what, actually, yes, I have a Cron job here that deletes old Kubernetes releases.
Here's all the stuff that it spits out that tells me when it's not working and then people, "Oh well actually if I put that in our--"
I don't know, background worker process of some form and then it kind of infects the application.
Liz: So it's kind of the golden thread model except for, you're doing it instead of on the monolith first you're doing it on kind of people's Cron jobs and other kinds of daemons that they can then cut, paste that code into other places.
Andy: Yeah. And I guess when people start seeing it then it starts coming from the top down as well.
Andy: So you just need to demonstrate some value because people don't want to waste their time on things that don't provide value.
And so you just need to show that there's a bit.
Charity: Yeah, and it's a hard pitch, right?
If you're like, "So in the short term we're going to redirect, you know a bunch of your engineering cycles away from product development and onto something, it sounds like gobbledy-gook to you."
"But, you know, we promise it'll make it better but in the short term it's going to make us perhaps less reliable and slower. And at some point in the future it'll totally make us better for reasons that you probably don't understand."
Because the idea of pushing immediately to prod sounds terrifying to most business people they're like, "But don't, we care about stability and security and don't we want human eyes...?"
One of our very human default beliefs is that the slower we go, the more safer it will be. And the more we get our human eyes on something the better it will be. Both of which are the opposite of, software physics is exactly the opposite of our intuition here.
Software physics says that it should be like a heartbeat, absolutely unremarkable, consistent, small and constant and fast.
And that's how it becomes safe. And that the more you slow down to put human eyes on it adds actually almost no value.
It decreases value.
Andy: It's the queuing thing. It's the handoff between departments.
Even if you're a QA person, for example I have a realistic story of this because we had a QA person join our team and their first reaction was, "Oh my God, you guys are terrifying. You're deploying to production how often?
My previous team was like once every two weeks and you're doing five times a day?" And we were we actually don't know.
Charity: We don't know, because it's constant. Why should we count? (laughs)
Andy: And then after a week or so, they're like, "But you have so many less errors. This is weird. I thought you guys were fast and out of control but you're fast and with control."
So they've let them have more time to look at things like exploratory testing.
Liz: Hmm. Is it really fewer errors or is it just lower magnitude errors?
Andy: I'd say in this case it was actually fewer errors.
Charity: I think it's fewer errors, I do, because I think that those big bang releases, they hide so much bullshit, you know?
Andy: Oh yes.
Charity: it turns out that you actually don't ever have enough time to look under the hood or to lift up the rock. You know?
Andy: Yeah. My first company that I worked at had a six monthly release cycle.
And this was an internal software development team in a company making a desktop software application.
Andy: And our manager left for no related reason.
And a few of us there started deploying more often because we're like, "Oh, I finished this thing. Why should I wait?"
Which was in retrospect, a really, really silly thing to do because we had zero unit testing, zero testing of any sort actually.
We didn't have a test person either. So yeah, we had to test in which, it seems to work, push it out.
But after the first few times it got to the point where biweekly or however often we were doing it was still not quick enough.
And we are this is actually less painful than those four weeks of hell after we've deployed something and everything is on fire for four weeks.
Liz: The lesson that we know is that trying to prevent failures doesn't work.
The better thing to do is to get exposed to the failure sooner.
So you can remediate fast while the state is still in your head.
Andy: Yeah. And I think certainly build process but deployment process speed comes into this as well.
Having a deployment process that takes 10 minutes is all right, until something goes wrong.
And then now you need with 10 minutes to undo this thing. And none of our customers can access the website.
Charity: How long does it take you to ship a single line of code?
Andy: Too long.
Charity: And people are all up in arms about, "15 minutes, that's too fast."
And I'm actually I feel like that's too slow. I feel like I'm being generous here going 15 minutes or less.
But because if it isn't what you're going to find you're going to have people just distributed, shelling out just doing SSH, copy the individual line out and do it just like, hot fix, restart.
You don't want that. You want the default path that people use every day to be the quickest, fastest, most optimized way or else you're going to get people doing stupid shortcuts like that all the time.
We had a problem once that we had to fix something very fast we had some static assets for our website in an S3 bucket.
And the build process took probably 20 minutes. This was a fair few years ago.
And the fix was, we'll just take the assets and copy and paste them onto the S3 bucket, problem solved.
And it didn't solve the problem because they forgot there was CloudFront and they forgot there was cacheing and all other kinds of things.
And yeah, sure, they'd named the files correctly but I'm sure the build was too slow, but it turned out that actually, if they'd taken the slow build time it would've still been faster than trying to hack it through.
And if hacking it through had worked why don't we just do that all the time?
Charity: Right. All the self-inflicted, the footgun stuff.
Potential for footgun is just infinite when you start doing things by hand.
Andy: Yeah. Yeah.
Liz: So concretely, what are the steps that you take someone through?
You mentioned build events earlier, but for our listeners who aren't familiar with it, right?
How do you start with, a completely uninstrumented build pipeline that's slow?
What do you wind up doing to make it better?
Andy: Depends what tools my clients have access to, or will allow me to use.
Given my own choice to actually Honeycomb and build events, but depending what they've invested in not necessarily the easiest way to go, but essentially it's write something that will log how long the whole build takes and then add something around blocks of steps, like, build Docker, run tests, that kind of level.
And then after that drill into those more.
I've written an application, which when you run Docker builds you pipe the output of that through this application and you get a span per Docker layer.
Andy: Which tells you what the Docker layer did and whether it hit the cache.
Whether it hit the cache that's the important one.
That's the most important part from that whole process so far because people don't have distributed Docker caches.
So you're builds are slow.
Liz: Yeah. Something you might be interested in is for people who cannot use Honeycomb who cannot use build events.
Amy Tobey has written a OpenTelemetry CLI that will generate open telemetry formatted stands from the command line.
Liz: Including, you know, having the command right at the start and a command right at the end so you don't have to have a process wrapping everything.
Andy: That would be nice. I've essentially written that several times in Go now and outputting it is usually OpenTelemetry based to whatever output systems their client has in place.
Liz: How many of your clients are now having tracing out of the box, as opposed to you having to recommend a tracing provider?
Andy: Most of them have had it, but have not been using it.
They use the logging because everyone knows logs and they have metrics cause everyone knows metrics.
And one application that I've come across has all the tracing stuff but only the automated tracing.
So they can tell how long a SQL query takes and how long an API request takes.
But there's no useful data. Like what the customer's email address was or anything else attached to it.
Liz: Right. If you don't have the climax it doesn't make sense to the application developers.
They never look at it. That makes sense to me.
Andy: Exactly. Yeah. I think there's more interest.
As people begin to hit the pain of microservices they start to know what's going on with my request.
As it travels through the system, it suddenly stopped. Well, the downstream timed out.
Okay, well, why did that happen? And you want to really dig into things.
But I'd say the last two or three clients have had one of the incumbent logging providers.
Who've got one of every tool available. They've got it. They've just not been using it.
And once you start showing them what you can get out of that, then they start being interested.
The second you show them a, here's your burn down graph of how long your build process was versus, here's the one of it now. "Oh that's..." Oh, and it's scaled.
And then you show them how much shorter, narrower a slot time it is.
Once they start seeing what useful information can be shown even on a single process level, you don't have to instrument your entire microservice, forest?
What do you call it a group of microservices.
Liz: Mm. That's such a common trend, right?
People think that just written tracing is all or nothing, but you can even get value out of just tracing what happens inside of a single process.
Andy: Yeah. Or depending on the process, when you start just a single API call.
Find your problematic one, add something to it and then add more.
Liz: Yeah. The kind of iterative school of adding information.
Take the slowest thing, decompose it, take the next slowest thing, decompose it.
Andy: You don't want to spend two weeks adding instrumentation to your service and deploy it in a big bang the reverse of what you want to be doing. So yeah.
Liz: You mentioned the logs earlier.
What is the thing that you find people using logs for? Do they have some degree of structure, at least?
Do they have a common identifier per request or is it just log spew all over the place?
Andy: Luckily structured logging is everywhere these days.
I haven't seen many systems recently that have just been plain log files that found thing, did stuff.
I haven't seen many of those recently. I mean, obviously things like web services still spitting out unstructured logs, but they'll catch up eventually.
But actual applications that people have written? Structured logging is where it's at.
Especially the .net world has done huge amounts for this starting off with the open source libraries and Microsoft starting to provide structured logging output out of the box.
But most of that has structured logging.
So it is better, but I don't ever want to have to run Elasticsearch myself again.
That's always two engineering cycles.
Liz: Hmm. That kind of gets to the issue of the format versus how you store it.
Where in theory, those structured logs with request IDs could be displayed as traces. Most people just don't bother to do it yet.
And it's probably how I first appeared on your radar was the how to do observability without Honeycomb which was ripping up the Elastic stack.
So I used the Honeycomb libraries so that I could write data to an Elastic stack because that's what was available at the time and just changing the output format.
It's good, but it's not great.
Honeycomb is much, much better at searching my data than ElasticSearch is assuming it's structured. And that's one step.
But if you can persuade developers to stop writing strings of texts, because I mean, structured logging is there but it's still with a message property that has a blob of text that I need to search through.
And that's what I want to get rid of.
Liz: Mm. So it's, semi-structured, it has some properties but it doesn't have all the properties pulled out.
Andy: Exactly. Yeah. The structured, the parts are great and filtering unstructured data is great.
But then when you're looking for a piece of information, that's not there.
If you've got a function that logs 17 lines of log data which have 17 messages, and now you're looking for the one function that didn't write a message?
Like I don't know, user found in cache.
Andy: How do you notice a piece of data that's not there? It's really difficult.
Charity: Really difficult.
Andy: Whereas if it's a Boolean property found in cache. Group by Boolean property, Oh look we have 20 people who weren't found and they're the errors.
Liz: Right. I might start to call that the Andy rule of Observability. Right?
Is your system observable or not? Right? It doesn't actually depend upon which format you're using.
But it's the question, can you, in your analytics, understand is there this field present or absent from a function call?
Andy: Yeah. And there's nothing wrong with any of the existing providers.
It's just having my data split across multiple places when I'm trying to figure out what went wrong.
That's annoying. I don't want that friction.
Liz: Yeah. And then it just adds minutes and minutes or even hours to your resolution time when you're figuring something out.
Andy: Yeah. I want to type in my correlation ID and then see everything related to that.
I don't want to have to go and check four places or search strings of texts. But, yeah, this is definitely a thing that's improving over time.
Liz: So when you show a customer this, I assume that they're like, "Oh my God, I can finally get value out of a waterfall trace. I can finally see what's going on."
How does that culture then spread within the company? Right.
You're not around forever because you're a consultant.
Andy: Well, usually they're impressed.
Sometimes depends who you're showing. They might not understand.
That's probably the hardest part is that sell of showing why it's important.
But I think once you've converted, it starts sounding semi-religious, once I've converted some people, once it's solved a problem for one or two people, they then become the biggest advocates of it.
And this actually happens inside the consultancy itself.
You have me banging on about observability constantly.
And just the other day, one of my coworkers was like, "I saw his talk about observability but it didn't fit my mental model. So I didn't care, but he's really passionate about this. So I decided to give it another listen. And now I understand."
And he's now started to actually look into what is different.
And so far seems to think it will be useful in his clients as well.
So I think it's finding those one or two people who then start to evangelize it as well.
Once you've got a few people who are really passionate about it, more and more people will start to like it.
And if you're coming into on-call on a new company and your first experience is one of the systems that has the observability data everywhere that's a much more pleasant experience you know?
That person moves to another team because people move around.
They moved to another team that, "Oh, I just go and check that, Oh, we don't have it. Right. Well, I'm going to add it."
And then just kind of organically grows.
Liz: I swear The number one source of Honeycomb customers is when people change jobs from a Honeycomb using company to a non-Honeycomb using company it's stunning to see how quickly they come on board.
Andy: That doesn't surprise me. Especially when people obviously logging, "What's the best logging provider?"
Comes up often in internal discussions and external ones.
And then there's those people go, "Oh, this is really good. and this is really good and you should avoid this one."
Then there's me going, "Just stop writing logs, do this instead."
Liz: Yeah. That's a completely different way of approaching the problem.
Andy: Yeah. And logs still have value. It's just that I want more value.
I want easier value, I guess. I don't want to be full-text searching my production issues or paying for full text searching.
Liz: So one thing that Charity and I often disagree about is 15 minute rule for continuous delivery.
How important is it to apply to your infrastructure as a opposed to your code?
Andy: Ooh. I guess I would. I'm going to have to give you the standard consultant answer, it depends.
Databases, 15 minutes, not a good idea.
Database schema changes, maybe that could be faster, but modifying database infrastructure itself seems a bit risky. Yeah.
I think it would depend on what the infrastructure was.
Generally speaking we've had it so that the infrastructure planning is run as part of the pipeline but someone has to approve it to actually be applied because there's that slight distrust that you are going to take out the actual computer cluster or modify a load balancer so that no traffic can come in or something.
But infrastructure being abstracted away enough? I mean, it's not a, Kubernetes' manifests.
Not infrastructure. And that applies pretty, well sometimes applies pretty quickly. So maybe.
Liz: That sounds like you're saying that there's kind of a bucket of slow things and fast things.
And some infrastructure things are in the fast path but some of them are in the slow path where you absolutely must tread with caution.
Andy: Yeah. And I think that's probably a good way of looking at it.
It's maybe not a fast and a slow path but maybe a trust and non trusted path.
I don't know. I don't trust DNS updates very much because I've been bitten by DNS so many times.
So I want to review those.
I mean, they're fast, but I still want to check that you're not accidentally redirecting all your traffic to local host, again. Database update changing instance size upwards. That I'm fine with being fast. Changing it downwards? I might want to look at that. So even the direction to change might matter.
Yeah. I'll stick with my original answer. It depends.
Also it how much visibility you've got of what is happening when you change that infrastructure.
Is there an undo button? If you broke a DNS record? Well, you can fix it in as record.
You destroyed your production database? You can fix it but you need a backup that's recent.
So it can be fixed but I'm not sure that it's worth the time saving of letting someone updated it instantaneously and five minutes.
Yeah. I guess cost trade off is probably cost and trust.
Liz: Those are indeed some of the challenging and hard problems.
Andy: Yeah. And I'd to point out trust of the system not trust of the people doing it.
I trust all my colleagues, people make mistakes.
Do I trust that the infrastructure is in exactly the same state as my HDL file or my preliminary program or whatever I've got? don't know, hopefully.
Liz: Yeah. And our entire jobs are about trying to reconcile those two things and keep them in sync as much as possible.
Andy: Yeah. And I've had places when the easiest way to verify that, when a lot of people have had production access to the AWS or GCP console is a running Terraform plan on a loop in the background.
And if Terraform plans shows changes and there's not anything around, that's fine.
But when it shows changes for a long time, that's not fine.
Liz: It's the difference between a red build and a green build, right?
You want your build to be green steady state rather than red.
Andy: Yeah. So just, just knowing that the change has happened and something's started to drift.
It's not a problem until some point later, but I want to know how often the configuration's drift.
Andy: And for the most part, current clients, not much. Infrastructure actually doesn't change that much anymore.
It's only when new stuff's been developed already or if a sudden likelihood of incoming traffic, but--
Liz: Yeah. It's the difference between the things that you do when you're prototyping versus the things that you do when you're trying to remediate an outage, but you want to roll it back to the good state or you want to a codify your new good state--
Andy: Yeah. Just as long as it's a process that works for people and that they understand that's how it works.
Liz: Yeah. You can have all of the fancy stuff in the world, but if people don't follow the process then you're not actually going to see the benefits.
Andy: Yeah. I saw somewhere many years ago describing it as The Pit of Success.
I want to make it so easy to follow the process that people do that.
If I can make infrastructure changes, the easiest way to do it is the correct way, people won't do it the non-correct way unless they've got a really good reason because it's painful and people don't like fighting against pain that they don't have to.
So yeah. Try and make it so that, I don't know don't give people write access to AWS.
They can plan as many changes as they want but they can't hit apply.
Liz: Or at least they have to do it through the system and not through fiddling with the console.
Andy: Yeah. Yeah. So the CI machine would go in and then you have a, "Do you approve this change?"
Yes, that's fine. Yeah. You can plan as much as you want and you can change configuration files and do whatever you want.
Just make sure a second pair of eyes has seen it, really. Ideally at the beginning of the process
Liz: Code reveal for infra, absolutely. Every time.
Andy: Yeah. Or possibly pair programming for infra.
Liz: Yeah. We had a outage recently that was caused by a single letter typo between X-large and large.
That was not even caught with code review.
If that, I think peer review would have eliminated instead if we'd been talking about it in real time with each other.
Andy: Yeah. And with these kind of things so often I'm like, "Could I have prevented this?"
And sometimes the answer is yes but the cost of preventing might've been too great. I don't like the phrase. "Move fast and break things."
But if you're moving slowly and still breaking things then it might not be great.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
O11ycast Ep. #60, Customer-Centric Observability with Todd Gardner and Winston Hearn
In episode 60 of o11ycast, Jess and Martin speak with Todd Gardner of TrackJS and Winston Hearn of Honeycomb. This talk explores...
O11ycast Ep. #59, Learning From Incidents with Laura Maguire of Jeli
In episode 59 of o11ycast, Jess and Martin speak with Laura Maguire of Jeli and Nick Travaglini of Honeycomb. They unpack...
The Past, Present, and Future of Observability
Metrist co-founder and CEO Jeff Martens discusses how your team can implement monitoring and detection best practices now, how to...