February 11, 2020
Disclosing Security Incidents from Routine to Breach
Security incidents are inevitable, the only thing you’re in complete control of is how you respond to them, and a well managed response ca...
In episode 41 of o11ycast, Charity and Liz speak with Aaron Aldrich of Red Hat. Together they pull back the curtain on organizing conferences, accelerating feedback cycles, multi-tenancy debugging, and getting started with new tools.
About the Guests
Aaron Aldrich: Especially from my position with o11yfest or with DevOpsDays Boston recently, it's been from the production booth.
I've been running all the actual video production side of it.
Very much I have had, it's very dashboardy because everything is live and it matters in the moment.
I've got like the livestream going and I've got other chats going and I've got the actual windows that I'm managing.
It's definitely sort of in that realm of if you're actively monitoring incidents and you're trying to watch, okay, I can't just know everything that's happening so I've got to have all these things up to try and monitor what's going on and all right, what are my changes affecting?
How is that coming through in the actual live feed by the time it gets to production versus what my changes are?
There's a lot of that balancing the signals on all side of it, from one side to the other side of the broadcast is sort of the same as similar to that above the line below the line.
Everything I'm doing up here, how's that affecting everything else?
Liz Fong-Jones: Yeah. It's super interesting to see that phenomenon of cognitive overload of having way too many dashboards, feeling like you have to have all those dashboards to see what's going on.
Aaron: Yeah, absolutely. You sort of find the balance of okay, what do I need right now?
What can I background? Okay, I can throw that off to the side.
I need it quick if I need to glance at it, but I don't need it up in front of me.
What things do I just need to alert?
Charity Majors: Do you find that compasses tend to fail in a lot of predictable ways?
Or do they fail in lots of novel and unique ways that you need to debug in crazy different ways?
Aaron: I think it's got a little bit of a balance if only because there's so many and it depends on if everything's live or if things are prerecorded.
All of that comes through with different groups of it.
There's almost always some measure of computer hardware, internet connection failure from some guest or host that you're managing that is almost guaranteed to be a predictable failure that you're watching out for.
But yeah, everything else is always new and unique.
I had never had where at o11yfest for instance, we're using OBS.Ninja so all of the video feeds rather than being some sort of video input are technically browser windows.
And it just decided it didn't want to do browsing anymore so it's like, oh, I've lost all the video feeds.
Everybody standby, which was like, okay, this is brand new.
Charity: This reminds me, sometimes you get some playbooks from organizers who you can tell have been through the trenches at conference after conference because they're fucking 15 pages long.
And they're do this, this, this, this, this. Try this, this, this. Be there, there, there. And it's simple.
You're just like, oh honey, I get that you have scar tissue, but you can't inflict that on all of us.
Nobody's going to read your 15 page document and nobody's going to remember it or internalize it if they do.
Aaron: Yeah. Yeah. I think that's sort of the key.
And that's sort of with any incident management or observability too, there's sort of the key, here are the main things to watch out for, but everything else you either have to mitigate.
Charity: Embrace the chaos.
Aaron: Or we're going to deal with it, here are the people that know the information you need to know.
Liz: Now would be a good time for you to introduce yourself.
Aaron: Yeah. Awesome. I'm Aaron Aldrich.
I am a managed OpenShift black belt at Red Hat.
I started DevOpsDays Hartford in Connecticut and I also have organized with DevOpsDays Boston and New York City and o11yfest just recently.
Charity: Managed OpenShift black belt. What does that mean?
Aaron: It's one of those great customer facing titles, which is we stole the black belt concept from Microsoft.
They have their global black belt teams, but we focus mainly on the managed OpenShift products are kind of the platform as a service, as opposed to the self run, open source end of it.
And it's very customer success focused so it's kind of like DevRel, but very targeted to we would like you to DevRel at this company please.
Liz: It's kind of this transition of how do we apply all of these DevOps skills to clients?
And how do we make sure that they stay successful?
Aaron: Yeah, exactly. I'm sure you've experienced it where a client, they know the tool and they've been sold on the idea, but there's the gap between the tool that they want and where they want to be and where they're starting.
A lot of what my role is, is coming in and saying, "Okay, how can we help you bridge the skill gap? Help you look at like, where should we be looking for skills? What things do we want to skill up on? How can we get from managing our monolith apps on virtual machines to a Kubernetes platform?"
Liz: And even is the Kubernetes platform the right thing to do?
What are the design goals? What are we trying to do here?
Where should we start? Not, let's not peanut butter and Kubernetes over everything, right?
Aaron: Right, exactly.
And part of that is trying to find out, I tend to be at targeted big groups where almost always there's some project that will fit that mold and so some of it's trying to find that out, like what is a good project to do?
Charity: I just had the best idea for a candy bar, Kubernutties, like a peanut butter bar. Kubernutties goes with everything.
Liz: And what are the sorts of challenges that you see people having? Where are those skill gaps?
Aaron: There's a couple things.
A lot of it is figuring out because we are in managed, there's a lot of drawing the line of what does it mean to have the platform managed?
And what things do I no longer have to worry about if it's a hosted platform?
And so there's a lot of questions like, well, how do we do X?
I'm like, well, you just don't have to do that.
That's why you're paying more for it because you just don't have to do that anymore.
Charity: And what is this circle back to observability, how does that help you?
Liz: Because you used to be an observability practitioner for many, many years.
Charity: Say more about that.
Aaron: Yeah. I'm trying to figure out where all my lines will draw here.
I think there's a couple things that get into it.
And part of it is redirecting a lot of old practice of how do we monitor the systems to how do we actually know what our applications are doing?
And especially coming from, we're Red Hat customers. They're Red Hat folks.
They get infrastructure. That's where they're focused. They want to monitor their systems, well what happens if our nodes X, Y, and Z? I'm like, well, you don't monitor that anymore. That's something we're going to handle.
Now you need to focus on your application and how that builds and what's happening with your customers and making sure all that's working out for you.
There is a lot of that shift of mentality to, yeah, you don't need to worry about the disk availability.
You don't need to worry about all of these underlying.
Charity: No more CPU load averages ever.
Aaron: Right. That's what your auto balancer was for.
That's why we set this up so that you won't have to think about it. It'll just magically get better.
And if it won't, that's why we have these background alert systems that our teams are paying attention to and the actual platform is handling.
Charity: What do they have to pay attention to?
Aaron: That's the great question.
Everything, is everything built on top of it.
They're going to care about, what are my actual pods doing?
The actual applications, how are they performing?
Anything you want to look at observability, how are customers handling that?
What are the success criteria for our application?
And are we able to meet those goals? And are we able to follow through with that?
And I think that's a lot of learning, especially for big enterprise groups that are not necessarily used to having the, a lot of apps are internal, especially most big groups like that.
Where they'll have, there's some of the customer facing stuff, but there's maybe a 100 internal applications that are being run on these platforms that no one ever hears about, unless you work in a certain department or unless you work inside the company.
And I think there's a big shift of understanding, these are customers of that product and so trying to apply the same mentality of you shouldn't be finding out that you've got low database throughput because you're getting a call from finance saying they can't run their calculations for the quarter.
That would be a bad way to find out that your application is underperforming.
Liz: Yeah. And it's that shift of mindset as well towards having faster feedback cycles where people are definitely used to throwing stuff over the wall rather than being responsible if it breaks and I think that also can be a challenge.
Aaron: That is a lot too.
I'm finding that a lot of it is, I think I've said this before, any successful transformation is a non-zero portion of therapy for the company.
And I'm finding a lot of fractured groups like that.
Where I'm talking to the app dev group about how are we going to move this application to a new OpenShift platform?
And it turns out oh, but then I'm talking the platform team who manages their container as a service at that company and it's, or the DevOps team that's doing all these containers and platform stuff and there's still that disconnect between the actual applications being operated on the platform and the team that's operating the underlying infrastructure and how it's being managed and what's being monitored.
Liz: Right. It's a thing where these people should be talking to each other, should be teaching each other's things and yet the data is just siloed.
Aaron: Yeah, exactly. It's still getting into those silos and sometimes it's really hard to break that down.
There's so many internal company politics that I think you get really exposed to, especially in that customer success and sales facing role, because so much of that has to be navigated just to get to the end of the conversation and figure out where the blockers are.
You'll have one conversation one day with say the platform or container team and they're like, yeah, this sounds great.
Everything's wonderful. And then suddenly you talk to the app team the next day who also needs to buy into and flag.
Oh no, we've always had problems with this. Wait, what?
How did this not come up? How does no one tell me that there are problems with this so far?
Liz: What are the steps that you take people through?
What do you have people start measuring? How do you have people start instrumenting?
Aaron: Sometimes starting is really just having these conversations and finding out what the problem points they're having.
If it's AppDev has been struggling, even moving their applications over to a containerized structure, it can get really down into the does this even make sense?
Or what is even working with it? It's really weird.
We haven't done a ton of it because we've been a fairly new team for Red Hat as well.
The move towards managed OpenShift platform from something that people are operating themselves is pretty new so there haven't been a ton of cases I can give us a good example.
Liz: How has the experience then kind of moving from a observability vendor to a kind of company that is providing services to these larger enterprises?
Aaron: Yeah. That's actually been a really interesting shift.
A lot more of what I did before was so focused on the instrumentation and how do I get data out of my applications?
And now switching this platform has been a really interesting move for me of just having to brush up on a lot of skills that sort of atrophy a bit when DevRel for a very specific role and moving somewhere else.
I think that goes to, we mentioned a little bit earlier, the pendulum between DevRel and engineering and it sort of swung back to middle for me, where before it was like, I just needed to throw some application up somewhere.
As long as I can create some sort of a manifest, I can throw it wherever and throw Elastic on it and pull data in. And as a colleague of mine said, highly instrumented, hello worlds and show up places.
And now it's a lot more of, oh right. I have to start caring about how do I set this platform up again?
And how do I actually get my applications running? And how do I tune?
Liz: Right, because you're a lot more living with the consequences of your decisions rather than kind of showing a DevRel app that is not running the actual production workload.
Aaron: Yeah, exactly.
It's a lot more of oh, I'm building a lot more pipelines, a lot more of all of that infrastructure that actually want to care about how it's operating, as opposed to like, I can run it on minikube and it'll be fine.
It'll do what it needs to do and give me something. Now it's like, oh yeah, okay.
One of the weirdest bits at the beginning was just getting access to everything I needed because it's with Elastic it was like, I can run it locally.
I can run it somewhere else. I can run it in their cloud product and it's fine.
And suddenly I'm like, oh, I need to set up whole clusters.
I can't just spend a little bit and expense that because there's no cheap way for me to stand up an enterprise, even a small cluster for a proof of concept, I need a number of machines that need to get funded for an ECT.
I need a number of underlying services and accounts that get spun up.
There was a lot more brushing up on how everything fits together rather than just focusing on this one observability product.
It's been kind of interesting seeing how observability fits into that too.
Of having that mindset of not only do I want to get this app up and running, but how do I then make it look like something operational?
It's been funny going from, it's sort of the opposite of these very basic apps that kind of run wherever that are highly instrumented, getting all sorts of telemetry data out of it.
And now I'm like, okay, I have this really robust infrastructure running a kind of basic app and oh, look, I can throw some telemetry out and I'm getting some data out of it. Cool.
That's going to be good enough for people to figure out where the differences are from there.
And it's been really interesting having everything from a management perspective, with a managed platform, we don't want to do a ton of stuff that's one off and okay, here's how you hand build it all.
I want a lot more that's here's the operator that you've run an install on, here are the parameters that you change in order to get it running.
Liz: And also I imagine debugging multitenancy is always a fun and exciting thing.
Aaron: Yeah, always fun and exciting. Especially with the way the accounts are split in such ways I've learned a lot more about cloud operator partnerships has been a whole thing.
Charity: How do you debug that with your tooling?
Aaron: The multi-tenancy with it isn't as overlapping as it could be, because so much is separated by cloud accounts.
Our issue is mainly bringing everything back in to the backend of it so that we get all the visibility we need into customer accounts.
Charity: You don't have problems where one customer is spiking or ballooning, doing bright loads, it bleeds out and affects a whole bunch of other customers and then you have to figure out because they're all getting slow at the same time and you figure out which one initiated it and that sort of thing?
Aaron: No, especially with a managed platform that's all being run inside of, it's through partnerships with especially Microsoft and Azure and on AWS. All the tenancy is happening at the cloud provider.
Anything's happening there, it ends up getting offloaded to the cloud provider, dealing with anything that would be happening.
And it would be underlying like EC2 instances so it's pretty rare that we're running into issues with it.
More of our issue is how do we allow a centralized SRE team at Red Hat look at all of these accounts across Azure and AWS and IBM cloud and Google cloud and do some sort of centralized management for all of that?
Which is why I talked about operators have been huge.
The joke was, they don't do anything at Red Hat SRE without writing an operator first because literally that's how they have to approach customer management is nothing.
We can't do anything unless we can automate it and do it for everybody.
There's sometimes feature gap between what can a customer self manage and what is available through the managed offerings only because we haven't automated it in a way that works well enough to just say, "Yeah, click these few buttons and it'll work."
And sometimes there's some gap in building the automation behind rather than just the functionality.
That's one of the weird things with regulated industries using a managed service, there's all sorts of background stuff we've been trying to figure out how can we successfully tie into this customer's infrastructure?
And they're still compliant with whatever regulations they need to meet.
And so there's a lot of work around that as well happening in the back.
I know it's not directly a observability with that, but all of it ties in because we have to pull the telemetry data from their infrastructure back to Red Hat so that we can monitor it because we're the ones monitoring all of the actual node health, all of the actual cluster health is being sent back to Red Hat.
We make sure if any of those are acting up, that's when we start taking action in order to fix it.
Liz: And with the huge diversity in customer workloads, this is kind of the same problem that Charity and Christina, we're working on at Parse of having people being able to hit your cluster with all kinds of random stuff that's going to break things.
Aaron: Yeah, exactly. And there's some of those around that too.
Especially for folks that are used to self managing, we're talking about the managed product, oh, well do we have access to all these underlying things?
I'm like, well, we can give you full cluster admin, but asterisk, don't mess with this stuff.
This is all the things we're managing for it.
You're the one that breaks it, all we're going to do is let you know that you broke it as opposed to being able to actually undo it and fix it for you because we don't know what you've done at that point.
We don't want to break other stuff so it's going to come to the hey, heads up, we lost your nodes.
Is that on purpose? What are you doing?
As an example, we had folks and we've turned out, we needed this functionality where we had some customers where every day, we just watched these nodes disappear and we're like, what the heck is going on?
And we try to bring them back up as these nodes go down and come to find out they were trying to use it temporarily.
We want to use this as a test cluster so we want to shut the cluster down at night because we're not using it and bring it back up.
But they were just going in and turning off and deleting nodes so they were just disappearing from our monitoring.
We were just losing clusters or losing nodes until we finally called them up and we're like, what is happening? Oh, we're turning them off.
Oh, please stop because it's really messing with us.
Now that we know we can turn off our alerting for it because it's not going to be helpful, but also heads up we've turned off our alerting for it because you keep turning them off.
It's kind of that balance. And now we're going through and that was one of the features.
I don't remember what our GA for it was, but I know we're trying to work on that.
How can we build that about lasting availability, where they can turn on and off their clusters and still get the full managed experience because of that.
Liz: Yeah, it's definitely been a fun experience for us taking our workload, some of which is staple and some of which is transient and moving it onto Kubernetes. And that's been a thing for the past. Only in the past two or three months has Honeycomb really started to embrace that. But prior to that, it was just like, no, these are our machine goals. These are our spot instances. These are our non-spot instances.
Aaron: Yeah. And it's definitely, there's the curve of when your understanding as a product developer clashes with what is the real world actual usage of this software?
In our mind, we're like well, we're spinning up whole clusters.
No one's going to be turning these on and off all the time and come to find out people are like, how can I turn this on and off all the time?
It's like, oh, okay, hold on. I guess we have to figure that out.
Liz: Your day job is very much focused on kind of the practical realities, how to do these things, how do you wind up doing kind of the video and event production stuff that you've been doing?
Aaron: I think the irony is that's probably my as much formal training as I have formal background because I actually went to music school initially out of high school, I went to Berkeley College of Music and I was planning to go into production engineering.
That was sort of my trajectory and then life happens and tech pays really well, turns out and music doesn't, turns out.
Anyway, it's always something I've had a knack for and really enjoyed.
I've done lots of live sounds or live production and things like that.
Yeah, this was one of those started out for, as COVID came in and everyone sort of had to learn how to oh, I guess I'm going to figure out how streaming works and how all this video production works and all of this sort of thing, knowing folks at a couple of different conferences that were going on, the folks over at Deserted Island DevOps, I participated in that as a speaker.
And so that was something that kind of got that taste of how things are working on Twitch.
Austin has been doing a lot of stuff on Twitch and kind of experimenting with that platform.
And as other conferences rolled around, I was like, I kind of know what's going on in the background of these things. I think I can try it out.
The OBS, Open Broadcast Software is it's open source. I can download it and try it out.
And so it was a matter of being the person that had enough of the background skill of live production in the organizer team of DevOpsDays Boston, that it was like, all right, I'll just jump in and do it. I will just do it because the other options are not great. I will care about the quality of production that comes out.
If we just try to use the in built tools to hop in or whatever. Let me just do it and make it better.
And I just, at a certain point was like, hey, I can totally do this.
Let me just drop everything else I'm doing because I kind of have to build this from the ground up and I'll take over this production aspect of it and it will be much better.
Liz: And we very much appreciated your volunteer efforts, especially for o11yfest, which you and I are both organizers of.
Aaron: Yeah. That was great. That was nice to do as the second one.
Having one under the belt and learning a little bit more about it and some of the production was really nice of doing with o11yfest.
Liz: We still wanted a lot of testing in production a little bit as we talked about in the opening.
Aaron: Yeah, there's a couple things.
Didn't predict for instance, it turns out if you don't give people guidelines on what the video should look like and aspect ratios and that sort of thing, you will get multiple responses the more people you have.
There were some things where I would crop a video because it looked like there was a bunch of black space, but it turned out they actually used that space later so I had to do a few things here and there, but I think one of the common things about whether it's incident ops or live production or stage production or whatever is sort of that sort of doing it live of well, I know how to fix this, so let me just fix it and set it live.
Let me just fix this and let it go.
And it's better to fix things halfway through, then just let it play out sometimes.
And so there was a lot of that going back and forth and a lot of little things where I know there's at least one moment in o11yfest where Paul was like, oh sure wish we could do screen sharing right now.
I'm like, oh yeah, we can do that. Give me 30 seconds. We'll have it up on the screen, no problem.
Liz: Which of your talks was the favorite from o11yfest? It's a big step that you got to enjoy the talks rather than.
Aaron: I think about this because so many of them were me monitoring and making sure everything was working and taking that moment to have a break.
The panel of all things is probably the one I paid the most attention to you, but I really enjoyed a decent panel discussion.
Part of it was almost because it stood out to me as this was a really good way to have this conversation with people about observability, these answers that I haven't seen a really well done panel in a long time.
That was kind of nice to see.
Charity: Panels are tough.
Aaron: Yeah, all of mine are.
You see, I've got such bad answers because so much of what I was focusing on the production of this, I really appreciated watching Ted's if only because of his microphone afterwards in his Q and A, when I was listening to the actual noise canceling work, which is the most bizarre, I'd never heard such a stark example of noise canceling functioning live.
Liz: For the people who didn't catch it live, what happened during o11yfest was Ted was taking QA with a jackhammer going in his backyard. It was amazing to see.
Aaron: Yeah, it was great because it would stop and all of a sudden you'd have this really decent quality vocals going through.
And all of a sudden you'd hear this faint jackhammering, but his vocal quality would cut down.
It'd sound like he's talking through a telephone on the other end.
And it was like, oh, it's just done frequency reduction on all these over loud frequencies.
That's really interesting that's happened.
Charity: You've been doing engineering for a while and then you switched over to doing DevRel stuff and now you've switched back.
Talk to us about why? What was compelling or impelling you at both steps?
Aaron: Yeah. In this case I would hedge it a little bit to say, I feel like I'm sort of in the middle of a pendulum swing rather than all the way back into engineering because it's still a very customer focusey and I still do a lot of talks.
They just happened to be very targeted as opposed to direct engineering work.
Charity: What is it that's compelling you to make a change though?
Aaron: Yeah. The thing for me was actually the opportunity that was available at Red Hat.
I happened to know some of the folks that were working on this new team that we're trying to build this team.
And for me it was the opportunity to go to a traditional infrastructure company like Red Hat who's very used to doing this big three year contract motion of sales and start to bring in this new cloud services, managed services opportunity of how do we start turning this around and changing the focus?
There's still a lot of like, I'm trying to think of how to word it.
That's why I'm kind of floating around the words here, but they've always very much been and they'll say to the enterprise open source company.
And so the side effect I think of being in that world is starting to even get as a product company thinking in that frame of mind of big projects longterm we have requirements so we're doing stuff log in and they're starting to move towards wanting to think about DevOps transformation.
And it is that big shift in motion from the big enterprise direction where you might think of longterm requirements and all these big projects and it's selling three year contracts and that sort of thing to this more cloud native DevOpsy approach.
How do we become more agile? How do we target people who want to be able to click a button and start a service?
Liz: And also how do you deliver that value sooner rather than over three years in kind of big bang initiative.
Aaron: Right, exactly.
That's the big thing that we're trying to work on is how do we help folks realize the value that can be delivered?
And how do we actually get that spun up? Part of the challenge after selling a big product, if you have a three year contract it's kind of fine to walk away and they'll either figure it out or they won't, but when you're especially doing cloud services, you can't really have someone spin something up and then not use it or not know how to use it because they can just as easily turn it off and walk away from your service just as easy as they started it up.
That's a big shift in mindset that the whole company has to go through.
And so it was really interesting to be able to come in with folks who are interested in doing that, interested in making that turn towards how can we provide value for customers?
How can we do stuff faster?
How can we focus on this cloud native approach as opposed to big on premise enterprise deals?
And that was sort of interesting to me. I thought it was interesting to be able to work in that role.
Part of it was because I grew up in Connecticut and only just recently in a summer of 2019 moved up to Vermont.
And so most of the environment I was around dealing with DevOps stays in Connecticut was big enterprise companies.
Hartford is full of Aetna and United Healthcare.
Liz: The center of the insurance world.
Liz: Little known secret for anyone who's not from the Northeast.
Aaron: Yeah. In fact, it's one of the funny things about Hartford is it's a ghost town after hours.
We do DevOpsDays and there's no real nightlife because everybody leaves at 5:30 and goes home.
Enterprise was very much what I grew up with and everything that was around that space.
I kind of have a heart for government and enterprise.
Folks that are doing that chop wood, carry water.
And they're trying to do something good and make something good in this environment that's highly structured and regulated and slow to change.
It's really interesting for me to be in a place that can talk to those folks.
Charity: Doing that for a while and then why decide to swing back to engineering?
Aaron: Yeah, well that was what I was saying, was this role is focused on those customers.
Where I was in DevRel focused on big community groups that was very much kind of anyone and everyone who could show up if I'm talking at a conference or having an Elastic meetup.
Very rarely am I getting a community of enterprise folks, it's kind of the community of the startups that are around or smaller groups that are there unless I'm targeted at going into that company.
With Red Hat and working on the OpenShift team, it was focusing on those enterprise groups.
It's still sort of community, but it's very targeted at these big enterprises and how can I help these enterprises to that?
Like I said, it's kind of a little bit of engineering for me, but really my main goal was being that customer success facing role and that sort of thing.
And so it's still being able to help people out and any of the engineering changes and what I'm brushing up on in that has been all around making sure I can have those conversations, making sure I can offer useful advice to someone who's in enterprise, as opposed to having conversations for a startup or having conversations for folks that are totally greenfield.
I'm almost always entering into projects that are already underway, stuff that's ancient stuff that's very brownfield and trying to figure out how to make those changes.
Charity: What kind of engineering do you actually do in your current role?
Aaron: Mine right now is still pretty limited on it's more around sort of the ideals and getting started with OpenShift in this case.
A lot more of it is how can we build an application delivery pipeline?
I've got some of the functions around that of making sure those are working and adapting it for customers' actual applications.
There's not a ton that's direct engineering and building it.
It's a lot more putting the parts together inside of a big project like OpenShift, that's sort of an amalgam of open source projects.
Liz: Yeah, definitely one of the things that I highlight often is people copy the first working example they find so you'd better have that first example be correct or else people are copying these broken idioms and paradigms.
Aaron: Yeah, right.
That's actually a really good point about it because that's a lot of what we're trying to do is get those things right.
How to get you up and running in the right way so that you're doing that the same way every time or in the case of app delivery pipelines, a lot of the question is around, hate saying DevSecOps because I hate all these DevExOps things, but I think we're stuck with the terminology that the market throws at us, but there's a lot of focus on like DevSecOps ideas of how can we build some amount of security and observability into our whole delivery pipeline of making sure our packages are secure.
Charity: Doing all this thinking about this sort of stuff, what advice do you have for our customers or for users, just for developers in general, who are where do I start?
What is the smallest amount of value that I can get out just to prove to myself, to my team, to my boss, that this is worth investing more time in.
Because I get this question all the time and it's difficult for me to synthesize an answer because I, haven't probably seen as many examples as you.
If you're from a traditional monitoring world where the golden signals or whatever, but you're trying to get more towards the constant conversation with your code live in production, the high cardinality, high dimensionality, just the next generation, the next wave of thinking about tools that is not so infrastructure faced.
Where it's like, is my service up? Is it down? Is it healthy?
It's more about what is the experience of each user end to end?
But that's a pretty big investment so where do they start?
Aaron: Yeah, yeah, yeah. One of the things I've mentioned before is there's all the talk about, especially in DevOps, shift left.
One of the things I've mentioned is you have to seek your perspective and it's not a good technical answer, but that's sort of the practical thinking answer of it, that everything I feel like stems out of.
Is thinking of what is the customer experience?
And what's the success criteria of this actual application?
Because so many are not even thinking of that. So much of it is making sure the architecture is up and that's running the application.
And so I think the big first shift is okay, well if it's all running, what does it mean to be running well?
What does it mean for your customer to have a good experience?
And what's your tolerance on that? How many times can I fail?
And I think getting away from the a 100% five nines myths that are out there.
Because I run into that a ton of well, what's the availability?
We ought to make sure it's up 99 point, whatever money percentage of time.
Liz: 99.999999% of the time.
Aaron: Right, exactly. I'm well does it really? What does that actually mean?
You're talking about minutes a year. What does this actually mean?
And do you actually have that tolerance?
And I think there's a huge learning curve, just shifting that mindset away from perfection, into what is my error tolerance?
And what's that sort of thing?
Liz: What about feedback cycles? What about kind of the amount of time that it takes to push code through your delivery pipeline?
Aaron: Yeah. That's another big one too.
And I think that's where the pipelining comes in in the first place.
And why that's one of the things that we have been working on a sort of secure, ideal pipeline example that we can kind of throw out there as this is our reference architecture for how this should look.
For OpenShift for instance, we adopted these Tecton for our pipeline processes.
That's kind of the main pipelines product that's there. Although, there's all sorts of other stuff because open source.
But what that can allow folks to do is not only have reasonable parts, is kind of the goal with it.
We're trying to get folks in the mindset of your pipeline should have concrete steps of physically move this from here to there.
But we also want the abstract of that as well.
What does this look like abstractly that can be applied to every single application that you have?
So that your build process is no longer custom written for every single application that you use but you can start just reusing these processes that you did and just plug in new variables.
Liz: Which kind of gets to that point of developer productivity.
If everyone is reinventing their own pipelines, then they're all doing their bespoke things.
They're breaking in different ways.
Aaron: Right. Right. You have to do this all over again.
It's like, oh, did we get that Jenkins script right this time? Do we copy it from the right places? It's like, no, no, no.
We don't want to do that. Just plug in the variables that you need, reuse the bits that work every time and let's use that.
And at the same time, we make sure all of the, ideally the builds and pipelines are all fed back into our OpenShift interface so that developers can go right to one place and watch their build run as opposed to, I've always been a fan of not having to jump between okay, I'm building this, now you need to open up Jenkins.
Now I need to go back and figure out what happened with it.
And I need to go to this other product to figure out what's happening there.
Charity: Right. The instant that you have to connect, switch, change tools, alt tab out, wait five minutes, that just kills your momentum.
Aaron: Yeah. Absolutely.
That's been one thing I always have been yeah, you can use these other tools, but I would use the ones that build into the one interface that you're using for everything.
That way you can at least have at least have it so it knows about it.
And you can click the button to bring up the tool as opposed to having to remember, all right, where do I have to go to find that out again?
As a new person to that team, I've really enjoyed the thought process that went into that.
Of being able to build an app and have one click links out to hey, let's go check the source code for this thing.
Or one click links out to like, hey, what's this actual build step doing? Not just what's its status?
And that sort of thing. It's been really nice to have those contexts built in.
Liz: Right. It's kind of like the ideal that people were supposed to get with having 12 factor apps that very few people actually have achieved in practice, except for you're building it for larger enterprises and they're actually able to do it systematically.
Aaron: Yeah. A lot of it is pushing a lot of that opinion into that process.
Of saying, "Hey, this is the way we're going to say you should do it."
And you can take the time to do it some other way, but we want to make the right way, the easy way. That's the ideal.
Liz: Yeah, definitely a lot of similarities with how we've been really opinionated in Honeycomb where it kind of makes people get good defaults.
But on the other hand, it takes a little bit of adjustment to realize that, yes, we probably have a reason for being opinionated about this.
We've been bitten by a lot of these things and don't want you to be bitten by them too.
Aaron: Yeah, exactly.
And it's coming from a open source backed company who's been very much in the realm of we don't have an opinion and you can use whatever you want in order to do this sort of thing.
It has been a bit of a shift there in mindset of being it's okay to have an opinion about this and tell people there's a right way to do it.
It is okay to have that and say, "Hey, this is how I would do it and how you should do it."
Liz: Oh, awesome. Thank you very much, Aaron, for joining us and people can find you on the internet @crayzeigh.
Aaron: Yes. C-R-A-Y-Z-E-I-G-H.