DEC 2, 2020

53 MIN

Ep. #7, Keptn with Andreas Grabner of Dynatrace

GuestsAndreas Grabner

light mode

about the episode

In episode 7 of The Kubelist Podcast, Marc joins Andreas Grabner of Dynatrace to discuss the continuous delivery control plane Keptn, performance-driven engineering, and the problem with monolithic pipelines.

about the guests

Andreas Grabner is a DevOps Activist at Dynatrace, working on the CNCF Sandbox project Keptn.

show notes

about the episode

about the guests

show notes

transcript

Marc Campbell: I'm here today with Andreas Grabner.

Andreas is a self-described DevOps activist at Dynatrace, the company behind Keptn, a CNCF Sandbox Project. Welcome, Andreas.

Andreas Grabner: Hey, thanks for having me on the show. How are you doing?

Marc: I'm great. How are you?

Andreas: Not too bad, actually.

It's early in November as we just talked earlier, but people probably don't know we talked a little bit about-- Christmas time is looming--

Christmas markets, but unfortunately this year not a whole lot of Christmas markets.

Marc: Covid is keeping us inside.

Andreas: Exactly.

Marc: We're here today to talk about Keptn, but I'd love to understand a little bit about your background before we dive into the project here.

So, can you fill us in on your career and what got you to where you are right now?

Andreas: Sure. I started in 1999 just before the 2000 boom, I would say.

I came out of high school and had my first job for two years, and then I actually made a career changing move to start working for a company that was focused on performance engineering.

We built performance testing tools, and I was actually-- I started as a tester on a testing tool, which was interesting.

I stayed there for eight years and it got me hooked on performance engineering, and then after the eight years I followed my CTO back then in the first company.

Because he founded Dynatrace, and two years into his journey of dignitaries I followed him over.

I've been with Dynatrace now for 12 and a half years, and I've been always having an evangelist DevRel role, really helping our customers at Dynatrace.

But I think more the general market to understand the power of monitoring, the power of performance engineering, if you are using it strategically in your software delivery lifecycle to speed up delivery and to bring better quality software to the markets.

So yeah, a big background in performance engineering and now I'm doing a lot of work on monitoring observability, and in the last two years I was heavily involved in the Keptn open source project.

Marc: I think a lot of us in the Kubernetes world are familiar with monitoring and observability, but you brought up the term "Performance engineering."

You actually have your own podcast talking about some of these concepts too, can you help explain what the scope and what performance engineering means?

Andreas: Good question. The podcast is called Pure Performance, in case anybody is interested in listening in.

One of my mentors in life, Mark Tomlinson, he came up with a great quote and he said, "Performance engineering is the art of impacting the next line of code that the developer writes, so that it is the best performing code and the best scaling code that it can be."

So basically, he differentiates between performance testing and performance engineering in a way that testing means you are testing an existing piece of code, and then you're throwing back the results while engineering is more holistic.

Where you say, "I'm going to show you and teach you how you can become a better engineer, a better developer, a better architect, because I help you understand the dynamics of your system, the scalability aspects and the performance aspects."

So performance engineering is really all about building better, scaling better, performing optimized systems.

Marc: That's great.

Thinking about performance before you're actually monitoring and observing it, thinking about it like the way you think about security when you write code.

Make it a first principle.

Andreas: Exactly. Like in software engineering, we do test driven development.

Maybe you want to do a performance driven development, where you are thinking about what can potentially Go wrong with this piece of code.

What's the performance impact if I'm adding in this new library, or if I change the way I access my database?

Or, if I am loading this particular piece of resource file uncompressed versus compressed.

There's a lot of different things that we've learned over the years, and patterns.

So we want to educate engineers to think more about performance before they start writing the next line of code.

Marc: That's awesome. So, let's dive into the Keptn project, what we're here to talk about today.

Can you start us off, just explain what the project is doing?

Andreas: Sure. If you Go to our website, Keptn.sh, hopefully it does a good job in explaining what it does.

Basically what it does, it automates the configuration of monitoring observability, alerting delivery and all the remediation of an artifact throughout its lifecycle.

So basically what we are doing with Keptn, we saw that there is a need in rethinking how delivery of cloud native application works and rethinking how operations works for cloud native applications.

Keptn provides a new event-driven approach where we are using a declarative approach for defining your workflows and processes, for delivering operations, and we're using an event driven model so that every tool that you want to use in your lifecycle can easily participate in that workflow without having to hard code them into your delivery pipelines, or all the remediation workflows.

With Keptn we are addressing pains that we have seen within our organization, but also pains that we have seen with companies we work for that all realize "We're all moving now to Kubernetes, we're all moving to containers. We are deploying them much more frequently."

And while we all know how Jenkins works for delivery and why we all know how Runbook works in operations, it feels like we need a new approach to allow us to scale this, because we are talking about a different scale of how many moving pieces we have in modern software architectures.

Marc: That's cool. I think I get it, and I think one thing that helps me understand some of these newer concepts is a more concrete example.

Is there a really simple concrete example that you can provide of where Keptn is able to solve a problem in the Kubernetes native world that we can grasp on to?

Andreas: I think the challenge we have when we started with Keptn, we initially focused on continuous delivery there but then we also focused on-- We call it "Continuous or automated operations."

Keptn covers a lot of ground. It really manages the whole life cycle of an artifact, and that's why we always have a little challenge to describe the gif Kept in the name.

What does it do? It's not just doing CD, it is not just doing operations automation, it's doing a lot of things, but the great thing about Keptn is you can make it do things that level up your current processes around delivery and the operations.

So to give you a concrete example, because this is also what drove this within our organization, focusing now on actually continuous delivery for us as an example.

When we moved as an engineering organization from, let's say, more traditional monolithic applications to container based architectures, we obviously for delivery chose the tools that we were familiar with.

We chose Jenkins, because that's the tool we've been working with for years and we had our pipeline staff deploying our apps, and these pipelines were obviously integrated well with the tools that we use to deploy.

So if you look at a classical delivery pipeline, you have your build and your deploy, you have your test stage all in that pipeline, and you have your different plugins of the CI tool of choice.

Like in our case, Jenkins, that then talks with the right tool to build and to deploy and then to run some tests.

It's all hard coded in that Jenkins pipeline, and then we saw that we took these pipelines and we adapted it to micro services, it worked well, more services were onboarded, more projects and more teams started to build micro services.

They all said, "Do you guys already have a Jenkins pipeline over there for deploying containers on Kubernetes?"

"Sure, I'll take yours." Copy/paste, maybe do some slight modifications , and then over the first couple of months-- and here we realized that we started with one Jenkins pipeline for one micro service that was a little hard coded with Helm, and then with Jamie there is a testing tool.

We saw that we had many copy/paste versions of the Jenkins pipeline, because every tool and every project had their own little nuances and requirements.

So we ended up with many snowflakes of these Jenkins pipelines, and now we got the requirement, "We need to switch tools for testing. We need to switch tools for deployment."

Then we said, "Where do we even have all these tools, and in which pipelines do we have them?"

It's very complex to manage, so it took a long time.

So we said, "If we have this problem, then I'm pretty sure we have a lot of other organizations that have the same problem of hard coded-- We called them "Monolithic pipelines," monolithic pipelines because they came out of this monolithic world.

So we said, "We want to solve the biggest problem here, which is actually the separation of concerns. Which is, what is the process of delivery and what are the toolings that are used?"

Because in the Jenkins world, everything is hard coded in this particular file.

The problem that Keptn solves is exactly this, what Keptn does is it gives you the option to define a process that defines how you're moving an artifact through the different stages.

What should happen to each-- Define what should happen in each stage and how many stages you have, and then you have another team that can define which type of tools provide capabilities that can participate in that workflow once that workflow is executed.

Just as in modern software architecture, where you have micro services that communicate with each other, they communicate through eventing.

Also Keptn at its core is an event engine, so that means when Keptn orchestrates a workflow like a delivery workflow, it sends events that basically say "I need somebody or some tool to execute a certain task in my current process."

Then you can have one or multiple tools that provide that capability to react to that event, and then they raise their hand virtually, obviously, and say "I can do the deployment. I can do the testing. I'm actually the tool that has been chosen for this particular environment to do the testing in our organization, so let me do it."

This clear separation of concerns, this event driven model, gives us a lot of flexibility because we can change the process at any time.

If our organization decides, besides having dif staging and production, that we also want to have an additional security stage where we do additional security testing, then we just change the process definition.

We don't have to worry about all the tool integrations that are hard coded anywhere.

On the other side, we can easily switch from one tool to another without having to figure out "Where are the hard coded dependencies to all of our pipelines?"

That's one of the problems that Keptn solves.

It allows us to easily adapt and change process and tooling that can be done by different teams, because typically there are roles for different teams within the organization that define the process and the tooling.

It's all event driven, it's all based on cloud native events, which are standard.

That means everyone can easily also integrate their own tools, it's all standardized.

It's an event based system and we are obviously using tracing to trace all these events, so we actually have a full audit of what is happening at any point in time.

Obviously everything we also do is git based, and that means every configuration file that has changed or that is needed by any tool that participates in the process is also storing this file or making changes in Git.

So, everything is completely version controlled.

Marc: OK. That is a big and really interesting problem, and I'm sure it's actually useful even as you scale an organization or scale a project.

If I don't have the need for all of these different workflow steps, adopting Keptn early allows me to easily insert those as the organization grows.

Andreas: Exactly, and thanks for bringing this up.

I know this was a lengthy description basically on the architecture and the separation of concerns, but you're right.

In the end, I described the full end to end delivery process at multiple stages. Deploy, test and validate.

The entry point use case that we currently see about 80-90 % of our users adopting is actually one key capability of Keptn that is used in delivery, but it can also use standalone.

Which is what we call "An SLO based quality gate capability."

So part of the delivery process, remember I said "Deploy, test and evaluate, and the way Keptn does the evaluation is it reaches out to your monitoring and observability tools, to your testing tools, to your quality tools, to your security tools, and basically pulls in indicators.

SLIs, service level indicators, like "What's the response time? Or, "What's the memory consumption? Do we have any new vulnerabilities? What's your failure rate of the last test execution?"

It pulls in all these metrics, again, event based.

It reaches out to these tools by sending an event and then all the tools that say, "Yes. I have some data for this environment where you need data."

They basically respond and say, "I have these five data points."

Then Keptn takes this data and compares the values against your objectives, so we have been using the same principles that Google has been spearheading with their SRE best practices around SLIs and SLOs.

So what Keptn does, it automates the collection of key metrics and SLIs, and it's then comparing them against the objectives that you as a Keptn user also declare or write down in a YAML file.

Which again, our version control, that means they typically live next to your source code so that every developer can say, "Here are my SLIs that I want, and here are my SLOs."

Every time I now want to ask Keptn to validate my software that is currently running somewhere and that is monitored by some observability tool, then Keptn can automate the capturing of the data and comparing it. This use case, again, it's very embedded in continuous delivery and also in automate operations in Keptn.

You can also use this standalone, and to your point, this is really what helps our users to say, "We don't want to replace everything end to end. But what we want to do is we want to, let's say, take our existing delivery pipeline that works well for us because we can deploy and we run some tests.

But what's missing right now is this event driven, automated evaluation of a quality gate," and a lot of people are catching on to SLIs and SLOs.

When we use these terms people are familiar with them already, many organizations have already defined SLOs in the production environment.

So we say, "It's great. If you already know what defines the success of your software in production, why only look at this in production?

Why not shift left and let Keptn look at these SLOs, and maybe some additional ones that are more interesting for you as engineers?

Let Keptn automatically monitor those, evaluate those, and give you a quality score, a quality gate score every time you run a build through your existing pipeline.

With this, we automatically level up your existing pipelines by giving it this automated validation step."

Marc: Got it. I have so many questions right now about how you actually are able to do that, so let's start with based on the outcome of that validation score, you mentioned that it goes through the deploy, test, evaluate phase.

Is that where it ends, and it leaves an event that I can then take an action based on that score?

Or does Keptn also have the ability to hook up remediation or some next steps to the process?

Andreas: If you're just using the quality gates capability, where you say "Keptn, you Go off. You analyze my SLIs and my SLOs that I have defined here in my git repository, and you analyze them on the system I just deployed for the last thirty minutes, because that's when I ran my tests."

Then Keptn will launch a process, launch a workflow and event driven process, and at the end it tells you "I got your results. Your result is 90% score, because we always come back with 0-100%."

So that means you can use this in your existing pipeline, and then based on that number you say, "OK. My pipeline continues."

Now if you use Keptn for continuous delivery, the whole end to end workflow where we can orchestrate the delivery and the testing and then that's the evaluation, then Keptn actually takes that evaluation result and says "All right. We've just deployed in dif, we ran the tests, we got a score of 95%, which based on your process definition allows us to automatically promote this to the next stage.

Which depending on your process definition might be staging, and then Keptn would automatically continue that workflow and say, "We have a new artifact, it just comes out of dif and it has a 95% quality rating. Now, I need somebody that can deploy this into the next stage."

Once this has happened, the test happens in that stage and then the evaluation again. So the process continues.

Keptn uses this quality gate information as part of the delivery process, if you use Keptn for the full delivery process to either stop the process or let it continue to the next phase.

It also, and to answer the other point you were alluding to, Keptn can also instruct a delivery tool based on your process definition to, for instance, a Can ary deployment.

When we get a new version out of dif or staging, we want to do a Canary deployment of that new artifact in production with 10% traffic.

Then Kept can actually say, "I have just initiated the deployment. Once the deployment is done, this can be done by the built in capabilities of Keptn. Keptn can also talk to Argo for Argo rollouts, Keptn can also talk to Fox or Spinnaker."

Keptn is, as I said, event driven so you can actually use any type of tools for the actual delivery action.

But what Keptn also does once the deployment happens and you run your Canary in production, then Keptn again enforces the quality gate and says, "We just deployed into production. We have a Canary release. Now, I want to tell you that your score is below your accepted criteria."

Let's say your score is only 50% on your Canary, and then Keptn can say "Based on that information, I know that you and your process said you want to roll back or you want to turn off the Canary."

So then Keptn would automatically send another event to say "Let's scale down that scenario, take it out."

Or, it could also happen the other way around, it could say "We got 100% quality off this new Canary. Maybe we want to scale up and add more traffic to it?"

Marc: A lot of what you just described there is really heavily around the integrations, that Keptn can integrate into other ecosystem products.

You mentioned Argo, Flux, Spinnaker.

A lot of the progressive delivery and roll out is starting to be covered more and more with SMI in service meshes, like Linkerd and Istio, and that integration.

Can you describe a little bit how Keptn integrates into service measures?

If I already have one running in my cluster, and I already have that process?

Andreas: So I think again it comes to the actual delivery tools, at least as far as I see it, from my perspective and my knowledge of all of these progressive delivery frameworks.

From the Keptn perspective, if you install Keptn, we deploy what we call the Helm service. So, we're using Helm.

We have a service that can listen to the deployment requests from Keptn and then can deploy a Helm chart, and if you are onboarding a Helm chart for your service and in the process definition you specify BlueGreen, we actually assume you have Istio and then we automatically configure your virtual services and leverage service measures for the actual BlueGreen deployment for the traffic switch.

Now, the same is obviously true if you are using tools like Spinnaker or Flux or Argo, so these delivery tools will then deliver based on your, let's say, roll out definition.

Then they use this to do this technically, your service measures that you have.

So I think what we see instead of having Keptn from a delivery perspective integrating with a service mesh, which obviously would make sense, we see right now more Keptn being integrated with your specific delivery tools that can provide a Canary deployment, and then depending on how this tool does the Canary deployment, may use Istio or whatever service mesh you have.

So, hopefully this answers the question a little bit.

Marc: Yeah, I think that does.

I think the next question that I have really around that is integration into legacy tools, you mentioned a little while ago that Keptn is using the cloud native events.

So I'm guessing a lot of that was done in order to provide compatibility with the whole massive ecosystem of tools that existed before Kubernetes.

If I have a legacy tool, like Jenkins, you've mentioned Spinnaker is obviously a modern tool.

But what about really old legacy tools?

What's the effort that I have to Go through in order to integrate one of those into a Keptn workflow or pipeline?

Andreas: The effort is actually if you know how to build a container and if you know how to build a code that can listen to an HTTP post event that contains a JSON object, then you're halfway there.

The way this works is Keptn, as I said, sends events to ask basically your tools, we call them "Keptn services," "Who can do the delivery?"

For instance, Jenkins is a good example because we already have an existing so-called "Jenkins service."

That means if you are adding the Jenkins service to your tool belt of Keptn, that Jenkins services then actually is listening to a deployment request.

What it does, the Jenkins service itself is a very small basically proxy container that is receiving that event.

That event then contains information about which Keptn project, which Keptn servers, which stage, some additional metadata about what type of deployment should happen, information about the artifact that should be deployed.

Then this service can then, like in the Jenkins example, just makes a rest call to the Jenkins API to trigger a job.

That job, again, obviously needs to be configured. "Which job do you want to execute?"

This is where every Keptn service that you develop, whether this is the Jenkins service that we just mentioned or maybe a service to you write for even another, let's say, older legacy tool.

The Keptn service that participates in the workflow when it gets triggered to an event also has full access to the Keptn configuration repository, which is the git repository that I mentioned earlier.

So that means as a Jenkins service, I can say "Keptn. Please give me my Jenkins configuration file."

The way we did it with the Jenkins service, you can upload a so-called Jenkins consortium or file where you can see which Jenkins job should be executed with which parameters, if a certain Keptn event is sent.

This is basically a very easy way to call any Jenkins pipeline based on any type of Keptn event, and if you want to build your own service again, the only thing you need to do is build a container that can listen to HTTP requests and then understand and be able to parse cloud native events.

We have templates for that, we also have the tutorial out there, and I think I've recorded a video where within five minutes you can write your own Keptn service to integrate any type of tool.

The template is so easy.

You just clone that repository and the only thing need to do is you need to fill in the code in the event handler for that particular event that you want to handle, so you need to know a little bit of coding, but you can decide which language and whether this is Go, which is one of our preferred languages, but you can also do it in Python or in any other language.

Marc: But at the end, it's just some glue code that I have to write.

Andreas: That's all it is, yeah.

Marc: That's cool. Let's switch gears for a minute and talk about the tech stack that you're using for Keptn.

Like, can you describe what languages and what frameworks and what tools you're using to build it right now?

Andreas: Yeah, sure. What we primarily-- Everybody can Go to our website for our GitHub page, the GitHub Keptn.

Then you'll see that's the Keptn Core Project, and I would say primarily we have Go there as the primary language for all of our core services.

There's obviously some Angular, some JavaScript for our bridge, which is our UI, but it's primarily Go.

We have some Python in there, we also have the Keptn Contrib organization.

This is for contributor projects, these are projects that are, I think the Prometheus integration, the Dynatrace integration, the Neotys integration.

These are all Keptn services that have been around with Keptn almost since the beginning, they also represent key integrations of Keptn to work correctly, obviously with Prometheus as an observability platform.

Also JMeter that's also part of this, because of further testing.

Then we have the Keptn sandbox, which is where all of the new Keptn services that are implemented are starting their life.

Then once they get more popular and you have the opportunity to move to Contrib.

I believe that some of our contributors are using Java as the language of choice.

We have, I think, a small number of Python contributions where people just wrote Python, as you called it earlier, "Glue code services."

But the core stack from our end is definitely Go and maybe some other-- I think you asked about frameworks, so Keptn itself when you install it you get our event bridge, our remediation service, our shipyard's service.

These are all the core services we have. Then we have Nets as an eventing engine, because Keptn uses events, as I mentioned.

We initially had, we were using two years ago when we started we started with K Native as an eventing engine, but then we switched over to using Nets and really then trigger forward these events to our individual containers to subscribe to these events.

We also have Mongo in there as a data store, we have one service that's called the configuration service that holds the git repository, and I believe that's a good overview of the stack.

Marc: Cool. Can you--?

I've been looking through the docs here in the architecture, and I think I'd love to double click on the shipyard service for a minute and understand a little bit more about what that does.

It seems like it's a pretty critical part of the system.

Andreas: Exactly. So remember when I said Keptn separates the concerns between processing and tooling?

The shipyard is really the service that is orchestrating the process, because we call the process definition "The shipyard file."

So when you define your delivery process, where you are saying "I have dif staging and prod, and in dif I want to do direct deployments with, let's say, some basic functional tests."

Then in staging you have, let's say, a BlueGreen deployment and you run some performance tests, and then in production you're doing a Canary deployment and you run some real user tests.

This is basically the process definition, and you define the process in a so-called shipyard file. That means every time you're creating a new project in Keptn, and Keptn is project based, every time you create a project you create a project and give it a shipyard file that describes the process for delivery. The shipyard controller is really, as you say, it's the heart and the shipyard controller is the orchestrator of that process.

So, every time I send an event to Keptn I'd say "Keptn, I have a new artifact for you, I want you to launch that process that I have given you or described to you in the shipyard file."

Then the shipyard controller takes over and says, "I know what to do first. First, I need to take your new artifact version which you just gave me, and make sure it ends up in the configuration repository to update your Helm charts."

" Then I will send out an event for the deployment. Once the deployment is done, it sends out the event for testing. Once testing is done, it sends out the event for the evaluation. Then based on the evaluation result, it then sends out the event to either promote the artifact into the next stage or it stops the process and alerts people."

So it's a very critical component, you're right.

Marc: Cool. I'd love to dive into the roadmap. Keptn is at version 0. 7 right now, at the time that we're recording this?

Andreas: Exactly. 0.72, 0.73 should be released next week, at the time of the recording.

Marc: I'm curious, what you're working on right now, what we should be looking forward to in the next few releases?

Andreas: The big and not a big move is going to come with O .8, which is targeted for the end of the year, at least as a beta version.

Right now, what we have when you install Keptn--

And I told you about these services that are actually subscribing to the events and then executing, so right now as of 0.7 These services have to be deployed on the same Kubernetes cluster where Keptn runs.

That means all of the control plane and the execution plane live on the same cluster.

Now, with 0.8 we're changing that.

Where you can install the Keptn core components like the shipyard controller and the remediation controller for all the remediation and all the other core components, you can install them on a Kubernetes cluster.

Then the execution plane, the individual tool proxies or tool integrations.

They can then be installed on other Kubernetes clusters, even on other remote systems that don't have to be Kubernetes based.

So this then allows you to say, "I have a central Keptn control plan and when Keptn is orchestrating a process for delivery, it can actually then execute these external tools. Not from within that Kubernetes cluster, but from anywhere where you have your execution plan installed."

And that's a big change for us.

Marc: That is actually really cool.

I think we're seeing a lot more of that in the Kubernetes ecosystem and in a lot of CNCF projects, the multi cluster support.

We were just talking to the folks at Linkerd and they've introduced some new support for that recently .

Diving into that for a minute, how does eventing work when I have multiple clusters running? How does Keptn handle that communication?

Andreas: Today it's very easy because basically when you are deploying your Keptn service, and let's take the Jenkins service as an example, when it starts up it basically subscribes to the events of the day and that eventing goes in the same cluster, and that's easy.

Obviously this won't work anymore once we install the execution plane outside of the Kubernetes cluster, because you probably don't expose that to the outside world.

Everybody can just register, so we will have a pulling mechanism .

That means you can install your execution plane, let's say again Jenkins on different machines, and when you launch them then they will register themselves on Keptn and basically say "I'm here. This is my metadata, I am the Jenkins service for this particular type of environment and I have these and these capabilities, and I am interested in these types of events."

Then this execution plane, this Jenkins service will then constantly pull and basically ask, "Keptn. Do you have new events for me to act upon?"

Once there is an event then Keptn will forward it, and then what's important, a tool like the Keptn service like the Jenkins service will then need to tell Keptn by sending an event back to Keptn that it actually starts executing the action.

So, asking for "Is there an event that I'm interested in?" is one thing, but then if the execution plane then decides, "Yes. This is actually something I'm interested in and I can do it," it then sends back the start event.

So it says, "I'm starting the delivery now."

Which tells Keptn who is working on this, actually, and it can then also obviously refuse and coordinate between multiple services to maybe try to do the same thing.

But it also tells Keptn, "Somebody is working on it. I'm also expecting a final result once the task is complete," which means the execution plane eventually also needs to send the finished event with a final status, successful or failed.

And the reason why we need these events, obviously, is because then Keptn needs to control the workflow, and eventually if one of the tools says I'm doing this but never comes back, then we have the opportunity for timeouts and for retries and things like that.

Marc: That's in 0.8 coming at the end of the year, is that something that Dynatrace needed and drove that? Or, was that more driven from the community?

Andreas: I think it was more driven by the community, where Dynatrace itself is also part of our community.

Because we are, while the core engineers or the core contributors of Keptn are Dynatrace employees, we see also Dynatrace as a user of Keptn.

So that's why I'm saying they're also part of the community, and therefore also driving requirements and future improvements that the remote execution was-- While it was also driven by Dynatrace internally, it was more driven by our adopters.

Actually it's funny that we started down on the Jenkins train, because Jenkins was one of the reasons they wanted Keptn to orchestrate the end to end delivery process but still wanted to use their existing Jenkins pipeline.

For instance, for deployments, because they have obviously a great Jenkins pipeline that they have built over the years and they don't want to throw this away, so that's completely fair to then let Keptn orchestrate the end to end workflow, but then call Jenkins pipelines for individual tasks.

The other big use case where the remote execution was demanded was for all the remediation piece, because so far we focused on delivery and Keptn can orchestrate the delivery process.

But the second big process that Keptn orchestrates is all the remediation, meaning if you are monitoring in the loading tool, is detecting a problem and then sends an alert to a problem to Keptn.

Keptn can then take this problem and can then look up a defined remediation workflow, so similar to a delivery workflow where you say "What should happen at each stage?"

The remediation workflow in Keptn says basically "If this problem comes in, you try this action. If this doesn't work, then the next section of the next--"

Basically, you specify a list of actions.

Actions could, for instance, be-- Let's say Dynatrace detects there is a rogue process on some remote machine eating up all the CPU, and therefore impacting a critical service.

It also shares this particular host, so Dynatrace can send this problem to Keptn, and then Keptn says "For a rogue process I have a remediation workflow."

It basically says something very simple, like "Restart the process" or "Kill the process."

But in order to do this, you obviously need to have access to that remote machine.

Therefore, we need to have Keptn services that can then execute these remediation actions also on remote services, and this was a big ask from a couple of large enterprises that wanted to really look into using Keptn for automated remediation.

I think to finish this, why they are looking into Keptn is because Keptn is not only doing a fire and forget.

It's not that you send Keptn a problem and then it executes action 1, action 2 and action 3.

What Keptn does, it executes the first remediation action and then it evaluates the state of the system by again using our SLO approach.

So it again reaches out to your monitoring tools and says "We have a critical situation in this particular area based on this problem, which has executed remediation actions. Now after we've done this, please give me the current state of the system by pulling in your critical SLOs. If the system is back to a healthy state, that's great. Then we know the remediation work. If not, then execute the next action."

So it's this, we call it a "Closed loop remediation."

We always execute an action and then validate, and if this isn't the validation, say the system is still in the critical situation, then we can execute the next action and keep trying until we've worked through all the process.

If nothing works, then in the end obviously we can escalate to a human being, for instance.

Marc: That's such a powerful concept there.

I think it's definitely worth chatting a little bit more about.

These SRE processes, you mentioned the Google SRE handbook and taking that and allowing me to build completely automated runbooks that operate completely on their own, they are autonomous systems.

I don't have to have a team of SREs who understand "Here's the runbook to apply."

But there's many to many action that Keptn orchestrates, Dynatrace has great first class support in Keptn, I'm sure.

But not just limiting it to that, being able to receive inputs from everywhere and the closed loop is really cool.

Andreas: Yeah, exactly. I think another important piece to this discussion, and I just did a-- I told you that we have our podcast, PurePerformance, and we just talked with Ana Medina from Gremlin.

She's a chaos engineer over there, and we talked about the interesting aspect of how we all try to educate people about all the remediation in production.

So basically, we're executing code in production.

In this case, in terms of Kepner's orchestrating a process and therefore triggering actions which are basically our code, but while this is obviously the desired state that we want to get to, it's also very scary.

If you think about that you are deploying remediation code into production and may have never tested it before, so that's why we believe if you take all this and shift it left and start thinking about test driven operations--

Which means when you have an engineering team that you should be responsible end to end for the services.

They obviously need to write great code, they should write great tests for their unit test coverage, their functional test coverage, their performance test coverage. But if these teams are also responsible for operations, they also need to write these remediation scripts or these remediation workflows.

For instance, using Keptn as the orchestrator and then individual actions that can then do things to have bring the system back to a healthy state.

But how do we test this? The way we can test it, I believe, is part of a delivery pipeline.

Deploying your new code, putting it on the load, and then enforcing chaos to simulate a chaotic situation.

With this, also test if your remediation code actually works as expected.

I think that's the thing we also need to get into the heads of all the site reliability engineers that want to enable their engineers to take more ownership and responsibilities, because this has to be built into the delivery pipelines.

Because you should not deploy something in production that has not been validated against SLOs, but also that has not successfully made it through chaos tests where you've also tested the remediation workflows.

Marc: I think you'll be hard pressed to find an SRE who doesn't really want to automate everything, but to get to the point of what you just described, the long term vision of these complete automated workflows requires a high amount of trust in the process and the tools and the code.

You're going to go to bed at night and you're going to wake up in the morning, and potentially there's different things running in the cluster than when you went to bed.

Andreas: Correct. No, I agree with you.

I know I'm preaching now, like "Yes. This is what you should do. Why doesn't everybody do this already?"

But I understand that it's easier said than done, but we have to have a vision.

If we don't have a vision, if we don't give people the idea or some way we need to go to, then nobody would improve.

So that's why I strongly believe that we all need to automate, but when we still are automating operations we also need to think about the quality of that remediation code in production.

In order to ensure quality, we need to test it.

That's why I believe test driven operations, which means integrating your remediation workflows in a test batch in pre prod, where you are deploying your code and putting it under load and then enforcing chaos.

Then see if your system actually makes it through that chaos fully automatically without the negative impact on your SLOs, th en you can go to sleep at night and not waking up with nightmares.

Marc: Let's dive back into the community side for a little bit here.

With 0.7 out and 0.8 coming out, Keptn is currently a sandbox project. Is there a particular use case that you're looking for?

Or, what's the best way for somebody in the community to engage that provides a lot of value to you right now and helps advance the Keptn project and vision?

Andreas: There are several ways to contribute.

On the one side, when you go to the Keptn git report you will find a lot of issues that are labeled with a good first issue.

So, if you really want to contribute code and you want to help us implement features or fix bugs, then you're very welcome to have a look at our good first issues on the report.

We have been involved in the Community Bridge program where we got a couple of people that helped us with actual contributions, it was great from the CNCF side.

We also have been part of hackathons, hackfests.

They are what we typically started with these good first issues and I think that's one great way.

The other great way to contribute is if you look at Keptn and you have your tools in your organization and you say "I wish Keptn would have an integration."

Start building your Keptn service for that tool. That would be great.

We also have a couple of issues, open issues on the git repo, like where we said "We would love to see integrations with Tool X Y Z."

So for instance, we have-- Obviously I'm representing Dynatrace, but we have issues there to build integrations with Datadog and New Relic.

That would be great to see. It will be great to see some better integrations with Spinnaker and some other tools.

Selenium often comes up as a testing tool where we would love to see contribution, so if you're familiar with these tools and you like what you hear with Keptn, then build it.

The other thing what you can do is you can start joining our community meetings that happen maybe Thursday at 5:00PM Central European Time, which I know is early in the morning on the Pacific side, 8AM or about 11AM Eastern.

That will be great, and the best way to get just a feel for Keptn, or if Keptn is something for you, go to our website or go to Tutorials.Keptn.sh and give it a try.

Give us feedback on the tutorials, on the website, on the documentation, because we know we can always do better.

I'm sure there's mistakes or things that aren't clear, so feedback is welcome and the best way to give us feedback is either through issues on the git report or join us on Slack.Keptn.sh and just let us know through the Slack channel.

Marc: The Keptn project makes a lot of sense right now to me, if I want to get started with it do you have any recommendations for best practices on how to maybe carve out a small way to get started that's not as overwhelming as trying to orchestrate an entire pipeline, end to end?

Andreas: Again, I have to give you multiple options.

But if you go to the tutorials page, they all work typically through end to end use cases.

Also on the delivery side, but we use our sample apps, so the only thing you really need is either a Kubernetes cluster where you can then deploy Keptn to.

Or, and I think this is a great thing that we have invested in, if you don't have Kubernetes experience or for whatever reason you cannot get to Kubernetes cluster we also have tutorials using K3S, the lightweight Kubernetes distribution from Rancher.

So, there is also tutorials on there.

That the only thing you need is a Linux box, basically, with one VCP or 4GB of RAM, and then you can walk through the tutorial that shows your quality gates and shows you delivering all the remediation.

Check out the tutorials, and also there's tutorials for Prometheus, there's tutorials for Argo, tutorials for D ynatrace. Plenty of things to get started.

Marc: Great. Then it's a sandbox project right now, Keptn is, has the team given any thought to milestones or timelines in order to move up to an incubation phase?

Andreas: So actually we just had our first Keptn advisory board meeting yesterday, and we had folks on the advisory board that also asked the same questions.

Yes, there's a clear goal for us to move to the next stage.

The goals are more adoption, so we are-- One of my goals, one of my personal goals is that I increase adoption.

Not me knowing that I know that people are using Keptn, but actually on the Keptn project we have an adopters file where we keep track of adopters that are allowed to say that--

They want to say that they are actually adopting Keptn, so this is one of the measures that we have public adoption of Keptn and reference level options.

The second thing that we learned yesterday in the advisory board, we need to grow contributors and we need to grow the people that are participating in our regular community meetings, because while we've been doing this for a while we still have not the numbers.

We still are low in attendance there, because I know there's a lot of projects out there and a lot of projects try to do the same thing, so we are all fighting for the same people.

But these are all goals, contributions and adoption.

Marc: Yeah, I think that's great.

I was just taking a look at the adopters file right now, and it's really hard because when you're writing these open source projects and putting them out there, there's definitely adoption happening that is running behind a firewall.

They're not sharing that they're adopting it, which is great that they have the ability to do that, but it's really hard to get a handle on the full breadth of the adoption.

Andreas: Exactly, because if you look at the adoption file as a handful in there, but I am personally working with about 50 different users and different organizations where I know they have they have it running.

I'm pretty sure the grey number is much higher because I don't know every Keptn user out there, even though I try to keep track of it.

But it's impossible, as you say, as the number is much higher.

But we just want to have more people officially declaring that they are actually using it because this helps us with the CNCF status, and it's also a great feeling for the core engineers to see who is officially using it.

Marc: Cool. So right now, some takeaways for me are "Get Keptn installed, have it wired into our Kubernetes clusters, g o check out the latest episode of the Pure Performance podcast with Gremlin and chaos engineering integrations," that sounds super interesting.

What other takeaways or topics that are worth bringing up that we haven't chatted about so far?

Andreas: You did a great job in asking all the right questions.

No, I think there's not a whole lot more to add other than every project like this lives through the feedback and the community.

So, we need to grow. But the question is, what do we need to do in order to grow?

Do we need to be building the right thing?

I think that's the most important thing, are we building a thing and are we solving problems with it that we didn't need to be solved?

I think the easiest way to answer this, "Is your organization maybe looking into ways to optimize your current delivery or your way and approach to automate operations?"

I'm not talking about replacing things that you have, we don't want to rip and replace, but are you doing automated quality gate enforcement?

Do you have an approach for automated operations? Meaning, for automated remediation?

Or are you looking into this? Are you building something yourself?

Because this was actually one of the things that initially triggered us when we saw we needed a new tool, we had a meeting two years ago.

We had a big event and actually, my boss was in a roundtable meeting with 50 CIOs.

He basically asked a question, "Who is trying to figure out what the next generation of delivering operations looks like in the organization on your new cloud native platforms?"

And everybody raised their hand, so everyone was already starting and thinking about building this new platform, or have already built it.

So the question is, "Why are all of you building the next generation platform, and why don't we have an open source tool to try to standardize on how the future of delivering operations look like?" I think this is why we also think that Keptn will take off, because we know there's a pain out there and we know there's a need, and instead of everybody having to build their own thing we want to be the one that is building the core platform.

By keeping it so open we can integrate with any type of tool that you have, because again, there's a lot of great tools out there.

We don't want to rip and replace them, but we want to orchestrate the end-to-end delivery process and the automation in operations.

Marc: Yeah, Andreas.

I think that message just shows how much effort has been put into the thought and the implementation here at Keptn.

It's easy to build something to rip and replace, to modernize, but to orchestrate and coordinate legacy and modern tooling to give me a really good workflow is definitely not the easy path here.

I think Keptn looks like it's doing a great job of it, so I'm excited to give it a try.

Andreas: Very cool. Give us your feedback, if you run into any issues got to Slack.Keptn.sh, that would be great.

Marc: Awesome. Andreas Grabner from Dynatrace, and one of the creators of the Keptn Project.

I definitely appreciate the conversation here today, and I'm really excited to see what the next steps are for Keptn.

Andreas: Thank you so much, and thanks for giving me a chance to talk to your audience. Stay safe and healthy.

Subscribe to Heavybit Updates

You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.

Content from the Library

Visit library

Jul 3, 2025

Podcast

Open Source Ready Ep. #17, AI Native Software Factories with Solomon Hykes

In episode 17 of Open Source Ready, Brian and John speak with Docker founder Solomon Hykes about his latest project, Dagger, and...

Jun 26, 2025

Podcast

Generationship Ep. #38, Wayfinder with Heidi Waterhouse

In episode 38 of Generationship, Rachel Chalmers sits down with Heidi Waterhouse, co-author of "Progressive Delivery." They...

Apr 3, 2025

Podcast

Generationship Ep. #33, Developer Experience with Nicole Forsgren

In episode 33 of Generationship, Rachel Chalmers is joined by Nicole Forsgren—developer productivity researcher and co-founder of...