SEP 30, 2020

55 MIN

Ep. #1, Practicing Chaos with Uma Mukkara of MayaData

GuestsUma Mukkara

light mode

about the episode

In this inaugural episode of The Kubelist Podcast, host Marc Campbell of Replicated speaks with Uma Mukkara of MayaData. They explore the tooling and best practices for adopting chaos engineering, as well as MayaData’s founding story and multi-project roadmap.

about the guests

Uma Mukkara is the Co-Founder and COO of MayaData, the originator of the popular cloud-native projects OpenEBS, Litmus Chaos, and Kubera.

show notes

about the episode

about the guests

show notes

transcript

Marc Campbell: We're here today with Uma Mukkara, the chief operating officer of MayaData, who recently released a project into the CNCF Sandbox called LitmusChaos.

We want to talk with you about this project, the motivation for the current state and the future plans. Welcome, Uma.

Uma Mukkara: Hi, Marc. Yes, thank you for having me here.

Marc: Just to get started, I'd love to hear a little bit about your background, your professional background, and how you got started with MayaData.

Potentially, even go farther back before that.

What led you to your career, and to working with projects in the CNCF ecosystem?

Uma: Right. We started MayaData about four years ago, along with Evan Powell.

Before that, for about six years I did storage for virtual machines.

The company name was CloudByte and that was doing storage in containers, while we did not really call it as "Containers."

That was really trying to put isolation in storage for virtual machines.

I did that for about five years, and we did really well, then we saw a great shift towards open source and DevOps being one of the primary adopters of the new shift towards open source and microservices.

Then I was really thinking, "Look. There is a big wave coming in, a big shift. Either I can grow this company without challenges, or we can actually put everything into open source and adopt the microservices model for real."

That's when we started OpenEBS, that was towards the end of 2016.

The idea was to create an open source technology that is really oriented towards microservices, and we tried to do OpenEBS storage in userspace.

We chose Kubernetes to be our orchestration layer or the platform, we really bet on Kubernetes and we had those choices, like Nomad in 2016.

The first release of OpenEBS was having Nomad as the underlying orchestrator.

But very soon we saw Kubernetes was one of the choices for most of the adopters of microservices.

And then we moved on primarily to Kubernetes, and that's how MayaData was founded.

It's to provide data in a pure anti-cloud login manner to enterprises who were going toward a microservices model.

We wanted to build this project in an open way, very community-centric so that it gets cooked, and then really marry the management of it towards the way you do Kubernetes.

That's how I really started, I didn't have the plan of starting Limtus or Chaos engineering back in 2017 , but what happened was when we really built the original initial version of the project, it was ready to be tested.

I wanted to adopt Chaos engineering in our DevOps and internal DevOps, and also the community where we wanted to tell the community that OpenEBS is being tested along all value parts.

We wanted it to be demonstrated in a way that community can see.

So, "How did you do your testing?" Whoever adopts OpenEBS, the won't actually do those testings before actually taking OpenEBS bids into production.

It was around 2018 that we started writing Chaos tests.

Of course, I looked around at various choices or writing the Chaos test cases, I had framework for that.

We all know Chaos Monkey, but it was not through Kubernetes and we wanted Chaos tests to be completely declarative and Kubernetes friendly.

It should be native to Kubernetes, the way of doing things.

So we started writing this project, Litmus, originally for OpenEBS community and also for our own purpose.

We had a SaaS product called Director, which is now we call as Kubera.

That Director SaaS product also needed to be tested, because it was completely un-Kubernetes.

So we used Litmus for Chaos testing our SaaS, as well as OpenEBS.

That's really how we started with Litmus, but then we saw a lot of interest coming in from community that were initially interested in, "This is cool."

And we said, "Let's actually turn this into a project."

We got a real good encouragement from CNCF in those early days.

So CNCF started the Chaos engineering group.

We saw that there was a lot of initiatives inside CNCF to promote Chaos engineering, and Chris from CNCF really invited us to the working group and we said, "Yes. Let's actually start this project."

And we announced it in KubeCon Europe 2018, but we really geared towards testing OpenEBS using Litmus.

Around six months down the line we realized operators were becoming a thing around the end of 2018, and I thought "Let's actually write a Chaos operator. Chaos CRDs.

Let's make it more generic among Kubernetes users on how to do Chaos. Chaos is the way to build resilience into whatever they're deploying on Kubernetes, or Kubernetes itself."

That's how we started providing a shape to Chaos engineering with Kubernetes.

Then we staffed more people, because we really believed in this becoming a need.

There were signals that a lot of interest was coming towards the CNCF ecosystem, people asking, and we saw Gremlin as well being reasonably adopted here and there.

We thought, "We need to have a Chaos engineering project in a Kubernetes-native way." So, that's really how it started.

Marc: That's great. I'd love to talk a little bit more about Gremlin and the whole Chaos engineering ecosystem, but early bets on Kubernetes--

Here we are at the end of 2020 and it seems obvious that making a Kubernetes native Chaos service is the way to go.

But back in 2016, I imagine the bet was on microservices before Kubernetes.

You were just talking about, can you talk a little bit more about the traction you saw in the Kubernetes ecosystem that made you decide to go all in on Kubernetes specifically for the product that you're building?

Uma: We had two different choices at the time.

Generally, "Microservices" meant at that time Docker, that was the word that comes into your mind.

Then you had three choices to orchestrate these containers, Kubernetes and Docker, Swarm and Mesosphere.

These were the popular choices. What led us to believe Kubernetes was CNCF?

We saw the early KubeCons around 2017 was awesome, and it was very natural.

The community is going to come around it, and there was a strong force trying to put the community together.

It was clear to us that Kubernetes was community-centric and probably will lead to broader growth or adoption.

Then we did a few interviews with very large friends or customers who are in the similar early adoption of containers as a strategy.

They were using their Docker or Mesosphere at that time, but they had longer term plans to consider Kubernetes as well.

We had one choice as a startup, we have to place our bets and we said, "Let's place it on Kubernetes."

That's how I think a year down the line it is very clear that was the right one.

I think the entire Cloud Native ecosystem, I personally believe that it has gone into a successful path mainly because of the foundation and how it is being run, and the transparency.

The driving force is pretty clear.

Marc: That's great. You mentioned that you originally started out creating Litmus as a way to prove and show that transparency into the testing that the team was doing around OpenEBS and the other products that you were shipping.

I'm curious, what was required for you to take that project that was dedicated to testing microservice storage?

It was probably pretty purpose-built at the time, and then turned it into a more generic solution to test any microservice architecture and Kubernetes.

Did the Kubernetes API help with that? What did it take to make that transition?

Uma: First of all, why it was we felt it was a need for a broader set of users?

And second is about what would it take for us to move towards a common API?

The first one is OpenEBS, we were-- I t's a data product, so you need to test your data, and it is the most critical thing of your application stack.

You cannot lose data at any point.

In order to test your data things can go wrong anywhere, either in the infrastructure within your software or also something can go wrong with the application itself.

It really meant that you had to do Chaos testing at all levels.

You had to see who's consuming your data, and what happens if that part is moved from here to there, and will OpenEBS containers sell the data and will the application continue to live on without any disconnection?

So, how do you introduce that Chaos?

Then within our OpenEBS architecture, the target board goes or hangs and slows down severely what happens to the application.

It's really the observability was at the top layer, and then the Chaos can go anywhere.

We ended up writing the Chaos test originally in Ansible, the normal way for applications.

So, "Instead of databases and then OpenEBS and then underlying, let's take the node CPU very high and let's eat some memory in some of the nodes and let's fill up the disks. Let's lose the disks."

These kind of things, and then the API that really led us to believe that "Let's create an infrastructure in a very generic way for Chaos" is the CRDs.

I think around 2018 the CRDs came into the picture, I think that was one of the announcements that happened in KubeCon.

Then came the operator around CRDs, so there was a very clear path that was visible on all applications that need to be--

They will be built and orchestrated by themselves with Kubernetes as the underlying substrate.

That really believed as, "We should use that API, the operator API, and the CRDs to build a very generic framework for Kubernetes Chaos."

Then underlying thing that is there is an orchestration layer for Chaos at the top, and below that layer is the actual experiments and different tests that you can do.

So, CRDs and operators, these were the ones that led us to a really rare.

Chaos was originally, but these two things together with this Chaos need is really the Litmus now.

Then we are building much more on top of that right now, but that's probably how we ended up here.

Marc: That makes sense. Back when you were creating it, there were other Chaos products out there. Gremlin comes to mind right around the same time that Litmus was added into the sandbox.

Another product was called Chaos Mesh, and I'm curious if you can talk a little bit about the motivating factor to continue working on an independent project.

Maybe, what are the differences between Litmus versus Chaos Mesh and Gremlin, and where the strengths are of your product?

Especially use cases that it handles really well?

Uma: The difference between Gremlin and Litmus is we are open source, we're cognitive in the sense that Chaos is built and orchestrated and probably monitored as well in a cloud native way through YAML constructs and YAML manifests.

So, if an organization or enterprise is looking at doing things in a cloud native way, Litmus probably is a better off choice.

You can do GitOps, you can automate highly scalable Chaos workflows using Litmus, because everything is YAML and you can use Argo Workflows and ArgoCD for continuous deployment, and then really build the real DevOps around Chaos.

That's probably how I would differentiate between Litmus and Gremlin.

It's a completely open source project, and even new features that are being built like Litmus Portal, which is an observability platform for Chaos, is also completely open right now.

It's part of CNCF, so it's co-owned by a larger community.

Then it's going on, and the difference between Chaos Mesh and Litmus, I think both of us started around testing the storage related projects.

IDB and OpenEBS, they are stateful workload-centric, and probably they also felt there is a need for a framework that is needed for testing IDB exactly how we felt with the OpenEBS.

But primarily, the difference is we are considering it a little bit more like a framework in addition to the actual experiments. For example, Litmus is Litmus Orchestrator, Litmus Hub and Litmus Experiments, Litmus Operators. You can envision Chaos engineering or resilience engineering for an organization, "What is my Chaos engineering strategy for the next two years? What is the framework that's needed?" Litmus will come into that picture.

Also, technically, the bigger difference between Litmus and Chaos Mesh is the way we introduce the Chaos.

To introduce Chaos with Litmus, you don't need to change application spec, you stay outside of the application spec and try to introduce Chaos in the Kubernetes APIs.

You don't need to sit inside a sidecar and then do that, which could be a big problem for many users.

My application itself is being managed through GitOps and that cannot be changed.

Even if I go and put it in a side car, it could be removed by somebody.

Some change that's happening on the application amplified.

So, we wanted to take down patch application spec, that is a strategy that we took.

What I heard, or at least that I know of from Chaos Mesh is we just need to go and attach the application spec, which will have some advantages in some areas.

But that's one of the primary differences.

The other one is the hub itself, with hub and the CRD infrastructure, what happens is you can bring your Chaos logic onto Litmus infrastructure very quickly.

For example, there are few other Chaos engineering projects like Pumba or PowerfulSeal from Bloomberg.

We have integrated at least some of these experiments of these two projects into Litmus originally.

Now we have native experiments that are native to Litmus for some of this Chaos.

But it was very easy for us to run a specific Chaos that was written in both PowerfulSeal and running Litmus because just put that into a Docker container image and call that as a library inside our experiment spec.

We call it is "bring your own Chaos."

If you are already doing some Chaos and you want to take advantage of the framework overall, it's probably a day's work for a developer.

And then now you had five experiments that you're using and then you just made them compatible with Litmus.

Now that you just don't have 5, you have another 32.

You now have 37 experiments that you can go and do your Chaos magic on your Kubernetes.

That was another advantage. It really is the hub and the way we run Chaos without touching the application.

Marc: That's interesting.

It sounds like a lot of the design decisions around Litmus were driven by the way you see organizations being able to integrate Chaos Engineering into their workflows.

I'd love to jump in and discuss for a little bit, if you have recommendations around best practices for adopting Chaos engineering in general, should the developers who are writing the code be doing it?

Should a separate team of SREs be writing the experiments and managing that?

Do you have any tips for somebody who's just getting started with Chaos?

Uma: Yes, I do. I mostly learn from how I see people reacting to Chaos.

So far I've heard two sides of the stories.

"I'm really scared to practice engineering in production, even if I am convinced, my management is not convinced. My developers are not going to be happy if I am going to intentionally start breaking things."

This is the first reaction. But in general, the SRE community has been open and increasingly open to the idea of Chaos first principle.

There is a lot of advocacy that's happening in the last few years, right from Amazon and CNCF, and a few other folks.

Netflix, of course, has been a great promoter of it. SREs do know that Chaos engineering is an inevitable choice, and your real question is the developers.

That's also one of the reasons why we created the CREs, so that it's very developer friendly.

For example, you can create a PVC when you're developing an application and you're consuming the storage underneath.

All you do is create a Kubernetes object and then apply that object.

You got your resource, and similarly as an extension through that, even before your module code or it goes into deeper integration, whatever Chaos that can happen there a developer can easily write that Chaos.

What we've been recommending is SREs should really encourage developers to start writing Chaos tests as part of their development infrastructure or ecosystem itself, so that they get used to the idea of the organizations are going to introduce Chaos at some point or not.

It's better to work with SREs and work with Chaos engineering rather than just delaying that.

With Litmus, it is easy for developers to use Chaos in their development lifecycle itself.

My best practice is, developers should start doing Chaos. For example, pod deletes in those CI pipelines, just do some simple Chaos tests and see what happens.

You don't need to go for complex workflows, but it's just in addition to your unit testing and integration testing, do some Chaos testing because it's easy to do and you don't learn anything.

All the simple experiments that are available on the hub, you just bring it and attach it to your application on whatever you want to introduce Chaos, and then you're done.

That really starts the mindset of developers as well, and it encourages the SREs to do further and more difficult Chaos tests.

SREs themselves, I think they don't directly go into the production, they take some time to introduce Chaos into the production, probably six months during a year sometimes.

But it's good to start with your long running testbeds, y our staging all that, I think, and there's no reason why SREs should not be using it.

Because anyway, if you take some time to learn and tune, a bug fixed in staging is a bug fixed in production.

There may be more bugs that are related to production, but at least you're finding something that can eventually be found a weakness in production.

My recommendation, definitely start early, start also with developers and Kubernetes upgrades or another driving force for the Chaos testings.

They're happening quite often, the way I've been observing you just upgraded to Kubernetes now and 3 months down the line you have one more machine available.

You need to do that, and if your services are in production and they are to scale large enough, then it's a big thing to do the upgrade.

You had to do it in an automated way, that CD is available.

Continuous deployment-- So you have to really inject the process of testing before that upgrade, and you can automate that very easily.

Chaos engineering is almost becoming an extension to the existing practices in DevOps, and automation, if you can automate anything it takes time in the beginning, but you reap the benefits very soon.

Marc: Yeah, for sure. The ability to use GitOps in Argo CD or Flock CD or any other GitOps tool to deploy the experiments is super cool and interesting.

It sounds like also, the advice is if you're getting resistance from breaking things in production intentionally, there's still a ton of value in using Litmus on a pre-prod or even just in the CI process.

Get comfortable there, start fixing the low hanging fruit, the bugs that come up.

Eventually down the road, start making the case for running it in production.

Uma: Yes. CI pipelines is the first target, of course.

We started using Litmus in OpenEBS.CI.

So, that's the CI pipeline for-- It's an end to end testing pipeline for OpenEBS, we call it "OpenEBS.CI," but the pipelines are the definite starting place.

We also have, similar to the other tools, that's another thing that we're seeing that is a trend.

You can do that in CI pipelines, you can build your Chaos testing into that, but there is another trend that's happening as a Chaos CI type.

Where you have drivers, you have GitLab, GitHub actions, and why can't you have Chaos actions?

That gets executed before the PR gets merged, so it's an additional pipeline, just a Chaos pipeline itself.

It could be run against retargeted, I mean we have so many boards that are coming up now in your pipelines to test.

A PO is approved, so that's great. Then it went through all the pipelines testing, and now I want to do additional testing.

"Let's build up a Chaos testing pipeline and then use a bot to go and kickstart it and do it. You need not touch your existing CI pipeline, you can have an additional Chaos CI pipeline."

So that's picking up, we are doing it and then some people are appearing.

But I see that as going forward, Chaos testing will become part of developers mindset as well.

They will be scared initially, but you have more freedom once the PR is merged.

Yes, I know that it was all tested, and whatever the SREs are going to test is much more complex use cases or Chaos workflows.

But as a developer, I've seen the code staying up against a certain set of values that's more freedom to the developer's mind.

Marc: That's great. I think you mentioned earlier that SREs and folks working in the cloud native ecosystem, they see Chaos as the inevitable choice.

It's an interesting way to think about that, because they're running Kubernetes cluster in their pods, their workload is in the cloud.

It's an inevitable choice because they might not realize it or not, but they're working in a chaotic environment and they're getting Chaos anyway.

Pods, nodes, things, networking, everything is going to be breaking on them at all times.

Uma: We call it as Chaos first principle.

You're an SRE, you're starting on your ops strategizing.

You build your infrastructure, you build your upgrade strategies, and operational strategies.

Don't bring Chaos strategy later, bring Chaos strategy on day one.

I build my infrastructure and I also introduce Chaos from day one.

Chaos cannot be postponed. This I heard from many engineering practitioners, and I mean who are SREs and also the non-advisors like Adrian Cockcroft from AWS, who is a big promoter of Chaos engineering.

I really like the way he advocates Chaos first principle.

And it's true, we wanted Chaos to be the first choice. That's why we created Litmus, to have our own SaaS ops.

Marc: Great, and you've doubled down. You've continued to work on it.

This was originally an internal testing tool at MayaData, and then you made it an open source project.

You really doubled down on it.

I'm curious, is there any examples that come to mind where it caught some error, it caught some problem for you in a CI pipeline, in the pre-prod or in a production environment, and it happened and it saved you from actually troubleshooting and trying to repair a stateful service that was broken?

That just made you realize all the stories about Chaos engineering are true.

We're going to make this an integral part of how we think about validating every release before we ship it.

Uma: Sure. There are two experiences that come to my mind, one in OpenEBS.

One of the challenges for us was the volume is going into read-only mode.

It was not easy to reproduce, it takes-- Sometimes you need to bring down this volume and then at the same time the node also goes down, and then it comes up in the rebuild process.

That is a sequence in which these volumes can go into read-only, and a lot of community users are successfully using, but the most common error that we had seen at that time was people reporting about read-only.

We did test some negative test cases, but it was not reproducible, and then it was taking a lot of time for our automation developers to reproduce it.

We wanted to automate that, and we wanted to write various Chaos test flows and scenarios.

Scenario number one is, "Do this, do this, then bring it up and then wait for some time, bring this down."

That's scenario one, and then similarly we came up with a few things, and then we put that into pipelines and within a few days you will see certain things.

Volumes going into read-only, then our developers instrumented the stack to see "What are the risk conditions?"

And "What could go wrong?" And I think within a few weeks those bugs were fixed, which were not possible in more than a month.

That's an example of how real weaknesses can be automated, so the developers can find themselves easy to-- They don't need to wait for it. They know that the system can reproduce that weakness, so you go and put an instrumentation.

I'm sure these guys can reproduce it in a day or two, so you can kickstart more CI pipelines continuously and get the bug at will.

That was one thing, and then the second thing was in the cloud ecosystem that we have--

I don't want to name that cloud service provider, but the nodes were going down pretty often.

There was an instance where all the nodes were in not-ready state for a few minutes.

One node going down is OK, but all nodes going down were a problem.

What we wanted to do was, our SaaS service went down once and primarily because there was no case where if all the nodes were to become unavailable it cuts to reboot.

You don't expect the Kubernetes service, all the nodes going down on a cloud service provider to be going down at the same time, but they do go down.

Or, they did in our case.

In our staging environment we introduced this case where randomly, not at a particular time, you just bring down the nodes "ABCDE" one after another without giving much time, and then we solve whatever problems that we observe, and then we fixed it.

I think sometimes it can absorb a particular thing that just happened to you, and Chaos engineering will-- This infrastructure will help you just quickly set it up.

The Chaos workflow can be set up, because you have a clear idea on what it is and you can be proactive in trying to imagine what can go wrong.

But at least to start with, "Yes. I burned my hands. Let me just do the same testing in staging, and developers can come and--"

They're more enthusiastic now, because you're not just telling that there's a problem, you're there to come back with a solution.

You're helping them on staging environment, pre-prod environment, "I can reproduce it. Come, let's solve it."

There was a great example where our dev, when the SREs work together rather than just pointing fingers, that "Yeah. My code is good and it won't be reproduced again," something like that.

Marc: Great. And then I'm assuming that a lot of these learnings in the early days are what drove the experiments that you're shipping in the hub right now.

If I just get started and install Litmus into my cluster and look in the hub for some default experiments that I can run, some of these are going to be some of the lessons that you've learned the hard way.

Battle tested, like "Here are good ways to get started."

Uma: I think most of the genetic experiments that you need to do Chaos are already there, and they're battle tested for sure.

For example, the pod memory hog and node memory hog extensions were added by our community user after they used Litmus in production on a very large scale.

So, that was really awesome.

It was not just our experience, and we have down the line, late last year we formed Intuit using internally Kubernetes engines.

Now they're public reference, so I can talk about them.

Because they entered-- It has become a maintainer on the project and they've been contributing a lot, and they just now merged a lot of AWS Chaos testing that they did with Litmus back into the hub.

So they are definitely battle tested, and that's the idea of the hub where more minds are coming together and they are sharing their experience back to the community by sending a PR to an existing experiment.

And tunables are another part, the Chaos experiments are completely trainable in terms of what library you would use to inject the Chaos or how long you do that, and together with that Argo Workflow, it's really amazing how quickly you can code up a complex error or scenario that you want to introduce primarily into staging.

You don't introduce such complex ones into production, production starts slowly, but what I've seen is you know exactly what happened and you want to code it up.

So, let's go. Within a day you are actually reproducing it.

These are very much battle tested, we had about 20 experiments 6 months down the line, and now we have 33.

Recent ones were primarily coming from our community , not just our own experience.

Marc: Great. And into it you created Argo workflows. Adding Litmus into that feels like it's just a really good way to ensure your GitOps pipeline is working well.

Uma: Yes.

Marc: And I assume you're using Litmus to run Chaos experiments against Litmus, so I know I'm getting a high quality product when I'm putting it into my cluster, right?

Uma: Yes, we are.

It becomes a little bit complicated if you open that, "What happens if Litmus doesn't work?"

That's the Chaos you want to do introduce.

So, we have used some logic of Litmus not in Kubernetes native way, but we run pipelines for Litmus itself which are also in open, the pipeline testing.

What we do is, "What happens if there is an operator issue?"

A LitmusChaos operator just stops, which is introducing or managing the Chaos itself.

So we go and break some of it by injecting Chaos from outside the Litmus cluster, and in summary we are using Litmus mindset to test Litmus itself.

On your other point, it definitely-- Interactions with them led us to adopt Argo Workflows in a very native way, the Litmus Portal that is coming out soon under development, the code is already open source.

It's building Argo workflows natively into the Litmus Portal itself.

So, install Litmus Portal, you can create Argo workflow by taking all these experiment and put them together.

We are in fact having some predefined templates for these Chaos workflows, it can just run them one by one and I'm pretty sure you will have some learnings just by running them once.

Marc: Yeah, those early design partners look into it for you, they are just so instrumental to building a good product that's solving and meeting customer demand.

Talking about that for a minute, I'd love to understand more about what types of experiments Litmus is capable of doing today.

You mentioned a few around memory hog for a pod killing a node, but how deeply does it integrate to the cloud provider or into Kubernetes?

What types of stuff can I do currently with Litmus?

Uma: We call that as two types of experiments, generic versus application specific.

Generic really means that they are Kubernetes specific.

Kubernetes has got a lot of resources, like a pod, a container and a network, etc.

What are the Chaos that you introduce against this resource, as we call it? As generic Chaos?

Among the generic Chaos experiments, that is another category called infrastructure Chaos.

You want to bring down Kubernetes node, and then you want to actually introduce node hog, and CPU hog, and memory hog.

So overall, inside the generic experiments we have pod delayed and container kill, and pod memory hog, CPU hog, etc.

Network is another important thing, network and there can be latency that you can see, and complete loss of a network between two parts.

Sometimes network corruption also can happen, the packets don't come.

Network. duplication can come in.

So we have some network related experiments in these areas, and the other one that really came at one of the meetups.

Users was saying, "It's great, but we really had a problem with our Kubelet service itself becoming not responsive the way we expected it."

This public cloud service provider who manage Kubernetes master for you, the master nodes for you, if there is an issue with Kubelet service they'll restart immediately.

But if you are doing your own Kubernetes master management, you probably have to be aware that you have to automate it. But it means that Kubernetes service can go down, and it's like your heart stopped beating for some time.

There is-- There's no Kubelet service redundancy. What happens to your services at that time?

You want to test that scenario to introduce a Kubernetes service skill, and Docker service still is another thing, underlying Docker image can go for a toss.

So these are what we call as generic, and then there are application specific.

Application specific is where we are pretty excited about going further, once you are finding weaknesses related to the files inside Kubernetes resources, your resilience is reached to a certain level.

But applications themselves can go for a toss.

For example, we know OpenEBS very well and we know what exactly that can happen inside an OpenEBS resource.

For example, just make sure the service stops responding for a certain amount of time.

How can you even reproduce that false scenario?

Only OpenEBS developers can write that as an experiment, and then share it to OpenEBS users.

It's whoever is writing the CI pipelines and complex CI pipelines, and if it's an open source test or even if it's not open source, if they want their users to introduce such application errors in their production staging environments, it's a chance for them to give those experiments.

We have some for Cassandra and core DNS, and OpenEBS, and more are coming. Retis, I think, is on its way.

We expect community to realize the value of uploading those experiments specific to their application, so that their users can make use of it.

It's all about being transparent with you users.

That is a problem that can happen, and don't hide it, but just tell them that this is how you can introduce a failure in one of my application scenarios.

Here is a solution, here is how you configure your application or avoid it.

It's better to be in that mode rather than just document it somewhere in a non issue, and let your user face that issue over and over.

That was the motive for us at OpenEBS, just to tell "T his is the Chaos testing that you can do. Go break OpenEBS and see still--.

Marc: That's great. I think applications are being packaged as operators and CRDs today, they're starting to rely on the underlying Kubernetes infrastructure even more.

A lot of them will use that CD core DNS a lot more internally.

It'd be great to live in a world where I'm deploying a helm chart, and the helm chart comes with Chaos experiments that I can enable by turning that on in the values YAML if I want to.

Uma: Install it and then run something, yeah.

Marc: Now that Litmus is in the sandbox, I'd love to move on and talk about the roadmap and the plans.

What are you guys thinking about? What's next right now?

What are the current tasks that you're working on, and challenges?

Uma: Right. I think for us, the observability is the next piece in Chaos engineering.

We did work the last 18 months to make sure that your design is right, your design is verified, you bring in more maintainers.

You just don't preach that your design is right, but have some real references where people have adopted you because your design is good and it was a great response in that fashion.

We believe the basic infrastructure from design perspective is complete and we have enough cases, and the next one is observability.

So what if I have a great tool or great infrastructure for introducing a fault, and I want to understand what exactly happened when the fault had been introduced, and this is typically where the complex issues are buried inside.

For example, last upgrade went fine, the following. But after I upgraded something is not working right.

This is what my staging is telling, so you want to go and verify observability.

Your stats, your graphs, around losses now and try to correlate.

You want to get a visual display of the context of various resources when the Chaos happened last time versus now.

So, we are introducing-- That's one, and the other one is there isn't learning that Chaos workflows are a need.

Argo Workflow, it's good to say that you can automate yourself using the Argo Workflow, versus "Here is an Argo Workflow, and ten Argo Workflows. Just go run it and let your team see what happens."

So, that's where Litmus Portal is coming. I think it's a big project for us.

It'll take an extra 2-3 quarters to get it completely done, but we're releasing all of it next month.

For incubation, I think it's the due diligence that matters since the process is clear.

You need more companies, more maintainers from various companies, they are just the primary sponsors.

At MayaData, I think we do have a good pipeline in that context.

Our list of maintainers are at the level of incubation already, and we may have good reference users, a large production deployment using Litmus out there.

I mean, we already have you, but I'm pretty sure in the coming months a few more will come.

I think that's good for incubation, and I'm thinking incubation can happen quickly the way Litmus is situated right now.

But for us, graduation really means a lot of applications, specific Chaos experiments.

We are trying to do IOKRs, filesystem Chaos and it still is another thing that keeps coming up.

Naturally, all those popular projects should see Litmus as a f irst citizen.

The way you mentioned after Helm, let me back up some Chaos experiments that I think should be installed.

That kind of a thing will happen soon after.

It's one thing leading to the other for incubation before graduation with the more CNCF projects using Litmus as a natural choice.

Then it's driven by community, so that's where I would think-- We also see some cloud service providers expressing interest, "This is good."

I gave just a helm chart example, I gave a cluster, "So see it for yourself before going too much into production." You can use Chaos testing as a service.

On my little site, we have plans to build some kind of a solution using Litmus, but that's another thing that I see happening using Litmus, Chaos testing as a service by clouds.

Marc: That's great. The portal sounds super interesting.

I'm curious now that Litmus is in the sandbox, what's the best way for people to get engaged?

Are you looking mostly for people to write experiments, contribute to the product, or just create use cases and run it in their environments?

Uma: Great question, actually. This is the confusion we also had.

There is so much interest that is coming in, first and foremost the usage and you tell whether things are working the way you want, and people obviously come to the Slack channel telling that they don't find it working.

So you found an issue, so that's happening, and we have call home metrics that is around 40-50 operators being installed almost on a daily basis.

That was some time ago, and then because of the ecosystem that is there people are very much inclined to contribute back.

We see people go extend the experiment and add it back to the hub, that's definitely happening.

Another thing with Sandbox is the use cases, you mentioned "Will they come back with new use cases?"

There is observability, there is home deployment, so what we did is we created 6 within Litmus driven by the enthusiasm of the CNCF model.

We have multiple 6 within Litmus for durability sake, for deployment sake and experiment sake, etc.

We are seeing small teams being formed within the community, and some person says that "I'm interested in actually submitting a graph on a chart for a particular application, because I'm an expert in that. How do I actually overlay Chaos with a particular application on my job?"

There are a lot more development that needs to be done.

We have tons of application, and we need Chaos only charts around those applications.

We have observability sake, so we introduced that model actually last week only.

We expect community to form their own groups, and we are at least the core team.

We are trying to be available as mentors to give them the initial start and eventually somebody leading that thing as we go beyond the incubation.

Marc: It sounds like Litmus was pretty mature for entering the sandbox.

How active were you involved in the process of actually submitting this to the sandbox?

And do you feel that was an easy effort, and worth it in the end ?

Uma: We were around the time -- If it was today, I would have felt it much easier recently that QLC has really clarified a much easier process or getting into Sandbox.

But we were around the time where the process itself was being updated, and I did feel at some point that "We have enough evidence to do the incubation directly."

Sandbox process was in limbo at that point, I felt "Maybe we should have applied to incubation."

But they quickly came back and said that "These are the following guidelines or having a project to be accepted into Sandbox, and yours is going to vote on it depending on whether a project has adhered to those guidelines."

Guidelines are simple enough, you need to prove that there's a community value.

It should be around overall cloud native ecosystem and it should benefit, and the question was good.

I think now it's much easier, but there was a time three months ago, for example, we applied it in March and we waited for about 3 months just for the process to become clear.

There were many projects in that ship, that's why you see there are a bunch of projects around 10 or 11 of them being accepted in one shot. But after that, I think it's pretty easy.

Marc: We're actually really excited about that. I think lowering the bar, making it much easier for sandbox projects, not all of them are going to make it out of the sandbox but it's really great to get some more eyes on the whole ecosystem in general.

Uma: Yeah. CNCF have earned that image, I guess everybody looks at "OK. You own Sandbox, that means there is some degree of due diligence that is already put in, so I should go look at it."

It's an enthusiasm for the maintainers as well, and it's easier to build an ecosystem around that.

You get more ideas, more people are trying out Litmus for example.

We were having 50,000 experiments in July, and now we're already 100,000 in just a few months.

The rate at which Litmus is being used is much higher out of the Sandbox adoption.

It definitely helps from a community adoption point of view.

Marc: That's a really good metric there.

And you have another project, MayaData has Open EDS which has been around in the sandbox for quite a while.

It's a little bit off topic from Litmus, but I'm curious how you've seen-- That one's been around for so much longer.

Do you see Open EDS also graduating into the incubation phase pretty soon?

Uma: Yes, definitely. In fact, we just submitted that for incubation stage last week, or early this week.

It's data, and data is a little more complex in terms of the standards, but it's a lot more critical and valuable.

So, everybody needs a proper-- I mean, it's a barrier to adoption within an organization, how successful your Kubernetes implementation has become really depends on how well your data layer underneath is.

The project itself is hugely adopted, we just went through the annual sandbox review process and it got cleared.

Some of it is really busy, and it takes some time to go through this process and then put it out there, but I think in the next few months we should see we are graduating into incubation opening this project.

We have some real large scale adoptions of due diligence, I don't think that's an issue.

It's just that the process that is required to submit incubation had to cover certain guidelines, documentation, and all that.

I think we submitted that, and we should see that selling through.

Marc: That's great. It's really cool that you guys are building Litmus out in the open and started putting it into the sandbox pretty early, and taking advantage of that process and that adoption.

It's great to see those and hear those metrics around the increased use, the visibility that the sandbox has created for the project .

Uma: Yes, awesome.

Marc: Great, that's a lot of really great information about Litmus and Chaos engineering principles in general.

I think it's-- Hopefully everybody's running Chaos right now, or running intentional Chaos. Chaos first principles, Uma did we--?

Is there anything else that we missed that we should talk about?

Uma: No, I think-- Thanks for asking wide ranging questions, Marc.

We did cover the past, present and future.

The final words from me is "Follow the Chaos first principle."

It's easy for developers to imagine Chaos as an extension to your code, and if you follow that principle you will find bugs before somebody else finds them.

I think Chaos engineering will become a natural next step for both developers, and for SREs there's more clarity coming around.

I'm happy to be part of this journey and looking forward to getting more feedback through the community and contributions from the community.

Marc: Awesome. Thanks, Uma. Looking forward to continuing to try Litmus and watching it mature and grow through the CNCF ecosystem.

Uma: Thanks, Marc.

Subscribe to Heavybit Updates

Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.

Content from the Library

Visit library

Apr 1, 2022

Podcast

O11ycast Ep. #50, Identifying Weak Spots with Benjamin Wilms of Steadybit

In episode 50 of o11ycast, Charity Majors and Jessica Kerr are joined by Benjamin Wilms of Steadybit. This conversation examines...

Aug 17, 2021

Podcast

Unintended Consequences Ep. #8, Resilience & Chill with J. Paul Reed of Netflix

In episode 8 of Unintended Consequences, Heidi Waterhouse and Kim Harrison speak with J. Paul Reed of Netflix. They discuss...

Apr 15, 2020

Podcast

Jamstack Radio Ep. #54, From Crisis to Creation with Rami Sass of WhiteSource

In episode 54 of JAMstack Radio, Brian speaks with Rami Sass of WhiteSource about securing and managing open source components in...