In episode 31 of The Kubelist Podcast, Marc and Benjie speak with Katrina Verey, Senior Staff Production Engineer at Shopify. This conversation examines Shopify's extensive adoption of Kubernetes, as well as the Kustomize project, a tool for addressing configuration management.
Marc Campbell: Hi there. As you've just heard, we have Katrina Verey on with us today. Katrina is currently a senior staff software developer and production engineer at Shopify. She's also a project owner and maintainer of Kustomize and a SIG CLI co-chair and tech lead. We're going to have a fun and pretty technical conversation here today. Before we start, of course Benjie is here too. Benjie, how's it going?
Benjie De Groot: It's going well. Excited to talk to Katrina, there's a lot of questions to ask.
Marc: For sure. All right, let's dive right in here. Katrina, we'd love for you just to start out by telling us a little bit about you and how you ended up working in the Kubernetes ecosystem.
Katrina Verey: Hi, I'm really happy to be here. My background is maybe a little non traditional. I actually studied translation in university and worked for the government of Canada for about five years in the translation and editing space. As much as I enjoyed that career, I decided it wasn't what I wanted to do for the rest of my life so at that point I started exploring what other options might make sense for me.
I stumbled on Rails Bootcamp, so I actually went to a Rails Bootcamp, discovered a love of programming while in that program and from there managed to get a position at Shopify as a developer advocate intern. In that position I worked on the Shopify app store which is the third party ecosystem of apps that plug into Shopify for our merchants to use, and I got a lot of really great mentorship as part of that opportunity that really let me solidify my developer skills.
From there I moved onto product work at Shopify, worked on the orders team and the storefront editor in particular. I had a mentor during that time because having this non traditional background, I had a lot of gaps in my knowledge that filling them would really help me with my growth as a software developer. My mentor was really helping me with that, and one of the things that he was emphasizing, I'd say around the beginning of 2016, was just infrastructure knowledge.
So one day he pointed out to me that Hack Days were coming up and there were a few infrastructure related projects as part of that Hack Day, Hack Days being an opportunity for developers at Shopify to spend a few days experimenting on random projects that don't have anything to do with their normal day to day work.
Anyway, my mentor was suggesting to me like, "Hey, why don't you take this opportunity to join an infrastructure related project and learn something new?" I thought, "Wow, that sounds like a great idea." So I went and I checked out the projects and the one that stood out to me was, basically, "Lets try this new Kubernetes thing. Let's check it out and see how it works, see if we can run our most important app on it, see how that goes for us." I didn't know what kubernetes was, I had no idea, but it sounded interesting so I joined that project. I helped out in the ways that I could and learned a lot, but by and large I would say I had no idea what I was doing.
Nevertheless it was a good learning experience for me and introduced me to what would be an incredibly important concept for my career because later in that year, folks with way more experience and knowledge than me who were on that team decided that Kubernetes was actually very promising and we should give it a more official try with a small team and try and build out an experimental cloud platform based on Kubernetes could look like.
To my utter surprise, I was invited to join that team, which was just an incredible opportunity and I was very intimidated by it because, like I said, I had no idea what I was doing and I felt a huge amount of imposter syndrome around working on anything infrastructure related whatsoever. But I thought I couldn't just pass up this opportunity to learn and to grow in this area that was really interesting to me, so I joined the team and I never looked back essentially.
The team did an experimental cloud platform, it was a success and within a couple of years, all of Shopify was running on Kubernetes and I really had just the most amazing experience getting to follow that journey from the beginning.
Marc: That's amazing. It's interesting, one comment there you said at the beginning, the hack days, that, "Can we run our most important app on this new Kubernetes platform that's coming out?" Why start with the most important app? Why not start with something that's lower risk? Do you have any thoughts about why you took that approach?
Katrina: Yeah. It wasn't my decision at all, that's not my good practice. But I do think it's a very good idea and a practice that we have in production and engineering at Shopify to run towards the risk, that's literally the expression that we use. There's going to be risks associated with any project and the more you know about them, the more you can tackle them upfront and discover how much they're going to be a problem and whether or not that problem is tractable, the better that you're going to be in the long run.
So instead of hiding away for later, we tackle them head on from the start. The idea there in this particular case was, if we're going to be able to use Kubernetes as the common basis for all of our applications, that's going to need to include Shopify Core as we call it, which is the most difficult thing for any platform to handle of all the things that we had. So let's find out upfront whether or not that one's going to work.
Benjie: I love that. That's really a great thing to encourage an engineering team to be, how to approach problems. I had a question for you, were you guys already containerized at that point in 2016? Or was this a bigger lift? If you can share that, what did the infrastructure look like before Kubernetes?
Katrina: Yeah, part of the motivation for wanting to explore potential bases for a cloud platform was that our infrastructure wasn't unified at the time. We had several, maybe four or five different ways that we were running our applications. But the Shopify Core application that I was talking about being the first one we tackled, that one actually was containerized.
For that one we were very early adopters of Docker and that made it easier to do that experiment right because the containerization piece was already in place. We used a variety of other technologies to run our other applications, which was kind of a problem for us because the lessons that we learned in building for the app that really stretched us in terms of scale and technology, they weren't transferrable to these other applications.
We were kind of having to reinvent the wheel and have multiple tools and techniques for maintaining these various things, and not being able to transfer lessons that we've already learned from the big apps to the other ones as they scaled.
So that was a big motivation in wanting to create a platform and finding a common technology that could be the basis for apps of any scale.
Benjie: One thing that I know is that I'm pretty sure that Shopify is probably one of the biggest users, if not the biggest user of Kubernetes. Maybe Google and Amazon are maybe bigger users, but you guys are pretty massive users. What does your architecture, high level, look like today?
Katrina: Yeah, we certainly are very big users. I don't know how to find out exactly where we rank but we certainly have a huge amount of clusters and also apps that we run on those clusters. When we first started building the platform, we had a vision that we would have a smaller number of really large clusters with massive multi tenancy and collocation of various kinds of workloads.
The original experimental version of the platform mostly just made a distinction based on quality of service, which actually did work put for us pretty well and is still a feature to some extent of the architecture that we have today where the experimental apps don't live alongside the most business critical ones. Our experience over time taught us that it actually can work better to have a smaller number of clusters that are more uniform in terms of the type of thing that they run, so we no longer collocate stateless and stateful workloads and we actually are moving towards treating clusters as disposable as much as possible.
Instead of investing in tooling to make upgrades seamless, we invest in tooling to make applications resilient to being transferred to another cluster so that same tooling is then able to be used both when we want to do a cluster upgrade because that actually looks like spinning up a new one on the new version and moving the applications over to it.
The same way as we can apply that tooling to a failover scenario, so we end up with these really robust resiliency tools for our applications that are applicable in both scenarios. As a result of this re: architecture, we now have a really massive fleet of clusters, the number is actually more than 400, which is pretty crazy. Another thing that we've invested in is tooling for managing those clusters as a fleet.
Marc: Yeah, 400 clusters is a lot. You said you were really focused on trying to have some uniformity. Is that inside the cluster or do you really have a platform team that says, "Oh great, you need another cluster or another dozen clusters. It's very easy for us to create additional clusters."
Katrina: It is really easy for us to create additional clusters for sure, that's something that wasn't always the case and we invested heavily in a few years ago to make that possible. Right now we're still improving our notion of having sets of clusters that follow a uniform topology that we have defined for them so that it's not 400 different clusters, it's 400 clusters divided into a smaller number of well defined shapes.
In terms of what I meant by collocating certain types of workloads and separating others, is that we might have one of those shapes be application clusters and that would run the stateless component specifically of the applications that the end users of our platform want to run. I should mention we run everything on Kubernetes, I didn't emphasize that enough earlier.
We run everything from stateless to stateful services that are the components of our platform all on that common technology. Some of the other clusters that we have are dedicated to a specific shape of stateful system, for example, we might have one dedicated to MySQL or Redis or mCache, Lex Search, Kafka and so on, all those other services that we run.
Then we have some understanding of what's in that cluster and what it takes to move it to the new one for the purposes of upgrades, that also gives us all those... Those aren't owned by the application teams, the application teams, they end up using one of the application clusters based on the opinionated platform tooling that we provide. The clusters such as the ones that run our stateful systems are owned by the stateful services teams that provide those systems as a service to our internal customers.
Benjie: So if I'm on the application team and I want to use a Kafka service, am I connecting across cluster? I want to understand how that works. Or any other service, any type of database service or whatever, just because you obviously have these clusters isolated from each other to some degree. How do you bridge that? How, as an application developer, do I get a service that maybe is in a different shaped cluster?
Katrina: Right. So we have heavily made use of the pattern of custom resources as a way to declaratively request infrastructure and the way that that often works today is that the application owner would have a custom resource in their cluster alongside their application, their stateless application resources that would request either the creation of or just access to a stateful system that might exist elsewhere.
Then the control events backing that custom resource is going to take a look at that and say, "I need access to the MySQL, I need access to Redis." It's going to set up a specially configured proxy to allow the app in question to connect to that service wherever it may live, and that obviously includes validating that the app is supposed to be able to access that service and it owns it, and then sending up all of the security configuration requirements to get it to connect to wherever that might live.
Marc: Cool. We're talking about Shopify and how you're currently running infrastructure and the path to get there. I want to definitely come back to that, but another interesting topic I want to transition to for a second or for maybe more than a second, is you're a project owner and a maintainer of Kustomize. I'd love to hear more about Kustomize. Can you tell us what's the inspiration for Kustomize, maybe for folks who aren't super familiar with it? Even just a high level overview of what Kustomize does.
Katrina: Sure. Kustomize is a configuration management tool that you can use straight from KubeControl, it's embedded in KubeControl as KubeControl Kustomize. The purpose is really to satisfy the basic configuration management needs managing variants that you might have for your application. Most folks who are deploying to Kubernetes have various shapes of their application that they might need to manage, and Kustomize is designed to fill that need.
Marc: It makes sense. Look, I've used Kustomize. I remember the first time it came out, it was like a new take on a really hard problem. There's other tools in the CNCF ecosystem that attempt to solve this problem using different ways, one that comes to mind is Helm. I think there's a proliferation of config management tools these days and Kustomize has a very opinionated, specific solution for it.
Kustomize specifically is motivated by the desire to provide a very Kubernetes native declarative configuration management workflow for end users of Kubernetes to take advantage of.
It was actually motivated, from what I understand, I was not around on the project at the time, but from what I understand it was motivated by identifying a shortcoming of KubeControl that it had no declarative solution for this really common problem of needing to make simple transformations ahead of deploying your workloads to Kubernetes.
It had KubeControl Patch, KubeControl Label, all these imperative commands for making those transformations but the best practice is to have a completely declarative workflow from start to finish when it comes to configuration management. KubeControl didn't give our users any way to approach that problem.
Marc: When you think about adding that on top of KubeControl, that makes total sense. At the time that Kustomize came out, I think if I remember the timelines right, it was Helm existed for sure, some of the other things that are available here in 2022 maybe were not available yet. There's definitely some stuff there.
So how do you think about the way that Kustomize solves the problem of giving users the ability to do that last mile configuration versus the way Helm does it? If you can talk a little bit about the strengths of the way Kustomize does it and maybe... Helm still exists and Helm has continued to grow in popularity, just like Kustomize has, so Helm clearly is solving a problem that folks have too. Can you share a little bit of your thoughts as to where each of the projects' strengths are?
Katrina: Yeah. The configuration management space is really rich and diverse. It was before Kustomize existed and it still is today, and I don't really see that changing to be honest. There are a lot of different tools in the space that come with different opinions and that will work better in different situations than others. There's really not going to be a silver bullet, and Kustomize isn't trying to be one.
I don't think any great tool is going to be built to address every possible use case because great tools tend to have strong opinions about one way to approach a problem. I think Kustomize's approach was pretty novel at the time, Helm takes a templating based approach which is fairly common across the space, and Kustomize specifically doesn't do that.
If you want to learn about the philosophy behind Kustomize, which I think is actually pretty interesting but I'm biased on that topic, there is a document called Declarative Application Management in Kubernetes that was written by Brian Grant in 2017. Brian Grant from Google. That document really is an overview of the configuration management space that is just super insightful into the possible approaches and the trade offs that come with them.
When I was working on configuration management internally at Shopify, I discovered this document and was super inspired by it. I think it can be useful for folks who maintain configuration tools or folks who are just trying to make a decision about what's best for their organization to take a look at something like that which has a very thorough overview, to get a good understanding of what the possibilities are and the trade offs they come with.
Marc: We'll include a link to that document in the show notes because it is absolutely worth reading if you're trying to figure out how to make your application configurable or deploy an application to Kubernetes.
Katrina: Yeah. One thing I would say about the difference between Kustomize and Helm is that they really have very different scopes.
They have in common that they both have a configuration management solution, but Kustomize is solely focused on configuration management and it is specifically focused on being a solution that is incredibly Kubernetes native and only deliberately exposes the Kubernetes APIs themselves in a purely declarative and configuration as data approach.
Whereas Helm is a lot more than that, it has a template based configuration management built in but it also is a full fledged package manager that helps you define, distribute and release your packages, and that is just completely out of scope for Kustomize. It's not something that's ever been a part of a Kustomize project or anything that we're interested in doing. That said, you can use these tools together as well.
You can use post render hooks in Helm to invoke Kustomize, or you could do it the other way around where you start off with a Helm chart as you're starting point and then you import that into Kustomize using its Helm feature to them make all of your transformations, create your variants, whatever it is that you need to do using Kustomize's declarative style.
Benjie: Do you like that? Have you seen a bunch of setups like that? What would you suggest if I was just getting started with Kustomize?
Katrina: It depends on what you have to begin with. If you are starting with a Helm chart, certainly that is a very approachable way to get started and Kustomize is really focused on making it easy to make these common transformations so if that's exactly what you need to do then plugging your Helm chart into Kustomize and just defining those simple transformations you need to make is a really great way to go.
Ultimately if you end up doing something more complex with Kustomize, you might want to render out that chart into your directory and set up a more Kustomize oriented system overall. If you're just getting started and you don't have any configuration defined to begin with or you just have a plain set of Kubernetes resources, I would suggest starting with Kustomize directly.
Which, is very easy to do because it's very oriented around those resources. All you need to do is drop them into a directory, add a customizations beside them that imports them, and then define those declarative transformations that you want to make.
Benjie: Yeah, so really if I'm starting from scratch, Kustomize in your opinion, not biased at all, is the way to begin? Yeah, okay. No, that's super interesting. There's a lot of nuance here and I think that I'm very excited to read through that doc actually, so I'm going to be looking at the show notes for sure.
Katrina: Yeah, it's about five years old now so it doesn't have all the tools that exist today and some other ones that are also inspired by it, there's another one by Google called Kept that was also derived from a similar philosophy. But it's principles that it outlines there are certainly still pretty comprehensive and very informative.
Benjie: Totally. At Shipyard we're big users of Helm, I'm pretty sure that at Replicated they're big users of Kustomize. Not that we're trying to start a turf war here. But tell me more about this Helm feature, I had no idea. You just mentioned a Helm feature that I can use and I want to understand that one more, what was that?
Katrina: Yeah, so most of the features in Kustomize you could frame as being either generating or transforming configuration, and one of the features that we have is based on generating configuration with a Helm chart as a starting point. There is a Helm field that you can specify where you point us to your chart and some of the parameters needed to render it and Kustomize will render that out.
It's actually shelling out to the Helm command that you have installed on your machine, Helm V3 and it just uses that to get that raw configuration set to use as the starting point for the transformations that you're going to make with Kustomize. That said, because we are relying on an external executable there are some trade offs to keep in mind with this, which is that you need to make that available and you're going to need to specify an additional set of flags to tell us that, yes, you really want us to invoke this specific executable and you believe it's safe and it's actually Helm and for the security purposes which makes it a little less convenient to use.
Because of that we actually have a plan that is documented in an issue on the Kustomize repo for moving the Kustomize integration into the new extensions format called KRM functions, which is another sub project that lead within SIG CLI. The idea there is that you can encapsulate it in a container and then we can provide the Helm binary that gets run and you can just approve the fact that we want to run Helm.
It's a little ways out still, but that's the direction we're headed with it and then we can provide a more complete Helm feature set because we'll be building something that is even more focused on providing that integration and making it complete.
Marc: That's cool. Well, the KRM stuff looks interesting. I think we'll talk a little bit more about that.
Benjie: Wait, hold on, Marc. This is the most important question that you know I like to ask. Katrina, "Kube cuttle" or "Kube control"?
Katrina: I say it three different ways, and I think most of the SIG CLI maintainers now also just say whatever comes to mind in the moment. Somebody asked us this at KubeCon actually during a recession. There is an old release note that I think says "Kube control". It says one thing or another. That is not the same thing as what is implied via our new logo from a few years ago which shows a cuttlefish, strongly implying that it's " Kube cuttle."
Benjie: The right answer is "Kube Cuttle." I just want everybody to know that, just based on that it sounds so nice.
Marc: The right answer.
Benjie: The right answer. You know I'm an advocate, the listeners of the podcast know that I am on a mission to make sure it is "Kube cuttle." We are going to win this fight. So Katrina, the right answer is "Kube cuttle,"just so you know.
Katrina: Okay, thank you for telling me.
Benjie: Wait, tell me more about this KRM stuff. I don't know anything about this. What is this?
Katrina: Yeah, so if you want to get more details, there's some talks available from both the maintainer session at this past KubeCon and Geoff Regan, the previous maintainer of Kustomize, and I spoke about how to use them in Kustomize specifically in depth at the North America KubeCon from last year. KRM functions is a format, essentially, for defining declarative extensions for configuration management tools. What does that mean?
KRM stands for Kubernetes Resource Model, and that means that we want to build pieces of code that are going to affect configuration management transformations based on a declarative state expressed as a Kubernetes resource.
In a way, Kustomize itself follows this exact model because customization is a Kubernetes resource and Kustomize the tool takes that Kubernetes resource as its declarative instructions for what it needs to produce. The KRM function specification describes how you as someone who wants to extend a configuration management, like Kustomize, it's also supported by KEPT. Again, adhere to the same principles and build an extension that integrates really nicely into these tools. Did that makes sense?
Marc: That does make sense. So it doesn't depend on the Kubernetes API server as a way to distribute and execute that, but the interesting part too is that as far as the security model goes and a few other side effects you get, and benefits really, is the KRM functions, I guess is that what you call them? Are packages containers.
Katrina: Yeah. The recommended distribution mechanism is containers, it's not the only one that's supported. You can build a KRM function that is implemented with an executable. We recognize that there are some situations such as various CI related contexts where you actually cannot run a container which puts a serious limitation on us if we were to restrict the format to that exclusively.
So from a security standpoint, certainly a container is the best option if that's a possibility. What we're working on is a concept called Catalog that would enable you to, as a user of KRM functions extensions, so a Kustomize user that wants to incorporate extensions, to tell Kustomize that you want to run these specific ones.
And, to be able to list out what those are supposed to look like. So what container can run them, or what executable can run them, and whether or not you as the end user prefer one or the other. That way it's a full metadata format so you are able to, as the person who is authoring these for publication and then as the user who wants to consume them, specify exact checksums that you expect the binary to have so that Kustomize can validate that for you before running it.
Then when you're defining your customization and you're saying, "I want to run my company generator," you can just say Kind - my company generator and Kustomize can look up which one that is, see that you've approved it in your catalog that you've included and then just run it seamlessly for you. That was a bit of a tip.
Benjie: Wow. No, that's great. So what is the timeline on KRM functions, all that stuff? When do you think that the implementation might be there? Or the spec is finished at least?
Katrina: The spec is pretty solid at this point. It's actually pretty simple when it comes down to it. The spec says that you have to use the Kubernetes resource model to describe the intended functionality and then your program has to accept a list in a certain format on Standard in, and emit that same format on standard out so that these KRM functions are composable.
But it's very easy to comply with, basically what it looks like from the perspective of the person writing the function is you've taken an object called a resource list that has a field for the list of resources that you are supposed to transform or add to. And then a field with that function configuration, which is the declarative KRM object that the end user is expressing their intent with, so telling you what to generate or what transformation to make.
Then you do whatever it is you want to do in your program to make that happen and you spit out the same resource list format on the other end with the new list of items that has the transformations or generations or invalidations applied to it. The format I think is pretty solid. The plan around Catalog to make this easier to use in the tools that adopt KRM functions as a format for their extensions, those are still more work in progress.
We have KEPs that describe the plan for them, but we need more folks to join us to work on the implementation and help gives us feedback to make sure that what we graduate in terms of the extensions is something that is going to work for the real world use cases that folks have.
Marc: Cool. I think that we don't want to spend the whole time talking about KRM functions. They're super interesting and I've been reading through the spec I found in the Kustomize repo here as you've been explaining it, we'll definitely include a link. I think one thing that can help a little bit is it is a pretty advanced topic that you're clearly building to solve a specific problem. Can you give an example maybe of where KRM functions would make a current process a lot easier to do?
Katrina: Two examples that we have recommended folks build extensions for are features that we don't want to add to Kustomize Core but that fill a use case that folks have in their organization. So if they need to make a really specific transformation that involves knowing something about the way that that organization structures its resources, that's not something that we can accept in Kustomize in particular because we only make structured transformations.
We never use any mechanism like templating or RegX to identify unstructured locations to edit. So when you have that knowledge you can build that into your own transformer and have your end users give you the information that you need to do that transformation accurately and do it behind the scenes, however you would like.
Another example is that some folks want Kustomize to emit resources in a specific order for deploy readiness reasons, for a few reasons, whatever it may be, and it doesn't correspond to one of the two ordering strategies that Kustomize supports. Well, in that case you can implement a transformer that you add at the end of the list that just takes in the list, reorders it into that order that you wanted and emits it so that the final order that Kustomize emits is exactly the one you want.
Another use case for KRM functions that really resonates with me in particular is the one where you have an organization with strong opinions on what configurations should look like or a need to produce abstractions for your end users to declare a more specific or a more organization specific desired state. So for example, maybe you have a standard app shape that all your customers use to get started and it takes five different parameters that are specific to your organization.
Well, you can make a generator as a KRM function that takes in those five parameters and gives them that starting point that they can then use with standard Kustomize primitives to take it from there to customize it to their own needs. In essence, Shopify actually does take this strategy, our configuration management system predates KRM functions but it was inspired by the Declarative Application Management Kubernetes document and bears a lot of similarities to the way that KRM functions are structured and built.
In fact, we used the KML library, a very, very early version of it as part of our implementation and our configuration generation tool, although we do use it standalone, is compatible as a Kustomize generator. That was intentionally so, and I think proves out that point that this is a really powerful mechanism for folks who really like Kustomize's approach but have more organizational, bespoke needs to address.
Benjie: Totally. We kind of got a little ahead of ourselves, but this is a great place to go back. How did you get involved in Kustomize? Especially being at Shopify, can you tell us a little bit about how OSS works there and how you got to get involved in Kustomize itself and all that stuff?
Katrina: Shopify is a really Open Source friendly company, we really believe in investing in the Open Source software that we depend on and we have a long history of doing so. In particular, for Ruby, Rails and React. Historically we haven't been as involved in Kubernetes but we're really excited to be doing more in that space as well these days. I would say that personally I was always excited about the idea of being involved.
But all those years in the past, I never really knew how to get started, never found the right approach and while I was working on configuration management, building the product to make configuration management tractable for our app platform, I went to a KubeCon and had the privilege of being introduced to Phil Whitrock who was one of the tech leads at SIG CLI at the time and also leads of the Kustomize project in particular, and I was super excited about configuration management at the time, right?
I'd been working on this project for a while, we were getting ready to ship it, I had all of these ideas about how it would work and Phil was also really excited to chat about that topic since it's close to his heart as well. That's how I started to find my place in the community, by being able to connect to somebody who is working on the same sort of problems that I was working on in my day to day.
For some reason, I don't know why, it had never occurred to me that SIG CLI was the Sig that was working on the same stuff that I was. I had always thought it was maybe Sig Apps, I'd gone to their meetings but I'd never really found a common ground. I gave a demo there once, but that was about it. When I found SIG CLI through these conversations I was like, "Wow, why didn't I realize this sooner?"
From there I started making a few little contributions and eventually I ended up having the opportunity... We didn't touch on this, but I joined Apple briefly for about a year and a half, and part of my role there was working on SIG CLI stuff as well. I've had continuity of subject matter in the last years of my career here. While I was there I also did a lot of work with Phil and was able to even further deepen my involvement with SIG CLI and with Kustomize in particular.
Marc: One thing that's interesting, going back to that declarative application management whitepaper that you mentioned that Brian Grant wrote. Back in 2017, when he wrote it you were working to solve these problems at Shopify, lots of different people are googling this. Brian works at Google. It's a really great paper, but the one sentence that really struck me was that he talked about how, "look, we've been working on this for years at Google, we've tried a lot of stuff.
We still don't really think there's a good solution yet." So it gives you confidence that you're not missing something obvious, everybody is trying to solve this problem, configuration is just hard in general. And so you're motivated, you're solving this problem, you realize that there are a lot of folks out there trying to solve this problem so you start to work on it.
How did you get involved enough to become like you are now, a project owner of Kustomize? You went from working on the project, making PRs into it, I assume, and then eventually said, "Look, I'm going to really focus a lot of my time, I'm going to take some responsibilities here and really drive this project forward."
Katrina: Yeah, that's exactly it. I started getting involved by contributing features here and there, making fixes, talking to the maintainers about the work that I was doing in the space.
What really got me most deeply involved was actually the KRM functions angle and what I was talking about regarding the power of these to address so many different use cases. You can even use them independently like Shopify does, where we have something that could be used as a KRM function extension for Kustomize but in practice we use it independently.
It's a way to build these composable units of configuration tooling that can serve such a wide variety of use cases while adhering to these great principles outlined in Declarative Application Management. So I was really excited about that concept and I contributed a KEP for a resource called Composition, it's still not merged yet.
There's an implementation up in a PR but this is something that is intended to make these third party resources defined as extensions for Kustomize to be more first class citizens of Kustomize and be mixed in alongside the built in generators and transformers. It makes, in particular, variant management when you're using both built ins and external generators and transformers a better experience. So that was my first really major contribution and to do that effectively I had to really deepen my knowledge of Kustomize as a tool and the implementation.
Benjie: So how is the Kustomize community going now? Fast forward a few years and how is it going? How many people are involved? What do you need help with? We have an audience here so what are you looking for? How can we be supportive of the Kustomize project?
Katrina: We have a very small team right now. I maintain Kustomize with Natasha Sarkar from Google and we have another person, Ana, who has just begun to join us recently which we're really excited about. Welcome, Ana.
We have such a wide user base that we really need more folks than just the three of us working on the project, so if you are a big fan of Kustomize, if you rely on it and are interested in helping us make the tool better for the long term, we would really like folks like you to step up and join I guess the maintainer, the contributor ladder to help us maintain this software and have the bandwidth, really, to do more of the exciting plans that we have on our roadmap.
Benjie: Wait, by the way, Shopify is using Kustomize today in production, correct?
Benjie: No? Interesting. One day, once KRM is in place they might be?
Katrina: Probably not, honestly. One of the things that Kustomize isn't particularly intended to be a tool for opinionated platforms like the one that Shopify has. Our configuration management tool is inspired by the same thing, it works on the same principles, it looks just like a KRM extension and perhaps we'll use it from within Kustomize as part of it for some advanced use case in the future.
But where these two things come together is at that philosophical level in terms of the foundational tooling that's used to build them. The thing that we have internally, it's also built on KYAML and it is also a Kubernetes Resource Model declarative approach.
Benjie: That speaks to a pretty awesome Shopify Open Source thing there that you're working on this much and you're not even using it. That's really cool.
Katrina: Yeah, and of course I don't work on Kustomize exclusively. I am involved in SIG CLI more generally and the goal of my team is really to connect Shopify to the upstream community in ways that make sense for us. This is one case where Shopify has a ton of expertise in this space, we have so much experience doing configuration management strategies at scale and the community has a serious need for folks who have that experience to step up and maintain these tools to help them make them sustainable for the long term.
So this is a place that it makes sense for us to help out, and that is really a part of our mission, is to help contribute to the sustainability of the Kubernetes project in ways that make sense for us and our expertise.
Benjie: That's great. I'm sure it would be great if you were able to say, "Yeah, I can fix this in Kustomize and make our internal Shopify systems work better," but it forces you to bring everything up one level higher where you're really thinking about configuration management, not the implementation of Kustomize to solve Shopify's problem.
Katrina: Yes, although those things certainly do come together and even in the concrete implementation of the KYAML library, which increasingly Kustomize is a wrapper around KYAML. KYAML, I haven't really explained it properly yet. It's basically a toolkit for manipulating YAML that specifically is expressing Kubernetes resources. It's this really great tool and we do a lot of client side work at Shopify, and the KYAML tool is a really helpful tool to have in our tool belt for that, and we can help evolve that one based on our experience which also helps Kustomize because Kustomize is built entirely around it.
Benjie: Is KYAML just a Go library, a Go package? Or is it a totally separate-
Katrina: Yeah, it's a Go package that currently still lives inside of the Kustomize repository. The KRM function strategy, we didn't build our tool out of the KYAML functions framework, which is a sub package of the KYAML tool, just because it didn't exist yet. If we were to restart today, building what we have would be so much easier.
Oh my gosh, it would be so much easier and if I had been in the community, working on this stuff at the time that I built the internal solution it would've been even better, honestly. There's so much benefit to be had by being connected to other folks in the industry that are doing the same work as you and that bring new ideas, and best practices, and together that's the great thing about community, right?
Together we build something better than what any one would build separately, so just being connected to the other folks in this domain and being able to be really familiar with, and contribute to, these tools that are very popular. Helping other folks across the industry has inherent benefits for us as well.
Marc: Yeah, it's a hard problem though, right? You solve this hard problem at Shopify, or anybody goes and solves a hard problem, then years later there's Open Source tooling that you're like, "Wow, I could've solved this problem so much quicker." Exactly what you just described. But you then have to weigh is it worth rewriting it and getting the benefits and the maturity of this, versus everything else that I could be doing? So I don't know if there is a future of Shopify to be able to start to migrate some stuff into the KYAML library or not, or how your team thinks about that.
Katrina: Yeah, that's a really good point. The configuration system manager is working pretty well, so the chances that we'll turn around tomorrow and change it up are not very high, honestly. But the system that we built was for a very specific use case and it's not the only one that we have internally.
There's still some unsolved problems in the configuration management space that certainly I think KYAML functions, as they exist today with the whole toolkit that's available behind them, could be a really good solution for. That is more likely to happen by far.
We wanted to be really, really opinionated with the solution that we provided to our app developers both to help guide them and to help them avoid common mistakes that we were seeing in the previous system that we had, that this KRM oriented system replaced. We actually intentionally didn't address the full spectrum of possibilities that existed at the time because we wanted to provide the best tool possible for the majority of our users on our platform.
Benjie: That tool is not Open Source, correct? The internal Shopify one?
Katrina: No, it isn't because it's really built that specifically for the organization I don't think it would really make sense to Open Source. What you might see analogously in the future is that as part of the KRM functions project there is a KRM functions registry that is still just getting stood up. There's nothing really of note in there yet, but that is a place where folks who have functions that are more generic can contribute visibly to the community both as examples of what other folks can do, and to literally share these pieces of functionality across companies.
Benjie: Okay, this is really cool. I feel like I'm understanding a lot more about how Shopify operates, both from a strategic perspective with Open Source but also internally and how you guys do stuff.
Katrina: I actually have a few examples, real examples of us having encountered exactly the sort of situation you described a moment ago where we have an internal implementation of something and then a community standard emerges and we're faced with this question, "Do we adopt it? Is it worth the investment?" In a couple of cases we have made that decision and we have not only switched over to the Open Source solution, but started maintaining it. So yeah, I could go into that a little bit if you're interested.
Benjie: Yeah, that would actually be super good because I think there's a lot of really pragmatic and practical decisions that, as an engineer, you're faced with, as a software developer, you're faced with every day when you think about seeing an Open Source project and you're like, "Oh, that's probably better." So I'd love to hear some of that story.
Katrina: Yeah. One example is the case of our use of OpenTelemetry . Someone on our observability team, Francis Bucheny who leads our observability team here, was really into the idea of tracing ahead of the curve. He got started at a time when there was no clear leader and he actually went with an internal implementation that we rolled out to production.
We had some good experiences with it, but we also concluded that we didn't really want to be maintaining our own implementation ourselves so we started looking out in the community, what was emerging as the solution to go with. At the time that he did that it was Open Census that seemed to potentially be the solution, but the Ruby implementation which was what we needed was not really maintained.
So we were thinking, "Okay, maybe we can step up and maintain that." Right when we were making that decision, that's when OpenTelemetry itself emerged and it didn't have a Ruby implementation at all so Francis said, "Well, this is a great opportunity. We have all of this experience from building the internal implementation and the community has this need for a Ruby implementation of this new standard, so this is something that Shopify is in a really good position to help out by providing... and that will also lead us to be in this better position for our long term future where we're using the community standard solution and being able to leverage the community documentation for how this works and really be inline with what everyone else in the industry is doing."
So yeah, our observability team stepped up to the challenge there and were able to found the Ruby OpenTelemetry implementation based on the experience gained from running Shopify's internal tracing product in production so that we're starting from a really solid place with that and we still maintain that to this day. Another example is that we used to use a different ingress provider, but today we Ingress and Gen X, and how that happened was we were having some issues with our original provider in terms of the feature set that it offered and we were looking around the community again for what we should perhaps adopt instead.
We identified Ingress and Gen X as the likely candidate but there was a big problem. We deploy at a really, really high rate. Our core application is deployed dozens of times a day and it has tons and tons of replicas that it needs to roll up every time we do that which means that there's a ton of endpoint churn in those clusters. At the time, Ingress and Gen X didn't handle that very well.
Without getting into the technical details too much, we would end up in a situation where our Ingress and Gen X pods would be proliferating the number of workers they were running and eventually get loom killed. So what we really needed was to introduce dynamic endpoint reconciliation to be able to handle that endpoint churn in our clusters.
Now, Elven Effendi from our running team was looking at this problem and he realized that we had a lot of experience with loop Nginx from running Nginx ourselves in our data centers in our previous architecture, and that could be applied to solve this problem in the community. So we proposed a solution, and it ended up working out really well, and with that in place we were able to make the switch over to Ingress Nginx.
Which, as a fun fact, we ended up doing lot earlier than originally planned because our original Ingress provider ended up having a major outage and Elven had to make the decision like, "We're ready to go, but we weren't planning on launching now. Let's just flip the switch. It's a full outage, let's flip the switch." And we did, and we never rolled back, Ingress Nginx roll out saved the day.
Benjie: Okay. We're coming up on time but I have a few other questions about Shopify and how it works over there. How does Shopify share the management work? Do the dev teams and the app owners have Kubectl access? Should application teams own their own clusters? Do they do that at Shopify? You guys have a really, really extensive community set up over there and I feel like there's a lot of lessons that you've learned over the years, so how do you guys look at who owns what and who manages what with all these clusters?
Katrina: Shopify's approach in general is to have a really opinionated technical stack, and that goes throughout the stack really from everything from our choice of languages, we really are a Ruby on Rails heavy shop and for the frontend technology we use React, right through to using Kubernetes universally on the backend and, notably for this case, the integration between those two.
So our app developers, they use Kubernetes through our production platform, which is essentially a platform as a service product that production engineering builds. We have an interesting balance that we strike though because in the production engineering model we build the platform and we build resiliency features into it that help our applications scale and recover from outages and all of those important characteristics.
We don't actually run the applications themselves on behalf of the app owners, the app owners own their applications from start to finish, all the way through shipping them to production and making sure they work well and it's the app owners themselves that are on call for it. So to that end, we do give them a fair amount of control within their name space.
Typically an individual application will have a name space for, say, their production instance and within that name space they're able to configure the resources as they see fit. The configuration management solution that we provide them gives them a really opinionated start that guides them in the right direction and highly restricts the ways that they can do the configuration to make sure that it ends up being deterministic and that we get repeatable deploys and all those other best practices in the configuration management space backed into what's possible.
But at the same time, they are able to modify arbitrary fields kind of like the way you can do with Kustomize by writing patches, and that means that they might start with us as an experiment but then they grow to need something more bespoke and they really need to tailor their infrastructure to their use case. That is something that they're able to do with our platform.
That said, we are still very opinionated in other ways as well in terms of giving folks production access. We have a developer portal that they can use to get insight into their Kubernetes resources and we use a progressive enhancement model where you don't need to learn anything about Kubernetes upfront. But if you need to dive into what's happening you really can.
We're not hiding Kubernetes from you, and in that sense our approach is somewhat similar to Kustomize where when we need to teach you something, we're going to teach you the real concept so that you can really understand what's going on under the hood so that all the public resources on what you're seeing are going to apply to you. There's no translation layer in there, per se.
Benjie: Wow. It sounds like you guys literally are doing the dream, or what I always think is the idealized dream of a good set up. It sounds really cool. I'm sticking with Shipyard but you're tempting me.
Katrina: Yeah, I think we've made a lot of good decisions over the years, and of course some of these weren't there to begin with, the configuration management stuff in particular. Obviously near and dear to my heart. The thing that we had there in the first place, it was a nightmare, it was not a good idea. It was template based.
Benjie: Yeah, but you guys are killing it. It's just crazy how literally you're describing what you guys do and I'm like, "Are you talking about a hypothetical or have you actually got this all working? That's crazy." Does it work well? Does what you describe work well? Is there a lot of velocity? Do you see that it is helping? Are application people?
Marc: What's next internally? Where are the challenges that you want to actually make that process better?
Katrina: Yeah, it is working very well for us. We have gotten a lot of velocity, as you said. I think having strong opinions as a core part of our model and really guiding our developers to do the right thing by default and then be able to unlock power when they need it, that's been a really effective model for helping such a large number of teams scale such a large variety of applications.
But there are new challenges every day really, and we're constantly working to keep up with the community and make sure we're following the best practices. We are very old users of Kubernetes, so we touched on this earlier, some of the solutions that we built predate the more established practices so we are constantly revisiting what we should be doing to make sure that we're getting the full benefits of the solutions that have emerged.
Right now one of the problems that we're addressing is that we made it really easy to create clusters because of the philosophy of trying to make them more on the restricted scope and disposable side.
But now we have a ton of clusters and it's really challenging to manage a huge, global fleet of clusters so we have some solutions that we're working on there but that work is far from done and that's a space where we're super interested in collaborating with other folks in the industry who are solving similar problems. As a company we really like to collaborate technically with others and talk about our various solutions.
Benjie: I have two quick followup questions. One is you mentioned you guys have 400 clusters, are those all living in GKE, AKS, EKS? Do you have your own data centers or are they spread around? I would assume that they're definitely multi region, but are they in different clouds? Then the other question was how many SRE and devops people, folks do you have at Shopify maintaining all this awesomeness?
Katrina: Yeah, we are GKE exclusive. When we first started the project, when I was talking about the experiment that we did to see if this would work for us. One of the things we did in the early days was run Kubernetes ourselves in our own data center as part of the validation that this is a technology that we're comfortable with and that we really want to invest in. But these days we're all GKE so all 400 plus of those are GKE clusters and, yes, they are absolutely around the globe.
Benjie: How many devops and SRE folks do you have?
Katrina: Shopify uses a production engineering model where we have more of a platform team approach, so I guess the answer to that question would be like roughly how many folks do we have working in production engineering, which does include a handful of SREs for the platform itself? That number depends on how you count exactly, but it's somewhere around 300.
Benjie: That sounds about right for what you just described.
Katrina: Another really interesting space that we work in is stateful systems on Kubernetes. That's one of the decisions that was most difficult to make in the beginning, when we were talking about are we going to consolidate everything into this platform? Super easy to say once we'd decided that Kubernetes is a good technology choice that we're going to run all of our stateless stuff on there.
But stateful, especially back when we were making this decision, it was far from obvious that that was the right approach. We decided to make a big bet on it and go all in with all of our stateful systems as well, and we think that's actually worked out really well for us. But it's certainly a challenging space and we have a lot of folks who are really excited to be pushing stateful in Kubernetes forward and solving the hard problems that still exist with doing that seamlessly.
Benjie: All right, wonderful. This has been super informative and I really do feel like you've been describing the goal state for most of our listeners. I know it's the goal state for me, and obviously I think about platform and platform games a whole lot. I know Marc does as well. But this is super, super informative. One last question before we let you go, is there any other cool Open Source projects that Shopify is a big part of or sponsoring or whatever that we should know about, we can put in the show notes? We don't have to talk about it too much, but just anything else that you think we should be looking at that you guys are contributing.
Katrina: We have contributed a couple of our own Open Source tools, notably one called Krane that's for deploys a couple of years ago. That's always been a part of our stack. Another one called Kube Audit that our infrastructure security team built. We also do occasional contributions to the many different Open Source projects that we take advantage of from the CNCF landscape, Anos, Prometheus, Falco, Voucher. Actually we are maintainers of Voucher as well which is a component of the Graf AS system. I don't think that's CNCF, but it's also maintained by our infrastructure security team.
Benjie: Nice. Well, it sounds like this is a really cool place to work, not to be too much of a commercial for Shopify. But it sounds really cool, I'm just blown away with some of the factoids.
Katrina: Yeah, obviously I want to highlight the strong points of what we've built so far but there's still a lot of work left to do and we're a company that is really heavily investing in Kubernetes as you can tell, and we have lots of openings to work on these various spaces and there's still plenty of hard work and challenging problems left to solve.
Benjie: Wonderful. Katrina, thank you so much for coming on. This was great. I reserve the right to ask you to come back in a year to tell us what's different, because I'm excited to see. Yeah, I don't know what you guys will build by then, but it'll be super interesting. Really appreciate you coming on, and excited to see where Kustomize goes and I will personally be checking out the KRM stuff and keeping a close eye on that, so that was something I learned today. Really appreciate you coming on.
Katrina: Thanks for having me.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
How It's Tested Ep. #5, The Future of Developer Advocacy with Filip Grebowski
In episode 5 of How It’s Tested, Eden Full Goh speaks with Filip Grebowski. This conversation explores Filip’s career journey,...
Open Source Development and How We Got Here
Heavybit General Partner Joseph Ruscio shares his perspective on the state of open source in 2023.
Jamstack Radio Ep. #128, Cross-Platform App Development with Simon Grimm
In episode 128 of Jamstack Radio, Brian speaks with Simon Grimm, a prolific content creator and developer educator. Together they...