In episode 27 of The Kubelist Podcast, Marc Campbell and Benjie De Groot are joined by Webb Brown of Stackwatch. Managing costs and performance-related tradeoffs has always been a difficult problem for users of Kubernetes, especially at scale. This talk spotlights Kubecost, a project offering real-time cost visibility and insights for teams using Kubernetes.
Webb Brown is the Co-founder & CEO of Stackwatch, and creator of Kubecost. He was previously a Product Manager at Google.
In episode 27 of The Kubelist Podcast, Marc Campbell and Benjie De Groot are joined by Webb Brown of Stackwatch. Managing costs and performance-related tradeoffs has always been a difficult problem for users of Kubernetes, especially at scale. This talk spotlights Kubecost, a project offering real-time cost visibility and insights for teams using Kubernetes.
transcript
Marc Campbell: Hi, welcome back to The Kubelist Podcast. We have another great discussion today with Webb Brown, founder of Stackwatch and creator of Kubecost. We're going to talk about their Open Source product, Open Cost. Welcome, Webb.
Webb Brown: Thank you so much, Marc. It's great to be here, really appreciate you having me today.
Marc: Awesome. And of course, Benjie is on too. Hey, Benjie.
Benjie DeGroot: Hey, I'm here, I'm here. I'm still here. Hey, Webb. Thanks for coming.
Webb: Hey, Benjie. Thank you again. Really, really great to be here.
Benjie: I'm really excited about this one, it's our pleasure. I have personally been using Kubecost on some level for a few years now so I'm really excited to dig into this one.
Marc: Great. So let's get started here. Webb, just to help kick stuff off, I'd love to hear a little bit about your background, your story, how did you get into the cloud native ecosystem, that type of stuff.
Webb: Yeah, absolutely. So before creating the Kubecost Open Source project in early 2019, my co-founder and I were both at Google between four and five years each. He was an infrastructure engineer, I was a product manager, we were both focused on infrastructure monitoring broadly, but at first on internal platforms like powering Google's infrastructure and applications.
My co-founder moved onto Google Cloud, I moved onto a dev tools team. Really throughout this effort we were looking at different aspects of this relationship between cost performance, health, reliability, et cetera, and what we saw was that on multiple layers they're all deeply interrelated.
You can't think about managing cost effectively without deeply thinking about the potential performance implications, be they an application or a configuration or an infrastructure change. So we were fortunate enough to see some of our teammates join the Kubernetes effort really early days and we saw that, wow, there was a real opportunity to help users, especially at the scale of Kubernetes, manage these kind of costs and performance related trade offs so we started talking to users.
One of the biggest pain points that we saw right away was that it was actually really hard for users of Kubernetes to even monitor costs effectively.
When teams were coming from VMs, they typically had relatively complex show back or charge back concepts in place. When they moved to containers that were really dynamic in Kubernetes environments they had a much harder time, so that led to the birth of the Kubecost Open Source project.
Marc: What created that difficulty and that change from the VMs? Was it really just the bin packing and all the different workloads running on a shared set of infrastructure?
Webb: Yeah, great question. We think it's two fold, first is the technical complexity, right? This notion that things are constantly being rescheduled, more common that teams are using autoscaling so resources tend to be shared more often and then applications and infrastructure tends to be more dynamic or ephemeral. So that's part one, there's just more moving pieces from a technical complexity standpoint.
But part two is we also think there's a downstream benefit from the amazing innovation platform of Kubernetes as a platform and the way teams are using it a lot of the times, which is empowering a lot of different engineers throughout the organization to be able to ship applications really rapidly in a really decentralized manner, where there's less of a small, centralized team managing the entire deployment process.
Benjie: So you were at Google, you were working on Borg. Is that right?
Webb: We were working on infrastructure monitoring on top of Borg, yes. Doing anomaly detection and device health monitoring and outage prevention, that sort of stuff.
Benjie: That seems like an intense job, to do that at the Google scale. Just since we're there, any cool little anecdotes that you can share with us that are something? I imagine there were some crazy insights that you guys just flipped a switch and were like, "Oh my god, we're spending what on what?"
Webb: Yeah. I think some of them were just the absolute dollar amounts, right? To see an application that had been deprecated months ago and it was still running and nobody noticed and, yeah, it was costing $100,000 a month or stuff like that where most organizations, that kind of stuff would be a really big deal. Just the sheer scale, I would say I saw things in the tens of millions of dollars regularly. But I think that Google is a special place for a lot of reasons, but it had different constraints from most organizations in terms of managing cost, right?
So we tried to create a culture of transparency, I felt it was more at the manager level and a little bit less at the engineering level, whereas we now see this major trend where more and more engineers are bringing costs into their Grafana dashboards, or looking at cost on KubeCTO. Definitely see costs becoming more and more of a first class citizen, and these figures had been in place for years at Google, whereas a lot of engineering teams, at least here, are just now trying to have a sense for, "Here's the cost of this microservice or named space or application, et cetera."
Marc: So before we go too much further, I'd love if you could paint a timeline for us around, you have this Open Source project kc, it's pretty popular, a lot of folks know about it, you have your startup, Stackwatch, and now you have a different Open Source project coming out that's Open Cost. Can you tell us a little bit about how they're related and how you've gone through that process?
Webb: Yeah, absolutely. So we started the Kubecost Open Source project the very beginning of 2019. We did that really just as we're personally deeply interested in this problem space and helping users. We launched it with the hopes that it would lead to us starting the company, but we weren't dead set on it by any means. It was only when we saw that 100 teams were actively using it and we were able to validate that we were truly solving a real business or technology problem.
So fast forward about three months after we launched it, we formed our company, fast forward another three to five months, we raise our first dollar venture capital and then brought on our first teammates. If you look at where we are today as a company, we're a team of about 40 today, still mostly engineering backgrounds and we've raised north of $25 million in venture capital.
So what we did is, right around when we raised money, we started building enterprise features on top of the Kubecost Open Source project that do things that are very relevant in enterprise settings like add RBAC and SSOs, YAML, that sort of stuff. We're really excited to launch this brand new effort that's called Open Cost.
It's something that we're doing, something we've never done before, which is working with a lot of partners in the ecosystem basically to take a lot of our Open Source code, contribute more to Open Source and then really build out a standard or spec around how to just measure costs in Kubernetes because there's really no uniform answer for how this is done across different environments, different providers, et cetera.
We've got an amazing group of partners, and yeah, we're really excited to launch that here really soon. You can think about it as like we're contributing a lot of our codebase to that and then the Kubecost project will then implement the Open Cost standard going forward, where again, Open Cost will be specifically aimed at measuring and monitoring the cost of containers or containerized workloads in Kubernetes based environments.
Marc: Got it. So is Open Cost really just the stack in order to create a standard around it, or does Open Cost also have an implementation behind it?
Webb: Yeah, so it'll have both. It'll have our core allocation engine, which is a lot of the internals of the Kubecost project today. Most of that is Open Source now, but some, there'll be some additional code Open Sourced. Then it's combining that and conforming that to this brand new spec that's being created with this group of partners across the cloud native ecosystem.
Marc: I know you haven't quite launched it yet so you might not be able to really name a bunch of names, but what types of other partners are you working with? Are they financial services or are they just other tech companies, to define the spec?
Webb: Yeah. So a mix of cloud providers themselves, as well as end users, as well as observability players that have been thinking about cost in different dimensions. We think that mix or balance is really valuable.
Obviously the cloud providers have been hearing about this problem from a range of users of the past couple of years as Kubernetes adoption has just exploded, end users that are regularly struggling with the problem are tackling it on a daily basis. Then observability players that think about this fitting into their broader suite of observability metrics related to different parts of observing a tech stack more broadly.
Benjie: That sounds like a pretty large coalition you're building over there. Can I ask, how did you start building that out? Obviously we're not going to name names right now because it's not complete, but give me an idea of how do you get started on something like this? How do you wrangle all those folks and how do you start an Open Source project, or consortium or whatever the right term is?
Webb: Yeah. So in this case it was really cool, where it almost happened whether we wanted it to or not in the sense of just the adoption of the Kubecost project and some of the functionality that was exposed. It just drew others to start this conversation with us. I think we started with like four or five, and that really quickly grew from there.
So I think, yeah, for anybody thinking about doing something similar, my experience in this situation was just letting it happen organically, which is starting with a project that you think will be valuable and then just really engaging from those that are using it and getting value out of it.
As part of this we have submitted the project now to CNCF, again drawing on the amazing contribution from every partner that's been involved, so really excited to see that be a neutral governance home, one day soon.
Marc: It's a good path to go through, have an implementation, get adoption of the implementation, let it mature a little bit and then take the lessons from that to build the spec, as opposed to the completely opposite angle which is, "Let's build the spec and then build an implementation." It's just so hard to get it right when you take that approach.
Webb: Yeah, totally agree. At this point we've project worked with close to 1,000 teams. We've got 1,000s of teams using the Kubecost project today, so yeah, I felt like from our experience at Google and during that time, able to draw on a lot of different scenarios.
There's a lot of complexity here and it's complexity that is at the intersection of your tech stack, but also your organization. To just pick one example, again there's this notion of shared costs in Kubernetes, whether that be the Kube system named space, or a monitoring named space, and different teams think about sharing those costs differently.
Ultimately it can be driven by something as simple as, "Well, who manages your clusters and who has the ability to influence those costs"? For example. So there's a lot of complexity here, and it wasn't until we saw many different variations outside of Google that we felt like, yeah, we were in a position to comment in a really deep and meaningful way across the flexibility and controls that you may well have in different deployments.
Benjie: So that's super interesting, and I think what's funny here is when describing how to do this, it's really the right way to build any product, and that is get the feedback from the users before you start telling them what the right solution is. Which is lessons that we've all heard and we all try and use every day, but it's just a great example of you guys really executing on that.
To switch gears a little bit, going back to Kubecost for a second, I can tell you my personal Kubecost story is 2019, I'm a high priced Kubernetes consultant, firing up clusters for people left and right.
We got a bunch of startups with all kinds of Google and AWS credits to burn. I think it was actually specifically GCD credits to burn. All of a sudden some of these companies started doing really well and they got a CFO or whatever and they're like, "How much are we spending on this?" Like, "I don't know." And then we found Kubecost and it was a very, very, very easy thing to install. So I've been an early user of kc, I probably don't use it as much as I should these days, but I think what's really interesting is how I remember getting value out of Kubecost within...
I think I had installed a helm chart and that was it, and all of a sudden I had this dashboard that I could proxy into. Maybe not the best CSS I'd ever seen, but I appreciated it nonetheless. So just talk to me, what were the original goals for Kubecost itself? And are you aware of how good a job you did at just making it accessible right at the beginning? How much energy did you put into that? Let's talk about the product a little bit.
Webb: That's amazing feedback. I had not heard that story, so really appreciate you hearing... So I think there's a couple things that highlights to me, first is that there's typically a scale threshold where teams, they start to become relevant. Maybe it's $50,000 a month, maybe it's $100,000, maybe it's $500,000. But typically it's when these dollar amounts start to get big enough to where, yeah, a CFO or somebody starts to be a little uncomfortable that they don't have great visibility here.
Now that Kubernetes itself is managing billions and billions of dollars of compute, we're just seeing more and more companies hit that threshold.
Yeah, from there we have been really focused since day one of just this being super simple to install and get up and running and get value from. We tried to, from day one, create an installation or create a first day experience that we as engineers would want to use and would be happy with. So that is for most teams, I would say 90%+ of teams, it's a helm install that depending on your provider and how long it takes to provision or spin up new pods, could be seconds, it could be a couple of minutes.
Then from there, by default we're scraping data every minute, that's configurable, but we would start collecting information. So within a handful of minutes you should have real visibility and really start to get insights on your environment.
I think you touched on the two pain points that we saw which was, one is just observability is really hard here. Again, a lot of teams, it's hard to say the cost of a cluster, the cost of a named space, or however they logically group applications. But then even more importantly, you may say, that a downstream effect of this, it's hard to be really efficient when you don't know the cost of different applications and different environments.
And so what we see is that while the Kubecost application has a bunch of different insights on how you optimize your infrastructure, your Kubernetes configuration, your applications, et cetera, we've seen time and time again it's just that transparency and awareness leads to a bunch of improvements, from enabling accountability, like letting engineering teams manage their own resources, et cetera. So yeah, there's a lot of thought that went into that day one and we've just refused to ever let that installation process become any more complex than just a simple helm install.
Marc: Is there a size or scale where you see people adopting Kubecost and having a lot of success? Whether that's $50,000 or $100,000 a month in Kubernetes spend that you mentioned? Or maybe there's other criteria, like you grow large enough that you have a CFO, you have a team that's looking at this? Or are you seeing folks adopt Kubecost when they're even a lot smaller?
Webb: Yeah. So the range is really big, and it's amazing. The range is big in terms of scale and industry, and so there's a lot of factors. I would say there's two pieces that I see to me as big picture trends. One is that if you're doing things in an environment that is very dynamic and your peak scale load is expensive, even if it doesn't persist for very long, that will have teams have heightened sensitivity.
One example may be that for a very short period of time you spin up a really expensive cluster with a lot of DPUs to do a bunch of really expensive training data, you typically wind that down or scale that down quickly, but sometimes that doesn't always go as planned. So you may be planning to spend $50,000 and sometimes you accidentally spend $250,000. Those types of environments, I think they drive people to start this type of monitoring and observability earlier.
But then I would say around the $500,000 a year, $1,000,000 a year Marc, in that general range, a lot of teams start looking at this really closely. One thing we do whenever we start working with a team, we start with one number from an optimization perspective, and that would be looking at this notion of cost efficiency which is basically, of every dollar you're spending in all of your environments, how much are you actually utilizing?
It's not uncommon for that to be a single digit percentage when we start working with teams, and that most of the time is majorly eye opening for teams. Typically, at that threshold you're talking about the potential to save relatively quickly multiple head counts, and so that's typically a pretty good motivator for teams to say, "Okay, we can spend a little bit of engineering time starting to think about this problem."
Marc: Whoa, whoa, whoa. Hold on. So give me that algorithm? Cost, meaning I've got 10 nodes on my Kubernetes cluster, but I know by looking at my Prometheus that the CPU utilization is at, let's say, 5% on average over the month, then according to you, then that's basically saying that I have 95% too much capacity on any given day. So that's the summary of that?
Webb: Yeah, yeah. Think about a cluster that has a CPU and, let's say, RAM and disc, right? We'd be doing that cost weighted utilization average that you just described across all those resources. Just what you said, if CPU is 99% of your costs and your average utilization over some historical window is 5%, then yeah, that would anchor you towards a very low cost efficiency number. Then you get into a discussion around, okay, that's your average utilization, what about peak or p99, et cetera?
Then you get into these really intelligent approaches for scaling up based on demand or shifting costs across lower cost thresholds, whether that's different tiers of disc or spot savings plans, et cetera. the Kubecost product today has about 16 different insights in the community version of kc, and so there's a lot of different things that teams can think about configuring to increase that number.
Marc: So does it just give those insights and those learnings? Or does it offer very, very specific remediations or suggestions, like, "Switch to this instance type of your EKS node group," or something like that?
Webb: Yeah, so you would get both. To use that example, we would be looking at historical utilization across all of your applications, you can either provide context to say, "Hey, this is an HA application, so I want more resources and I'm comfortable with a lower average utilization so I can be very confident that I'm not going to be CPU throttled when I have a burst." So we take all of that into account and then say, "Hey, given that your in an EKS environment, run these different bin packing algorithms and here is a recommendation on a single node pool set up and more complex..." So from there actually you now have the ability, as of a very recent release, to one click accept those actions, whether that be right sizing a set of pods or the infrastructure itself.
Benjie: Okay, so cost efficiency is what you guys call it. That seems pretty darn powerful. I have to say, as a user of clouds for a very long time, I've always felt like they don't want us to know what we're doing. I won't ask you to comment on that, Webb, but I think we all know that part of the model is we just throw it over a wall and have someone else worry about it.
I'm just curious, has there been any weird pushback to this that you've seen from developers being like, "I don't want to know how much I'm spending," or whatever? Is it always positive or is there negative there? What are the lessons that you've learned there about when you're opening up people's eyes to this cost efficiency number? Because I'm sure I know that we're guilty of it at Shipyard, but all of a sudden you go to a CFO and you're like, "Oh yeah, so it looks like you spent $100,000 last month and you only need to spend $7,000." That can be a tough pill to swallow. Just curious how you navigate those waters?
Webb: Yeah. I tell you, overall it's been incredibly positive. I think part of that probably stems from we were in those shoes not long ago at the cloud providers, and can understand the perspective of at the end of the day, if it negatively impacts the customer experience in any way, shape or form, you do care about it. Right?
But, that being said, if you look at the arc of the cloud native journey, the last couple of years most teams were really getting started on their production in Kubernetes and so there's a lot of focus on day one problems, but getting applications to be up and running for the first time and making sure that they're secure and reliable, et cetera. Costs typically comes after that, and I think as a result you're starting to see cloud providers be more and more focused on it and think more and more about it.
I think we've already seen that shown with the involvement of some great cloud provider partners in the Open Cost effort. But I think the other insight is if you look at the CIO's perspective or CFO's perspective, and they're betting on Kubernetes in a really big way because it's a no-brainer, there's all these amazing innovation benefits and scalability, flexibility, et cetera, there's oftentimes there's this piece of just getting them comfortable with the fact that they will have some observability and some guardrails and some ability to manage cost.
And so from that perspective again, very much see that cloud providers realizing that this is a way to get teams comfortable with scaling their workloads in different cloud environments. And again, I think that's only really now coming to the forefront as you're talking about tens of billions of dollars being managed in Kubernetes platforms there.
Marc: I'd love to dive into a little bit more about what the product Kubecost does, right? We talked about the initial example that Benjie mentioned, it was CPU utilization. That's useful, but I don't think that that's the depth of what the product actually does. Just poking around the website, cost allocation is something that cloud providers can't really do and Kubecost can do that. Can you help me understand a little bit about what that is and how you're able to do it?
Webb: Yeah, absolutely. So that cost allocation piece is really the core of this observability problem. To go back to the example from earlier, you're in a large scale EKS cluster, with hundreds or thousands or maybe even tens of thousands of nodes across a different set of clusters, and you're running different instance types, you may be running across EZs, across regions, again across multiple clusters.
You have RIs, you have savings plans, you're leveraging SPA delts and you have a combination of Kubernetes controllers across deployments, stateful sets, et cetera. You're using autoscaling. To actually come back and say, "Here is the cost of an application in that environment," is really, really complex.
So what Kubecost would do by default, you helm install it, you'd be using public AWS pricing, and then at any point you can integrate it with your cloud account and it would then reflect your actual bill. That could be because those savings plans, SPA, et cetera, being applied, it could also be that you have an enterprise discount applied on top of it. Kubecost would reconcile any price in your infrastructure to what your actual bill says. That's incredibly powerful as a foundational problem to solve because now that you know the cost of everything in your environment, you can now start to think about budgets or policies or alerts with a degree of accuracy that just wasn't possible before that was in place.
Marc: Yeah, that's actually super cool. That will actually then go into the Kubernetes cluster, not just looking at the cloud provider but taking the control plane and the shared infrastructure cost and calculate the ratio of the cost that that's using and put it back to that application.
Webb: Yeah, exactly. So one example would be you deploy a new stateful set in your environment, it gets scheduled on, say, multiple nodes. We'd be looking at the amount of resources that you're requesting and using, the cost of those resources on those particular nodes, and then can aggregate that in any dimension. So if you want to look at that cost by labels or annotation or namespace, or just for that particular stateful set, you can look at that in any different dimension and have a bunch of configurability on top of that to say, "Oh, this is actually a shared resource and it should be distributed in this fashion across my other tenants, et cetera."
Marc: So the only thing I really need to do is just make sure my apps have some label or annotation or some grouping, and then Kubecost could figure the rest of it out?
Webb: Yeah. There we really try to meet teams where they are, and oftentimes teams already have some structure to determine owner or tenant isolation. So yeah, it could be labels, it could be annotations, it could be just namespaces. We even see things like teams having an entire cluster or node pools with taints and tolerations for different tenants, so a lot of ways in the Kubecost project to slice and dice that data.
Benjie: So one question, well, I have a few questions, but one question that I have is let's just say that I am very, very, very privacy driven, security conscientious, and I don't want Kubecost to know my utilization exactly, or maybe there's just some metadata in there. Does my info stay on my own cluster or what's the data model around where my data lives?
Webb: Yeah, it absolutely does. This is another thing that is super important to us. We talked about building the product experience feature that we wanted to see from day one, and that was we wanted to be able to install the product in minutes or less. We definitely didn't want to have to get on the phone with anybody from sales or anything to install the product.
We felt that we should have total control and ownership of all our own data, so by default nothing ever leaves your environment. You can even lock down your namespace egress where Kubecost is installed. By default we ship with a persistent volume where there's a time series database and caching layer data stored locally in that cluster. You can choose to write that to, say, an external source location like an S3 bucket which you own and control completely.
All three of those aspects are super, super important to us. That said, we do now have a hosted product that's in limited availability, that we're actually testing with a handful of users. Say we would prefer to manage our data plain for us, where we would just ship a really small, single pod agent and push metrics to us remotely.
Marc: So that's of more a SaaS model at that point?
Webb: Exactly.
Benjie: Because that is something I remember explicitly being like, "Oh my god, I love this. I'm running this on a pod, the interface itself is on a pod, not actually going to someone's website that has all of my information that I don't want them to have." I have to tell you that what I do with my company, with Shipyard. Let's say this, every one of our customers has a single tenant cluster and it's been like that since the beginning, and we abused Kubecost in a few tricky ways at the beginning there and got away with some stuff we probably shouldn't have gotten away with.
Sorry about that. I don't know why, it's not a confessional. I'm sorry, but I just felt the need to tell you. I'm sorry, Webb. But it made me love your product, I will say that, so very, very much. Today if I am signing up for kc, and I won't be specific about Shipyard's numbers, but I've got 10s, 20s, 30s, hundreds of clusters possible but each one of them maybe only has a few nodes.
That would fall into the business class, right? But then all of my data is living on each one of those individual clusters and so no one sees my data, my customer's data. It's just this unified thing that is pretty great, and then it can also help me figure out how to make those clusters more efficient for my customers. Is that correct?
Webb: Yeah, that's exactly right. First, appreciate you sharing that story. I think it speaks to how we think about building products, which is if you could see my face when you were telling that story, I had a big smile on it. It's like we want to be helpful, first and foremost, and above everything else, always. On top of the community version of kc, we've got two paid tiers. A business tier and an enterprise tier. Business tier allows you to do just what you said, which is toggle back and forth between unlimited versions of clusters.
We've just found that users generally find above, say, three, four, five clusters to have a single interface to where they can easily transition across. But then the enterprise product itself deploys with typically a different architecture, which is commonly backed by a storage bucket, and then users a Thanos or Coretex to build a totally unified view of everything.
Then you can truly go and see all the insights or optimizations that Kubecost has, what's the highest dollar impact, and that could be on cluster number 47 out of 120. It also allows you to say, "What's the cost of this application across all of my clusters, all of my environments?" And filter across dev product staging, for example. Those are two of the key features that are unlocked with those paid versions of kc.
Benjie: Yeah. Look, I remember the reason. Someone explained this to me when I was very early on in my career, the reason why it's really easy to crack Photoshop back in the day, or Word or Maya 3D or whatever these things were, is because the college students, the students get the cracked versions, they get good at using them and then they go to enterprise and they pay for licenses. I very much say that that model worked or is working a little bit with you guys. I'm not sure if anyone's supposed to say that out loud, but I just did, so what are you going to do?
But no, I really appreciated how I had the data on my cluster, I could abuse it, sorry, and it really brought a lot of value. So I think is just super interesting, the way that you guys started and it was just... I don't mean to kiss up too much, I sometimes kiss up to projects that I really like, but I really like this project. It obviously left a Marc because of how open it was and how much I could get value out of it immediately. I think a question that I would have for you is the next immediate step is Open Cost, right? But what are you dying and itching to build next into kc, the product itself?
Webb: Yeah. A lot of stuff really building on top of Open Cost, now that we'll have this unified standard to think about costs in different dimensions. I think we're just at the beginning of interoperability or integrations, or just really cool things we can do there. To me, the obvious example is we're going to have better transparency across Grafana, Prometheus and other time series databases that speak PromptQL. We can increasingly say, "Here is a broadly community adopted standard for how to measure the cost." I just knew that so many other really, really interesting approaches for then managing costs or managing cost relative to performance and these other trade offs are going to start to emerge.
I could see some of those being part of the Open Cost project, I could see a lot of them also just being built on top of that Open Cost project or integrating those metrics into other products. It's such an interesting time for me, in that I feel like we're about to begin this brand new chapter and I feel like there's 50 different directions and integrations that I'd love to see, and I can't wait to see what comes first, and I really look forward to us working with people across the cloud native ecosystem in addition to some of the core functionality I think we can build within the Kubecost product itself.
Benjie: Yeah. It's actually interesting you say that, I've been looking at cost optimizations a bunch for what we're doing and a few companies have popped up, like Vantage and Turnery I think is the other one, and Cloud, Cloud Xero. These are just Saases that I've seen while doing some research.
They're more about monitoring the entire cost of the entire cloud, whereas I think that it's really interesting that you guys have just really focused on the Kubernetes CNCF side of this world. Do you see a world where you guys all work together or do you see that's something that you would work on outside of the partnership or something like that?
Webb: Yeah, I definitely think there's the possibility. We're of the mind that we're going, as you say, super deep in Kubernetes and we want to be the absolute best at helping teams think about cost and manage costs. That's what, from our experience at Google, that's the biggest pain point that we saw, was that during this move from BMC containers. But ultimately the intersection of general cloud service cost data and container deep insights is super, super interesting.
We've got a couple partnerships that are early days, I definitely want to see us do more here as time goes on and I think that can be now with the launch of Open Cost, that could be through pure open sourced or through enterprise Kubecost which has some additional functionality on top of that. It just feels like we're going to have more options and paths to go down for different collaborations like that, and I think it will be hugely beneficial. I think there will be a lot of teams that are just at the Open Source and then we see where the integration evolves to.
Marc: Webb, I have a question about the product again. We talked about the cloud optimizations and looking at the cost for the different cloud providers, but I think you also work with on prem clusters. How far down that world do you go? There may be folks listening to this that aren't running in a native OS or JCP, but they're racking and stacking servers, they have Acolo that's running Kubernetes. Can you talk about how you're able to help manage that?
Webb: Yeah, absolutely. So today, between 15 and 20% of Kubecost users have an on prem component to their broader footprint. So we've got a bunch of customization and flexibility around bringing basically custom pricing sheets. That can be determining an amortization schedule on all of your infrastructure, determining from there an hourly operating cost for, say, nodes or resources within your infrastructure. Then from there it can really build up all of the same insights that would be available in a cloud environment.
That can be right sizing applications, configuring Kubernetes or looking at bin packing efficiency, et cetera, down to the infrastructure level. I would say one thing we haven't touched on yet, very explicitly, but a lot of teams with on prem footprint at scale, where they may have had some sort of chargeback model in place pre Kubernetes or pre containers, that can be a really important component and data source that they're able to get back to that chargeback model so that different tenants can really take ownership and accountability of the resources that they're consuming. Across maybe not just their on prem environment, but maybe on prem plus cloud in a single pane of glass.
Marc: Yeah, that's cool. Kubernetes really does become the modern cloud provider, I guess. You handling the integration from Kubernetes into the underlying infrastructure really allows me, a user of kc, to be able to say, "It doesn't really matter where I'm running this infrastructure, I get the same reports, the same data, the same information."
Webb: Yeah. That's exactly right. If we go back to some of those core tenets early on, we drew a line in the sand. At the time I think it started with 1.11, where we basically said that anywhere the Kubernetes API is totally supported above 1.11. We will support you with kc.
Now, we may not have the deep built in API integrations in a cloud provider that we haven't integrated with, but we would work with teams to bring their own pricing sheets if they're really interested, to know we now have a number of providers that are on the roadmap for Open Cost, where we want to go do those deep integrations like we have for AWS, CCP and Azure today where we would support enterprise discounts and all of the complexities of modern, at scale, enterprise agreements with cloud providers.
Benjie: I want to go back for one second, because I don't know if we talked about this and I want it to be clear. When I do my helm install, kc, what is actually installing? You said it was a restricted name space potentially and all that stuff, but I just think it would be good for people... Obviously I'm basically a walking commercial for kc, but I think it would be good for you to just explain what's actually installed on my cluster all the way.
Webb: Yeah. So first, tons of modularity and flexibility, but at the core is two required components and only required components. One is the core Kubecost pod itself, so it is what is collecting all this telemetry, it is what's generating the insights, et cetera. It has that core allocation engine that we talked about. Then it is tightly coupled with a time series database. We by default ship with a Prometheus server, but that can be swapped out with a time series database that speaks PromptQL of your choice.
Then optional modules, we can ship a node exporter where you get machine level metrics, it unlocks a couple extra insights. Then a Grafana set of dashboards. This supplements the core Kubecost project, but it's also meant to just be a reference implementation for teams that want to bring cost metrics to their Grafana dashboards. So really it's those two core, the Kubecost pod and Prometheus, but then a couple other extra optional add-ons like the ones I mentioned, then also the notion of network cost monitoring as well. That's a totally optional add on.
Benjie: What exactly is it using to get these stats? It's using the Kubernetes API I take it, but tell me how that works a little bit.
Webb: Yeah, so totally directly to the Kubernetes API, looking at, see advisor data. If you do a network module, looking at either a kernel module or EBPF, then also talking to cloud provider. Billing data if you're cloud environment. So a range of data sources, but again, all of that is done, all of that process and ingestion is done directly in the cluster and stays in that environment.
Benjie: Cool. Wait, so the EBPF, is that a new thing? Or was that there a long time ago?
Webb: It is a relatively new thing, that's that optional add-on where I think it's really interesting as an engineer, but we build a map of your entire Kubernetes environment, and from there we can say intelligent things about things like cloud egress versus cross-region egress, et cetera. We can then come back and allocate that to individual pods or services so that, if and when you did get a large network traffic gateway bill, et cetera, you can actually allocate those costs back to the core Kubernetes tenant that generated said traffic.
Benjie: That's really cool. For those that listen to the podcast or read the newsletter, everyone knows that I think S-Bomb and EBPFs are the future, so I am always happy to hype that a little bit more. That node module, that's just a Daemon set I throw on. I also know that it's Daemon set that I'm putting on each one of my nodes that I want to have that on for, correct?
Webb: Yeah, exactly. It's like a single helm flag if you just helm install or helm upgrade, it would deploy that Daemon set. Then, yeah, you would just start seeing network metrics in Kubecost and different insights at different levels, whether that's seeing cost of the egress by a name space or a label or a pod, et cetera, et cetera.
Marc: You mentioned earlier that once you formed the company Stackwatch around it, you started building some enterprise functionality. Is that separate binaries that I'm actually running or is it just handled through licensing?
Webb: So if you think about the two different images that we build today, first is pure Open Source and then everything else is managed via licensing. Our view is we want to, one, make it really easy for teams to try paid features if they are interested in those, and then two is if you were to upgrade to, say, enterprise kc, we want to make it a super smooth transition. So you can just drop a key in and just unlock those other RBAC, SAML, et cetera, enterprise features.
Marc: I'm always curious, so when you have that functionality but it's behind an entitlement that somebody has to have a license for, are you just delivering it in the Open Source repo or are you managing that and keeping that code and the implementation of that functionality proprietary?
Webb: It's a mix, but some of those enterprise features today are in a private repo. Yeah.
Marc: Got it, cool. I want to transition up to the creation of Open Cost now, and you talked for a little bit earlier on what that was and that you got to build the Kubecost project. I'm curious, was there one event? Can you pinpoint one thing where you were like, "Yeah, now it's actually time to try to create this consortium and actually create this spec and go through this process?
Webb: Yeah, I think it was really twofold. One was partners coming to us and seeing what you're working on, were struggling with the same problem in this capacity, could we look at doing something together? Or do you have anything we can leverage? Does your Open Source do this, et cetera?
It was partner engagement, but I think also there was this other piece which is after working with close to 1,000 teams, we'd seen a large number of enterprises roll their own solution here. That could be varying degrees of complexity, some just like Grafana dashboard, others building things on top of Prometheus and others building their entire end to end observability stack.
After seeing, I would say, more than 100 of these we saw that basically all of them came to different answers, approached the problem from a different perspective. Seeing that really combined with the partner discussion, really highlighted the pain point that without common definitions and really a common standard, there's going to continue to be a lot of different ways to think about cost just because there's enough complexity here that teams will continue to build their own solution, getting to their own answer without more and more guidance from those in the community.
Marc: So when you actually decided to do this, how much of a discussion did you have as a team to decide, "Yeah, let's not only create the spec but put it in the CNCF and go through that project," versus just, "We have an Open Source project, we can define a spec, we can put this on the GitHub repo"? What is the value in actually contributing that spec to the CNCF and going through that whole process?
Webb: Yeah. We had a fair amount of discussion about it within the team and it's interesting because day one when we launched the Kubecost project, we had ultimately an intention and a goal to contribute either all of that Open Source or some portion of it to CNCF. But fast forward, call it two plus years after launching that project and having a bigger engineering team, we had a lot of discussions around what's the best way to do this and what's the best way to segment code and put a governance model in place, et cetera, et cetera.
So a lot of discussions there, and ultimately it was around how do we do the right thing for the community? And how do we also put ourselves in the position to still be able to build a lot quickly for the Open Source project itself, but also on top of the Open Cost project? Because we still feel like early days for this effort and bringing cost as a first class observability metric, so we had a lot of discussions there.
But ultimately it was bringing these partners in and having conversations there to just see all the benefits from having this in a neutral third party governed environment, because just the amount of ideas and input we've gotten through that process once we stated those intentions and once we developed this in a community-first way had been huge.
It was a totally different approach from us building V1 of Kubecost just with customer conversations, but really taking our own shot at it at the beginning.
Marc: Got it. I think there's a lot of CNCF sandbox projects and applications right now, there's a discussion on the TOC mailing list around changing some of the process. I'd love to hear, if you can share, a little bit about the timeline and your thoughts on the process. Overall, what's it been like? How long has it taken? I don't think that it is a sandbox project yet. Have you gotten any feedback from the TOC along the way?
Webb: Yeah, so we have submitted it, it is in the queue now. I believe when we submitted it there were 15 to 20 projects in front of us. We probably first submitted it about two months ago, and I think we're at the very front of the queue now. So from our perspective it's moved pretty quickly, we were able to talk to a couple of TOC members and just gave really, really helpful guidance in terms of how they think about that and how they think about going into the sandbox and preparation for incubating, et cetera, et cetera.
I have seen some of the discussion, I think it's really interesting perspectives. From my perspective, it's like there's a lot more projects that are really interested in being part of CNCF sandbox, which overall is a great thing. I think it's also a good time to reflect on why the sandbox exists, the value it's providing and if that is changing over time. But yeah, overall from my perspective, things have moved pretty quickly but hopefully going to get early feedback on submission, I hear, pretty soon.
Marc: For what it's worth, I think that's a pretty good approach when you have a relatively successful Open Source project and a commercial entity behind it, just realizing that instead of donating or contributing your project to the CNCF, you can actually contribute a reference spec or a spec and a reference implementation that your project implements allowing... Really opening the door to competition, but keeping control of your commercial entity, your business.
Webb: Yeah, at the end of the day, we think about it as like we want to help the community build this standard that doesn't exist today, we then want to build a bunch of cool stuff on top of it. But yeah, we would love to see everybody involved in pulling this standard together and increasingly us has been one small voice in this broader group which is driving this effort together. Already we've seen that become true, where we're increasingly just one small piece to a great group that's bringing fresh ideas and details to the spec as it stands today.
Marc: Yeah, so talk a little bit about your existing community, because obviously it's going to grow hopefully, and you're going to get more support from the sandbox stuff if and when that happens. But talk about the last few years of building out Kubecost and how the community has grown and the contributions there. I bet you maybe had a few hires out of that. Tell us a few highlights about the community with the existing kc.
Webb: Yeah. A lot of the Kubecost community has come together organically. We don't yet have someone focused on it full time, it's like our engineers are in there, we're there and talk directly to users as founders or creators of the project. It really goes back to us just trying to create a Slack workspace set first, that's welcoming and you can come and ask Kubecost projects, you can come and ask Kubernetes or general optimization questions.
Today we've got, I believe, about 2,000 users in that community today, but really on our side, looking to invest and really engage with that community in more and more meaningful ways going forward. It doesn't look too much different from when we first started the project, which is just a couple engineers getting together saying, "Here's how we're thinking about optimization in the world of cloud native."
Benjie: Okay. This is all super cool. Now I have to ask you the question I love to ask everybody. What is the weirdest, coolest, oddest, and there can be a few of these, examples of Kubernetes installed doing really weird workloads that you did not expect to see? I feel like you probably have seen some pretty cool, random things with you work so far, so do you have any fun anecdotes around that?
Webb: So many cool things, unintentional and intentional. One that I like to share that I'm really proud of is, and it was not an intentional deployment of Kubernetes by any means, but we have now caught multiple Bitcoin miners with the Kubecost project. We talked a little bit about it earlier, but in both of those cases, our network module was deployed, specific namespaces or services were egressing, were both consuming meaningful amounts of CPU and then egressing data when the shouldn't have been.
The maintainers of those clusters went and investigated and, yeah, found malware in the form of Bitcoin miners. That was not necessarily something that we designed Kubecost for, day one, but it made us really proud that we were able to help stem those attacks and consumption of resources that were unwanted.
Benjie: Yeah, that's a fun one. Another question I have for you that's a little bit going back, I would love to know, I'm biased but do you have any rough ballpark on how much people leave preproduction infrastructure on and what the costs are there? This is a number that I need to figure out for my own personal stuff over here, but I'm just curious if you have any insights on that because I've heard some astronomical stuff, I have personal experience obviously helping people reduce their preproduction environment costs. But I'm just curious if you have any thoughts on that?
Webb: It varies, and one of the big things that drive that variance from our perspective is, one, there's some organizational or understanding your application and how you're building microservices, et cetera. But two is how do you provision and isolate different developer environments? Some are giving namespaces per developer, some are giving clusters, some are giving virtual namespaces, et cetera, so depending on how you're doing that, that can very heavily impact on it. Obviously there's a lot of other things, but that is one big factor.
We regularly see it at like 15 to 20% of total compute spend when you give a lot of freedom to developers, but yeah, like you were saying, that can be typically reduced by 50%+ pretty quickly. We work with teams to think through the trade offs, obviously you want your engineers to be able to move really quickly and if they need a fresh environment to test some new service you're building, that can be really important. But also you want to make sure that, yeah, you're not totally shooting yourself in the foot from a cost standpoint while doing that.
Benjie: Right, then building out. Okay, now we're going into real Shipyard territory here. But have you seen people building out cost controls, built out of your alerts and stuff like that?
As an example, I think something that I very much am an advocate for is ephemeral environments, so on every pull request you get an environment that's full encapsulated, yada, yada, yada. But the cleanup of those environments is something that doesn't always happen, and also the length of them living doesn't always happen. Are people using kc, because I know there's some alerting syntax, to actually bake that into their existing bespoke devops pipelines? Or are you seeing anything there?
Webb: Yeah. This is where we feel like interoperability and an open API, an open set of standards is so cool and powerful. We've seen a number of things, one example would be we actually had a user build a spinnaker integration of Kubecost where it would basically, as part of your deployment pipeline, would look at cost, efficiency of cost, et cetera, and could take action for you. That could be block a deployment, for example.
But then also have seen teams take exactly what you just said, which is these alerts and firing to a webhook and then taking some action, where action could be pausing workloads in a named space or notifying an owner that they would be paused in some certain amount of time, including just right sizing those workloads on their behalf. What we try to do in general is create these really flexible APIs and insights and we'll have more and more of the ability to take action with our product. But yeah, also seen teams navigating this and building really cool stuff on their own.
Benjie: Yeah, my mind races when you start talking about that. A little preview for Kubelist listeners, we're going to try and get Justin Garrison of Carpenter fame on here and talk about that stuff. But are there any cool integrations with the autoscalers of the world? That Spinnaker thing is great. Any projects I could go look at and just be like, "Whoa, here's someone using this cool Open Source thing with this other cool Open Source thing to shape their cluster and get costs down"? Is there any projects that you could point to or not quite yet?
Webb: Yeah, so a couple things. One is there is now a couple blog posts on the Spinnaker piece, I believe there's a user blog post and now we have one, and I believe the Armory team has one. But for sure user plus us. Then I think a really cool place to look also is the Kubecost community version integration with the Cluster Autoscaler. There it can basically just give insights into why autoscaler behavior is what it is, and that could be why isn't my autoscaler scaling down like I would expect it to?
So those would be two that could be cool Open Source integrations to check out. There'd be a handful of others, one would be just Kubecost and Prometheus, where basically all of these metrics that we've talked about, from the core allocation engine, are all written back to a time series database of your choice which by default is Prometheus. From there I can take in and do Grafana alerts, alert manager, custom rules, that sort of stuff.
Benjie: Now, I know you said there was a release coming out. We're recording this before the release, obviously, we're actually recording this before KubeCon. Right before KubeCon. Do you want to give us any details on that release that's coming out and what we should look out for, how we can contribute and keep an eye on it?
Webb: Yeah, so within a week after KubeCon EU, expect an announcement around this brand new Open Cost effort. We are super excited about it. Again, this is our first thing that we've built with a handful of partners in the community. We look at it as the beginning of this new chapter of really taking our code and implementation, combine it with this spec and putting it in a neutral home where others can drive the roadmap forward with us in a really big way.
Benjie: Super cool. All right, well, Webb, there's a lot of really awesome stuff. We all know that I'm a big fan of kc. Really appreciate you coming on and we'll be taking a pretty big look at Open Cost and figuring out how we can use that in our own projects and across the CNCF. So thanks again.
Webb: Thank you, guys, so much. Really enjoyed the discussion. Thanks so much for having me.
Marc: Thanks, Webb.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
Personal Branding for Founders
Why Personal Branding? A lot of founders I’ve spoken to, especially those of us who are technical, bristle at the idea of...
Machine Learning Lifecycle: Take Projects from Idea to Launch
Machine learning is the process of teaching deep learning algorithms to make predictions based on a specific dataset. ML...
The Future of Coding in the Age of GenAI
What AI Assistants Mean for the Future of Coding If you only read the headlines, AI has already amplified software engineers...