1. Library
  2. Podcasts
  3. The Kubelist Podcast
  4. Ep. #51, CI Is the New Bottleneck with Kyle Galbraith
The Kubelist Podcast
60 MIN

Ep. #51, CI Is the New Bottleneck with Kyle Galbraith

light mode
about the episode

On episode 51 of The Kubelist Podcast, Marc Campbell and Benjie De Groot sit down with Kyle Galbraith. Kyle shares the story behind Depot and explains how the company evolved from accelerating Docker builds into building an entirely new CI platform designed for the AI era. The conversation explores BuildKit internals, remote caching, microVMs, AWS infrastructure, and why modern software development may require rethinking CI from the ground up.

Kyle Galbraith is the co-founder and CEO of Depot, a developer infrastructure platform focused on accelerating container builds, CI workflows, caching, and remote execution. Before founding Depot, Kyle worked on platform engineering and infrastructure systems at multiple startups, including the nonprofit Thorn, where he helped build systems supporting online child safety initiatives. Today, Kyle leads Depot’s efforts to rethink CI and build infrastructure for the AI-native era of software development.

transcript

Marc Campbell: Hey, welcome back to another episode of The Kubelist Podcast. Today we have Kyle Galbraith, co-founder and CEO of Depot, to talk about some of the new stuff that they've been building. I'm really excited for this conversation because I've been using a lot of Depot, their new CI service, since it launched, and going to have some questions here too. Welcome, Kyle.

Kyle Galbraith: Awesome. Thanks for having me.

Benjie De Groot: Yeah, Kyle, super excited to have you on. Been following you for a while. We kind of like to start it off with you just telling us a little bit about your background, where you grew up, when did you get into programming, engineering, all those things. So just give us a quick overview of where you came from.

Kyle: Yeah, I think my story is probably somewhat similar to most engineers-turned-founders story, kind of all started with video games, way back in the day. Like, I grew up racing motocross, so I grew up racing dirt bikes. From the time I was six years old, I was on two wheels racing around the country in the US and fell in love with a computer game called Motocross Madness. Worked on hacking in my own mods into the computer game, way back in the day, but then, kind of shelved computer programming when I went into university.

I thought about going to film school in Southern California, but ended up like taking an elective in community college. I was back in computer science. I was like, actually, I really enjoy this. This is my preferred thing. And yeah, went through the whole computer science program. Did that while I was working full time as a software engineer for a startup back in Portland.

So I've worked in startups my entire career and pretty much that first startup was like, yeah, like I'm gonna start my own startup someday. I was dead set on that. Kind of went through that for several years, then did a short stint of consulting.

I absolutely hated consulting. It was like a peek behind the curtain that I did not need in my technical life. Sometimes you don't need to know how the sausage is made, let's put it that way. Like you don't need to know that some technologies that are holding up billion dollar companies are built on toothpicks.

And so I left that. And yeah, that's when Jacob and I-- Jacob is Depot's other co-founder. We met at a nonprofit called Thorn and we were kind of brought in as software engineers to that company. If you're not familiar with Thorn, Thorn is the nonprofit that was originally founded by Ashton Kutcher and Demi Moore that focuses on child safety online. So specifically removing child sexual abuse material from the Internet.

So we were working on some really interesting products and services for government agencies. So think like FBI, Department of Homeland Security.

Benjie: Some three letter agencies. That's super interesting. Wait, so you were a motocross person at 6 years old? That is not common for most engineers that I know. So where'd you grow up? Were you in Portland, you said is where you grew up?

Kyle: Originally from Portland, Oregon. Nowadays I live in the south of France, so I live in Montpellier, France specifically.

Benjie: Oh, okay. Again, not very common for, well, for French engineers that's definitely common to live in France. So it's interesting to see that you kind of had a broad swath of engineering experiences before you ended up starting Depot. And that's really cool that you were working on this nonprofit with your co-founder.

So we're going to fast forward a little bit. 2020, 2021 is when Depot kind of was a twinkle in the eye or when did it start?

Kyle: Yeah, I think Jacob and I left Thorne and we joined a database startup called Aero Software that was largely focused on like building an elastic search replacement and then kind of like moved into really like the Datadog, SigNoz type of space. And I think that was like 2020.

And then we kind of like worked on that, but then we started like-- We always faced very specific problems working as platform engineers both at Thorn and at this database company where we were building out their entire platform as a service. We actually used Replicated at that company for like how to deploy that database technology to customer accounts.

And we just started building like a better way to build container images initially. So that's the first product that Depot ever built.

Benjie: So you left the company and then built this thing? Or you kind of figured out this problem and started building it internally and then--

Kyle: Yep, started building it internally for ourselves. Effectively, the idea back then was building a container image inside of GitHub Actions is painfully slow for two reasons. One, saving and loading layer cache over the network to GitHub Actions Cache API is, for lack of a better way of putting it, a dumpster fire.

And then building multiplatform images is not great either because you have to emulate the ARM architecture if you're on an Intel runner.

Benjie: Yeah. And the common theme there is speed, I think was the dumpster fire part of all that stuff. So you guys are like, hey, this is a problem, let's try and start solving it. And then you started the company in 2021? 2022?

Kyle: Started the company in 2022. Effectively, we built that first prototype of our container built products, rolled that out in beta in May of that year, had our first like 10 beta customers within a week, and then spent between May and July of that year building the first version that we launched publicly.

And when we launched just that product publicly, it was like, I don't know, something like 25 customers in the first week, which is nothing nowadays, but back then was like, oh, actually other engineers have this problem. Like, this could be more than a side project.

Benjie: Hey, a lot of people would kill for 25 customers, Kyle. Haha.

So what would you say in the early days, from a technical standpoint, what was your solution that was kind of unique? I'm guessing you were bootstrapping bare metal or just large EC2 instances to build these things and push images around. But what was some of the core insights early on from a technical standpoint?

Kyle: Yeah, when we were talking about the original Depot product, which is the container build product, there was really two insights and it comes back to the original pain points. One, building a container image inside of a CI runner is slow because you don't have your previous layer cache. Right?

So of course you can persist it off of the machine over the network, but networks are slow, they're even slower inside of CI systems and they're flaky. And so the first like, real unlock was, hey, we can fork BuildKit. That's the build engine behind Docker. Right?

Benjie: Caching still doesn't work correctly in BuildKit, I'm pretty sure. I haven't tried it in like a year or two, but, oh my God.

Kyle: Yeah.

Benjie: So you actually forked the original. You guys did your own fork of BuildKit?

Kyle: Actually we have our own fork of BuildKit. Nowadays, that fork isn't really a fork anymore. It's our own, essentially, container build product. It's our own build engine. There's a lot that has been changed inside of that, all the way down to like the internal metadata database inside of BuildKit was like this weird archived repository and now we replaced it with like true SQLite inside. Things like that.

Benjie: So going back, so you guys fork BuildKit and then basically spin up some bare metal. Was it bare metal? Was it EC2 instances? What were you guys doing early on?

Kyle: Yep. So the container build product is EC2. So effectively we shift the container image build off of the CI runner and onto a remote machine. So now like your build is happening remotely. And then the original prototype was using EBS. So like we write your layer cache to an EBS volume.

So then like your container build happens remotely. When your build is over, we kill the machine, we keep the EBS volume effectively. Right? And so then effectively when you do another build, we reattach the EBS volume. Your cache is instantly available.

Benjie: Right. Well you know, except for the EBS factor.

Kyle: Nowadays Depot's container build product does not use EBS.

Benjie: Yeah, but early, back then, that makes a whole lot of sense. And so what kind of gains were your customers seeing early on?

Kyle: Yeah, I think like back then 2022 we were talking like 5x, 10x speed up with just the layer caching alone. And then we added multi platform image builds. So that concept is effectively take the Intel portion, build it on an Intel machine, take the ARM portion, build it on an ARM machine, skip all of the emulation, runs on native CPUs, merge the image back, you get one image that can run anywhere.

And when you factor in, when you get into multi platform image builds, that's where you get into the 40x, 60x faster because you're skipping the emulation.

Marc: Yeah. And on GitHub Actions you're running in a shared constrained environment, much more limited. But then you're also generally running QEMU on top of that. Unless you're doing like self hosted runners, which then when you try to spin up the ARM ones, it's just going to like be very, very slow.

Kyle: Yep, exactly.

Benjie: Okay, so you mentioned that you got 25 customers in the first week. Especially in today's day and age, which is, you know, this is four years ago or five years ago, but now it seems like 17 lifetimes ago. How did you guys get attention? How did you guys get people to find out about you?

Kyle: Yeah, I think this is still true today, we lean very heavily into writing deeply technical content on our blog. Depot, from a technical aspect, I think we've always been a pretty open book. There's not really many secrets in terms of how we accelerate things. And doing it that way effectively allowed us to have content and material that resonated with people like Hacker News and various different subreddits.

And so that did a major help in getting that initial customer base built out. And then that's where like the YC component comes in. Right? So we got those initial customers. It was like, "oh, okay, like maybe this isn't a nights and weekends thing. Maybe this isn't a side project. Maybe this could like actually be a startup."

And I've told this story a couple different times, but like the truth is, is like we applied to YC on a whim. Like we had like no real plans of actually starting a company and like doing fundraising and going through all of that.

But like when we applied, like we sweated the details. I think that's like one of the things like many people don't get about YC even today is like they just like apply and they're like, I have this idea like "Twitter for cats with AI generated images. You'll fund this, right?"

Benjie: Yo, Kyle, I told you that in confidence. I don't know why are you're--Haha.

Marc: Haha.

Kyle: Yeah, exactly. And we sweated the details. We were like, what's the actual business here that we could build, again back then just focused on the container build side of things. And we found out that we got into YC and we were in the winter 23 batch. We were the last hybrid batch. So Jacob and I did not move to San Francisco.

Benjie: Were you in France at the time when you applied?

Kyle: Yeah.

I literally interviewed for YC on the floor of my totally empty house back in Portland because like I had already sold everything. Like I was moving that weekend to France.

Benjie: Wow. Wait, and when you say sweat the details, can you give us one example of what that means?

Kyle: Yeah, I think YC really hires for founders that will be resilient. Like as the two of you know, being a founder is not a walk in the park, right? Not in the slightest. So you have trials and tribulations. Not daily, pretty much hourly, once you start getting to things at scale. And so you have to reflect.

Like for lack of a better way of putting it like when have you been through some stuff and you came out the other side of it, like that's sweating the details inside of a YC application. And there's different ways that YC asks that question and I'm sure they ask that question slightly different now, but it's like really showing that you as a founder, like you can get through the trials and tribulations of it.

Benjie: Yeah, I mean, I think that's a pretty consistent thing. And a lot of people talk to me about like, how do I become a founder? Stuff like that. And it's always like, how much "blank" can you eat and smile? That seems to be the best thing. And then, you know, the whole second time, third time founder thing also does make sense. It's like, "oh, I know these people will..."No resilience. I think that's, that goes without saying.

So, okay, you get into YC. You weren't expecting to get into YC. Thank God it was still hybrid because you're moving to France. Is your co-founder in France as well or where's Jacob?

Kyle: Jacob lives in London.

Benjie: Okay.

Kyle: Jacob's originally from Texas, so it was really, I think it was like three months after I started working at Thorn with him, he moved from Texas to London. And then all of the expat like propaganda started there. Like, "Oh, you should move to Europe."

Benjie: Yeah. I mean, I think you guys are pretty smart, from the outside looking in. I don't know. I mean I love New York, but oh boy, it seems like that six week vacation every summer seems pretty awesome. I'm sure you don't do that.

Kyle: No, I wish founder life matched up with the French--

Benjie: Yeah. Maybe it's actually worse to live there because then you're just like, "What am I doing? Like in August, everything is shut down and I'm sitting on my computer talking to Marc and Benjie."

So, okay, you get into YC, that's hybrid. That's the tail-ish end of Covid or I guess Covid was kind of over at that point. Definitely over in Texas, not really in California.

What value did you get out of it at that point? Especially in the hybrid thing? Did you start hiring at that point? Like what, you got some funding? Like where did things start going after that big inflection point of the YC acceptance?

Kyle: Yeah, I think YC for us was-- So again back then like we were focused on just the container build product and how can we accelerate as many container image builds out there and like how can we also offer that as like a underlying like infrastructure piece too? Right?

So it's funny like back then we were like Fly would be a great customer for Depot. Right? And we had very early conversations with Kurt over at Fly and it was actually Kurt that recommended that like we should go and apply to YC and I think that kind of like hits on what's the value that we got out of YC. It's really like the network.

It's really part of like being part of that network and being able to go to any founder in that group in a non spammy way. If any like current YC founders are listening, please go back and read the handbook because there's a right way and a wrong way to reach out to people in the network.

Benjie: Book face. Book face has changed a little bit in these days. Haha.

Kyle: Yeah, exactly. And so for us it was like really the network. And then also I think what YC instills in you as a founder is, and I still have this like when I talk to our partners over there of like you think big and then you go and talk to them and they're like "okay, but like how would you 10x that, like how would you think even bigger?" And like you thought you like reached the limit of your imagination.

Marc: You're like, "I already 10x that and 100x in my head before I came to you."

Kyle: Yeah, like how could you like think even bigger? Right? And so it's like I think I carry that with me even today. Like whether it's a revenue number or a usage metric, whatever the case may be, it's just like, what would I have to do to like 100x that number in some like constrained timeline? Right?

Because if it gives somebody like aan arbitrary, it's totally open ended, like how would you a 100x it.Cool. They could come up with all kinds of different ways of how they do that over three, five, 10 years. But how do you do it in six months?

Marc: I want to go back to the product. Right? So you initially built the product, you kind of described it as like you spun up EC2 instances, they were attaching EBS volumes, but you were spinning up an EC2 instance per workload that you had to run.

Kind of precursor to sandbox's ephemeral style production environments. Benjie likes ephemeral environments. But you also kind of mentioned you're not doing some of that technology anymore. Like, can you describe kind of like how the product works today underneath the hood for just the container build?

Kyle: Yeah, the container build product. Back then it was like a true fork of BuildKit. Nowadays, like it's our own build engine. So we've effectively simplified it directly or simplified it all down into what does it take to build a container image as quickly as possible inside of a cloud environment? Right?

So that goes all the way down to like the scheduling of the build, to BuildKit has the concept of like de-duplicating builds. So a single BuildKit build can actually be like 10 different container image builds. Right? And it has like the capability to de-dupe that work. We've like modified that to make that faster.

It's not EBS anymore. Nowadays we run our own Ceph storage cluster. And the reason for that is purely like throughput. EBS does not have great throughput at scale.

Benjie: And you're still cloud, you're still not racking and stacking your own stuff? This is all like AWS, Azure, whatever.

Kyle: Yep, still all inside of AWS. And effectively, like for both our GitHub Action runners product and our container build product, we effectively have created like our own provisioning system. Because we couldn't just like use auto scaling off the shelf from Amazon because auto scaling is too slow. Right?

So when somebody does like a Depot build or they run on a Depot GitHub Action runner, like they expect that to start in one to three seconds, right? Like they don't want to sit there and sit around waiting for it to launch. So we had to create our own provisioning system where effectively we keep machines around that have been warmed and flashed with the AMI.

So like, if you're not familiar with how AMIs work inside Amazon, the AMI, when you start an EC2 instance is effectively streamed off of S3. So like, as blocks are read, those blocks are streamed off of S3 and into the machine. That wouldn't be performant for like starting a brand new instance because when you're Talking about a GitHub Actions runner or even the container build product, you're talking about an AMI that's 70 to 80 gigabytes in size.

So instead we had to like create our own provisioning system where we start the machine with that AMI and we read all of the relevant blocks directly into the machine state and then we stop it. So then when we go to start it again, it can instantly boot up. Like it already has all of the blocks read in into the machine that it needs to effectively start all of the work.

Benjie: So it's kind of like using the snapshot mechanism to get in front of that network S3 latency issue?

Kyle: Yeah, but it's like using-- Because EC2 and snapshotting an AMI is also very slow. So you can't snapshot an EBS volume or an AMI because that can take on the order of 10 to 60 minutes. Right?

So instead we have to like maintain a pool of machines that we've warmed. Then when a builder request comes in, we have to take one out of the pool, start it, run the build, but then we get into the security story as well.

Marc: I will say it's interesting because like I get it, you built the company around this idea of like, "I can actually make this build faster, like considerably faster."

One comment that you made was, "nobody wants to start a CI process and have it sit around for 20 seconds or so while it's like getting the instance ready." We're used to that though. That's actually kind of the way CI exists in the world. I appreciate that Depot solves that, it makes it a lot faster. But you also, you know, I think that the interface that you create it's not, "I commit and a PR starts it and then I don't notice this 20 seconds that GitHub Actions takes to orchestrate and start the runner up."

I'm actually literally running Depot build from the CLI often and I want to see like you kind of create this like environment and this user interface that is just going to be more demanding of a high performance. Like an instant response, right?

Kyle: Yeah, exactly. So if we go back in time when we first started the container build product, like again, if you remember I was talking about, we were facing this problem inside of CI. So we're like, okay, we need to make layer cache instantly available across CI jobs so that we're not saving and loading it over the network and we need to skip emulation so that we can build multi platform stuff in a single runner.

It wasn't until we built that product, and we built that product in such a way that Depot build was a drop in replacement for Docker build. And so what that meant is like people could just swap out their Docker build inside of their GitHub Actions or CircleCI workflow and swap in Depot Build. And then we built that and we're like, wait, like actually I can do the same thing locally too, right?

So I can move my local container image build off of my local machine and onto a remote build host. And what's interesting about that is that unlocked, the layer cache is now shared across our machines. Like that was a total accident. Like we had not considered, "Oh, like you could actually like share, like you could build the image on your machine, I could go and do Depot Build and we go against the same build kit host with the same cache and I can just reuse your build results."

That was quite interesting to like unlock for ourselves and then also like to walk through like what's the trade offs of that? Because there is trade offs with that. Right? You're moving a local build, container image build specifically, which tend to be quite large over to remote hosts. Right? Which means now your build context has to flow over the wire to the remote BuildKit. It's a game of what's the trade off. And even today--

Benjie: Right, because if I've got some binaries, whatever, on my local machine, I have to get that up to your build machines.

Kyle: Exactly.

Benjie: So how did you solve that?

Kyle: Smarter syncing, essentially.

Benjie: Okay, so you kind of did like the Tilt model where just kind of like--

Kyle: Yep. You effectively like sync the stuff that's only like changed that you don't know about in the build context. And then similarly, so we're talking about both sending the build up, but typically like when you're building an image locally, right, Marc, like you want to then run it as well, Right?

So now you have to pull it back down too. Right? And BuildKit as a project by default had always assumed that like the default was the results stay in the BuildKit cache. And then the second option would be like you push it onwards into a registry. Right?

But there's a third option which is like you load it back. So you tell BuildKit like run the build and then send it back to me.

Marc: In CI, that's like, the end state of CI is going to be push it to this remote registry. But like very often you're going to like build it locally and then run a whole bunch of integration tests against that thing. Right? And so like yeah, you're right. Like actually we would solve that. We would use like TTL.sh. We would build, push and then pull in ephemeral and not deal with auth. But like every decision you make like that, you're like, oh, but this is only going to add 10 seconds of latency or 10 seconds of like delay. This one too. This one too. And then like inevitably you're like, why is the CI pipeline taking 30 minutes to run these days? This is like untenable. I don't know how to solve this.

Kyle: Yep. And so we had to like also write the sync on the other side. Right? So that when we load back like we essentially by default BuildKit would like load in a very like, naive way. It would always send everything back, when actually what you want is you want load to be a lot more like a Docker pool. Right? Which is like you only send back the layers that have changed.

So we had to build that into our BuildKit.

Marc: And that's proprietary, the way you built it. Or did you find some open source thing that you're able to kind of leverage for it?

Kyle: We built that ourselves. Yeah.

Marc: Nice. So if I have a large artifact, I don't know, weights of some model that I want to include in an image or whatever, but it's huge. Like it literally, it's not like if that file change, does it actually get down to like what chunks, what bytes of the file have changed and transmit just that?

Kyle: We focus on just the layer at the moment. We did have a project for a while that we called Depot AI, before all of the craze that we're living in, that was effectively like building all of the popular open source models as images and then hosting them inside of our own registry and using a special format that is like not proprietary to Depot, but using eStargz, which is effectively like the smarter version of that sync that I was talking about.

So in that model, like when you say "from Depot AI and the QWEN model," you can essentially say like from that, copy out this file. And eStargz is the format that is smart enough to like not pull the entire layer and then copy out the file, but to actually like look up what is the index of that file in that layer and then only send back the file.

Marc: Oh, that's cool.

Benjie: Sorry, did you say that that product is no longer?

Kyle: Yeah, Depot AI is not really like a product. It was more like an experiment.

Benjie: Can I play with that experiment?

Kyle: Yep, that's still available today.

Benjie: All right, I will check that out. So do you guys have your own registry then, or do you push to other people's registries?

Kyle: We have our own registry nowadays. Back then, when we were first starting Depot, you pushed to your own registry. But now we have our own registry that's built over the top of, effectively, Tigris, and then we front it with various CDNs. That's the newest version of the registry.

Prior to that version, we also had a version of the registry where the layer bobs would be stored in Tigris, but the manifest would be stored in ECR. And that was like, total transparency, that was like a Rube Goldberg machine.

That was not good because you have the layer bobs distributed all across the world thanks to Tigris, shout out to them. But then you have this pinch point that is ECR. And generally speaking, that tends to not be too bad until US-East-1 goes down.

And then, now you have this pinch point where, all of a manifest. So the manifest, for context, that's how you know which blobs to go get. Right? So you can't get the manifest. You can't go get the blobs. So nowadays, we don't have that. Now we have, our own proper, registry.

Marc: Is the intent of the registry-- Do you recommend somebody who's using Depot to build to push the images to your registry then, and then have my production cluster pull from your registry?

Kyle: 100%? We've found that the performance is faster, the pushing to it is faster because it's inside of our network.

Marc: That's on AWS still?

Kyle: Yep. So there's like, effectively what we do is we use Tigris's concept of essentially like, they siphon out of S3. Right? So we write the layer blobs in the manifest. So, the registry writes to our own S3 bucket. And then we replicate it out to Tigris.

What's cool about that is then if you do a Depot pull from your machine, it goes to Tigris. So it goes to the closest edge location to you and returns the image to you. The side benefit to us is that doesn't touch the AWS account, and therefore we don't have to think about the egress of that.

Marc: That's exactly where I was going. I was like, oh, I don't want to know what your egress bill is on AWS. Haha.

Kyle: Yep. That's our solution to the Egress. But then if you're on like a, let's continue with the container build products. That's what we've been talking about. Let's say that you're doing a container image build inside of Depot and you're from, you have like a "from registry.depot.dev/myimage" /our infrastructure is smart enough to know, oh this is a container build happening inside of our AWS account.

We're not going to go all the way out to Tigris to fetch that. Like we know exactly where that image lives inside of the infrastructure.

Benjie: Yeah, I'd imagine at scale, not everyone knows this, but egress is how they get you and at scale those numbers are massive. So I know we're talking about the container build project a lot and there's other stuff we want to talk about, but real quick, can you give us some rough numbers of where you were when you joined YC with customers? Give me something that's like, "what the heck," from then till now and then we're going to start talking about other stuff.

Kyle: Yeah, I think when we came out of YC, we're probably talking like 7,500 customers, something like that. We were measuring revenue in the like less than 15,000 per month. Right? So nowadays like where Depot's at across the suite, like we're doing tens of millions of dollars in revenue, thousands of customers across-- What are we up to? Like five different products.

Benjie: On that note, tell us about some other products that you guys evolved into.

Kyle: Yeah, so like we built the container image build products and we really like focused on that for the year following YC.

So we just focused on growing that business and focused on building that up. But it became apparent to us of like all of the infrastructure pieces, all of the building blocks that we were resembling could really be applied to other builds that you could accelerate. But instead of focusing on the individual build, why don't we focus on the entire CI workflow?

We built our own managed GitHub Action runners with our own tech baked inside that are anywhere between three and ten times faster than a GitHub hosted runner. We did a lot of cool things inside of there where effectively like we optimize the GitHub Actions runner binary to short circuit.

We've essentially built a system that doesn't rely on GitHub's webhooks to know if a job needs to run. That's something that many people don't know about Depot's GitHub--

Benjie: Is that just because webhooks are slow?

Benjie: They're also very flaky.

Kyle: Very flaky.

Benjie: I complain about this all the time to Marc.

Kyle: Pro tip to anybody that is like, curious, go read GitHub's docs on webhooks and you'll quickly find that they are best effort. Which means, they won't always deliver them either , which is extremely problematic when you want to like, run a job. Right?

Benjie: Yeah, no, we've dealt with this so many times. Some of the things you're talking about, we've built and I wish we would have just used Depot's product at Shipyard.

Wait, by the way, I want to go back to one thing before we dive into the next stuff. You talked about the security posture and how do you handle security. You have a pool of workers, you turn them off and on very quick. Obviously in memory, in cache, there's all kinds of stuff that I don't want my builds, I don't want my-- There's secrets in there. There's all kinds of stuff.

So how do you handle the security part of the container build thing? And then we're going to keep diving into this other stuff.

Kyle: Yeah. When we talk about container builds, we talked about building our own provisioning system. So you have to pull a machine out of this warm pool, start it, run it, run a build. And then I think a common misconception is two things.

One, people either assume that we just like leave that machine on. Right? But that would be wildly inefficient because, you could run one container image build and then not run another one for three hours. So we're not going to leave the machine on.

The other assumption people make is that like, we stop it and like reuse it. But if you know anything about a Docker image build, the one thing people, all people should know is it requires the highest level of access on the machine possible, effectively requires root on the machine.

So we can't really reuse it because we can't trust it. Like we don't know what you did inside of it. Right? Like, you could have tainted it. You could have like stashed something in memory. Like you could have done anything, where if we reused that, it would be a major security hole. So what we do instead is like we nuke it from orbit, we kill it.

Like all build hosts and GitHub Action runners, for that matter, the EC2 instances that back them are single tenant. So effectively we launch it, it runs your job. Once your job finishes or your build finishes, we nuke it from orbit. And that's why it's so important that we have our own provisioning system that maintains a fleet of compute so that we can pull one out, instantly start it, do its work and then kill it.

I've been on a number of EC2 service team calls to talk about how we use the EC2 API effectively. Nowadays, Depot is making tens of millions of API calls to EC2 for machines. We are probably somewhere in the top 10% in terms of daily active volume of fresh EC2 instances. We have to be like pretty high up in the top 10% of that nowadays.

Marc: But there's nothing that's had you say, maybe, you know, "AWS is not going to scale with us, the price is going to be better if we actually go back to, what Benjie was saying, rack and stack some servers, throw some kind of other hypervisor on it and run this yourself?"

Kyle:

I believe that you're buying a different problem if you go and rack and stack your own servers. Like you're trying to solve two problems at the same time. One is, what's the software business and value proposition for your customers? And two is a real estate problem, which is: How do you maintain enough capacity in your own infrastructure of this rack and stack methodology to serve your upcoming demand?

Right? The trade off you're making if you rack and stack your own servers is effectively like, you're going to over provision. That's kind of the only way. You have to buy way more. Right. I think you're kind of seeing this with all of the Frontier Lab companies too. Right? And this is like the stuff that's starting to come out is like, if they don't hit this revenue number, they've already bought servers that are five years away that money is earmarked for.

Marc: Yeah, and sandbox companies too, that can like, I mean we've had Daytona on and like, they really take pride in how quickly they spin them up. And there's a lot of hardware changes. They're basically, they're running on bare metal that they do. But you're right, it does. It's definitely not the easy way and you're definitely leaning into it and you know, you have to deal with scaling very different--

AWS is like, "whatever, there's infinite scale here. I'll make more API calls," until you can't, which I'm sure you have stories and you can hit that when they're like denying the requests and stuff like this, or they're out of capacity in a region.

Kyle: Yep.

Benjie: And I would assume you get to take advantage of the Spot instances as well.

Kyle: Spot is interesting when you're talking about build acceleration, right? Because Spot can be pulled out from under you, right?

Benjie: We got like 30 seconds, so I guess that's not enough.

Kyle: You have 30 seconds. But like when we talk about Depot, like one of the trickiest things about operating in our space is like we're literally talking about like a Linux VM that anybody can do anything they want inside, right?

So like if you were able to like box that in and say like only these certain things are happening inside of the VM, then like a 30 second timer on the instance being pulled out from under you, you could probably like engineer your way around. Right?

But because it can be like literally anything, like somebody could be running a massive database migration job inside of the runner, right? Like you don't want to kill that machine. And like there's interesting companies like working in this space too, right?

These are the ones that are like live migrating machines. So they're like figuring out how you snapshot the memory and essentially like, oh, like this Spot one is going away. Like stand up this new one and replay the memory into that one so that you can just fail over at a memory level into another machine.

Benjie: Yeah, we had the Unikraft guys on the other day, so that's right up that alley.

Marc: I mean at some scale though, yeah, that's a lot of complexity, a lot to manage, a lot of moving parts. It is effectively another Rube Goldberg machine that you're creating. But at some scale it's gotta just drive your costs down so much that you have to consider it at least, right?

Kyle: Yeah, I think that's like something that you have to look at. But there's a lot of other levers that you can pull before you get to that. So on the like Depot product roadmap stuff and things that we've shipped recently. So like we shipped our, we have our own GitHub Action runners. Those are wildly popular. There's a lot of cool tech built into those.

But effectively, building that product and scaling that business revealed to us of like what's challenging about building a managed GitHub Actions runner offering is the dependency on GitHub, right? Where you can only accelerate like 30% of the workflow and 30% of the job because the other 70% still lives with GitHub, still lives in all of its plumbing, still relies on GitHub actually delivering the job down to the runner, still relies on GitHub being online for the runner to report back to the mothership.

And so we went and built our own CI engine and that's our newest product, which is Depot CI and Depot CI is like, what if Depot controlled everything? So from the ground up, it's Depot's compute on Depot's infrastructure with Depot's caching connected to all of the other Depot products with a programmable interface.

So, everything can be done via an API or a CLI command. And the power of that is like, I can give that to any agent and an agent can essentially write its code and then also just like trigger CI right from that session and monitor, "does the CI pass? Is it green? Oh, it's red. Like, dump the logs, figure out what failed, fix it itself."

Marc: And you're doing everything? So GitHub Actions has secrets, right, where, the team will do that. You're hosting secrets, you're delivering secrets out of these, these runners, everything?

Kyle: Yep. And so, and that product uses a very different architecture from our EC2 product. That product is running on bare metal inside of AWS. It's essentially using cloud hypervisor underneath the hood with our own bit of spice mixed in for some of the things that we're doing. And effectively, Depot CI is built on top of our own sandboxes, right. Like, we've built our own sandboxing over the top of these federal hosts.

And that's fundamentally different than the other products. Right? Like, it's not one time use EC2 instances. And t he driver behind that is really, we got really good at optimizing how fast an EC2 instance could start.

The general numbers I've seen is an on demand EC2 instance can start in anywhere between 30 and 60 seconds, depending on the AMI and the machine size. We got that down to two seconds, but I would like it to be 200 milliseconds.

Right. And that's where you get into choosing the different architecture is fundamentally a micro VM or sandbox running on a metal host, there's things that you can do to start it significantly quicker that you just can't do when you're talking about like a virtualized service like EC2.

Benjie: Right. So you need the KVM, you need the KVM, basically is the answer. I mean, I don't mean to simplify it, but I'm just saying at the end of the day, you need KVM, you need access to as close to the hardware as possible. You're still on Amazon for that?

Kyle: Yeah.

Benjie: Wait, wait, you said that you got EC2 instances start in two seconds?

Kyle: Yep.

Benjie: Come on. Come on. Is that true? Come on. No way.

Marc: Benjie doesn't believe. Haha.

Kyle: On Linux machines, started in two seconds. Yeah, that's like literally ripping out everything at the kernel level that you do not need. So like all the way down to like, EC2 hosts start all kinds of random things that, you just don't care about . Again, it comes back to like, how do we warm the machines too?

Benjie: Okay, crazy. So you got these things down to two seconds. They obviously didn't have necessarily every package that you needed and you still had to hot load stuff in or whatever. And so you got to this point where you're like, we want to have our own CI system, so we're going to go bare metal, KVM, Micro vm, you said, cloud hypervisor. So you're using what, the Rust-- What's that thing called Rust rvLLM?

Kyle: Something like that.

Benjie: Yeah, something like that. Oh, you know what, this brings up a great question. Can you just talk about the different stages and like, how many employees you're at now?

Kyle: Yeah, Depot nowadays, 22 people, really spread out all across the US and Europe. So kind of like 50- 50 nowadays. It's wild. 22 people is not a lot.

Benjie: No.

Kyle: It's not even like a drop in the bucket, but it is like wild to go from like 1 to 20, that's a very, weird experience.

Benjie: Are most of those folks engineers, I take it?

Kyle: Most of our team is engineering, yeah.

Benjie: Okay. And so there's some pretty smart folks. We got some Rust people doing crazy things over there. And so going back to the CI product, you built the sandbox underneath so you could turn off and on these things, you obviously have a pool of hot bare metal that's running your stuff. When did you launch the CI product and how's it going?

Kyle: The CI product launched back in March at KubeCon in Amsterdam. It's been great. I think one thing that's unique about Depot CI is it understands other CI syntaxes. So it speaks GitHub Actions. Today we're working on our own, like SDK is, so you can define your own CI language inside of code, if you want to do that.

We're working on things like supporting like GitLabs front end or GitLab syntax as well. But effectively, we do the translation so like we translate that into our own, we call it an IR, so like our own intermediate representation. And then we essentially like turn that IR into the individual like sandbox commands that get executed underneath the hood.

And it's been cool to like build that because I think a lot of times people think of a CI engine as being directly tied to like a YAML syntax and we've like fundamentally broken that, which is like you can bring a YAML syntax, we'll do the translation and then it can just run. And so what that means is for people that want to adopt Depot CI, they just literally drag and drop from the GitHub folder into a Depot folder. And it just works.

Right? One of the things that we learned early on in doing that is the one trick there is like secrets inside of GitHub Actions, right? Secrets are tricky to pull back out manually. Some organizations, I don't know about your organizations, but we can have hundreds of secrets inside of GitHub Actions.

People don't want to go and copy and paste those over into a new CI system, right? So we could make the drag and drop the workflow simple. But like how do you make the secrets simple? And so we figured out a way to like essentially run a one time GitHub Actions workflow file during like a Depot migration that would literally like port the secrets over into Depot CI. And so like imagine the however many hours of like manually copying and pasting secrets over into a new CI system or was like 5 seconds.

Marc: I will say, Kyle, we at Replicated, we did actually move a couple repos over from GitHub Actions to Depot.

Kyle: Nice.

Marc: Some very long running, like slow, like CI that kind of has like organically grown over years. It works the way Kyle's saying, I will say like you literally like the YAML is just the YAML. Just change the GitHub folder to a Depot folder and it works. And so far we've been happy. You guys have kept the service running like 100% uptime too.

Kyle: Yeah, yeah.

Benjie: Don't jinx him, Marc. It's been a month and a half, first off.

Marc: Haha.

Kyle: I feel my pager catching on fire already. Haha.

Benjie: Yeah, Marc is just trying to-- This is a very weird way of trying to ruin your life. Haha. So why did you decide to do the CI product?

Kyle:

I describe Depot as a build acceleration platform. And so my goal, when I take a step back and think about it, is I would really love for all builds to be as close to near instant as possible. Now of course that isn't always possible and there's certain things where it's always going to take a long time.

Benjie: You're talking about Yarn, you're talking about Yarn. Haha.

Kyle: Haha. I'm specifically talking about like one of the benchmarks that we always do is just like compile the Linux. Compile Linux on anything that we're trying to benchmark. Linux, by the way, is great for testing like your CPU performance because it's like heavy heavy CPU use.

And so when it came to building our own CI engine, it was really what's the limitations that are in front of us today with the current products? And it was really all about like we can't actually accelerate all the things that we want to accelerate because we don't have access to them. Right?

Or like when you're talking about something as stupid as GitHub not sending you the webhook. Right? How can I just like remove that latency altogether? Well don't be reliant on the webhook. And so Depot CI is really like our take on what if we took over the hundred percent of the process. And we're working on a lot of really cool stuff that like pushes that even further.

Benjie: So you guys have the CI product. You mentioned there's five products. Give me a 30 second pitch on each one of the five just to make sure we cover that for our audience.

Kyle: Yeah, so there's Container Image Build Product, that's our original product that makes a container image build anywhere between 40 and 60 times faster.

We have our own GitHub Action runners, those are anywhere between 3 and 10 times faster.

We have Depot Registry which is our own container registry.

We have Depot CI which is our own CI engine.

And then we also have a fifth product that is a remote caching service that we call Depot Cache. So think like all of the cache performance gains of like a container layer cache inside of a container build product, but apply it to other build tools. So think like Bazel, turborepo, Gradle, things like that.

Benjie: But it's sitting in your infrastructure. So there's still network latency, right?

Kyle: Mhm. Yep, there's network latency. But when you stitch all of those things together you get like this compounding effect of build performance.

Benjie: Right. I would imagine that there's a lot of SLA uptime stuff you guys stress about. Do you have a good story about something insanely stupid that happened that was maybe your fault?

Kyle: On which direction? Haha.

Benjie: yeah, well, I always like to have like a ridiculous story about, like a vendor doing something stupid stupid, but then also just a mistake that I've made or something like that's just like, hilarious. Anything that you're open to sharing. If not, no problem.

Kyle: I think this was talked about on one of your other podcast episodes I listened to. But everybody is feeling the CPU crunch. So, we were definitely in a strategic advantage being inside of AWS. But I think one of the, things that like shocked me at the scale that Depot is running at is and many people that are, doing small things with EC2, right? Maybe you're launching 10, 20, maybe 30 instances a day.

You come to rely on you launch an instance, and it's good. At the scale that Depot's running at now, right? Like, we're talking millions of EC2 instances per day. It's not uncommon for us to launch a EC2 instance and it's bad, right, that it's corrupt or its EBS volume is screwed, or like its local instance storage is not operating.

And so once Depot reached a certain scale, now we've built systems to detect that, right? And if anybody's ever curious, like PlanetScale, that team over there has fantastic blog posts about how they see this at their scale.

But effectively you have to build systems of, I pull the machine out and you have to build your own health checks of is the machine actually good? And we had one instance of, you can pull a machine out. And we figured out, okay, we can determine the machine's not good, we'll give it back.

But then we make the API call again and we get back the same flipping machine. And so, yeah, it's a little bit like reverse engineering some APIs, right? Because why would-- To give credit to Amazon, why would they cover that scenario, right? Like, that's not a normal scenario, right?

Like, I would say, the scale that we're doing things is like, that's not the 90% case for that service. But yeah, like having to build a system to, you pull one out, it's bad. But, I can't give it back. So I'm just going to like, hold onto it until I pull one out that's good. And then I'll give it back.

Benjie: So like error correcting like EC2 basically.

Kyle: Yeah. Error correcting at like the infrastructure level. Right? Where it's like, it's not really like our infrastructure. Like we're getting it from Amazon but like you still have to like handle that scenario.

Benjie: What about ECC2? Haha. How about that? That's kind of good, right? I'm a naming wizard. Haha. Wow, that's-- I personally, I've definitely had some bad AMI or all kinds of instances across the board. Do you guys do any on prem offering or how does that work?

Kyle: Today we do what we call Depot Managed, which is effectively we can put the data plane into your own cloud account. So we don't do like true on prem yet.

Benjie: So hybrid cloud stuff. And is that all, that's only AWS, I take it? No Azure?

Kyle: It's only AWS today. Although we have people working on Azure and GCP at the moment.

Benjie: So we're coming up on time. But I do want to ask a little bit about your roadmap and probably, I mean I feel like the agent world that we are in and that we are heading towards very rapidly is pretty prescient for what you guys are doing.

So I'd just love to hear what you think the next maybe six months for Depot is, but then it may be like two years from now also. And that, that's a really ridiculous question today because I don't know what tomorrow looks like but do your best.

Kyle: Two years is a hard one to answer. But I think what's interesting about this like current point in time is CI has kind of always been critically important. But it's also kind of also been this, like it's not product work, it's not feature work, like--

You're not directly delivering value to your customers via CI. That's the general take that engineering teams have historically had. But now, we're thrust into this new world where a single engineer with five agents at his side or her side can author code at 20x the throughput of what they could do three years ago.

Right? And so when you start to like carry that out to not even a large team, like a five person team, the throughput at the code authoring stage is massive. But if you look at the bottom of like what happens after, like what happens after committing the code, all of that code and all of that velocity has to flow through one thing today, and that's CI.

It's because we've taken this new technology and we've bent it into our existing paradigm, which is a very human centric paradigm. It goes back decades of we commit some code, open up pull requests, CI runs. I may be like reviewing two, three pull requests a day, like maybe five, ten deployments a day.

But now, we have hundreds of pull requests a day all going through CI, all still need to be reviewed. And I think what CI is changing into is really, it's becoming like the verification layer of all of this code that's being authored. Of effectively, agents are writing more and more code. Engineers can't really, review all of that code because there's so many pull requests, even with AI agents reviewing code.

And so CI is sitting in this unique space where, it can be the verification substrate of, "can I trust this code? Is it high quality? Can it go to production?" So a lot of things that we're working on over the next six months is like, how do we unlock that? Like, how do we surface that both to engineers, right.

In a high level, don't need to go all the way down into the details type of way, but also like automatically surface it back into the agents, like actually writing the code so that if CI fails, there's not this current clunky workflow where like, human engineer has to copy the error out of CI and paste it back into its agent that like wrote the code.

Like, it can just know that that code failed and fix it itself and the loop continues. Right? So that's like a lot of the stuff that we're working on now and then, yeah. Two years? Like, I don't know, man. Your guess is as good as mine. Haha.

Benjie: I'll see you on Mars. Haha.

Marc: Yeah, I do love that though. Look, you set off and you made CI, which was slow, you made it faster. And now, yeah, we're producing so much more code. And what we're actually doing is we're just discovering the next bottleneck in the process.

Kyle: Exactly.

Marc: And clearly, you know, engineering team starts shipping more and ones that figure out, agentic remote dev environments, things like this, it ends up quickly becoming review. And so exactly that, thinking through not just how do we make this process faster, but how do we rethink this process for a world where humans didn't write the code and where's the next bottleneck going to be? Continue to solve that because there's another one.

Kyle: Yeah. And I think that's like, if you look around the space, everybody, especially at this moment, people are talking about a viable replacement to you know, who-- And, a lot of people that are, looking at that, and it's like they're looking at it through the same paradigm that we do software engineering today. But, we're not doing that software engineering.

We're doing a whole totally different type of software engineering nowadays. And it doesn't mean that software engineers are going away, not in a million years. We need more software engineers than ever.

But, we now need to manage all of these work streams. Right. And these are like asynchronous work streams that we're doing with machines. And, we need to define a new paradigm for that, not bend our existing one into that technology.

Benjie: Yeah. I still haven't quite squared the circle around how our existing systems even make sense for what's happening today. You know, it's a mix. You talk to some customers and they're like, yeah, I got my cursor token yesterday. And you talk to some teams and they're like, I haven't looked at code in six months.

Kyle: Yeah.

Benjie: So it feels like that's where we're going for the most part is the not really looking at code. Definitely more of systems architecture thing. But then at the same time, why do we need a human to do systems architecture at a certain point, in this evolution of what we're doing? So then how do these traditional systems, these CI systems even make sense other than that validation loop?

Kyle: Yep.

Benjie: Which does need to be kind of rethought, from the ground up, it feels like we're using primitives for, you know, it feels like we're trying to figure out how to make cars really fast using stone wheels or maybe that I don't know if that's a good analogy, but something to that effect.

Kyle: I think that's a pretty good analogy. That's kind of where we're sitting today.

Benjie: Yeah.

Kyle: I think the space is ripe for, really starting to, rethink these problems and, how can we apply these tools in smarter and smarter ways?

Benjie: Well, you got me convinced. The way that you look at the world is how to make these builds faster. And no matter what, that's pretty defensible today and for the near future because it still takes a lot of time. I would love one day to find out the numbers that you've been able to do with Yarn and NPM and All these other things that we've all struggled with.

And your BuildKit work sounds spectacular. I think I just saw the other day that that Solomon Hykes and the Dagger folks actually just completely got rid of BuildKit and rewrote it from the ground up. I'm guessing that's open source. I should have checked, but just knowing Dagger, I'm guessing that's open source .

Kyle: I don't know. I know Solomon quite well. And Depot and Dagger talked a lot about like, how could we actually do that together? I don't know where they ended up on that. Like, I think our use case is slightly different than Dagger, but it's really cool that they came to that same conclusion.

Benjie: Well, we did the same thing in Shipyard. We have like a tenth of what you guys have probably built that we just had to do for our stuff. But we do at this point say, hey, you should just use Depot to a bunch of folks.

Kyle: Yeah.

Benjie: Well, look, we really appreciate you coming on and taking the time and yeah, the future is exciting. I have a feeling that we might have you come back on in six months if you're up for it and see how unbelievably different this entire conversation is.

I think it's really cool what you folks are doing and doing with a 22-person team. That's super duper impressive. But really appreciate you coming on and looking forward to seeing what the next six months to two years looks like.

Kyle: Awesome. Thank you for having me, guys.