Library Podcasts

Ep. #26, Origins of Kubernetes with Tim Hockin of Google

Guests: Tim Hockin

In episode 26 of The Kubelist Podcast, Marc Campbell and Benjie De Groot speak with Tim Hockin, principle software engineer at Google and an originator of Kubernetes. This talk contains invaluable lessons from Tim’s storied career, the early days of Kubernetes, and insights on building and maintaining trust throughout the K8s community.


About the Guests

Tim Hockin is one of the originators of Kubernetes. Since 2014, he has been committed to engineering at Google, where he is currently principle software engineer. He was previously a software engineer at Sun Microsystems.

Show Notes

Transcript

00:00:00
00:00:00

Marc Campbell: Hey, everyone, and welcome to another episode of the Kubelist Podcast. As always, Benjie is here with me. Hey, Benjie.

Benjie De Groot: Hey, Marc.

Marc: Cool. So let's just dive right in here. We're really excited. Our guest today is Tim Hockin, a Principle Software Engineer at Google.

Most of you who are familiar with Kubernetes probably come across Tim's pull requests and contributions to the ecosystem. Welcome, Tim.

Tim Hockin: Thanks for having me on.

Marc: So let's just kick it off with a thing that we'd love to hear more is just your background and your story. How did you get started in software and into the role you're at right now?

Tim: Sure. Well, I started in software in college, like many people I discovered Linux and I thought it was pretty cool and I spent a lot of time hacking around playing with Linux.

I found a particular affinity for operating systems and kernely stuff. When I graduated college I went to a startup called Cobalt Networks which you may have heard about. We used to make little microservers, it was hot for a hot minute.

When Cobalt got acquired by Sun, I spent my time there doing Linux work inside Sun which was interesting, and then one day Google came calling and say, "We have a kernel team too."So I joined Google, ostensible to work on kernel.

Instead I got Shanghaied into doing BIOS work, so I wrote Assembly for a bunch of years at Google, helping bring up servers, moved my way up the stack into machine management and data center management.

Eventually jumped over the Borg team where I worked on those same topics there. Worked on this project called Omega which was a revamp of Borg, and when the Kubernetes opportunity came up I saw it as something I couldn't pass up so jumped on that. I've been there, gosh, almost 10 years now.

Marc: Great. So I'd love to dig in there. There's two different things that we could chat about, the kernel and the BIOS stuff, sounds like a fun conversation.

I'm going to start right now with talking about that transition from Borg to Omega to Kubernetes.

Can you share a bit about, for anybody that doesn't know that background of the progression from Google into what we currently have as Kubernetes? Just share a quick version of that story?

Tim: Yeah, sure. So Borg was born in like 2003, so even before I joined Google, and it was well established by the time I started really working on it.

It's a very bespoke system that is really designed to run Google's applications in Google's data centers and be integrated with all the rest of the Google ecosystem.

Omega was a project to rethink some of the core primitives with more modern ideas with some of the lessons that we had learned along the way to see what we could do to make that ecosystem better.

It started as a Let's Replace Borg project, and it really turned into a research project, who's ideas then folded back into Borg, and so Borg today has adopted a lot of the ideas from Omega for many reasons.

Kubernetes was designed in a different market, it's not Googlers, it's not custom software, it's off the shelf, open sourced things. It had a lot of different requirements than Borg so we took what we thought were the best learnings at the time from Borg and Omega and wanted to build them into the version 0 of Kubernetes.

Retrofitting major features onto a system like Borg turned out to be very, very different, so starting from scratch with these ideas was more approachable.

Benjie: And the intention there with Kubernetes was not to be used as an internal replacement for Borg, it's kind of like the open source community giving back. Is that correct?

Tim: More or less, yeah. Look, by the time we were working on Kubernetes in 2013, Borg was already 10 years old so it already had thousands and thousands of features that were really custom built for serving ads and search and Gmail, and most applications in the world would never, ever need those capabilities. Kubernetes was designed to be used by the rest of the world, and as such we focused on the requirements of the real world, as opposed to Google.

Which, if you're a startup and Google comes to you saying, "Hey, we want you to customize your product for us," you might consider running away.

It's very different to have such a big, dominant customer who has such weird, unique requirements, so even up till now we've more or less ignored the Google specific requirements in favor of the broader industry requirements.

Benjie: As a Kubernetes user, can you give us just three of those requirements, just so they can blow our minds with the scale and the problems that you guys are dealing with over there?

Tim: Sure. Look, everything at Google runs on Borg and that includes cloud.

So think about the sorts of requirements you have from your container management system when the containers you're running are actually VMs which have to be long lived and plugged into arbitrary overlay networks with static IPs and identity built into all of that.

So there's a ton of requirements that go into Borg just to serve cloud right. Just as a reminder, something like GKE is containers running on VMs running in containers, running on Borg so those sorts of requirements.

Or take something like Search which finely, finely tunes the number of machines that they have in a cluster based on historical information and predictive information about how the search traffic patterns are going to look.

Those clusters are significantly larger than the average Kubernetes cluster and bring a whole host of requirements for scalability and reliability that Kubernetes just doesn't have. Borg is just architected differently.

Benjie: Totally. It's always been fascinating to me that you guys, or Google in particular, was so keen to open source and create this Kubernetes Project. What was that like, getting that through the brass at Google, I guess would be my question.

Tim: Yeah, it was a journey, for sure. We covered some of this in the Kubernetes documentary.

I don't know if you've seen that yet, but it was an interesting exploration from different perspectives of how we got this through.

We, the engineers on the ground, we saw this docker thing happening and we thought it was going to be big, and we knew that it was only a piece of the solution.

We could all see that if that was the Borglet, then we needed a Borg Master, to use Borg's terms, and that's where Kubernetes really could be important.

There were people in the call chain who thought that we would maybe be giving away the Golden Goose or that we would be empowering the competition too much. But ultimately, cooler heads prevailed and we realized this whole thing, this evolution was going to happen and we could see it happening. We could see the beginning of it moving and we had a clue which direction it was going to go, and we had really one opportunity to be part of that. Otherwise we were going to just be along for the ride.

Benjie: Right. And looking back historically, I think about Map Reduce and that whitepaper, and Hadoop.

And Google in the past has released pretty significant whitepapers that have created entire industries so it seems like you guys were like, "Yeah, you know what? Maybe it'll be good to lead this and make sure that it happens the way that we want it to happen."

Or help shepherd it, maybe is the right term. Is that maybe the change in thinking that was part of that?

Tim: Sure. We definitely learned from Hadup and Map Reduce. Not that that could have been different.

The organization and the technology just wasn't set up for us to have done that at the time, but we definitely looked at those sorts of experiences and thought that we could do better than that, we could have more impact than that both for the industry and for Google as a business.

And ultimately, that's the argument that won out, was we could actually not just make the world better, but we could actually build a business around this, that this was a product that we could sell that customers would be hungry for if we could just show them the vision.

Benjie: And going back a second, I did watch the Kubernetes documentary.

The quality of that thing was hilarious. I thought it was going to win an Oscar. Not hilarious, it was just really well done.

We'll put a link to that in the show notes, but it's a really great history that dives into a bunch of these questions so I think it's super interesting and I couldn't recommend it enough.

Marc: So transitioning to the brass at Google, as Benjie put it, has given their blessing and you decide you're going to create Kubernetes, not necessarily to solve Google's problems.

But from your description, you're learning from and repackaging the Borg primitives in a way that's more consumable by an open source project, and folks that aren't running necessarily at Google scale but still can benefit from some of that.

You saw this opportunity and you went and you became... You started spending most of your time or all of your time at Google working on Kubernetes in the early, early days?

Tim: Yeah. We knew that we needed to get the thing off the ground, so we needed a core seed team, and some of the folks up in Seattle built a prototype of what approximated Omega and Kubernetes.

That was built on tools like Docker and Go and Salt Stack, and a whole lot of BASH code to glue it all together, and they showed us some of their demos of what they were working on which aligned pretty perfectly with the way that we were thinking about Borg and Borg as a service, it was the goal at the time.

Those two things together added up to an opportunity that needed people to work on it, and so a smaller group of people cracked off from their main projects and started working on this full time.

We had the support of our direct management, which was great, "Go and figure it out," they said.

And so we did. Very quickly we needed more and more people to work on it and so we went and recruited from the best of the best, the people that we'd worked with in the cloud organization, the people we'd worked with in the Borg organization and brought a ton of that context to building Kubernetes.

And getting the run up to that announcement at DockerCon, that really was the moment the snowball started gathering speed.

Marc: The team there was making several bets that had to come together to work, right?

So obviously with Borg and containers, they've been around for a while, you had expertise in containers. But Docker was new and obviously gaining traction as the commoditization and making it easier and easier to run containers everywhere.

But on top of that, you just mentioned that the early versions of Kubernetes, it was written in Go, which is not the language back then that it is today.

How did you think about that and making the decision to say like, "Yeah, let's keep this in Go and go all in and build the entire stack in Go"?

Tim: So that's an interesting story in its own, I think. I came at this project from a C++ background.

Most of Borg is written in C++. In fact we had some components in C++ that we thought maybe we could just open source these things directly.

But some other people came at it from the Java point of view and they had a lot of Java experience and they thought we have some Java components, maybe we could just open source those and use them directly.

Ultimately the prototype was written in Go because Go is a language that is excellent at writing code quickly, it has a good standard library, good support for things like HTTP which were at the center of a lot of what we were doing.

And so we took the prototypes and we ran with them for a while, and when it came down to the discussion of, "What are we going to do this in for reals?"

One of the concerns that we asked was, "Well, if we're going to open source this, it's really critical that we get a community, that we have other people who contribute to this."

And when you look at open source projects that are widely successful, there aren't a lot of C++ projects that fit that category.

There are a handful, but there are not many. C++ is unapproachable by many people, and in fact a lot of people just don't even want to approach it, and not for bad reasons.

Java has its own baggage. Go was new, it had a decent ecosystem, Docker was written in Go, SED was written in Go, and so those things together said, "Let's see what we can do with Go. The Go team is here, so maybe we can talk to them if we have problems."

Marc: Yeah. That's awesome, and I think obviously you've added to that ecosystem a ton and there's a lot of... We wrote all of our code in Go, so I like Go, a huge fan.

But I want to jump back, you said like, "Look, we were building the first prototypes and the experimentation for early Kubernetes in Go and then we took a step back and we said, 'What are we going to write this in for reals? What's the version one? What are we going to go for?'"

That's an interesting concept that is not necessarily immediately obvious when you look back on it. There was not a clear path of, "Before we write the first line of code we know what it's going to take to 1.0."

Did you have a certain amount of time, or what were the metrics that you were looking at during that experimentation phase of early Kubernetes before you said, "Okay, now we need to settle on a language, now we need to settle on an architecture, now we need to actually write this thing in a way that's going to be production grade, it's time to stop experimenting and time to start building"?

Tim: I don't think there was a moment when we said, "It's time to switch over."

But as we started seeing the prototype grow, it was a case, "Well, none of us know Go."

Actually very few people on the team knew Go very well, and in fact we've been accused of writing Go code that looks like Java, which is probably true from the early Kubernetes code. We really thought about what do we know, what can we get done, what has this ecosystem.

In fact, the very first version of a proto-Kubelet was written in Python and it was basically shell scripts embedded in Python strings that were calling Docker.

So how's that for a stack? We thought Python probably wasn't the right language to build a large production system in.

There's some internal documentation in Google that says, "Really, please don't do that."

And the general accepted successor to Python for that niche is Go, so that carried a lot of gravity there. It was a component by component decision.

We had a thing that was the Kubelet to Omega, which we called Omelet, which was written in C++ and we thought, "Maybe we could just use that. Maybe the API server, the prototype that's already been written in Go. That's fine, leave that, but we'll write the node agent in C++. We'll be able to use the open source libraries that we had."

But ultimately some folks like Joe argued that the open source communities around C++, the libraries for doing stuff in C++ just weren't nearly as good, and having a polyglot core system was really not going to be a great idea.

Benjie: Yeah, I think that that decision early on to not have a polyglot situation is one of the big reasons.

I know a lot of people that just have really dove into Kubernetes and loved it, and I think it makes it a lot more accessible.

So Tim, I just looked up an email that you and I have from 2015 when I met you at the Kubernetes Architecture and Use Cases - New York Kubernetes Meetup, November 5th, 2015.

That's when I really learned about Kubernetes. I just think that it's amazing, where we are seven years, six and a half years later.

What I want to know is, when you went around proselytizing Kubernetes and all this stuff, did you have any idea that it was going to be this? Or were you just like, "This is really cool, I'm loving it. But who knows what's going to happen?"

Or did you know? Because I felt, when I was listening to you present that night, I knew what this was going to be and I knew it because I heard your excitement, I heard your voice.

You were explaining this stuff, all of a sudden a Canary deploy was not a dream that I would have one day. It was like, "Oh, I could do that."

So I'm just curious, did you actually know? Was I right? Did you know or did I just get too excited? What do you think?

Tim: Well, I'll try not to take too much of the credit here but I had a strong feeling that we were onto something good.

Having used Borg, I didn't see how I could ever life my life without it again, and I believed that if I showed people what they could do with Kubernetes, that they would feel the same way, that they would look at it and they would see, "I do that, except what I do is so much harder than the way you just did it."

In fact, I remember that meetup because I was trialing one of the tools that I had written to script demos, and I remember running through those demos of things like a rolling update.

These were really primitive to what Kubernetes can do today, but I remember running through those demos and having people in the audience nodding along going, "Yeah, I get that. Yeah, that makes sense."

So I definitely felt like we were onto something. Now, truthfully, if you had asked me then, "What does success look like in six and a half years?"

I would not have painted you a picture of what today is. I don't know what I would've said, but it wouldn't have been this.

This is wildest dreams territory and this is still growing, right? Every release is bigger than the last, so it's super, super exciting. But no, my crystal ball is not that good.

Benjie: I couldn't tell you how excited you got me, which actually brings up a little anecdote that I love to rib Marc about.

I went to that meetup, at the time Marc and I worked together, I was working at Replicated. I came to LA for a big meeting with three of the best engineers I know, Dmitri, Ethan and Marc, and I am easily the worst engineer in that room.

But I had seen you and we at the time were making a decision for the Replicated scheduler, do we keep doing our own home grown thing? Do we go all in on Mezzos?

Basically I was convinced that Kubernetes was the thing, I couldn't quite articulate myself because I just kept trying to repeat what you were saying.

But I did not do that great a job, and so I love to talk to Marc and give him a hard time where he told me that, "I think Mezzos is going to win this thing."

That was a long time ago, obviously things have changed. But trying to copy what you said was a direct influence on me, and literally something clicked in my brain when you did that.

The other thing, I remember that BASH tool, or it was like a scripted BASH thing and you were doing this demo. Do you still have that? Is that a project that we can point people to? Because that thing was great.

Tim: Yeah. So it was something that I slapped together in the Kubernetes Contributor, when we had one.

We don't anymore because it just turned into a breeding ground for rotted stuff. But I took it and forked it into my own personal GitHub, so if you go to GitHub/THockin/micro-demos, I've got it still.

It's a little shell framework for writing these scripted demos and then a bunch of examples of Kubernetes demos that are built around it.

It emulates typing, it throws some randomness into the output so that it looks like you're typing, so if you really want to play a trick you can wave your hands over the keyboard and look like you're typing but never make a mistake, and then you just hit enter and it will run the next command for you.

It's real demos, it's really running those commands. It just makes it so you don't have to remember to type them right.

How many times have you watched a video of somebody who's clearly cut and pasting from a doc somewhere, or ctrl+R'ing things from their BASH history or whatever they're doing to get it done, and it's tedious.

This takes away that tedium and makes sure that your demos are reproducible.

Marc: Yeah, and you want that when you're running a demo too. You want that authenticity, but you get up there in front of a group and a BASH command isn't working, and you're like, "Wait, I don't know why this isn't working."

Benjie, we're going to have to find a link, see if we can track down that meetup that you were at when Tim presented.

Any chance we can find a recording or something on YouTube of it and put it in the show notes here?

Because I'm going to need to watch this and see what inspired you six and a half years ago on Kubernetes so strongly that you still talk about that.

Benjie: Yeah. I'll never stop. It was the one time. It's like a broken clock is right twice a day, so I have to make sure everyone remembers that.

So by the way, that tool, Tim, where can we get that tool? Is it called Milli Vanilli? What's the name of that tool?

Tim: It's cleverly called Micro-Demos. I'll bop you a link and you can paste it in the notes.

Benjie: We'll make sure to do that. So switching gears a little bit, let's go to today.

What's a day in the life of Tim today working within the Kubernetes ecosystem? So we're recording this right before, I think, 1.24 is locked, I believe. The day before.

So thanks for coming on, by the way. But what does it look like? What's a day in the life of Tim? What's the big challenges? And maybe what's the current on the top of your brain of a big challenge?

Tim: Sure. Yes, tomorrow is the code freeze for 1.24. I looked at my calendar for this week and I thought, "Why do I sign up for these things?"

But I love talking with you guys. So this is an unusual week, but my normal week within Kubernetes is one or two sig meetings during the week where we have these Zoom calls where we get together and talk about specific interest areas.

I pay attention to sig networking, which is my big one, multi cluster and architecture core stuff from the system, those are my big sigs that I pay attention to.

The average week is probably reading 10 times as much as I write, doing either code reviews or KEP reviews, KEPs are our design docs, and giving feedback on how to integrate ideas with the rest of the system, looking for the pitfalls with some creative, new idea. Where is it going to fall over?

Where is it going to intersect badly with some other feature? This week in particular is all code reviews, so it's the biggest of the big code reviews that haven't been reviewed over the last couple of months because they're so large and daunting.

Well, now the pressure is on to get these things reviewed in the next couple of days so I'm spending a lot of time doing code reviews.

Which, realistically, involves flipping back and forth from GitHub to a code terminal and searching for context and figuring out what's going on, and running tests, and oftentimes checking out the PR and running it, throwing some logs in to make sure that it does what I think it's supposed to be doing.

Also I spent some time looking at customer issues and user issues, trying to figure out what's going on, is this thing that they're filing a real bug? Because Kubernetes is full of them, and trying to help triage there.

The community at this point is so large and successful that my own role is less critical than it used to be, perhaps, so I try and pay attention to the things that I know I can offer somewhat unique input on as opposed to the everyday things.

Benjie: I am not a contributor today to Kubernetes. To benefit the community, I have stayed away from writing code. But I'd love to hear a little bit about the code review process, actually. So you get a PR, then you run it, just talk to me about what an actual code review looks like. And then also the testing, how do you test it? I want to understand how you-

Tim: Sure. Well, it depends on the sort of PR. There are some that are easy and there are some that are not. At this point in the code review process, the code freeze process for the release, the ones that are left are the ones that are not easy. So thinking about the reviews that I've done over the last couple of days, there are some that are deep into proxy, so

I've spent many hours staring at this code that's restructuring the way IP tables' rules are being generated, for example. Convincing myself that the changes make sense, that the right use cases are being considered, that the code itself is readable enough that I can comprehend what's happening here, that if I was forced to explain to somebody what's going to happen on their system that I could reason about it.

In many of these cases it was check out the code and stare at it in a code editor, jump from call site to function definitions, make sure that they make sense. I offer suggestions sometimes on naming things that would make more sense, or asking questions that will lead to either, "Yeah, you're right. That was a bug."

Or comments to explain what's going on at that place. Testing wise, I'll build it. In this case I built a Qproxy component, pushed it out to my test cluster, maybe added some debug logs that I could convince myself the right paths were being hit, compare the old and the new state to make sure that it worked well.

But that's all the manual stuff. We do have a pretty robust CI system, so when these pull requests are pushed in the first place, we trigger a CI build that will spin up a whole cluster, run our battery of end to end tests against it, feedback through GitHub how did it go, what went wrong.

That gives me a lot of confidence, the fact that that was green already meant that that contributor had already fixed all the obvious stuff.

Benjie: So those end to end tests, my brain swirls when I try and think of what that looks like. Can you give me a super high level?

Give me a few examples of how you do end to end tests when, if I'm changing a Kubelet, is this to open source projects? Or how end to end is it, I guess?

Tim: It's literally spin up a cluster. Depending on which test cases, it may either be an actual cluster on VMs, or it might be a KinD cluster, Kubernetes in Docker.

Spin up a pod, for example, with a particular specification and then verify that in fact it is running and it is serving the traffic that we expect it to serve, that it got scheduled, that the Kubelet ran it, that the volumes that we expected to be mounted are in fact mounted.

Or in the case that I was looking at this week of Qproxy stuff, it was make sure that when I go and run a pod and create a service for it, that all the different flavors of traffic that we expect to be able to reach that pod, can in fact reach that pod.

We turned on a load balancer which actually went out and provisioned a cloud load balancer, and then run traffic through that load balancer to the backend pods to make sure that it was in fact serving from the pods that we expected it to be served from, and try to mutate the various input modes there and make sure that the changes are correct.

Which, give us confidence in the totality of the change. Obviously we have unit tests that say, "The expected output is X and we ran it and we got X, so that's good."

But the end to end test is the thing that really gives us the confidence to say we didn't forget something important, because the end to end tests in code are in conformance, sweet.

So if we did forget something important, then hopefully the end to end test would catch it and say, "Hey, this particular thing failed," and then you'd go and figure out, "Well, why did that traffic mode not work?"

Marc: Yeah. I think with the surface area that Kubernetes has and how many other services depend on and count on these interfaces, CNI, CRI, all of the CSI, Kubernetes to work you need that automation in order to have any kind of confidence pushing it out. But I want to actually go back and chat for a second about that manual process you were talking about.

I think that happens often a little bit behind the scenes, nobody really sees that happening. The amount of time and effort you, or somebody in a role like you, are putting into really detailed looking at that PR and deciding.

You mentioned you replaced the Qproxy on your local cluster to test it out and really get comfortable with it and understand how it works, and keep that knowledge so you understand. That feels like a hard thing to scale, so how do you get folks outside of Google and how do you grow the people that are in that position of trust that you have to be able to do that?

Tim: That's a great question, and it is existential to the project, that we fix the bus factor, that there are people who can do those things.

So picking on the same example, there are other people within sig network who would do the same work, in fact, probably did do the same work that I did to convince themselves that these things work well.

So it's not just me who does it, but it is probably still a relatively small list. It's probably less than a dozen people who do this on a regular basis.

I'm always working with the people who show up, the top contributors, the folks who are willing to take these bugs on, how our techniques work with each other, we share each other's tricks and get better at doing this process.

In this particular case I spent probably combined over the last week, I don't know, 10 hours on this one pull request, just making sure that it works. But it's a really important pull request that changes the way things work internally, based on clarifying some under-spec'd areas of the system and so it's really important.

I'm certainly not going to spend 10 hours on every pull request, but there's always a handful of these things that require that level of attention to detail.

There's this voice in the back of my head that says, "Wow, this is just too complicated. I should just ignore it. Maybe it'll go away."

But the truth is these are important changes that the system needs to be able to adopt right.

In this particular case we found some set of corner cases that were just inconsistent and don't work well and weren't really well documented, and new features which were bringing really important functionality were going to make the problem worse.

So this person said, "I'm going to step up and I'm going to refactor this fairly intricate subsystem and I'm going to send you a giant pull request that is really well factored into NICE commits, but you have to convince yourself that this works."

And saying no to it would mean saying not to all the subsequent capabilities that we would like to build into the system, so it was important.

Marc: Yeah. I think that's so important. Kubernetes, I actually remember early, early days, going back to earlier in the conversation when it came out, it wasn't high scale at all. It was very limited but had a lot of potential, and everybody could see where this thing is going to go.

As the project's matured over the years, the complexity of it actually has gone up a lot too. When your 1.24 code freezes now, people are going to trust when you ship 1.24 that they can put it on their clusters and run it in production, and you don't want to ruin that trust by introducing regressions or being too careless.

But you also want to balance that and encourage community adoption, and you don't want to default to saying no to every PR that comes in either, so it's tricky.

Tim: Yeah. You touch on something really important there, there's a lot of opportunities for changing things that would be simpler or make the code cleaner, but would represent breaking some small corner case that maybe used to work, maybe it shouldn't have ever worked but it did work.

We've had this debate over and over and over again, and repeatedly we fall on the side of don't break users. Even if this thing was not supposed to work, if it did work it's effectively part of the API now.

You get this struggle between, "Well, it only works in these certain circumstances, and not in these other circumstances so we should just fix it and simplify it and make it consistent, and not work in all cases."

I find myself, I'm attracted to those arguments but I have to step away and say, "No, we don't do that. If we can make this work then we have to keep it working and just document that this is a special case and, for historical reasons or for good reasons, whatever the reasons are, we're going to keep it that way."

I think that that really shows the project's commitment to being enterprise ready and happy to make the customers successful. We're not going to say, "Well, that was a bad idea. We're taking it away."

Benjie: What you're describing right there is the constant struggle of every engineering team for every product that is a software application.

For me personally, as someone who obviously has a company outside, this is literally the conversation I think I have once a week, of, "Hey, the customer is always right, even when they're wrong."

How do you balance that? I think that what you just articulated is a really challenging way for all of us engineers to think about things, and I can commend over the years that Kubernetes...

I definitely have done some things that I shouldn't have been doing, but it never broke with new, subsequent releases so thank you personally for that.

It's interesting how much as an open source project, Kubernetes really does have... I can't imagine the number of users that it has, but you have to keep all of them happy.

I think Marc spoke about this earlier, you guys do a great job of that and so it's just impressive. Speaking of which, I wanted to ask you what is the most surprising usage of Kubernetes that you know about, that you were just like, "Oh my god, they're doing that. That's so cool"? Maybe you're like, "Oh crap, we have to keep this feature in there."

But give us a few examples of some of the things that are just mind blowing. Who's using this thing for what, your baby for what?

Tim: Yeah. Well, there's a few that I think are interesting. I tweeted at some point, I think it was at a KubeCon keynote when the folks from CERN were talking about how to use Kubernetes to run the Supercollider, and that blew me away.

That is so cool, right? That is world changing science and Kubernetes is down there helping them control, which just blows me away. Even now, it was years ago, and just saying out loud still gives me the chills.

There's other demos that I've seen that are mind blowing too, like Kubernetes running in an airplane, running inside the US Air Force, in fighter planes, doing stuff I'm sure they can't tell us about. That is not what it was intended for. It's awesome that it works. I haven't heard any specific requirements from them, things that didn't work. But wow, that's a pretty strong bending of the system in my opinion.

The other one that I thought was interesting but I didn't really see coming, is people using it as a software distribution mechanism, building their SaaS products that they ship to enterprises and bundling it with an embedded Kubernetes cluster and saying, "Just run this and this will turn on a VM in your VMWare or whatever, and it will run a Kubernetes cluster on that VM and it will run our software in that Kubernetes cluster."

And they're using Kubernetes as a way to say, "We don't have to deal with some of the problems that we used to have to deal with."

The Kubernetes cluster isn't part of the surface area, it's not part of the product that they're selling. It is just a mechanism for them to deploy their stuff, which I think is pretty cool.

Benjie: I think it's super cool. I think you just described Replicated, by the way, and I don't work there anymore so I think I can say it with less bias, how effective Kubernetes has been.

I have observed what Marc and the team have done over the years, and it's really impressive, the portability that Kubernetes gives us. Again, because of these standards that you guys have built over the years, so I couldn't agree with that more.

One, I don't know if you're a frequent listener to Kubelist, Tim, I'm sure you are, but we actually had the folks over at Rancher on and they told us that there's somebody running a Raspberry Pi Kubernetes cluster in a satellite, in space.

Tim: That does not surprise me. I joked with somebody from NASA at a KubeCon and I just said, "Hey, when you're ready to put it up on the Space Station, you let me know, okay? Just give me a call because I'd like to watch."

Benjie: Don't you think you should get a ride or something for that? Maybe they need you on prem to help install in case there's an issue?

Tim: That seems fair to me. If anybody from NASA is listening.

Benjie: Yeah. Maybe we could get you on one of the next SpaceX flights. Well, look, I think that I've heard a bunch of other stuff.

Another little anecdote I'll share, maybe 15 years ago I had a family friend who was actually a test pilot, maybe it was longer ago than this, and he was telling me, and this was when he was testing out the F-22 so this was years and years ago, he was telling me that those were running a Windows subsystem or running Windows.

I swear this was probably Windows 3.51 running in 2000, and he was telling me that mid flight he would have to reboot half the systems, literally flying, and this was a fly by wire system so I was like, "I don't understand, how does the plane not crash?" I never quite understood that.

So you were saying the fighter jet use case makes a whole lot of sense, to me, just bringing back another old anecdote there.

Yeah, I've seen all kinds of crazy stuff and us, over at Shipyard, we also are running arbitrary workloads, same type of thing. It's just crazy, what this has unlocked or us all.

Tim: Yeah, at the end of the day everything is software, right? And there's nothing that isn't software anymore.

Whether you're talking about enterprise company management or whether you're talking about cloud scale stuff, or whether you're talking about retail and running all of the little devices that are inside of a fast food place, it's all software.

It boggles the mind, 90% of the software in the world or 99% of the software in the world will be bespoke for those sorts of applications and they all need some way of deploying and updating and canarying.

Especially with the rise of things like mobile ordering and the customer cards, it's like the future, you see these things in movies, right? When you walk into the GAP, what was the-

Benjie: Minority Report, right?

Tim: Minority Report. Thank you, yes. Right. We're kind of there, maybe it's not scanning your eyes, but it sure is your phone telling it, "Hey, I'm checking in."

Benjie: Yeah, that's true. Chick-Fil-A is running, I think, local edge Kubernetes clusters at each one of their franchises, so we're there.

Tim: Yeah, exactly.

Marc: So let's transition for a minute here. Tim, you mentioned earlier that 1.24 code freeze is tomorrow.

Are you involved at a high enough level in the project, in the overall Kubernetes project right now, not just the sigs that you're focused on, to be able to share some highlights like the TL;DR for what people... I imagine this episode is going to be released post code freeze, but before 1.24 makes it out so what should we be looking forward to in 1.24?

Tim: So that's a trick question because Kubernetes is a loosely federated set of project that all check into the same code base. There isn't a top level theme or goal anymore, we don't do top down development in general.

So I can tell you what the hot things are in the sigs that I'm paying attention to, but there isn't an overall like, "This is a XYZ release where we're going to focus exclusively on features that do X, Y, Z."

It just doesn't work that way. There's always APIs in various states of evolution.

We have a pretty formal process for evolving APIs so every single release, this one not excepted, there are APIs that are moving from alpha to beta and APIs that are moving from beta to GA, and features that go with those APIS, whether they're command line features or server side features, and often they are verticalized within a sig. I don't mean that in a negative way.

Often we in the industry, we talk about siloing being a bad thing. I don't mean that as a bad thing, I think actually our sigs do a pretty decent job of giving people who have common interests, common goals, a space to work on those common goals while still being permeable boundaries between those groups.

So I can't answer the question from a high level, what's going to be in .24? A lot of things.

We have a release team who spent an enormous amount of human energy pulling together all of the enhancement proposals that are going to be in the release. That deadline was a couple months ago, and I think there were north of 40 enhancement proposals that are going to be changed in this upcoming release.

Marc: That's great. Yeah, and it's going to be the newest release when we all are, hopefully, over in Valencia at KubeCon in May.

Benjie: When you are at KubeCon or you have been there in the past, I think it's interesting.

So for people going for the first time this year, or maybe the next time, what do you think are the biggest things that you've gotten out of being at KubeCon over the years and just being face to face with all these people? What are some of the highlights of things in conversations you've had?

Tim: Well, I'm not sure that I have the normal KubeCon experience.

But I think KubeCon has amazing technical tracks, so first of all massive kudos to all the people who do those proposal reviews.

That is an enormous amount of very hard work, I've done it, I'm not on the program committee right now because, Oh my God, is it a lot of work so massive props to those people who put together great conference, first of all, on the technical content, and to all the speakers and everybody who presents.

For me personally, the thing that I value out of KubeCon is being able to get together face to face with the people that I interact with primarily via GitHub and Slack and Zoom.

Putting a face to a voice or a real three dimensional figure to a person, being able to sit down and have a meal or just chat in the halls is so useful and productive.

It is how, in my opinion, the biggest and hardest decisions really get cracked and it's how we build the camaraderie and friendship, really, that let's us build a project of this scale, trust each other to make good decisions without second guessing each other.

And to have even conflicts and disagreements when we need to and still keep everything friendly and happy. I've certainly seen other communities that don't have that and I think the real life connection matters a ton.

Benjie: Yeah, I think one of the things and actually in the documentary about Kubernetes, they broached this a lot, I think that the community and the building of the community is probably the real amazing victory of the Kubernetes project. It's so civil.

I know Marc is actually one of the crazy people that helps with setting up the KubeCon. I think the real time, I'm sorry, the runtime track this year, Marc, is that right?

Marc: Yeah. The whole reviewing proposals, it is to your point, there's so many great proposals sent in for folks who want to speak and the bar at every KubeCon, it's higher and higher, the technical track for the actual sessions.

But I think it's also super interesting to dig into what you were talking about, that trust, the camaraderie, the friendships that you actually build, you go there, you're in person and you have that...

People are fully present when you leave the events at the end of the day, you're exhausted because you were so engaged in all these conversations at a different level than you are when you're connecting over Zoom in remote, and jumping from meeting to meeting.

Tim: Totally, absolutely. I get worn out. Usually my voice is blown by the end of the first of second day, but it's the best kind of laryngitis, right?

It's yelling over the crowds of excited people who are just doing fun stuff and there really is no substitute for human contact when you're building this.

Not to get too political with it, but with the whole return to office and remote work situation in the world, we're on the pivot point of many companies calling for a return to office.

While I very much appreciate my ability to work from home and I can do events like this recording from my comfortable desk at home, when I go into the office and I have meetings face to face with people, it is a different experience.

Benjie: I actually remember the first time, obviously like everybody else, COVID we went all remote and a year, a little over a year later, between a few different waves, a few people got together in an office in a WeWork space for our company.

We were just working through a hard problem, and it was the first time we got together in person for over a year and we had a whiteboard in the room, and it was just like, "I miss this. I so much miss this." It was so, so great to be able to do it again.

Tim: There is no substitute for a room and a whiteboard.

Marc: Okay. So, Tim, we're going along here but this is pretty darn engaging, so I'm going to ask you the few more questions I have if that's okay.

I wanted to know, looking back, what are you the most proud of? And also a lot of us, we're building software and we all have regrets and we're all like, "Oh my God, why did I do that? Or why did I do that sooner?"

I just thought maybe you had some very interesting perspective on that, around the project.

Tim: I find myself in an awkward place within this project where I seem to own the networking subsystem, but I'm not a networking person.

I really don't have any networking background, and so I came at it from the perspective of application developers and what would I want to work, how would I want it to go if I could have anything I wanted? And then I just said, "Lets do that."

And I've had many networking people, Real Networking people, capital R, capital N, Real Networking, tell me that I have made their lives very difficult and I'm okay with that, because I have, I think, made the lives of developers easier.

So if there's anything I'm proudest of, it's probably that focus, that ability to say, "I understand that this is hard, but hard is okay. We can do hard."

And to keep the relentless focus on what do developers need, what is the experience that they need? How do I empower people without getting in the way? That's really where my focus has been and continues to be.

Yeah, there's definitely a lot of things that I probably would do different if I could go back in time, but I can't and I'm okay with that. I'm just going to live with it. That is a lesson that I've had to learn through this project, I'm okay with it.

Marc: Well, when Kubernetes is powering a time machine in 17 years, maybe we can have something interesting recursive fixes.

Tim: To borrow a phrase, there's only two kinds of systems in the world, the ones that people complain about and the ones that nobody uses.

So as long as people keep complaining about Kubernetes and as long as the venom keeps flowing on Twitter, I'm a happy guy.

Benjie: I think that's very fair to say. Again, I want to have you go be a Tony Robbins for my engineering team over here, because the things that you're saying just make a whole lot of sense.

But we're not building systems because they're easy, we're building systems because they're hard and we have an end user who we're trying to make their life... Personally, I like to try and make people's lives magical and the end user of Kubernetes is obviously developers.

Well, maybe you could argue the end user, but anyway. So yeah, I think approaching these problems not from a how hard is it to do, but how valuable is it for the consumer, I think that shines through this project.

There's a lot of stuff there. So you're happy with the YAML is really what you're saying?

Tim: Well, that's a different question. I think I posted on Twitter a few weeks back that I think everybody should use a variant of YAML that isn't the default that we use, where it looks a lot more like JSON but it has comments and I think that would be better.

Not everybody agreed with me, let me just say that, but as it goes, YAML is fine. Except for the whole Boolean interpretation stuff, that was just a mistake.

Benjie: The yes and no. Yeah, there was some crazy things there.

Marc: Yeah. I have been proselytizing to not add a new YAML. I'm okay with YAML, it's just we got to stop having everybody add a new YAML on top of a new YAML on top of a new YAML. But that's neither here nor there.

Okay, so one thing we didn't touch on and we're going to wrap up pretty soon, I promise, but what are the biggest challenges that are keeping you up today?

What are the things that you want to focus on or that you want someone else to focus on, that you think is in the immediate path for Kubernetes to just keep growing and being what it can be?

Tim: The banner that I've been carrying of late, or the windmill that I've been tilting at anyway, is multicluster and here's why.

From the beginning we built Kubernetes as a walled city and as long as you're within the walls, we have running water and we have paved streets. Occasionally there's potholes, but overall it's okay as long as you stay within the city.

We put our fingers in our ears and pretended that the world outside the city wasn't important and didn't really exist.

We all knew that that was false, but it turns out that as you get successful and you get adoption, you cross the chasm as it were, it becomes really important that you have doors in your walled city that lead to reasonably treadable roads onto other cities.

My analogy is breaking down. But the biggest problem that we have is integrating Kubernetes in real environments is under serviced, whether that means dropping a cluster into your enterprise network and having it work from non cluster clients or with non cluster servers, or integrating it across multiple clusters, across regions, or across providers even, there are reasons that people need multiple clusters.

Those reasons can't be denied or designed away, they have to be dealt with and so the thing that I have been most focused on for the last year or so is what are those reasons, how can we best serve the people who need to leave the city and find their way to some other destination?

So I've spent a lot of time looking at it from the networking perspective and from the application perspective, from the administrative point of view.

There's many different problems there. I think that is the place that we need to go next. Not to say that Kubernetes is done on a single cluster basis, it's certainly not.

There's a lot of capabilities that we can add, but they are getting more refined, more sophisticated maybe is the right word, so they're having less return on each investment, maybe.

Diminishing returns. The real value is getting out of your cluster and into another.

Marc: And that's something you see that might be interesting to bring into Kubernetes core? Because these folks might be solving that today with other layers, right?

Like a service mesh or other tools that are applications that they'll bring on top. You actually see this as fundamental to the next step of Kubernetes right now?

Tim: I want to be careful because I don't think Kubernetes can or should try to be everything to everyone.

We should not in house a service mesh into Kubernetes, although we kind of already have one in the form of Kube Services.

But it's not my goal to say, "Absorb SDO." It is however my goal to find the things that SDO or other... Service mesh is one example.

Other systems that build solutions, what do they struggle with? What can I do to make it easier and better and safer to build those systems and to make it more consistent and more normal to use those systems?

As an example, we started the Ingress project many years ago. It sat in beta for a very long time because it was an unsatisfactory API.

We finally did GA it with the acknowledgement of it is what it is. But we also started the process of replacing it with an API called Gateway.

Gateway is a much more robust API, and it turns out is actually capable of expressing much of what service mesh APIs would have expressed.

So the question then is can we use that to normalize the API a little bit, at least the common path, so that users can have that same sort of portability that they're used to with Kubernetes Core primitives when they're starting to go beyond the edges of their city?

Another example here is some work we did in multicluster sig called Multicluster Services which takes the concept of Kubernetes Services and defines some semantics.

Not an implementation, it's not a thing that you download from Kubernetes and run. It's a specification for how multicluster services can work, which then let's implementations go off and figure out how do I offer this right?

And that is because it's very tied to how your network works, and already we can see there's a handful of implementations out there that are offering different ways of getting the same semantics.

At the end of the day the application developers are the ones who use these APIs and they can rely on some project defined semantics, but implementation defined details.

Marc: Got you. I'm just looking up, we're going to include a link to the Gateway API that you were talking about in there. I think that's really interesting as an example of that iteration that you're working on.

Benjie: So CNCF is a big part, it's the biggest part, of Kubernetes. How involved in CNCF are you and do you keep an eye on other projects? Or how do you look at the ecosystem as a whole?

Tim: Yeah. CNCF obviously is critical to the success of Kubernetes and ongoing. I'm not honestly super involved in it. I'm not on the governing board or anything like that.

Those sorts of roles don't suit me. But I do pay attention to the projects that are going in and the ecosystem and the chatter about those projects to see what's coming out.

The reason Kubernetes was successful was that it was an opportunity to compete with the established players, so I like to watch what the new guys are doing as they're coming up. But I'm not super, super involved in it.

I do some work with running our Kubernetes infrastructure which serves the website and all of our downloads and those sorts of things which is officially run by the CNCF, but funded by grants from Google. And so that's my biggest point of contact with the CNCF right now.

Benjie: So Tim, really appreciate having you on here. One thing I wanted to ask before we jumped off, was if some of our listeners want to contribute to, let's just say, the networking sig or whatever, where do you suggest they get started?

Where's the best place for me to start as a developer if I want to help contribute and be a part of this?

Tim: So that's a really tough question because the system is large and reasonably well established.

What I mean is, all the easy stuff is done for the most part. We're onto the hard stuff that is intricate and has subtle implications for the rest of the system.

That's not to scare people away. Definitely we welcome new contributors, but before you can just show up and start swinging you've got to learn about something.

What is it that you're interested in? If it's the networking group, is there a topic that you're particularly passionate about or that you have some background in that you'd like to participate in?

We have more bugs than we can shake many sticks at, and so coming in and saying, "Okay, maybe I'll work on some of the bugs around..."

We were talking about Qproxy, "So I'm working on some of the Qproxy bugs." There are to dos and things left in the codebase.

There are issues filed, although not as many as I'd like that are labeled Good First Issue.

That's the way to get started. Show up at the sig meeting if you can, or show up on the mailing list, introduce yourself.

Say what you're interested in and what you'd like to help with, what skills you bring here.

We welcome all skill levels, whether that's doc writing or code writing or tests, or whatever background you've got. Certainly we have Real Networking, capital R, capital N, experts on the sig and we have other people like me who are less so.

We have all flavors, we're very welcoming I think. So if you're interested in helping, step one is just to show up.

The advice that I give to most people inside Google and outside of, "I want to start but I don't know how to help, I don't know where to start with it," the usual answer is start with Main.

Pick a component that you like, that you're interested in understanding, start reading the code, and I swear if you can make it an hour without finding something you want to fix, you're not looking hard enough.

Whether that's refactoring or renaming or commenting or moving stuff around to make it more readable, new eyes are such a valuable resource that I hate to waste them.

So if you're new to the project, pick something, start at Main and start walking through that code until you feel like you understand it.

Marc: That's an interesting approach. I think it's really good advice. So Tim, thanks so much for joining.

I know you took time out of a really busy week where you have a lot else going on, getting 1.24, trying to do your part in that release.

Definitely really appreciate you taking the time, I learned a ton and it was a really interesting conversation for me.

Tim: Thanks for having me, guys. It was actually a lot of fun.

Benjie: Thanks, Tim. Really appreciate the time.