APR 7, 2021

53 MIN

Ep. #12, Istio with Craig Box of Google Cloud

GuestsCraig Box

light mode

about the episode

In episode 12 of The Kubelist Podcast, Marc speaks with Craig Box of Google Cloud. They discuss Istio’s features and community, the purpose of service meshes, and solving big problems at the platform level.

about the guests

Craig Box is the Kubernetes / Istio Advocacy Lead at Google Cloud. He is also the host of Kubernetes Podcast.

show notes

about the episode

about the guests

show notes

transcript

Marc Campbell: Welcome, Craig.

Craig Box: Thank you very much. I'm very happy to be here.

Marc: Great. So, Craig agreed to join me today to talk mostly about Istio.

But before we get into the weeds there, let's talk a little bit about your background, Craig.

What does the cloud native advocacy team do on a day-to-day basis at Google Cloud?

Craig: Developer relations is a funny thing to do in a world where your customers aren't necessarily developers.

I think of developer relations as being the three legged stall of somewhere like Apple, for example, where you have people who make apps and then you have people who buy phones, and then buy the apps, and those are different teams.

In the case B2B software, like cloud, the users and the consumers are the same people.

And so, with that in mind, I'm also focusing more on operations' teams.

So, in our developer relations function in cloud, we talk to people who operate software just as much as people who build it.

And I'm in particular focused on cloud native because that's the space that makes sense for my particular background.

I was a BBS, SYSOP way back in the day.

I did system administration work coming out computer science school, because I didn't really want to do the traditional lock yourself in a basement and write code kind of thing.

I enjoyed dealing with people, I enjoyed dealing with systems and that worked very well for me, as cloud came up as a concept.

The first meeting I went to when I joined Google seven years ago was in Seattle.

And I actually wrote two things down on a Post-It note, which I still have somewhere in a drawer.

The first one is, "There's all these hexagon logos, someone should really make a Settlers of Catan board for Google Cloud."

Which unfortunately, nobody ever did. But the second one was, "There's this project seven thing, which I should look into."

And that, of course, became Kubernetes and that's really been what I've worked on ever since.

Marc: That's great. Yeah, there seems to be a lot of ex-BBS, SYSOPs in this world lately.

Craig: I think there's a difference between enjoying taking software, and customizing it, and having it solve a problem, and starting with a new page and building something.

I'm definitely more of that former group.

Marc: Yeah, that totally makes sense. I agree.

I mean, I think Kubernetes solves all kinds of problems and it's often technology, where it should be.

It's a means to an end. You're starting a business, you're trying to create a solution and Kubernetes helps you in all kinds of ways.

The whole cloud native ecosystem does.

But it's also fun to sometimes get to really be deep in the weeds and have technology and the cloud native ecosystem be the thing that we're working on.

We're not just using it to deliver something else.

Craig: One of the things about Opensource and especially working in developer relations for Opensource, is that the people who build the thing are very active in the community.

If I was working on an internal product, I might need to be the outgoing developer face and saying to people, "Here's how it works internally, because it's closed source and we can't necessarily show you."

But in the case of our Opensource stuff, people like Brian, and Tim, and so on, they are very active in the community.

They're participating a lot. They can be seen. The code can all be seen.

So, it's a different set of skills, I think, and a different set of things that need to be done.

A lot of the things that I work on are, "What's the lowest hanging fruit on the particular project that I can solve?"

And it's not normally a code thing. Sometimes it's, the documentation doesn't clearly explain what it needs to do, or the integrations need to be done.

And sometimes it's finding the people internally, connecting them together and getting them to solve those problems.

Occasionally, I get to sit down and write some code, but I wouldn't necessarily take anything that I wrote and run it in production.

Marc: That's awesome. That's great.

You're also the host of the Kubernetes Podcast, which has over 140 episodes already right now.

First, congrats. That's a lot of really great work contributing back.

Craig: Thank you.

Marc: I'm kind of curious, what motivated you? How did you and Adam originally start the Kubernetes Podcast?

Craig: Yeah. Adam Glick and myself both had a little bit of a background in radio and we both worked in the Kubernetes space at Google.

Him in the marketing team and myself in developer relations.

And we both had an idea around the same time that this would be an interesting thing to do.

We both enjoyed podcasts. Being a company that owns and runs a large video platform, there is quite a focus on video at Google.

But I think that video works well for things where you have your full attention.

You're teaching someone something, doing it through video is great because you can show them things.

Podcasts are a great way to catch people when they are walking the dog, or doing the dishes, or commuting in a car, when that's a thing that people used to do. And you can't necessarily teach them, but you can talk to them about why things are and sort of explore the background and maybe give them some hints for things that they should go and look up when they sit back down at their computer.

Adam came at it wanting to give the news of the week kind of show, I came at it wanting to give an in depth interview kind of show, and between us, we joined the two up and that's where we landed.

Marc: Yeah, that's great. I love the news of the week at the beginning of it too. That's actually--

I'll listen to the podcast when I'm going for a run and it's like, you're outside, it's going to be 45 minutes to an hour of running and being able to tune out the agony, the pain of running and actually just listen to the show, it takes your mind off it.

It's a great podcast.

Craig: Well, thank you. And I find that it's very hard for people who enjoy podcasts to just sit and listen to their own thoughts.

I think they're used to having someone else's thoughts in their ears all the time.

Marc: Right.

So, one of the really neat things about the CNCF, the Cloud Native ecosystem in general is that it really encourages a lot of experimentation.

And through 140 episodes, you've probably talked to all kinds of projects.

Some that are huge, like Istio and other projects that we might be using on a day-to-day basis, and maybe some, the experiment didn't work out.

And I'm curious if there were any episodes that you remember recording that you were really excited about, you thought they were going to catch on and become really, really big in the ecosystem and didn't.

Craig: I'd like to claim that we were selective from the beginning and only picked the people to talk to that we knew would succeed.

But that being said, when I look back on the people that we've spoken to, there has obviously been a large amount of consolidation by way of acquisition.

There are people that we talk to who were building something as a startup and then were acquired by a larger company and that became part of their broader portfolio.

But I don't think I can look back and look at anything that obviously is an evolutionary deadened.

Rocket is the singular example that people talk about in the cloud native space as a project that was archived.

It was built as a container runtime, but then when Docker opened up the kimono a little bit more and standardization happened on their platform, it really wasn't needed anymore.

And that, to me, is a project that achieved what it needed to do. Its goal was to open up run in containers.

It wasn't necessary for it to be Rocket that was the runtime that succeeded.

And a lot of the ideas from Rocket went into OCI and the specifications for containers afterwards.

So, overall, there are a lot of people who try and scratch an itch a particular way.

I think it's valuable for people to open things up and publish them, even if no one else wants to use them, just to get the experience of running things.

Marc: Great. So, let's dive into the topic that we came here for, Istio.

So, to start off, how would you describe Istio to someone?

Craig: Istio is a service mesh. That's a little bit of a topological definition.

So, I like to look at it by way of example of how the problem came about.

We started off at Google with the idea of wanting to index the internet.

And in order to do that, back in 1996, '97, it was a lot of machines.

It was, take a job that you couldn't necessarily do on one machine anymore and then scale out to those other low-cost machines.

And in general, once we'd moved software that used to run, every single person had a desktop machine and software ran on that machine, to a world where it's all running as a service in a data center backed by many computers.

Everything you launch is internet scale. Everything you launch at Google gets millions of hits per second immediately upon launch.

And if you build something successful in the community, you can expect a large amount of traffic to it.

And that means you need to have generally more computers running it than just one.

And when you start having more than one computer, you get into this distributed system world where you can no longer have all of this data of everything happen in one machine.

And now, we have to deal with the problems of decentralizing.

We talk maybe about taking monolithic applications and breaking them up into services, but even just taking one application and running multiple instances of it, now we need to worry about if we make calls between these incidences of the application.

We're calling over an unreliable network, so the traffic may not succeed.

You may hit a busy endpoint. You may have your request denied for a security reason. You may have someone infiltrate your network and get access to an endpoint or take it offline or something. So, you can no longer trust that the network is there, in the same way you could when you were just calling a function on the local machine and you were both the source and the destination.

There are a lot of people who solve this kind of problem in their own code.

If you are writing everything in a single programming language, that's practical.

But not everyone is necessarily even writing all of their own applications.

A lot of people obviously are running off the shelf software that they don't have access to change the code to, so we find that hooking in at the network level and basically making the network application aware, is the best way to solve this problem.

You might previously have thought about putting a proxy server between a front-end and a backend application.

But again, that becomes a potential point of failure.

So, what we did was take-- Instead of a middle proxy, we would break that up and say, "Each source and destination gets its own little proxy attached to it," which runs in this pattern which we call the side-car.

And then, you know that each proxy is always there because it's running on the same environment as your particular workload.

You can run that next to a virtual machine instance as well as a container instance, if you want.

And then, you can program all those proxy servers to be an application aware network.

And when I say, "Application aware," it's not just saying, "Take this packet and throw it to an IP address and I don't know what it is."

It can inspect it and say, "All right. Well, this is a particular application that I know how to handle. I know how to handle its retry requirements and I know what kind of thing it is so I can make security decisions on this."

I can move all of that stuff out of my application, so I don't need to worry about it myself.

And I've got the network layer thinking about that for me.

And so, that network layer is what we call the service mesh.

Marc: Got it. Yeah. And I think that's something that we talk about a lot, about why Kubernetes is so popular.

It's taking patterns that we were all responsible for at different layers of the application, different runtimes, different programming languages and moving them all to the platform layer, just makes the job of the developer--

When you want to write your code, you focus on writing the application and solving the problem.

You don't have to continuously reinvent these same problems, because at the basic level, Kubernetes is providing platform level solutions.

But really, it's not just Kubernetes.

It's Kubernetes in a service mesh and all of the layers on top of this that are actually functioning as that virtualized compute layer.

Craig: Yeah. And bringing it back to the desktop metaphor as well, showing my age again, you can think about applications that used to write their own access to the hardware.

They used to have to do their own graphics, and sound, and so on for gaming, for example.

Then there were operating systems and higher level environments, things like Windows and APIs like DirectX and so on that you just wrote the thing that mattered to you.

And now, we're at the point where you basically pick up an open source game engine or a commercial game engine that you can license for free and you write the business logic of your game and you don't have to worry about all the pieces underneath it, because someone else is managing those layers for you.

And the service mesh is one of those layers in the cloud native stack.

Marc: And coming out of Google, like you mentioned, everything at Google, the moment you launch it, it's operating at internet scale.

So, there's a really great paper from Google, Borg, Omega, and Kubernetes that talks about--

From Brian Grant, talks about all of the years of learning.

And that really focuses on-- We looked at Borg, Omega, and Kubernetes, but I'm assuming the service mesh is really those years and years of shipping everything at internet scale at Google, brought down so that I might not have that problem.

I might not have that scale problem today, but I shouldn't have to worry about that problem in the future.

Craig: Yeah, the SRE book talks a little bit about the Google stack in a sort of pseudonymized version, but Borg is obviously a layer of that in terms of, how do I run these applications?

Then you also need to get people connected to them and have those applications be able to talk securely to each other and so on.

Google addressed things in a slightly different way, because Google was one company who was, again, both the source and the destination.

We would code everything that we wrote in one of four programming languages, so it was relatively easy to maintain libraries, but still four times harder than it should've been.

And even Opensource software that we would take from the outside, things like MySQL, everything that ran at Google was recompiled internally.

We put it into our giant mono-source repo and build it and run it ourselves.

So, the way that software was ran and deployed at Google didn't necessarily make sense for the way people would run things with many different teams, and places where there wasn't complete trust between all the people involved.

One of the layers that made sense to break out was the communication layer.

We had an internal system, which eventually got opensourced as GRPC.

And that's a way of saying, "All right. I'm going to take this library. I'm going to embed it in the applications that I write and I don't have to worry about writing the libraries to take packets off the wire and turn them into classes that are in whatever language that I'm using."

That is an example of a thing that was an internal Google thing.

It had a few different variants internally and then it's got Opensource version created.

That now becomes the internal system that Google uses, replacing what was done internally.

And there are other people who have been through Google and looked at this problem and implemented it themselves.

But we're really pleased with the success of GRPC in the ecosystem and it's a great solution to that problem, which we know people are going to have.

They might not think they're going to have that problem when they're just writing one application and running it on one or two servers to start with, but if you make a few investments upfront, then it's a lot easier to deal with the inevitable problems of a potential success later on.

Marc: So, making those initial investments-- I'm running a Kubernetes cluster, but if I don't have a service mesh, am I doing it wrong?

Should everyone just have a service mesh from the beginning installed on their Kubernetes' clusters?

Craig: That's a tough question and I think the answer to that sort of comes down to layers.

It's like, question, should your fridge have Linux?

A lot of people have smart fridges these days.

They're going to run some sort of operating system and perhaps you should have the functionality that is provided by your smart fridge, but you don't necessarily have to care how it comes about.

The same thing is true of a stack that you run a modern cloud native application on top of.

You should have something that runs containers and deploys them for you.

It's generally nowadays Kubernetes, because the ecosystem has seen the benefit in standardizing on a single API.

There are sometimes diversion implementations of the thing behind the API, but ultimately you can solve problems by dealing at a much higher layer than the Kubernetes layer.

So, in something like cloud run on Google Cloud, you can just deploy a container, or you can give it a source directory and a build pack and say, "Just run this thing."

The thing underneath it might be Kubernetes or it might not. It's based on the Knative Opensource project, so you can take all of those pieces and run them on your own Kubernetes, or you can run them on Google's magic machine as well.

But ultimately, you're dealing with the thing at the level that makes sense to you.

It might, again, use a service mesh layer, like Istio, underneath it to handle the traffic routing.

But if you're dealing at the level that makes sense, then you may not need to know about those details.

Now, if you're a platform team, you absolutely need to know about those details, because you are providing that for somebody else.

And at that stage, there are very few times that I would say you can solve the problem of a service mesh better yourself. It's the same way of saying, "Well, I could deploy containers better myself."

There are some other platforms out there which are more specialized, but ultimately, a general purpose platform like this, with the industry support, with people working on it, it's going to be easier to run.

It's going to be easier to find people in the marketplace to work on it. And it's ultimately a really good solution.

Marc: So, that all makes sense. I mean, Istio is definitely-- Or service meshes in general are a really good solution.

But, here we are-- Just being realistic for a second.

You have a really good solution to a set of problems, but there's definitely a challenge where, I'm a developer and I have a problem, and that problem might be solvable by Istio or solvable by the service mesh pattern, but I don't necessarily see this as, "Oh, that's going to solve the problem that I have."

How much work do you do and what do you think about, how to advertise and make sure that that the folks out there running Kubernetes are aware that, "Hey, this problem that you're running into is solvable by just installing and using this pattern that we already have?"

Craig: The problem isn't necessarily solvable by throwing software at it.

The idea of this area I think is that there is a lot of automatable toil and so on that can be solved, but you still need to have process behind it and you still need to have a development team who are willing to invest in that process.

Ultimately, Istio is a product that moves those concerns from the development team layer out into the platform team layer.

So, as the development team, you may have a requirement that says, "Our software has to be secure." And that might have a specific requirement, which is, "Everything must be encrypted at rest," or, "Everything must be encrypted behind the firewall." We have GDPR requirements that say, "We're signing off on this being true." And then you can implement that yourself.

You can look at the complexity of doing that when you have applications that you don't necessarily have access to add libraries to, or to recompile yourself.

Or you can say, "Wouldn't it be great if the platform did that for me?"

And that's where we see obviously people writing applications, don't want to have to deploy things by hand, and where a layer like Kubernetes comes in to solve that, the same thing is true of the network, is people want to be able to handle some of the traffic management situation.

They want to be able to do AB roll-outs and they want to be able to see which versions work better and plug that into a telemetry system.

So, there are a lot of these problems that people have and ultimately they'll go around and look for a solution and probably their platform team has to implement it, rather than the developers themselves.

Marc: I think that's actually a great point too.

We often think-- A lot of us tend to default to thinking about, first-party software.

"I'm writing software and I'm going to deploy it to the cluster so I can control that entire ecosystem."

But to your point earlier, Google may recompile all third-party software and bring it in house that way, MySQL in libraries and Opensource tools, but most of us don't.

And we're running third-party software and the ability to inject stuff around regulatory and compliance value into it has to happen at the platform level if we don't want to change the code itself.

Craig: Absolutely.

Marc: Great. So, you mentioned earlier about a bunch of different things that Istio can do.

As you're out there as a developer advocate and helping folks understand it, do you see a common getting started, like, "Here's how to-- "

You can't throw software to solve a set of problems, but where do you see a lot of entry point into folks adopting service meshes?

Craig: It would be really easy if we could say, "This is the one problem that everybody had," and focus our attention behind that particular problem.

But it turned out that of the three main areas that service meshes tend to approach, it was very even split in terms of what people were able to do with it.

We had people who wanted to be able to do load balancing for services running things like GRPC, which at the time at least were not easily supported by cloud load balancers.

We had people who wanted to be able to do distributed tracing between services and who can benefit from the fact that we've got this proxy injected into the request path of every workload and that sends telemetry off to something like Zipkin or --.

And we had people who were segmenting their network and allocating their identity based on where you are, which is the way it's always been done, really.

If you're in this particular subnet, you have access to these particular services.

There are a lot of people, especially in the financial services space who would find that they filled up their subnet and that all of a sudden they couldn't deploy anything, because they have to put it somewhere else.

And so, the service mesh model, we're able to move identity to the workload. It doesn't matter where you are anymore, it matters who you are and what your identity is, and has that been attested by the server?

Then all of a sudden, you can rebuild your network without having to bother the network team.

You can deploy your applications in a way that allows you to grow past those boundaries.

And those are just a few examples of the early use cases we saw for Istio.

And it's really a case that people would come along with a problem.

They'd look at it and say, "Hey, we can solve this problem."

We would recommend that they came and solved exactly the problem that they had.

They may not necessarily need to deploy all the rest of it.

If you were only doing load balancing for GRPC, for example, you might not even need the side-car.

People were just using this as the way to get traffic into their cluster.

And then later on, as they find a right, "I'm comfortable with this. I understand what it means to operationalize this and to run it in production," then they can look at extending functionality and running other patterns in the cluster.

Marc: Okay. So, let's dive in the weeds for a little bit with some of the things that Istio can do.

So, the side-car, if I have all of these microservices and I'm not encrypting all traffic inside the cluster-

Craig: Shame on you.

Marc: Yeah. Can Istio help me solve that problem at the platform layer now?

Craig: Yeah, absolutely. So, it's very easy to enable mutual TLS by default.

Each service that you deploy gets given an identity, which is based on what the application is and the service account that's running it.

And then, you get a certificate assigned to you. And so, all the communication between you and any other service is encrypted with those certificates and you can verify both ends.

So, that's the concept of mutual TLS. Because this is managed by the side-car and the side-car is the Envoy proxy, which we can talk a little bit about why we selected that, but you have interoperability not just with the things inside that cluster, but anything else using the same identity systems.

So, the SPIFFE service ID, you can interact with other services that use that.

You can do things like decode JavaScript web tokens or JWTs and you can validate identity from services outside the mesh.

And you get all this with the knowledge that you don't have to worry about rolling your own crypto.

If you have to think about deploying and dealing with encryption yourself, you're almost certainly going to make some kind of mistake.

And so, this is a case, obviously. People embed libraries.

It's always challenging to hear of open SSL vulnerabilities.

I think there's one that's coming out roundabout this time, where we don't know what it is, but something will happen.

Marc: There's rumors.

Craig: And it's important to have something where you trust the implementation.

And having just one implementation rather than, again, five or six different variants of this, if you have to have a different library depending on what language you're using, is very powerful.

Marc: Right. And you haven't used the term, "Zero trust," yet.

But, removing access based on either physical or logical network, is what you were talking about, and a strong identity of that service, and the pod, and the runtime gives you so much more capability around scaling the services, and revoking access, and just auditing, and knowing what's going on, and not havin--

And just gives you a way better security footprint too.

Craig: Yeah, absolutely. So, I'm sitting here with a Google laptop in front of me.

It has a certificate that's issued to me. I have a little security key that I can touch, and the password, and so on.

And with this laptop and those things, I can basically be on any network in the world and get access to the systems that I need to.

There's nothing special about being in a Google office, when that was a thing that people could do.

And that's the same idea that we bring to the zero trust security model, as you mentioned, inside of Istio, is that the workload that you have should matter.

You should be able to theoretically run it in a hostile network environment where people are snooping all of your traffic, and it shouldn't matter that someone's doing that, because you're encrypting everything yourself and you're handling all the authorization authentication.

Marc: That's great. So, when you talked about mutual TLS for that, let's call it, east-west traffic, where you'll have a pod talking to another pod in the cluster.

What about where I have multiple clusters or I'm actually even expanding to multi-region or even multi-cloud, so I want traffic to egress and ingress, or otherwise talk from Google Cloud over to a different cloud provider, to AWS or to Azure.

Can a service mesh span that? Can Istio span those multiple clouds?

Craig: Yes, of course. We have the concept of a gateway.

And I like to think of a gateway as being a side-car for the internet.

And so, you have all of the traffic that comes in from outside goes through the gateway.

It can have the decryption of the TLS from outside and then the identity put in for services that are connecting internally.

We have a bunch of different patterns that you can use for doing multi-cluster service mesh.

You can have a single mesh that expands across multiple clusters where you have the control plans in there sending their service information back to the same conceptual mesh.

You can have a mesh per cluster and then you can use those gateways effectively as a zero configuration VPN between them and you can rattle the traffic through those.

It tunnels them to the other clusters.

Istio doesn't really mind how the traffic works, as long as all the pods can reach each other by IP address, it's able to do everything on top of that.

Marc: Yeah. And you mentioned even virtual machines that exist outside of the cluster can join.

Craig: Yeah, we've done a lot of work on that in the last year.

A lot of the way the Istio works has been based on the idea that service discovery system finds all of the services and then when you try and communicate with one of them, it interjects, captures that traffic and routes it through the mesh.

To be able to add VMs to this, you need a way of being able to run the side-car on the VM, which is generally easy enough. It's just install something like a DBN package.

And then you need to register that so that it gets an endpoint on your mesh.

In the last year, we basically got that down to a single command that you run.

We registered the workload and then it's now available to access, as if it was running inside your Kubernetes environment.

Marc: That's pretty neat.

So, I've been around the tech ecosystem a little longer than I care to admit right now, but one of the things that sometimes helps Istio service meshes is this term.

And trying to wrap my head around it, one of the things that helps me sometimes is thinking, "What are alternatives to using Istio or service meshes in general?"

What do you see people doing who-- Are they all building it in the application layer, or are there other alternatives that you see?

Craig: Yeah. If you control all the applications yourself, then you might not need a service mesh.

Or if, for example, you are the developer and the platform team, because you're a one person shop, then you can bring that down to the application layer.

Netflix addressed this very early on with some of their Opensource tools, because everything that they ran was on the JVM.

So, it was practical for them to write libraries, which we run inside their environment.

They had heuristics, which they could use for doing traffic routing.

And again, that's fantastic if you are all of the endpoints.

But the moment that you want to introduce something that runs in a different environment or that you bring off the shelf from somewhere else, now you have to find a way to interact with that.

And in the case, for example, of GRPC, we thought, "All right. This is going to be a problem people have and we need to build libraries for many different languages."

But we find it to be a lot easier the way that the service mesh addresses it and the way we do it in Istio, which is just capturing traffic at the TCP or the EDP level on the network.

Marc: Got it. Yeah.

And Cloud provider load balancers are often now able to route GRPC traffic, but that doesn't really takeaway the value of Istio.

Running it all internally is still-- I mean, A, it's more portable.

And B, it's a lot more powerful, especially in a lot of different workloads where you actually just want to kick that traffic right inside the cluster.

Craig: Yeah. There are a couple of different things here which are interesting to think about.

The first thing is that the new internal Google load balancers for layer seven stuff are actually built on top of Envoy.

So, we're using this Opensource proxy that we have a very large contributing to internally, and we're able to build out things like Google traffic director on top of this and get a lot of the same power that's being provided by the Istio control plane when configuring Envoy.

And in terms of that traffic, and we've mentioned GRPC a couple of times here. GRPC is able to do some of the things that you can do with the Envoy.

You can give it a list of endpoints and say, "Please do load balancing between these endpoints."

So, you just need to have a way to configure it.

And there's work being done on making GRPC endpoints be able to be participants in the mesh, so you can use the Envoy APIs, which are collectively called the XDS APIs, and you can actually use them to configure GRPC endpoints.

So, that's something that we've built out for our cloud product and you'll probably see that landing in Istio sometime soon too.

Marc: That's really cool. It actually kind of allows you to collapse those layers a little bit when you need to.

You mentioned earlier, Istio's built on top of Envoy. You brought up Envoy again.

I'd love to understand more. Why Envoy?

Craig: Yeah. So, at the time, we needed to have some sort of proxy to run next to all these workloads and we went out and had a look at what was available.

And Envoy had been opensourced by Lyft maybe six to 12 months beforehand.

I can't remember exactly what it was, but they've been running it in production for quite some time.

It was written in modern C++. It had an API which you could use to configure it.

A lot of other proxies at the time, you basically had to stop the thing and reload configuration off the disc in order to make changes to it.

But Envoy was very modern in that sense. It was able to fit into this dynamically reconfigurable cloud native environment.

And it had a team that was really keen to work with us.

And it's really been good, I think, for both sides.

Istio's adoption of Envoy really gave it a shot and then Google picking it up for other use cases as well. I think we're probably the number one contributor to Envoy at the moment. But we're seeing a lot of other people say, "All right. This has got some real power behind it now."It's got a vote of confidence from a big team working on this.

And then a lot of other teams have picked it up as well.

There's a lot of people who have built API gateways, and edge proxies, and the good thing about standardizing to some degree on Envoy is the data plane in this ecosystem, is that you can inter-operate.

So, we can have people build things that'll be able to mesh traffic between, for example, an Istio service mesh that's using Envoy as its side-car and AWS app mesh that uses Envoy as well through our third-parties who can build the thing which effectively meshes between those two environments, because it's using Envoy as the data plane in both of them.

Marc: Yeah. It's actually really cool.

There's a whole ecosystem of Opensource applications and others being written around Envoy.

You see it more and more. I was talking to some folks who were building a web application firewall on top of Envoy.

You mentioned API gateways.

It is actually really cool to think about that interoperability layer you get, it allows you to really shape the network, and really push everything down to the platform, and just write the application that you want to write, and you can take advantage of the whole ecosystem.

Craig: When we started with Istio, we were really treating Envoy like an implementation detail.

It is the Istio proxy. It is the thing that makes Istio work.

Now, we're seeing people-- Like at the recent Istio Con, for example, we had a talk from Atlassian and they have been using Envoy statically.

They've been configuring it themselves and deploying it with all their workloads.

And they were explicitly looking for a dynamic control plane for controlling Envoy, and they're now adopting Istio in order to do that.

So, there are people now who have picked Envoy as a thing and they're now looking for ways to configure that.

So, we're seeing it come from both ends.

Marc: Neat. That's really cool.

So, go back to that example earlier. I'm running Kubernetes and I'm not--

I identify, "Oh, wow. I actually need to be encrypting this traffic and Istio sounds like a very good solution for me."

How do I think about the ecosystem in general, though? There's now--

Several years ago, SMI, the service mesh interface was announced, which a few different projects adopted.

Microsoft has come out with Open Service Mesh. There's Linkerd. There's Istio.

How do you think about that ecosystem and where Istio is the strongest?

Craig: Yeah. The one thing about Istio is that it's really been adopted by a large ecosystem of people.

There are a handful of service mesh players now, most of them have a single company who are behind them.

Most of them have maintainers only employed by that company.

We've got people working on Istio from dozens of companies.

And more importantly, for people as an end user, you can actually get Istio as a service, not just from Google, but from Red Hat and IBM.

And VMware's Tanzu service mesh is based on Istio.

And you've got Huawei and AliCloud in China who will give you managed Istio services.

And then you've got a bunch of startups who are building things, Tetrate Service Bridge, and GLUmesh, and Aspen Mesh, for example.

They're all platforms that you can get that do that higher level layer of application network management.

But they all use Istio underneath as the layer and they're all great contributors to this.

So, we're seeing the ecosystems standardizing around this to some degree and that's fantastic.

That's exactly what we set out to do when we Opensource the project.

Marc: So, has the introduction of the service mesh interface changed the landscape or changed the roadmap at all for Istio?

Craig: No, I don't think so. I don't find a lot of people--

When I have conversations with users or customers, it never comes up at all.

There's not a lot of chatter about that outside the space of people who are trying to build service meshes.

I find that the concept of a lowest common denominator APIs gets tricky very quickly.

What I mean by that is that, for example, when cloud started as a concept, there were things like right scale that you could say, "Hey, here's an API and call it to get an instance."

And if it was configured to use Google, you might get a Google instance.

If it was configured to use Amazon, you'd get an Amazon instance.

But those APIs can, of course, only support the things that are common across all of the people that they're configuring.

And that's the same thing that we had with Kubernetes ingress is the idea of being able to say, "Here is a path. Here is where traffic should go."

That is great in terms of lowest common denominator, but then all of a sudden, all of the things that your load balancer can do that other people's can't, there's no way of putting that in the API that applies to everybody.

So, it very quickly became a mess of annotations. "If you're using Google, do this. Here are a bunch of annotations to configure all the special things that Google can do. And if you're using engine X and running it inside the cluster, here's a bunch of different annotations that you can use."

So, we don't see a lot of customer demand for a lowest common denominator API for service mesh.

And what we are trying to do with Istio is we're trying to support APIs in Kubernetes where they make sense.

So, for example, the Kubernetes multi-cluster SIG has been working on an API for multi-cluster services or defining a service that exists in multiple different clusters, but it is the same thing because it has the same name.

And that's something that we have support built for in Istio because that's something that's part of the underlying platform.

Marc: Got it. So, it sounds like just interoperability parts of it are really good, but saying, "Hey, we can't add this functionality in because it's not defined in the interface,"that that's not something that's super interesting to anybody that you're seeing.

Craig: Yeah, no customer comes to us and says, "Hey, Istio needs to implement service mesh interface support."

The SMI team, they contracted someone to build that because they wanted to say that they're Istio-compatible, but it's just--

Of all the things people come and talk to us about, that's not something that is high on the list.

Marc: Switching off of the technical details for a minute here, normally, on this podcast we're talking to maintainers of CNCF projects.

Sandbox, incubating, graduating, kind of going through that. In Kubernetes, obviously, in the CNCF, as a graduated project.

Istio's not, though. Istio is a part of Open Usage Commons, it's a new foundation.

Were you involved in that and can you talk a little bit about the decision at Google to donate Istio to a dev foundation instead of the CNCF?

Craig: Yeah. So, there's a common misconception in the way that you set that question out, I feel, is Istio isn't in the Open Usage Commons because that's not really a thing that can be.

If we think about what an Opensource project is, it is predominantly the code and copyright of the code belongs to the people who wrote it, and then there are generally some sort of form of assignment, which allows the people who administer the project to be able to distribute it, and so on.

And then you have the idea of the name of the project.

And trademark law generally says that the people who create the project, own the name.

And when you give a project to a foundation, for example, the CNCF, you will transfer the trademark and things like websites and so on over to them, but you don't need to do anything with the copyright of the code, because that's not a thing that exists.

And that is most of what an Opensource project actually is.

So, there is a lot of understanding in the ecosystem on how to deal with copyright of code in Opensource.

There are a lot of legal precedence there.

But the idea of trademarks when we're dealing in this open way, where we want someone to be able to certify, "This is what it means to be compatible with a particular thing, but we want it to be open to everyone who's participating in it."

That was less well defined and that's something that Google's been looking at for a while.

The creation of the Open Usage Commons was around that idea of, "We need to figure out the pattern of what to do with trademarks in Opensource."

There are three projects that were put in there to start with, just because we had to start somewhere.

Istio is one of those. But Istio wasn't donated to it in any sense.

Istio has always been owned by the community of people who maintain it, and contribute to it, and that hasn't changed.

Google actually creates five to eight Opensource projects every day.

We have over 13,000 Opensource projects that we've created.

3,000 of them are 30-day active at the moment.

And I think under 20 of them are in a foundation of any sort, so it's definitely the exception, rather then the rule.

Marc: Got it. Yeah, thanks for clearing that up.

Five to eight Opensource projects a day is a mess of velocity.

Has putting it in the Open Usage Commons, has that actually changed any way that I can or can't use Istio from before that decision was made?

Craig: Not at all.

The only people that it really matters to are people who want to operate some kind of Istio thing and they want clarity on what they can and can't do with trademark.

Previously, they would've gone to a Google lawyer for that who would look at the Opensource guidelines and say, "This is fair and this is not."

Now they will go to an Open Usage Commons lawyer.

But for people who are using Istio day-to-day, it makes absolutely no difference whatsoever.

In terms of the most common thing people think about in the CNCF, it's really about marketing.

I talk to a lot of people who have projects who say, for example, "I put it in the CNCF because I want people to learn about it. I want people to know about it."

I talked to people outside the US and they think that they'll get more visibility of their project if it becomes part of the giant marketing roadshows of KubeCon and that kind of thing.

We didn't really need that with Istio. Istio was a well established thing.

It's got a lot of people using it already. It's got a lot of people talking about it.

We looked at the ecosystem and figured, "Right. This is something that we're very happy with the way things are going."

And the only thing that we really needed to address was for some of the vendors who were working on the project, that confidence that the trademark was free, and fair, and open for everyone to use.

Marc: I think that's a really good point.

I mean, outside of the legal implementation details about the trademark, one of the big benefits of the CNCF sandbox is eyes, right?

That marketing-- They've changed a little bit about how it works.

You're not guaranteed speaking slots at KubeCon anymore, but it's still putting your project into the CNCF sandbox, gets a lot of eyes on it.

But to your point, Istio already had a little bit of recognition and it's known in the ecosystem, so that wasn't a big motivator for what to do with it.

One of the things, when we're talking to different people on the day-to-day and we're talking about service meshes--

And this is a question really both about Istio and service meshes in general for you.

Service mesh and Istio have a little bit of a reputation of being heavy, complicated, a little bit difficult to implement. Is this right?

And if somebody has this as a preconceived idea, what would you tell them to steer them to giving it a try or to correcting it if it's not the right idea?

Craig: I think you can take that sentence and sub in Kubernetes and even now or five years ago, pretty much it'd be correct and especially in terms of perception.

I think one thing that we found with Kubernetes early on was that we'd kind of created a mote for someone else to come in and say, "Here's an easier thing."

This was perhaps because Google had GKE and a lot of the engineers working on Kubernetes in the early day were Googles.

And so, we said, "Of course it's easy to install. You just go to the Google Cloud console and hit a button."

And then for people running on other environments who didn't have that button, they'd say, "Oh, this is hard."

And then Docker, for example, came out and said, "Look how hard Kubernetes is to run. Here's Docker Swarm. It's a lightweight container orchestrator and look how small and easy it is."

And it demoed really well on stage. It was great for day one.

And really, what it did was sort of encourage Kubernetes to speed up work on the Kubeadm project and make it easier to get that same output in Kubernetes.

And if that works for you and your business model, more power to you.

But ultimately, we find that we have a little bit of a crystal ball here and we know the problems that people are going to get that scale.

And one of the reasons we saw Kubernetes succeed, even though it had all of these difficult primitives in it, all these things like tanks and tolerations, and disruption budgets, and so on. First of all is, you don't need to worry about any of them, if you don't have that problem.

And secondly, when you do have that problem later on, and we know you're going to have that problem because we've got this little crystal ball that says, "We've been running services on this pattern and we know what you're going to have to deal with," then you can deal with that.

And so, it's power, it's useful, and you don't have to worry about it if you don't want to.

Ultimately, in terms of Istio, we've done that work on making it easier to install. It's just a single command line, Istio STL install, and it runs.

We've done a lot of work on making it easy to upgrade and do Canary deployments of new versions of the control plane so you can roll off traffic from one version to another and test that your upgrade's going to work.

There are a lot of people out there who have seen what happened in that Kubernetes' space and they're trying to say, "All right. Well hey, we're going to say this about service mesh because it makes sense to try and promote our product."

But ultimately, if that's something that you have that experience from Istio in the past, I definitely encourage you to go back and take another look because we've done a lot of work.

Marc: And I think that's a super fair point too.

There are still-- A lot of people talk about Kubernetes in that same light.

And it's not binary, you have to adopt all of the functionality.

Just like you might be just running some pods in Kubernetes or a deployment and that's it and you're still running Google Cloud SQL or RDS for all your stateful services.

You're still getting a ton of value out of Kubernetes that way, even though you haven't adopted the entire ecosystem and everything.

And the same is true for Istio, there's value in just that mutual CLS, if that's what you want to start with, or some other low hanging fruit.

And then, the rest of it's just there for you, if you choose to enable it.

Craig: Yeah. When we launched Kubernetes, we really thought that the thing people would latch onto was the cost-saving, was being able to bin pack workloads, was to be able to do more with fewer machines.

That was the super secret thing that worked for us with Borg and Google, and that we wanted to make available externally.

But we found that people were actually quite behind in terms of toiling.

And the idea of them just having a consistent API to let them deploy small amounts of software, even if it was just one container per machine, that was revolutionary to a lot of people.

And then that was the thing that drove Kubernetes adoption.

And later on, once people have got their pipelines for CI worked out and so on, and they're able to trust that the system works, then they can think about bin packing, and then they can think about applying auto-scalers, and tanks and tolerations, and all the more complex things.

But ultimately, you probably have a problem that works for you and that might be, "I need to do some load balancing, or maybe I need to secure my traffic."

And it's very easy just to enable the thing that matters to you and ignore all the rest of the features.

They're behind a curtain. Open that curtain only when you need it.

Marc: Yeah, that's great.

I mean, it's funny to think about bin packing as the value prop of it, but that's because you're at Google where you're used to Borg.

And the folks at Twitter have Mesosphere and Facebook have Tupperware and all these large companies have this technology and they've moved so far past--

It's a common API to deploy infrastructure, but startups and the rest of the world, it's like, "Hey, that's actually super valuable. That's the piece we want."

Craig: Yeah. And credit to Docker for-- Credit to DockCloud, I should say, for the idea that this is a thing that people would need, and then building technology around that.

And we adopted Docker for Kubernetes because we saw the uptake of that in the community, that it was a more convenient way of using these kernel privatives that we built and developers were there.

And so, we want to meet developers where they are and give them a way to access that power in a nice, easy package.

Marc: Yeah. I mean, Docker definitely has the great developer experience there.

And I remember years ago there was conversation about, "Oh, is there just going to be Docker on the desktop, but you're going to be running a different runtime in production?"

But honestly, it doesn't really matter. That's not the value of really, "Oh, I need to run Docker in production."

It's, "I just want to make it easy to containerize my workloads and put them in," wherever those are, throw them in an OCI registry.

And then if it's gVisor, Containerd or who knows what it is that's running in production?

There's an interface around that and I don't care anymore.

Craig: Yeah. If you can pass your tests, and if your code runs, and you get the output that you want, then that's great.

And I don't know if Docker on the desktop is like the year of Linux on the desktop, but you never know.

Marc: So, do you see projects that are building in the Istio ecosystem?

We talked a little bit about Envoy creating this interoperability layer in projects that can adopt and communicate with other Envoy projects deployed.

What about at the Istio layer? Do you see-- Can native use Istio, right?

Help me understand a little bit about how Istio becomes that interoperability layer.

Craig: Yeah. So, if you think about Knative, again, in terms of the layer cake we've talked about.

Knative wants to address developers.

They want us to be, "Write code, and give it to us, and magic happens."

To serve that code, there needs to be a network layer.

And then to run that code, there needs to be a container orchestration layer.

So, we launched Knative and Istio was the supported layer for doing that.

But it doesn't actually use a large percentage of the things Istio can do.

Predominantly what Knative needs from its network is incoming requests from the internet get routed to a particular service.

And you might want to say, "We're rolling out our new versions, so 1% of the traffic goes to the new version and 99 to the old version and so on."

So, that's really a north-south use case. There's very little ...

If you think about App Engine or Heroku or those kind of internet deployed services, there's very little case where a service is going to talk to another service inside that environment without going out and connecting to the external API again.

So, there were people who saw this and thought, "All right. Hey, maybe I will add support for my own thing in Knative."

And so, Knative now supports five or six different network layers.

Again, a lot of them are Envoy powered.

Things like Ambassador and GLU, for example.

And you can plug in what makes sense for you to provide that, to provide the traffic routing for your serving workloads.

If you're running Istio already, that's fantastic. If you have a workload--

Not all the workloads that you run in an environment are going to be stateless workloads.

Not all of them are going to suit Knative.

And so, you're probably going to want to run something else alongside it.

And so, if you've already got Istio there, then you can just put Knative serving on top of that.

There are a lot of projects that are looking at extending Istio and providing observability to it.

The Kiali project is obviously a great way of looking at observability.

But a lot of vendors who are adopting Istio for their own mesh products are building some of their own stuff on top of that too.

We see people who build deployment tools, things like Flagger from WeWork is often called up for this as a way of doing roll-outs.

They support Istio and they support other service meshes, I believe, through the SMI.

So, there are a lot of different approaches to this. It doesn't necessarily have to be Istio-specific.

Ultimately, we just want to be an implementation for providing good networking.

And if you can configure those APIs, again, at the Kubernetes level and then Istio just knows what to do, then we're meeting our goals.

Marc: Yeah. I mean, I think that's the point. Again, here, it's API-driven and interoperability really are the power features of the platform right now.

Craig: Yes.

Marc: And so, looking forward for what's coming for Istio, what's on the roadmap that excites you in both the short-term or even the longer term?

Directionally, where are we going to see Istio go?

Craig: I'd like to tell a story about the difference between the way different companies look at things.

And I tell it in the context of the layer seven load balancer at Google.

If you think about a traditional vendor, they will build an MVP of something and it'll do just enough.

And then we'll have version two of the thing, which will add a different version.

And then version three and so on. And by the time you get six or seven years down the track, you've got something that's quite full featured.

With Google and the fact that we were largely externalizing something that we had gone through that process with internally, we were able to come out of the gate with the Google layer seven load balancer and it had that seven years of effort behind it already.

There wasn't any cost load balancer that you could address with a single IP address.

Well, why? It handle traffic routing globally and all these things that really were ahead of the time in the sense that you shouldn't launch a V1 with all of the stuff in it.

Where's the MVP? And that's a benefit of having this blueprint, this crystal ball if you want, looking at what things people are going to need.

And in large part, that's how Istio's been as well.

We launched 0.1 and it had all of the features that it was really going to have today.

Not necessarily as stable or as well thought out. There's been a lot of work over time.

But, the basic idea was there. It's not like, "Here's a kernel of an idea and we'll add things on it over time."

The things that were-- We now think of as the modern definition of a service mesh, were in Istio right from the very beginning.

So since then, it's really been a sort of stabilization process.

It's learning from customers, and figuring out what they want, and improving things along the way.

There are a couple, again, of evolutionary things that we thought would maybe work for the way that things worked at Google, and we've changed as we've heard feedback.

For example, we had effectively a set of microservices that ran the Istio control plane, because the people building things at Google were largely the people who were operating things like GKE, and that made sense for us and the toiling that we had.

But for most people in the ecosystem, we found that the right way to deal with that would be to size them as a single service.

And some people think of this as a rebuilding a monolith, but it's really just right-sizing something and bringing it to a thing that makes sense for who's going to operate it.

So, while there's been a little bit of change about the way we've packaged things, really, the things that we're dealing with are stability.

We're dealing with support for external APIs. Again, I mentioned the multi-cluster services.

The new gateway APIs that are coming in Kupernetes, we're building support in for that.

We're looking at how we can support day two tasks, like upgrades.

And we're looking at how people in the community want to install Istio.

We've done a lot of work on operator installations last year.

We had deprecated support for Helm, but it turned out there was a lot of demand for that, so we brought back support for Helm V3.

We had a roadmap presentation from some of our TOC members at the last IstioCon and I love the subtitle of that.

They said, "It was a heartwarming work of staggering predictability."

I think that's really what we're aiming for, for 2021, is really just taking features that have sat in beta for a little bit too long, like things in the Kubernetes ecosystem I want to do, and bringing them to stable.

So, giving people the certainty that we thought this through, we definitely know that the thing is 95% right but we've just got a couple of rough edges to file off.

Just finally filing off those rough edges and getting things right where they need to be.

Marc: Maturing the project, that's actually a good place to be. It's hard, though, for sure.

Craig: Yes. Some people talk about the fact that the last 5% of the effort takes 95% of the time.

Marc: Exactly, yeah. That's exactly it.

Craig, thanks for joining today and sharing all this info on Istio.

I know I learned a lot about the project. Do you have any final words to share?

Craig: If you're listening to this show, you're probably the kind of person who likes listening to tech podcasts and who quite possibly likes the sound of the New Zealand accent.

So, please do check out kubernetespodcast.com where you'll hear this kind of thing every weekend.

Marc: Awesome. Thanks, Craig.

Craig: Thanks, Marc.

Subscribe to Heavybit Updates

You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.

Content from the Library

Visit library

Jun 3, 2026

Podcast

High Leverage Ep. #11, Why Agents Need Computers with David Crawshaw

On episode 11 of High Leverage, Joe Ruscio speaks with David Crawshaw about the shift from traditional developer infrastructure...

Apr 8, 2026

Podcast

O11ycast Ep. #89, Software Is the Killer App for AI with Bryan Cantrill

On episode 89 of o11ycast, Ken Rimple and Charity Majors are joined by Bryan Cantrill. They dive into the origins of...

Jan 29, 2026

Podcast

Open Source Ready Ep. #30, Inside Unikraft and Unikernels with Felipe Huici

On episode 30 of Open Source Ready, Brian Douglas and John McBride sit down with Felipe Huici to explore Unikraft and the growing...