NOV 11, 2020

57 MIN

Ep. #6, Linkerd with William Morgan of Buoyant

GuestsWilliam Morgan

light mode

about the episode

In episode 6 of The Kubelist Podcast, Marc speaks with William Morgan of Buoyant. They discuss the complex service mesh ecosystem, as well as the origin and roadmap of Linkerd.

about the guests

William Morgan is the CEO and Co-founder of Buoyant, the makers of Linkerd. He was previously an infrastructure engineer at Twitter, where he ran several teams building product-facing backend infrastructure. He has worked at Powerset, Microsoft, adap.tv, and MITRE Corp, and has been contributing to open source for over 20 years.

show notes

about the episode

about the guests

show notes

In episode 6 of The Kubelist Podcast, Marc speaks with William Morgan of Buoyant. They discuss the complex service mesh ecosystem, as well as the origin and roadmap of Linkerd.

transcript

Marc Campbell: Hi, I'm here today with William Morgan, the founder and CEO of Buoyant, the company behind Linkerd. Welcome William.

William Morgan: Thanks, Marc. Really nice to be here.

Marc: So let's just dive right in. So Linkerd is a service mesh.

I think a good place to start out is just, can you explain what the functionality of a service mesh is?

What does the service mesh supposed to do for us?

William: Yeah, that's a good question because there's so much buzz about the service mesh but often, it's hard to pick apart what it actually does.

In part, that's because it doesn't actually do anything new.

So the way I like to think about it is the surface mesh has these kind of three buckets of features that it gives you, gives you a set of features around reliability, gives you a set of features around observability, and it gives you a set of features around security, right?

And it's designed to give you those, you know, in a world primarily where you're running on something like Kubernetes and you're building microservices, right?

And the thing that the service mesh does, like we've had those features forever, you know, retries and timeouts and like TLS and whatever.

We've always had those features. What the service mesh does, kind of the magic, is that it gives you those features at the platform level.

So rather than having the application have to implement them, you shift them down to kind of like the underlying platforms.

The Kubernetes, you know, your Kubernetes platform is doing a bunch of these features for you.

Marc: Yeah, that's great. I think that's one of the main draws of Kubernetes, in general, you know, we've had like all of that functionality.

So I'd like to hear a little bit about like the background, the origin story of Linkerd.

You were at Twitter in the early days before starting Buoyant.

Like, can you kind of walk us through the origin of where Linkerd came from?

William: Yeah, so both my co-founder, Oliver Gould and I, were engineers at Twitter.

And Twitter at that time was going through this pretty massive transformation internally, from this monolithic Ruby on Rails app into this big microservices, you know, kind of thing.

And what's amazing is that that transformation actually worked, right, which often these massive rewrites don't actually work in a company, right?

This is like an initiative. It's like this big initiative. But it worked and it worked really, really well.

And as part of that, we had to figure out a lot of stuff, you know.

It's like, I don't think we really knew what we were doing.

In fact, we didn't really have the word microservices.

We didn't even, that wasn't like a thing. And this was, you know, 2010 to 2015.

Anyways, so it was like kind of pre-Docker or maybe Docker was around but we certainly weren't aware of it.

But as part of that transformation,, we had to kind of figure out this new thing that was happening in our application which was the service-to-service communications.

In the monolithic world we just had this big old blob of code, and you'd make function calls. Now suddenly we had replaced those function calls with network calls.

So service A would call service B. A network call is very different from a function call.

Like, a function called basically always works and it's extremely fast, whereas a network call often doesn't work and it's often very, very slow.

So in order to handle that, Twitter had to invent a bunch of infrastructure, you know, basically to make that a manageable thing.

One of those layers was this project called Finagle. It was this open source project.

And you, as the application developer, you'd say, well, I'm service A, I want to talk to service B, you know, hey, Finagle, make that call and give me the result and you get the result.

But under the hood, Finagle was doing load balancing and it was instrumenting everything and exporting metrics'.

And it was, you know, doing all this like fancy request multiplexing and all that stuff.

So the very first version of Linkerd, you know, which was the first service mesh, was literally Finagle in a proxy form, right?

Finagle was this library. We were like, well, no one wants to use the library. Let's turn this into a proxy.

Marc: So, Twitter originally wasn't on Kubernetes, right? They were on Mesos.

So when Finagle was out originally, was that a Mesos's project?

And then eventually, you created Linkerd as the Kubernetes implementation of that?

William: Uh, it wasn't really like that. Finagle wasn't really a Mesos's projects.

I mean, certainly we were starting to deploy things heavily to Mesos at that time, though, we also had a, you know, a whole bunch of the infrastructure was, you know, just running on regular old un-orchestrated machines.

And actually, the first version of Linkerd was pretty agnostic so we would work with Mesos, we'd worked with Kubernetes, we'd work with Console, we'd work with, you know, a bunch of these different kind of service discovery providers.

The ZooKeeper, which was what Twitter used heavily for service discovery.

And it was funny, as over time, you know, and Linkerd has evolved quite a bit since those days, we've actually focused a lot more on Kubernetes.

And the modern version of Linkerd is extremely tied to Kubernetes.

And, you know, a lot of that kind of platform flexibility, we've actually shed along the way.

Marc: I'd like to talk about the differences there between Linkerd version one and version two, because what I've heard is, you know, my understanding is that Linkerd 1 was a Java application and then Linkerd 2, you rewrote it into a different run time.

Can you talk a little bit about like, what drove that and what those changes were?

William: Yeah, that's right. Because that first version, you know, Finagle, that was a Scala library.

All of this Twitter infrastructure was on the JVM, primarily in Scala. There was a little bit of Java here and there.

You know, V1 of Linkerd was okay, we're taking Finagle and we're turning it into like a proxy.

So that meant that Linkerd itself was on the JVM.

And that was kind of okay but, you know, especially as the world kind of moved more into the Kubernetes ecosystem, it became harder and harder to really convince people to use the JVM.

You know, especially by the time you added Scala and Netty and Finagle, and like all these layers, you know, the JVM is very good at scaling up.

You can get these single instances that can handle crazy amounts of throughput. It's not very good at scaling down.

What we wanted to do was we wanted to give you a lot of these proxies, right?

And so we'd end up in this world where, okay, the proxy, after heroic engineering efforts, the proxy would take 130 Mbs of memory, right?

Tiny by JVM standards.

But people would be running that alongside their little Go applications and containers and Kubernetes and those applications were taking 50 Mbs.

And, you know, it started to get a little silly for us to say, oh, don't worry.

It's like, here's this transparent layer of infrastructure, by the way, it's going to like quadruple your memory usage or whatever.

So for a variety of reasons, in, I think it was 2018, we released the 2.0 version of Linkerd, which was a complete rewrite from the ground up.

You know, it shares the name, shares kind of the features that, but almost every other aspect was different.

And you, it has a control plane that's written in Go and a data plane that's written in Rust.

And, you know, all of the JVM and kind of the Twitter Stack stuff has been left behind by us.

And that was a pretty momentous transformation for the project.

And that's kind of where all the modern, you know, Linkerd effort is in this new architecture of Go and Rust.

Marc: So now I'm assuming I can run the sidecar Linkerd and it's not going to use 130 Mbs per pod?

William: Yeah, that's right. That's right.

And that side, that term sidecar is a good one to bring up as part of this discussion, because the way that Linkerd works and that many, not all service meshes work, is they add those proxies, they add them directly into the Kubernetes pod as like a sidecar container.

So your application containers' in the pod, the sidecar proxy containers in there and we do a little fancy wiring, so that all TCP requests to and from the pod go through that proxy.

Marc: So I think, you know, talking about Linkerd and how it works compared to other ones, I'd like to like dive in a little bit and understand like, what are like some of the differences?

So I think there's a lot of service meshes out there.

Some big ones are Istio open service mesh, Linkerd, obviously.

Can you help me understand like, what's different between those three or what's the different between Linkerd and the other two?

William: Yeah, I mean, it's definitely more than another two. There's like another nine or something.

Marc: Right, right.

William: There's a company that has two different service meshes.

Marc: Wow.

William: Yeah. And, you know, it's interesting to think about kind of the reasons why there's been such an explosion there but the short answer is that Linkerd is the best.

You should just use that one.

For us, the way we think about it is, a lot of the service meshes have that same goal, right, which is we want to give you these features at the platform level and the reason why that's so powerful, you know, you can move things around from here to there for any number of reasons.

But the reason why having these features at the platform level is so powerful is because they are platform features, right?

These are things that, it's actually kind of irritating for the developers to have to implement and it's hard to get right.

And they're often features that work best when every component of the application is using them.

So all these kind of factors combined mean that it's actually a lot better for everyone involved, if those features live at the platform layer and then the platform owners can kind of own them and control their own destiny, rather than having to rely on the developers to implement TLS all in the same way.

And, you know, kind of fight with the product managers who don't care about TLS, but they care about these features and so on.

So anyways, every service mesh kind of gives you that same basic set of functionality, but there's a huge difference in kind of the shape of the project.

And for us, for Linkerd, our focus has really been on we want to give you like the minimalist approach and especially, we want to, you know, we want to really understand what the problems are that the service mesh solves, and rather than building like some giant platform that can do anything for all people, we want to give you the smallest, lightest possible solution to those problems, specifically.

And that has resulted in Linkerd actually looking pretty different from many other service mesh projects.

We have, you know, we have a dedicated proxy for one that is really specific to the service mesh use case.

A lot of other service meshes are built on Envoy, which is a good project but it's a generic proxy that has a lot of complexity.

And it takes a lot of memory because it has to do a lot of different things. And the kind of bias in Linkerd towards relying on Kubernetes so heavily means that the configuration surface area is really minimal.

And a lot of stuff, we do, you know, like mutual TLS or whatever, we just turn that on by default.

We don't need you to configure anything.

When you add Linkerd, you know oh my gosh, you've got MTLS working between pods without you having to do any configuration whatsoever.

That kind of stuff is just a, that's like a, you know, kind of the shape of the project as determined by kind of the goals that we put in front of us.

Marc: Okay, and Linkerd implements the SMI, right, the service mesh interface?

And that was announced, I think, back in Barcelona, KubeCon, like a year ago.

Can you kind of talk a little bit and help me understand what SMI is and what that means for a service mesh to implement that?

William: Yeah, so, you know, we've been heavily involved with SMI since the very early days.

And in fact, if you look at everyone likes to say they've been heavily involved but if you look at the, you know, on GitHub at the list of contributors, it's Thomas Rampelberg, who's, you know, one of our Linkerd maintainers who actually has the number one commit or the number one set of commits by count to SMI, which is pretty funny.

But so the goal with SMI with the service mesh interface is to provide an interface that is service mesh agnostic that you can build into so it doesn't matter if the underlying implementation is Linkerd or Istio or something else, you can use this interface in order to accomplish certain things.

Like the example that I always like to use is there's a tool called Flagger which does progressive delivery, basically you know, canary rollouts, where you look at the metrics.

So you look at success rate, you know, you've got your existing version, you've got the new version.

You slowly start shifting traffic onto the new version.

You're looking at the success rate. And if the success rate starts dropping relative to the old version, then like you undo that traffic split.

That's a really cool example of something of the kind of tool that can sit on top of SMI, because SMI has a metrics API. And SMI has a traffic shifting API.

And by combining those two with this tool, well now, Flagger has the ability to basically do the same kind of behavior across any service mesh that supports those interfaces.

So, that's what SMI is. Linkerd implements parts of SMI.

There's still parts that we're working on. And the spec itself of course, is a changing and evolving thing.

But that's the basic idea. And to me, the goal, like kind of the point of this is for tools like Flagger, it's less of a user facing thing, right?

If you use a service mesh, what do you care, whether it's this one or that one, you know, or whether it implements some API that someone else has decided, right?

But if you're building a tool, well, then it's really useful to have this tool work across all service meshes.

Marc: And that actually brings up another question.

So at the beginning, you mentioned a service mesh provides reliability, observability, and security, and now like the functionality of Flagger with progressive rollout, is that something that's also built right into Linkerd or are you depending on the ecosystem to provide that level of functionality with Linkerd?

William: Yeah, that's a great question. So Linkerd provides the building blocks.

So there, what we provide is golden metrics. So things like success rates and latency distributions and request volumes, and things like that.

We provide you know, those uniformly across every application for every service running on your cluster. So that's super cool, right?

Without you having to change application, you get all these metrics and we provide the ability to do traffic splitting where you can say, okay, I want to send, you know, N percent of traffic that's destined for this service, I actually want to send to this other service.

And those are the two building blocks and Flagger composes them into this particular use case around progressive delivery.

There's other things you could do, you know, there's other ways you could combine those building blocks.

So for example, with some of the multi-cluster stuff we do, we have these fail over mechanisms where you can say, okay, I want to shift traffic away from this implementation of the service on this cluster.

And I actually want to go and rely on the same implementation as other cluster because this cluster is like failing or whatever.

So you know, I think this is a really nice way of doing things.

We'll provide those basic building blocks and then you can build applications, specific logic on top of them.

Marc: If I install Linkerd, is it completely contained inside one Kubernetes cluster?

You were just hinting at the ability to have it across multiple clusters.

Like where would be the use case that that's useful or how are most people running it today?

William: Yeah, so for a variety of reasons, Linkerd is the kind of level of granularity is at the cluster level.

So you install Linkerd on a cluster and you get one Linkerd per cluster.

We, early on, tried having multiple Linkerds like per namespace or whatever, but Kubernetes just makes it really hard to do that.

There's a whole lot of Kubernetes primitives that are you know, kind of cluster wide, you know, CRDs or cluster wide and there's no hierarchal namespaces.

So anyways, yeah, Linkerd's kind of level of granularity at the cluster, but one thing that we are noticing in the Kubernetes world kind for those same reasons is that there's an increasing use of multi-cluster deployments, right?

And by multi-cluster, I don't just mean, like dev, staging, prod, which almost everyone has but instead, multiple prod clusters which you do maybe, because you know, you want to have your clusters be really close to your users until you've got like geographic diversity or maybe you do it for kind of fault tolerance or high availability reasons or disaster recovery or whatever.

And so in that world, being able to provide communication between clusters, actually fits really nicely into Linkerd's, you know, feature set, right?

So Linkerd is mediating the communication that happens between two pods that are run on the same cluster.

Well, we can actually extend those same guarantees around TLS and reliability and so on to cross-cluster communication.

That was a big feature for us in the 2.8 release earlier this year was this kind of transparent multi-cluster communication which is really to say cross-cluster communication.

Marc: That's cool. I want to switch gears for just a minute and kind of come back to the ecosystem and the various service meshes that are out there.

Linkerd is a, it's a CNCF incubating project right now.

Open service mesh is new from Microsoft.

And that's a CNCF Sandbox project, I believe, a lot of similarities with Linkerd but Istio is not in the CNCF.

That's in the open usage commons.

And that's a little bit of a change for the Kubernetes ecosystem to not just have everything in the CNCF.

Have you been happy with the decision to be a part of the CNCF, or like, do you have any thoughts on that decision and that split that Istio has created?

William: Yeah, I think we've been happy with the decision to be part of the CNCF like that.

It seems very natural for us, right, and that's where Kubernetes is.

That's where Prometheus is. That's where, like everything that's in our ecosystem is all hosted in the same organization and Linkerd just fits right in there, you know.

For us, I think I take a very particular lens through these things, which is that what I care about is what makes sense to the user.

And so to a certain extent, like who cares, whether it's in foundation A or in foundation B, like it doesn't really matter, you know, but on the other hand, I don't need to add any more like weirdnesses.

It's like, yeah. So let's have it be part of the CNCF. Why not? It's like it fits right in there.

For Istio to be in OUC, you know, and not just to like be there, but for Google to have created that thing, you know, just for Istio or Istio and, you know, a handful of other projects.

Again, you know, on the one hand I'm like, well, who cares?

You know, from the user perspective, if you can solve a problem with a piece of technology, then like, whatever.

On the other hand, it does feel weird. I don't think it does Istio any favors, you know, whether it actively harms adoption, that's maybe up for discussion.

I don't know. I tend to be a little pragmatic about these things.

So mostly, I'm concerned about like, can you solve a problem with this tool in this direct and concrete way as possible?

And if you can, then great, and if not, then, not.

Marc: Cool. Linkerd is currently an incubating project.

Do you have a roadmap or a timeline or anything that you're thinking about to apply for graduation and make it a graduated project?

William: Yeah, I mean, I think we've been ready for graduation for a long time.

It's just, we just haven't gone through the hurdles to make that happen, but I suspect we'll start moving in that direction pretty soon.

I mean, you know, again, like, on the one hand, who cares, you know, it's like it solves these problems.

What's it matter, whether it's at level two or level three of, you know, whatever this group of people over here have decided.

On the other hand, like, hey, why not?

If we're going to be in the CNCF line, then we might as well go it the whole way, but certainly there's more than enough adoption and more than enough contribution and kind of ecosystem activity to support graduation.

So it's really just a question of, yeah, going through the process.

Marc: And when you put the time into it.

There's not a certain feature or a certain metric that you're trying to hit still right now or anything like that?

William: No, not really. Not really.

I mean, the graduation as I understand it is intended to be a measure of, you know, is this a mature project?

Is this something that you can adopt and kind of rely on, okay, it's going to be there.

It's going to be around. There's like an ecosystem of activity around it. It's not going to just die.

And, you know, wither on the vine. And Linkerd, you know, for years now has met all of those criteria.

Marc: So a lot of different projects in the CNCF are projects inside larger organizations, but Linkerd is like I mean, that's what Buoyant was created in order to make.

And so it's all open source though and you're giving it away for free.

You don't even have, you've given the project into the CNCF. How does Buoyant make money?

William: Oh, gosh. You know, deep inside Linkerd, there's like a little section of the code where if the request that you are proxying contains like a dollar amount or a monetary amount, we just subtract one like 0.01 from that.

And we send it to Buoyant's bank account.

Marc: I've seen that movie.

William: So that is a good question.

And obviously, it's one that every of our many investors has wanted a really clear answer for, because we're not just building an open source project but we are building a business around that.

So, you know, at Buoyant. we're taking a particular approach which is, I've never really liked the open core model which is kind of like the traditional, I guess, way of making money from open source which is that you have an open source thing, you know, and then you have like the enterprise version of that.

And the enterprise version has like these certain features that you need.

And like, you know, I think there's ways you can do this that are really annoying to the users.

And there's ways you can do this, that are not quite as annoying, in terms of which features you place in the commercial versions and which features you place in the open source versions.

But I think it's really tricky to get that balance right.

And every time you are investing energy in that project, you know, you're asking yourself, okay, is this the kind of thing that I should try and put behind a paywall?

Or is this something that should be in the open source?

I think it's very difficult to have that mental process and to do that in a way that's honest with your users. I think if you follow that model, it's very easy to set yourself up where you are kind of doing things that are not in the best interest of your users because you are trying to grow the company.

Anyways, so what we're doing is something quite different which is I want to build tooling, you know, from Buoyant, so as opposed to Linkerd, which is all open source.

The commercial stuff that I want to build around Buoyant that we are building at Buoyant is a tooling around not the service mesh itself, but around the sorts of things that are enabled by the service mesh.

So the most obvious incarnation of this is we have a SAS product called Dive.

Right now, it's in private beta, but if you go to dive.co, you can sign up for the wait list and we're slowly letting people in.

And it's super cool and getting lots of great feedback.

And we've got, you know, a lot of customers using it and getting a lot of value out of it.

But what Dive does is, it's not a service mesh, but it is instead, a layer that sits on top of the service mesh that sits on top of Kubernetes.

And it solves a bunch of kind of higher level problems for you.

So what the service mesh solves, what Linkerd solves for you are these kind of computer problems, right?

Like I have a request. I need to send it, you know, send it over here and I need to send it over there.

And you know, if it fails, I need it to retry and I need it to be secure.

So that, you know, if someone breaks in and they can't like sniff the communication.

And then what Dive solves, is really more of the people side of things, the process side, the business side of things.

Okay, I've actually got like five different clusters and I've got my applications.

It's distributed across all of them, you know, and one of these clusters is like our prod EU cluster, and one is our prod MA cluster.

And so, how do I make sense of this application? How do I make sense of the metrics and things around it?

How do I put those into things like SLOs or service level objectives?

How do I tie those to business metrics?

How do I, as an operator, actually really build this platform because that's ultimately what you're doing, right.

I want to build a platform for the developers.

How do I give them a UI where maybe they're not Kubernetes experts, but they should be able to understand when they deploy their code to this platform that I've built on top of Linkerd and Kubernetes and whatever, you know, they should understand what this thing is doing.

And like, did that deploy succeed? And if so, is it getting traffic and all that stuff.

So there's a whole set of things that we can solve very, very effectively, you know, in the form of a commercial product that doesn't require us to like, hold things back from the service mesh. Does that make sense?

Marc: Yeah, that makes sense.

I mean, Kubernetes, I don't think really, often is credited for making things simpler to understand.

So Dive is, you said it's a SAS product.

So I have my Kubernetes cluster or clusters, and I put Linkerd and install the open source project inside there, you know, once I have access to Dive.

Then it's able just to like, I'm able to just connect my Linkerd installation into Dive and is sending some of the data back up and I can control my cluster or view it from inside Dive?

William: That's right, that's right.

So we want to give you that dashboard, you know, that kind of unifies everything across all your clusters, across all your namespaces and gives you, you know, not the implementation view, right?

Oh, this pod is in this namespace of this cluster and whatever, like there's a million tools for doing that.

But instead, you know, kind of from the organizational perspective, what is my application and who owns it and where is it running?

And like, what's the state of that? And what has changed recently, right?

There's a whole set of questions that you need to be able to answer effectively if you were building a platform on top of Kubernetes.

And these are the sorts of things where, you know, for all the Kubernetes adoption, people are just starting to tackle this.

People are just starting to realize, holy crap, you know, I'm building a platform and I actually need a layer on top of here that can translate this to the rest of the organization.

And I need to layer on top of that, just for me.

You know, it's not enough to have a bunch of Kubernetes cluster inspectors, right?

And to have like Prometheus sitting over there or Datadog, and like, okay, now I've got a platform.

It's actually a lot more that's required, especially if you want that platform to be, you know, kind of the thing that gets out of the way of the developers that enables them to launch code and run code in a way that can scale and that's reliable and where they have the right feedback loops, right?

'Cause what you really want, now, I'm getting a little philosophical.

What you really want I think is, you want your developers to take on the persona of service owners, right?

So it's not just, I write this code and I push it to the platform. And like, now I'm done, right?

You want them to say, I push this code to the platform and I still own it, right?

So now, I'm looking at it and I'm getting alerts for it.

And if this thing breaks at 3:00 a.m., I'm on call for it and I'm going to wake it up.

And I might not understand the details of the platform and how that stuff worked.

But I have ownership over my code starting obviously, with the writing on the code, but not ending with deploying, right?

I have full life cycle ownership and that's what we call the service owner.

So that's kind of the philosophical goal that I think we want to get to.

And that's how we're designing Dive, is certainly is to enable that sort of behavior.

So you have not just platform ownership, but also service ownership.

Marc: Yeah, Dive looks super cool.

I mean, we have a team of SRE, site reliability engineers, that tries to build some of the tooling to enable that service owner inside our organization but Dive just really looks like it codifies and productizes a lot of that functionality for us.

William: Yeah, well, that's the thing. It's not like I'm a genius and I just like, come up with this, right?

It's like, I look around at what companies are doing.

And they're basically all building versions of Dive, internally, and they're always scrapped for resources and it's always hard to like build this stuff.

And it sits at this weird intersection of like, well, I need to build a UI on top of this platform stuff, you know, and that's already like two communities that don't always overlap, but yeah, if you know, this stuff has to be built.

So that gives me a lot of confidence. And certainly, the feedback that we get from our customers continues to give me confidence.

So hopefully we'll get past, you know, kind of the private beta stage pretty soon.

Marc: That's cool. You know, you mentioned kind of getting a little bit philosophical.

I was like preparing a little bit here. I came across a manifesto or a Mesos-Festo, I guess, that you've written on that on the Buoyant website.

William: Uh oh.

Marc: I'm kind of curious.

Like, if you can explain that to us here so we can understand like, you know, the parts that we haven't covered already, like what do you have in your service MeshiFesto?

William: Yeah, so look, you know, I am the CEO of this venture-backed startup but I'm still an engineer at heart, right?

It hasn't been that many years since I was writing code for a living. And so I am very empathetic to our Linkerd users and adopters.

And I feel like I understand the kind of situations that they are in and, you know, they'll bring in this technology and they kind of have to justify it and like, no one understands what the hell it does.

And you know, it's a tough role, being in the platform role is a tough role to be in.

So mostly what I wanted to do in that MeshiFesto, that's a great term, was I wanted to have this very honest look at the service mesh, right?

If you're an engineer, your life is, especially a software engineer, but I think any kind of engineer, your life is about trade-offs, right?

You know, there's no perfect solution to all these things.

So you're always taking something and you're trading something else for it.

And I just wanted to get that out there and say, look, here's what the service mesh is. Here's why it makes sense.

By the way, it didn't make sense 10 years ago. So here's what changed. And here's what made it make sense.

Here's what you're getting. Here's what you are giving up, you know, and here's how to think about that trade-off.

I just wanted to have this kind of honest conversation because for better or for worse, there's so much noise in the service mesh space.

And there's so much vendor-driven marketing and like I'm a vendor, so.

I have to be a little careful when I say that but you know, I'm not a vendor of the service mesh.

That's an open source project, right? I'm a vendor of like this, of Dive and like this kind of platform tooling on top of it.

But there's so much noise. It's really hard to tease apart like what this thing is and what actually, you know, what's the value and how do I think about this?

And are there situations where I shouldn't use a service mesh. The answer is yes.

There's tons of situations where it doesn't make sense.

So that was my goal with that piece was to just get it out there, you know, and at least try and have a little bit of signal floating in this giant sea of noise.

Marc: So you're saying, it's not actually required to run a service mesh, if I want to have a reliable, observable and secure environment?

A service mesh might make it easier but it's not like, I need that.

William: Well, I mean, you know, we ran Twitter, which is probably the highest scale system that I'll ever work on without a service mesh.

Oh yeah, we had Finagle, right, which maybe, you could argue was like an early version of the service mesh, but you know, it wasn't like sidecar proxies and all that stuff.

Yeah, there's many, many ways to have a reliable system.

I mean, you can use a Monolith. You don't even have to use Kubernetes!

Marc: That's true.

William: Believe it or not.

You know, so it's easy to get caught up and you know, especially in our little bubble but the reality is, there are many ways that you can build reliable software.

I happen to think the cloud native approach is a really good one but it's not one that's available to everyone.

You might just have so much kind of existing investment in other technology, that for you to switch over, would be crazy.

Okay, you know, make a data-driven decision, right?

And the same thing with the service mesh. You could be using Kubernetes and maybe you are like running a Monolith on there.

Well, I don't know, maybe that's a little weird, but if you have a Monolith, you probably don't need a service mesh because there's no service-to-service communication.

And for us, you know, specifically for Linkerd, we tie very heavily to Kubernetes because that allows us to keep Linkerd very lean and very small, but that means that you can't use Linkerd if you are running on, you know, outside of Kubernetes and that's the trade-off that we've made and that should be clear and obvious from like all of the documentation and even the marketing around Linkerd.

Marc: So let's say that I do, I'm not running any service mesh in Kubernetes right now.

And I decided it's time to start kind of dipping my toe in the water. I run a Kubernetes cluster.

Do you have any patterns that you've seen work really successfully for somebody to avoid, you know, trying to adopt too much all at once and like what is a good path for somebody to start adopting Linkerd?

William: Oh yeah, yeah. This is a great question.

So I think the most important step is step zero, which is understand what problem you're trying to solve, right?

Like your problem cannot be, I need a surface mesh, right?

I don't have the service mesh and that's the problem I need to solve, right? That's not the reason to do this.

And the reason that has been so pounded into my brain is 'cause for whatever reason, man--

It's like, there is a segment of the world that it's like fashion-driven technology. Like, you know, why am I doing this?

Well, I read about this blog post and therefore, I'm going to do this thing.

And like, I understand from kind of the learning perspective, okay, you should try different things and you know, you should always be exploring and learning, but it's weird how many people are adopting Linkerd because they feel like they need to adopt a service mesh and they don't have a reason why.

So step zero is understand what problem you're trying to solve.

And, you know, I have some suggestions for what those problems are.

And then step one, you know, I guess is installing Linkerd, right?

We try and make it and this is a big difference between Linkerd and any other Kubernetes service mesh is that we try and make it as easy and as safe to install Linkerd and to do it in an incremental way, as possible.

So one of our big, you know, even from the very beginnings of kind of Linkerd 2.0. Certainly, 1.x was different 'cause, you know, there, our goal was like, okay, let's take Finagle and like, whatever that thing does, you know, that's now a service mesh.

For 2.0, we actually took a step back and we said, okay, what's the right way to do this, right?

What would you actually want out of the service mesh?

And one of the kind of philosophical principles was that we arrived at was, if you have a functioning application on Kubernetes and you install Linkerd, the application should continue functioning, right?

And it's like, it sounds crazy, but it's actually not that crazy. It doesn't sound crazy at all.

You know, it's hard to do. But that is a principle that we've held true to, since, you know, 2.0. We're now on 2.9, right?

So yeah, once you install Linkerd, your application should continue to function.

And then, you know, there's kind of two parts to a service mesh.

There's the control plane. And there's a data plane.

Once you install the control plane, actually nothing should change because you've just installed some things off to the side and they're not even active yet.

The next step is you add the data plane. This is the actual proxies.

You add those to your applications, and that can be done incrementally.

You could even do that one pod at a time. But usually, you'll do it one service at a time.

And so we give you these really safe ways of incrementally adding Linkerd.

And the other thing we do, which is really, really critical, and man, this is like the product of many, many lessons learned, is we give you a lot of tools for understanding the state of the service mesh, for inspecting, like--

Okay, what's actually happening here because the moment you install Linkerd, I guarantee you, the moment you install it, you know, when someone else is running their code on it and something breaks, they're going to be like, dude, what's this service mesh?

What did you do? What does that, that thing's broken. It broke my code.

Marc: Yep.

William: Right, and so that happens every single time. So we need to give you those tools.

So you can say, actually, look, here's this thing. And here's what happened, and this is why that's happening.

Marc: That's cool. And so you said 2.9 is the current version.

William: Cool, did you say that's cool? Marc, it's not just cool, that's foundational.

Marc: Actually, you know what? It is. I was like.

As you were explaining it, I was like, kind of going through the installation docs and you know, like the whole CLI that you shipped, you have preflight checks built into there.

We should dive into that a little bit more.

William: Oh man. It's a product of like this experience. And again, I think it comes back to empathy for our users.

Like, we know what's going to happen when you install this thing. It's not like, oh, cool.

I got this thing running now. I get to collect my paycheck. I know you're going to get into trouble.

I know it because I've seen it over and over and over again.

And what we're trying to build with Linkerd is not like this ultimate platform to solve all things for all peoples.

We're trying to build this thing that actually solves very concrete, specific problems for you know, SREs and platform owners who are adopting Kubernetes.

And so we can't just, it's not just like solve the metrics problem or give them great observability.

It's like, I need to solve every other problem that you're going to encounter, you know, including the like people in your company are yelling at you because you've installed a service mesh problem.

Marc: Yeah, and I think, you know, you have battle scars that you were alluding to there around installing it and having that affect something in the cluster, which brings an application down or stop some kind of service-to-service communication or ingress traffic or something.

And just by, not because you're using Linkerd, just the process of installing software should not affect anything that's currently running in the cluster.

William: Right. Part of building a good product is understanding, I think, you know, every, every component of that life cycle from not just installation, but from maintenance and upgrades and all that stuff.

Marc: Yeah.

William It's hard! Like this stuff is hard.

It's not easy to run this stuff even as, as easy as we try and make it with Linkerd, man, it's still not trivial by any means, actually running a service mesh you know, or heck, Kubernetes itself in production.

Marc: Yeah, I think that's what gives a lot of these like Kubernetes and other service meshes, you know, potentially, you know, bad names and like the complexity.

It's just like the effort that you have to put into making it consumable and usable and approachable. It's a large, huge effort.

There's a lot of product work that has to go in even if it is just a CLI tool, you have to think through all the edge cases and everything.

William: Right.

Marc: I'd love to shift now and kind of talk a little bit about the current version of Linkerd, what you've been working on and like the roadmap, what's coming, what should we look forward to?

William: Yeah, so 2.8, which was out in June, multi-cluster, as I mentioned, was the big feature there.

And 2.9, which is out in early November. The big feature there is MTLS or mutual TLS for all TCP communication.

So Linkerd has had mutual TLS for awhile now for HTTP communication.

In 2.9, we're extending that so that it doesn't matter, you know, what the protocol is.

If you are making a connection, you know, a TCP connection from point A to point B and like, you know, both sides are meshed, so you've got the data plane running on both sides, then Linkerd will transparently add mutual TLS to that connection without you having to do anything.

In fact, you don't even have to enable it. It's on by default.

And by add MTLS, what I mean is distribute the certificates and rotate them every 24 hours and tie the identity there to the Kubernetes service account of the pod.

Like take care of all the details here so that when you install it, you know, you have this thing that pushes you in kind of a major way towards zero trust security without you having to pay the price, without you having to configure a whole bunch of stuff and potentially get it wrong and potentially leave it insecure.

So that's a pretty big milestone for us.

Marc: Yeah. I actually would love to dive into a little bit of the technical implementation details there.

Mutual TLS, for all TCP connections, instead of just HTTP. Like what were the challenges with implementing that?

William: Yeah, so actually, doing the TLS itself is not that hard in the sense that there are many, many libraries that kind of exist to actually do that.

The hard part is usually the certificate management, you know, like rotating the certs and distributing them and creating them.

And where do they come from? And, you know, chaining them.

And, you know, once you get into multi-cluster, then it's like, there's an additional layer of complexity there.

So really it's more around the certificate management and making sure that you're doing this in a way that doesn't make things less secure.

You know, make sure that the key materials are stored in, you know, using some Kubernetes primitive that actually keeps them safe and you're not transmitting them across pod boundaries.

And, you know, there's just a lot of details to get right when you're doing this.

Marc: With that, do I get strong identities for all the pods, also?

William: So you get, yes, you get strong identities, which are tied to Kubernetes service accounts.

So the service account of that pod is what Linkerd will use as the identity for the certificate.

Marc: Okay and then that's shipping in the next couple of weeks here in early November, right?

William: Early November, yep, yep. And there's a bunch of other. Man, this is a massive release.

I mean, we also moved the proxy to a multi-core runtime.

So up until 2009, we've actually gotten away with just running on a single core because it's so fast and, you know, we've just optimized the hell out of it.

But once you get to a certain throughput, you know, and concurrency, being able to extend to multiple cores becomes necessary.

So we've got a new multi-core runtime. 2.9 adds ARM support.

So if you want to get this thing running on your Raspberry Pi, you can do that. Gosh, what else did we add?

Service typologies which is this new Kubernetes feature that allows you allows kind of express these routing preferences, like, hey, try and send it to something on the same node.

And if it's not on the same node, then to the same cluster and so on. So there's just a, yeah, ton of cool stuff.

Marc: It sounds like a lot of that, I mean, ARM is great.

Like, we run a lot of, you know, Graviton and we even like, you know, clusters like that run, things like that now for the nodes.

And so, that's awesome.

And it sounds like a lot of the other stuff is around just really like doubling down on optimization, making it faster, making it like service typology is the primary goal of that, around optimization?

William: Yeah, that's right. That's right, a lot of this work is around making Linkerd as fast and as low memory as possible.

'Cause remember, you're adding these proxies to like every pod.

So if you've got, you know, 10 applications and each application has a hundred pods, you know, well now you've got a thousand instances of the Linkerd data plane proxy.

And so if that thing takes 130 Mbs, you know, it's like, it's not trivial you know, so we try and make that as small as possible.

And then every request is going through, not just one, but it's going through two proxies.

You've got one on either end. You've got the client side and the server side.

So every millisecond or microsecond of latency that we add, in the proxy is like potentially, something that's a user facing change.

So making this thing as fast as possible, you know, is the other goal.

And then of course, you know, there's kind of a, we haven't talked too much about this in this podcast, but there's this underlying security theme that we focus on very heavily in Linkerd land, especially in the data plane, because that's where you know, the application data has to transit, right?

So if you have PCI data or HIPAA compliant data or PII or whatever, all of that data is going through the proxy.

And so the proxy has to be really secure. It's got to be really stable.

You can't have a buffer overflow exploit in there.

It's part of the reason why we wrote it in Rust.

And that same kind of idea of security, of course, extends to mutual TLS and you know, all the other cool stuff we do.

Marc: So the proxy is written in Go but that data plane, that's what's written in Rust right now?

William: The data plane is the proxy. And that is what is written in Rust.

Marc: Okay.

William: Yeah.

So the control plane is written in Go, which is nice for a lot of reasons in the Kubernetes world, especially almost everything is written in Go, so we can kind of leverage these Kubernetes client libraries and things like that.

And it's also a nice language for kind of open source contribution because the barrier to entry tend to be quite a bit lower than other languages, but on the data plane, you know, the thing we optimize for there is security and speed.

And so Rust was really the only, it was the only logical choice for us there.

We had to make this thing as fast as possible, which meant we had to compile the native code.

We couldn't have a, you know, a managed runtime, even Go, which was relatively fast, we knew it was not going to be fast enough at the proxy layer, but at the same time, you know, of course we didn't want to write it in C or C++ because it's hard to get programs written in those languages to really be safe.

It's not impossible but it's not, you know, it's a lot of human effort as opposed to a lot of computer effort.

Marc: Sure. And how has Rust been as far in an open source project?

Or is this not, that not an area of the Linkerd project that you're really looking and striving to get a ton of open-source contributions into?

William: Oh, we definitely are. It's just that the barrier to entry is a lot higher than on the Go side.

So, Eliza Weisman, who's one of our Linkerd maintainers, has started doing this weekly or near weekly, proxy live coding stream, live coding, live stream, which has been really cool.

And we've done a lot more kind of blogging and stuff about the proxy 'cause proxy's super cool.

Like if you are into, if you're like a systems nerd, proxy is like this super fast asynchronous network program written in, almost like kind of the state of the art, like network programming stuff is all happening at Rust these days.

So it's like, you know, nerd heaven in there, but the barrier to entry to Rust is pretty high.

You know, there's like the borrow checker and whatever. I don't know.

I'm just like, I'm looking at this from the outside in.

I've never written a line of Rust, but I've watched Oliver, you know, who is like, the real, the person who was writing a lot of the code in the proxy, especially, I've watched him transition from Scala to Rust.

And you know, it's not an easy transformation to make because Rust is, it's designed with this kind of core safety memory, safety thing under the hood.

And that takes time to wrap your brain around even for a really experienced programmer.

Marc: It sounds like it's really cool that you have that live coding stream going on once a week or so.

So if you're just kind of getting into it, you kind of learn the project, the architecture and kind of watch an expert here that knows it.

Just watch them write some code and you'll probably pick up a ton from it.

William: Yeah, we love getting proxy contributions. That's super cool. And it's such a cool project.

This is one of the things where I'm like, oh man, I wish I could still, you know, was allowed to write code 'cause have fun, you know, it's like super optimized network programming code, that's yeah, that's fun.

But I did write this article a year or two ago on InfoQ about our evolution from Linkerd 1.X and 2.0 with Go and Rust.

And a lot of the factors that went into our choice of Rust for the implementation language.

You asked how it's been. It's been great.

But you know, when we made that choice, back in 2000, I think '17, was when we were starting to really look at it.

It was scary. It was a gamble because the network ecosystem was just starting to flush out there.

So it was a real gamble to invest in Rust, but it's been great.

You know, a couple of years later, we're like, all right, that really paid off.

Marc: That's great. Is there a certain type of use case that you're looking for from somebody who's coming in new into the ecosystem that you, you know, you'd like to see it fleshed out a little bit more in Linkerd?

Is there anything like that, that you're looking for, you'd like to kind of put a call out for?

William: In terms of using Linkerd, I think the multi-cluster use case is a super cool one.

And we're just seeing people start to adopt that.

I think we were a little ahead of the curve with 2.8.

So I'm definitely interested in hearing more about kind of service or cross-cluster communication with Linkerd.

I think we've done something that's super cool there where it's totally transparent to you, as the user. You can even do fancy stuff where you're like failing traffic over, like service A is talking to service B, and you're going to slowly fail traffic over to another cluster that has B on it. But I think we're just starting to see people really invest in that now.

So, and I'm sure there's a couple of bumps we have to like iron out, especially around things like certificate management in that world.

So I'd love to hear more use cases around that. Similarly, the tooling, you know, building stuff on top of Linkerd is always something that I'm interested in.

Flagger, I hope it's just the beginning.

I love the idea of there being, you know, this business logic that you're building, whether it's for canary deploys, or for cross cluster fail over or whatever, where you're tying together these core primitives that Linkerd is providing around metrics, around traffic splitting and so on and building these like higher order operations around that.

And then in terms of contributions, gosh, man, there's so much fun stuff to do.

You know, one thing we've been looking at, kind of on and off is Wasm or WebAssembly.

Envoy has done this recently, and it seems to have worked out really well.

So it'd be interesting to think about what we can do on the Rust side, in kind of the same vein.

We haven't needed to as much as Envoy needed to, mostly because the Linkerd proxy has been so specific to Linkerd, but you know, one thing that we lost in the move from the JVM to Rust is the JVM, you know, is like, it sucks in a lot of ways, but one thing it was really good at is it had a great plugin model.

So if you want to do some really business, you know, business specific, business logic in the data plane, like you wanted to rewrite headers or to like inject your custom thing or whatever, having that plugin model was actually really valuable in 1.X and we haven't really had a great way of doing that in 2.X because we're compiling these proxies, you know, in Rust, down to these super light, you know, native code things.

So Wasm might be a way of giving people data plane proxies, which would be, sorry, data plane plugins, which I think potentially could be really cool.

And I'd love to see what people can do with those.

Marc: Yeah, that's cool. That's a great idea. And so the team's pretty focused right now and you know, 2.9 is coming out. Do you have any thoughts as to what you're going to be looking at in 2.10?

William: Yeah, there's a couple of things we're going to look at.

One definitely is making the control plane a little more modular.

So right now you get Prometheus and Grafana and all these components and the dashboard, and so on.

I'd like to support, and we started doing this already, I think in 2.9, another thing we'll have, is bring your own Prometheus so we'll make it so that you can more easily use an external Prometheus rather than relying on the one that Linkerd installs.

I'd like to extend that further, or we would like to extend that further and make it so that you can have these really minimalist control plane installs.

Another big one on the roadmap for us is policy.

You know, we've done all this work to get MTLS and identity so that we can, you know, we can encrypt the connection and give you confidentiality.

We can verify the identity on either side and give you authenticity.

But the thing we haven't added yet is authorization. You know, is this request actually allowed to happen?

Right now, the proxy will do its best to satisfy the request. It'll always say yes, but sometimes he might want to say no.

So policy is on the roadmap for us, as well. And I think that'll be, that'll be really fun.

Marc: That sounds like a lot of work to get authorization, right, though.

William: Well, you know, it's actually not. I mean, we have all the building blocks in place, right?

The proxy is already there. It's already, you know, making decisions about requests. It's already got identity on both sides.

The only thing it doesn't have is like, the decision to say yes or no.

So if you can provide it with that, you know, presumably we wouldn't, you know, we'd plug into something.

OPA is like a common framework for like expressing policy, presumably plug into something like that.

I don't want to like come up with a definition language for Linkerd. I'd rather play with the existing ecosystem.

So I actually don't think it's a huge amount of work.

Oliver's probably listening to this and like rolling his eyes 'cause he has to write the code.

But from my perspective, as the marketing slash, you know, CEO person, it should be pretty easy.

Marc: It is cool, that you want to plug into the existing ecosystem and not go zero to one on something new. So like that should help a little bit.

William: Yeah, well, and that's been a big theme for us, as well.

You know, we rely, not just a heavy reliance on Kubernetes, but we pull in Prometheus and Grafana and all these components because, you know, we don't even provide our own ingress, right?

That's another big difference is, Linkerd basically, we just make it pair with any ingress controller that you want to use.

So it's all in the name of being lightweight and being composable and kind of being you know, this is what a good engineering project should do.

It should fit in and it should be modular and composable and pluggable.

Marc: And very, very bash style. Just do the thing that you want it to do.

Do it really well and be a building block on top of that.

William: Right, the Unix philosophy.

Marc: Cool, I don't really have any other questions right now.

Is there any other thing that you know, you want to share or you want to bring up?

William: I think you've covered the highlights with this incredible set of questions.

I guess the only thing I'd add is you know, if anyone is out there, who's running Kubernetes and wants to solve problems around observability--

So getting a consistent layer of metrics, you know, especially things like success rates and latencies across all their applications, or wants to add a set of kind of default reliability tools like retries and timeouts and canary deploys, or wants to add mutual TLS in a way that is not painful and requires very little effort on your part, but that still gives you a lot of control, then please go to Linkerd.io and just click on the getting started.

It should take you about five minutes to install it.

We've got a very healthy and active Slack channel on Slack.Linkerd.io.

Tons of friendly people who are eager to help you out. And it's all open source.

We don't hold back any features. It's all on the CNCF. It's pure problem-solving goodness.

Marc: Awesome, well thanks, William.

I've really enjoyed the conversation here and I've learned a lot about Linkerd. I got to go give it another try.

William: Thanks Marc. Hope to see you in the Slack, giving us lots of feedback.

Subscribe to Heavybit Updates

Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.

Content from the Library

Visit library

Oct 19, 2022

Podcast

The Kubelist Podcast Ep. #33, Tailscale with Avery Pennarun

In episode 33 of The Kubelist Podcast, Marc and Benjie speak with Avery Pennarun of Tailscale. This conversation explores VPNs,...

Apr 7, 2021

Podcast

The Kubelist Podcast Ep. #12, Istio with Craig Box of Google Cloud

In episode 12 of The Kubelist Podcast, Marc speaks with Craig Box of Google Cloud. They discuss Istio’s features and community,...

Aug 7, 2019

Podcast

EnterpriseReady Ep. #12, Service Mesh with William Morgan of Buoyant

In episode 12 of EnterpriseReady, Grant talks with William Morgan, CEO and Co-Founder of Buoyant. They discuss how Twitter’s...