1. Library
  2. Podcasts
  3. The Kubelist Podcast
  4. Ep. #28, Plural with Michael Guarino
The Kubelist Podcast
60 MIN

Ep. #28, Plural with Michael Guarino

light mode
about the episode

In episode 28 of The Kubelist Podcast, Marc and Benjie speak with Michael Guarino about Plural, a scalable solution empowering DevOps teams to build and maintain cloud-native and production-ready OSS infrastructure on Kubernetes.

Michael Guarino is the founder of Plural. He worked previously in software development at Facebook, Frame, Twitter, and Amazon.

transcript

Marc Campbell: Hi, and welcome to another episode of The Kubelist Podcast. On today's episode we have Michael Guarino from Plural to talk about his project and how it aims to make it easier to deploy and operate open source applications hopefully in Kubernetes. We're going to learn all about the project, but before we start, as always, Benjie's here with me.

Benjie De Groot: Hello, hello, hello.

Marc: All right. So let's start off with some quick introductions and backgrounds. Michael, why don't you just start us off with telling us what you did before creating Plural?

Michael Guarino: Yeah, yeah. So I've been basically an engineer the entirety of my career. Right out of college I joined Amazon in Seattle for a couple of years, I was on one of the resale website teams just doing boring, Amazon stuff I guess. But really I wanted to go to live in New York, and I did that by joining actually the Vine team at Twitter, so I was one of the backend engineers at Vine which was a pretty cool experience.

We actually only had like five backend engineers at peak running all of Vine's platform, and we were running it basically all on the Open Source infrastructure at scale, our own MySQL, all our own Redis, Memcached, RabbitMQ, actually a pretty big Elastic Search cluster as well. So I got a lot of hands on experience of running infrastructure really, really directly at scale there which was really, really fun.

Then, well, Vine was discontinued in 2016 as everyone knows, I had to find a new place, I did not want to work at Twitter any longer, and I ended up leading backend engineering at Frame.io, where I built most of their core systems as well. I basically ended up building so much that I had nothing really interesting left to build and went to Facebook, and was quickly disillusioned by what I was doing at Facebook.

It was, frankly, very boring to me and I was extremely, extremely bulled up about Kubernetes at the time. What I was seeing was the emergence of standardization of what a distributed system really should be, or defined as, and that really wasn't the case and it was what caused, I think a lot of problems around the manageability of a lot of infrastructure because every single company would have their own special snowflake setup.

They'd do it their own way, they'd have their own batch scripts, everything like that. It was just a constant mess, really. If you have some degree of industry wide standard, you can suddenly automate on top of that and make what is now Plural, have the ability to repeatedly deploy pretty complicated stacks and a lot of different infrastructure is a viable possibility. So that's what I wanted to do.

Marc: Good. So that makes sense, so yeah, the snowflake style deployment definitely makes it difficult to manage, but I want to go back. So you said at Twitter when you were working on Vine, there were five infrastructure engineers managing all of that infrastructure at scale, did I hear that right? Five?

Michael: Yeah, yeah. That's basically correct, yeah. Really we had two SREs and three people who were doing more like actually building the APIs and everything like that. But that was really it and it was like 15 million monthly active users, and it was a pretty... At that time I was completely blown away. I came from Amazon, very much of the opinion that, frankly, Amazon should just run everything that's too hard for mere mortals to do, and seeing that it actually happened at Vine was just a complete paradigm shift for me.

Because we were running very complicated infrastructure, like a shard in MySQL with multimasts through... for every shard with our own cloning procedures and everything, and it was actually doable. So I had to figure out what that gap was and it was pretty clear, actually, once I had teased it all out. The gap was operational knowledge, it really was, making people build infrastructure this way is a staffing problem.

They just can't find people who have deep infrastructure experience, and the reason you need that deep infrastructure experience is because you have to just know so much in the stack. You have to know Linux fundamentals, you have to know the various APIs you have actually interacting with Kubernetes, or the cloud provider APIs. You'll have to know distributed systems fundamentals to do debugging and stuff like that.

It's pretty rare to find a single human being who can know that, but again if you have some degree of standard API that you're programming against, you can have software know all that for you instead of a person and that completely changes the game hopefully.

Marc: Yeah. So I imagine at Twitter, in order to do that, or at Vine, sorry. In order to do that you really did have to... There was not a single snowflake in there, everything had to conform because it's a slippery slope once you start doing anything that doesn't just follow a normal pattern.

Michael: I wouldn't go that far.

Benjie: That's the idealized version of that, I think. Wait, so at Vine, were you guys on Kubernetes? This was 2016 I think you said, is that right?

Michael: Yeah, so this was really right at the beginning of a lot of the containerization stuff and Kubernetes was quite immature, so we were not on Kubernetes.

Benjie: Right, but Twitter was Mezos at the time, I believe.

Michael: Yeah, and we were not in Twitter's main infrastructure, that's another interesting detail. We were actually on AWS whereas Twitter has its own data centers, and most of the acquisitions had a similar issue, like Periscope was the same deal and they would have to figure out how to interact with the core Twitter systems. But we actually ran everything basically on EC2 bit chains, and manually operated all of those systems.

That was even more bonkers, we didn't have a proper orchestration system and we were still able to basically manage the operations for the entire platform with a very small team. It was because we had tons of ex-Google SREs, ex-Amazons. We had an extremely crack team, even by Twitter's standards, running the thing.

Benjie: Right, but the big takeaway is if you don't have a bunch of Google SREs sitting in your back pocket it's pretty difficult to do?

Michael: Oh yeah, for sure, yeah. Unless software fills the gap.

Benjie: Right, of course. Were you containerized that the time or was it not even containers?

Michael: When I first started it was not containerized at all, and then we tried to start getting into Mezos and Marathon, fitting on with the same scheduling system that Twitter was using. But it was maybe 60 to 70% done and then Vine was discontinued.

Benjie: Right, so you were really pushing a pretty big boulder up a pretty steep hill with all of that experience. Okay, so we have to ask, how many fail whales did you deal with?

Michael: I was not early enough at Vine that the infrastructure was immature. The biggest thing that happened there was, I don't think you guys remember, but there was that terror attack in Paris in 2015 when ISIS was really active and there was a video of either AK47s or a bomb going off at a soccer game or outside of a soccer game. It was the most viewed video on Vine, at all, of all time.

Vine was killed soon after so nothing was able to compete for it, but it actually blew up an entire database shard it was getting that much traffic. I think that was the biggest outage I remember, mostly because obviously it was also a pretty historic event, the consequence of it being it made us also have to have a pretty interesting on call rotation.

Marc: Just for a second here, I think it's worth acknowledging the cameras in our pockets, the phones, all that stuff has gotten us to a place where pretty significant news, pretty significant things like that can get distributed immediately. A good part of what we're doing here is that we're helping them build out infrastructure that actually educates people and gives people some stuff, so there's obviously some downsides to that as we've seen but there's also some really powerful positives there. So moving on, then you went to Facebook, and what were you using at Facebook? Was that Kubernetes? What was that?

Michael: Yeah, this is somewhat public but not everyone knows. Facebook basically built their own, they've been around long enough, they're like Google where they build their own orchestration and containerization systems. They have a system called Tupperware, which is their analog to Borg at Google. I don't know to what extent to they poached people from Google to build it, but that's what they use for orchestration entirely.

I think there's some Kubernetes usage but it's probably more on the enterprise engineering side, so back office applications. The actual production systems all run on Tupperware, and I don't fully know what they use for actually containerizing applications. I don't remember. It ultimately resolves under Linux C groups like Docker Container does, but it's not Docker.

It's something completely different, and they have their own package system as well. Basically you can't roll all of Facebook's servers with a standard Docker registry, it would actually blow up a number of switches so they have to use something that's equivalent to the BitTorrent API to distribute their packages.

Benjie: That's interesting.

Marc: So while you were at Facebook though, it doesn't really matter I guess, I know that this is The Kubelist podcast, we talk about Kubernetes, but it's really these primitives and these common API abstractions around container orchestration and scheduling and running and operating that makes it. That's the value, right? So Tupperware being an implementation of it you still start to say, "Oh wow, I can avoid these bespoke solutions, avoid snowflake systems and get scale"?

Michael: Yeah, yeah, for sure. There's some aspects of Tupperware that are heads and shoulders better than Kubernetes.

One thing it does is it can deploy in multiple regions without any real sweat, whereas Kubernetes really is completely locked down to a single region and it does much larger scale deployments than a standard Kubernetes would be able to manage as well. Deployments with hundreds of thousands of servers underneath them whereas I think a Kubernetes controller would probably start choking for that sort of thing, just based on the way it's architected.

But there are also huge issues with Tupperware in terms of its usability and everything like that, that people who had a Kubernetes background were like, "Ulch, I wish I was still using Kubernetes."

Benjie: In fairness, I've never worked at Google, but I've heard similar stories about Borg, where Borg had scale but usability challenges which is why they rewrote it in Borg, Omega, Kubernetes and here we are.

Michael: Yeah, for sure. I think it's inevitable, and part of that's also the Open Source development lifecycle versus an internal development, where you can force weird primitives down your company's throat but an Open Source community would revolt at that and it disciplines you into creating something that's a little bit of a better product.

Marc: I think one of the things that I'm excited to dive into here and talk about Plural is, let's talk about Kubernetes, these platforms provide lots of different values, one of them is scale and impact on cost reduction. But one that I think is super powerful to even smaller organizations is just that common API, the standardization that says, "This is how all software can actually interact with the infrastructure." And so I'd love to hear, give us the pitch, what is Plural? What does Plural do?

Michael: Yeah, so think of Plural as like a package manager for applications to deploy to Kubernetes. What we want to give is an experience comparable to the MPM Add, when you add a JavaScript library to one of your applications but for a full stack into your Kubernetes cluster. You can just do Plural bundle install, Airbyte type in GCP.

It'll give you a wizard of the initial configuration for it, then Plural will build, it'll sync all the artifacts locally into your Git repository so you can track the state of the application appropriately and then have complete ownership of it, then Plural Deploy will then take all those artifacts that, helm charts, terraform modules and so on and so forth, and create the application in your cloud. But the ultimate goal is you just get completely validated deployment packages that you can deploy wherever you want them.

Marc: You mentioned helm charts, Terraform modules and et cetera, so if I want to run some Open Source package, do I need Kubernetes or can the Terraform actually create the Kubernetes cluster for me also?

Michael: Yeah, that's a good question. It does actually create the Kubernetes cluster for you so you don't even have to really think about Kubernetes at all to use Plural. It can be considered like an entire implementation detail. It will create Kubernetes in the most low, operational effort way, so we'll use the managed control players for the major clouds, like AKS for Azure, EKS for AWS and so on and so forth.

We also have a pretty featureful admin console, so you deploy your applications, you can also deploy our console application and it'll give you a full administrative suite with all the operational tools you need to understand how to manage the application. It has things like dashboards and logs, it'll receive upgrades over the air so when we push new versions of a package it'll actually automatically deploy it for you.

You don't even have to think about upgrades. Then we also have a pretty cool interactive run book experience, so more complicated things like scaling a database where you probably don't know the underlying Kubernetes CRD that's representing that database, you can just have a purely graphical experience where you input that new amount of CPU or that new amount of memory to it, or maybe add a few gigabytes to its underlying disk and then click scale and it'll do it all for you.

That system is also extremely flexible so we can create all sorts of different wizards to do other complicated interactive operational tasks with a full graphical experience so you don't have to learn KubeCuddle or anything complicated like that.

Marc: So going back, right? Running a database, especially a database at scale, distributed in Kubernetes, there's lots of challenges. So we as an industry, we're like operators, that's a great abstraction around that. I don't need to understand how to operate that database, there's all these configs and everything like this. I can have this reconcile loop that's watching it and maintaining the desired state, and what you're doing is providing an even easier to use abstraction on top of that. Does it have to be an operator or can it literally just be anything that's driving through a graphical interface?

Michael: Yeah, I can literally be anything. Influx DB doesn't have an operator, but we have a run book for scaling the Influx DB deployment that we have that ultimately just resolves down to reconfiguring a helm chart. It's meant to be a very flexible system and it's just a matter of us adding the appropriate plugins to be able to solve for any of those operational issues.

The operators are really simple, but there's a ton of people who are not going to want to learn the spec of an operator and how to interact with them. If you can just give them a graphical interface that they can interact with instead, it's a huge win for them and they'll be able to do things they wouldn't have done otherwise.

Benjie: So this kind of hearkens back to your whole software abstraction layer for infrastructure, and so really it's a pretty powerful extraction on top of Kubernetes. So this is for Open Source projects, correct? This is for Elastic or MySQL, or whatever. Is that correct?

Michael: Yeah. There's no technical limitation that would mean that we always have to do Open Source projects, but it's what we're going to be focused on primarily because there's a ton of people who obviously want to deploy them, they have a ton of adoption already so we can plug right in and get a lot of users that way, so there's a little bit of a scaling concern there.

It also makes it a lot easier from a licensing perspective and stuff like that as well, so there's a lot of advantages to it. We hopefully will emerge, will develop to a point where people can deploy their own applications with Plural as well. We have a pretty good deployment engine. Our console is hopefully a really good experience for managing any application, not just Open Source ones and hopefully people really like it and want to use it more.

Marc: You're talking in that scenario as a first party application deployment? Like I have an API that I'm running in Kubernetes, but I should be able to have this nice operationalize in graphical way to be able to manage it if I want.

Michael: Yeah, exactly. One of the big benefits there is operational handoffs, so when you're ramping up a new devops engineer or a new backend engineer, if you have that full suite of operational tools that they can just jump right into then it's a really big deal. Then obviously time to response for incidents, the Gold Standard at Facebook for their foundation teams is they would build web interfaces for all their common incidents and do all of their operations via those web UIs because the response time is just that much faster. And if you know what's going to go wrong, you should be able to automate it to that degree, so same sort of principle.

Marc: I'm curious why that is, why is a web interface that much faster to be able to respond versus somebody who is able to access the cluster and run KubeCuddle or whatever the Tupperware equivalent would be?

Michael: I think it's just because if you think of what you do when you're responding to an incident, there's a lot of data aggregation you're basically doing, right? So you have to view a ton of metrics to understand what actually is going wrong, then you have to context switch to a totally different interface and figure out, login or whatever credentials you have to get KubeCuddle access, it just makes it a lot slower.

Then confirmation that you've actually fixed it as well would be another context switch into another interface. But if you have just everything on an interactive dashboard or something like that, you can basically do everything in one place and understand exactly what's going on and fix it.

Benjie: Yeah, I really tend to agree with you, Michael. Marc, I've worked with Marc before. He's an amazing engineer, he doesn't understand why some of us aren't as fluid at the CLI as he is, let's just put it that way. But Marc, not everyone is eat, breathe and sleep KubeCuddle, so just get over it. Okay?

Michael: It's useful even if you are a Batch God, right? All the Facebook SREs who are top tier, they're very good with the command line. But the truth is, if you basically have an existing run book that you would be executing anyways then it could easily become a web interface and you don't have to go through a doc, then go through another UI to figure out your metrics and then go through another UI to figure out how to log into something and then go to a terminal and then back and back and back. So it becomes a much easier, smoother experience.

Marc: Yeah, that's great. Thinking about it in terms of avoiding the context switching, you end up with this reproducible process that everybody is going through, it's definitely easier to onboard and you're not avoiding the CLI, you're just saying, "There's a normal way that we can go through this." I'm guessing if I have Airflow deployed through Plural, I still have KubeCuddle access to that cluster if I need to do some kind of advanced type of operation-

Michael: Yeah, exactly.

Marc: Let's go back. One word you mentioned, it was When We, I think it was the way that you said, "When we push new versions of these packages, they're easy to get and install and update." So is that what Plural does, is you're actually taking these Open Source packages and creating the package published version that you're recommending that we run?

Michael: Yeah. Ultimately this is something we also want to build a community of people doing, but at the moment we're the primary packager and for the most part... Taking Airflow as an example because it's definitely a good anchoring example for this, but what we'll do is we'll find the best helm chart out on the Open Source ecosystem and then hydrate it with the appropriate stuff to create that operational environment, the dashboards, the log filters, the run books and everything like that.

Make sure that it's deployed in a production ready way, so we'll figure out how to get IRSA identity for EKS appropriately or workflow identity for TKE so you have temporary credentials getting injected into the cluster. Set up the S3 logging and everything else that you would need for a full cloud deployment. Then also figure out the persistence layer appropriately.

Airflow has a PostgreSQL database underneath it so we'll use an appropriate way of deploying a PostgreSQL database, we typically use the Zalando PostgreSQL operator for that. But it will at least be something with things like backup or restore, fail over and all that that you would want, instead of a really crappy Bitnami helm chart deployment of PostgreSQL. Which is what you normally see wrapped up in these charts.

Marc: Yeah, that was going to be my next question. What are you adding on top of what Bitnami has? But it sounds like you're really just like instead of, "Here's a million different options on how you could deploy it," which is really great, and it's awesome that Bitnami has these public helm charts available. But you're actually saying, "Look, we've put a lot of effort into thinking through what's a sane way to run this in production so you don't have to understand what all these different options are."

Michael: Yeah, that's exactly right. What our goal is, is to make sure that all of the things that we package are production ready, production grade deployments and then the other big thing is, if you do install a Bitnami chart, you're going to have to read the chart's values file and really dive in deeply into how it actually works just to install it.

Our installation, our bundle install process where you basically go through a pre guided wizard means you don't have to read any source code or anything like that to get up and running hopefully. It's just a very guided, simple experience and then, boom, you have an application.

Then the other big thing with the Bitnami product is that there's no operational environment after you've installed it, and that's to me the biggest problem. Because while I may be willing to pay that initial upfront installation cost, if it comes with a recurring constant operational headache afterwards then I'm not going to be confident about it whatsoever, and our admin console and all those operational tools that we've been talking about I think actually makes it a viable product beyond the cool getting started user experience.

Benjie: Right. So how do I handle my secrets and my environment variables and stuff like that when configuring this stuff?

Michael: This is a really important question. Basically all Git Ops has this issue and there's a lot of different solutions for it. What we basically did is we reimplemented in our CLI the Git Crypt Open Source project, which it'll basically encrypt secrets directly into the Git index itself so when you push a change to Git with a file that would have a secret, it'll be completely AES 256 encoded in GitHub or GitLab or wherever.

But then when you check it out, clone it, then you run Plural Crypto in it, Plural Crypto Unlock, it'll decrypt it entirely for you. We have ways of managing the symmetric keys and everything like that for people so that they can appropriately share their repos and stuff like that. But that's how we manage our secrets, it makes it a lot easier to do a lot of the automation in terms of generating all those repositories and it's just a wildly better user experience than all of the other secret management tools I've seen for Git Ops.

Marc: Then so at the end it is running potentially as a secret in Kubernetes. So if you had a chart, table, you don't have to repackage all that to have a different runtime or anything like that?

Michael: Yeah, exactly.

Marc: I'm curious, first, how big is the team at Plural right now?

Michael: We're definitely growing. If you'd caught me a couple weeks ago it would be very different. But we have now six full time employees and then three other people who are on as contractors, but basically with us full time. Then we're going to have four more employees coming in by July.

Marc: That's great. That's a pretty small team. You're trying to build a product and you're trying to maintain the packaging for these Open Source projects. How do you scale that right? How many folks and how much work does it take to maintain that Airflow package, as an example, and keep updates available and understand all the changes that are in it and make them available to anybody who's running Plural?

Michael: Yeah. The reality is there's definitely going to be a point at which this can't scale to a single organization, so a big part of what we're doing is we're building this as an Open Source project with the idea that hopefully we can create a community around it of people who will do some of that work around packaging and maintaining of specific applications. Or the communities themselves, the Airflow community would be eager to do the packaging for Airflow on Plural so we can divide and conquer that workload. There are some interesting details with it though.

One of them is the way that Plural becomes as an architecture is it kind of simplifies things pretty radically. We're never going to be deploying some sort of multi tenant Airflow that works for millions and millions of people. It's just an Airflow cluster for a single company, it's relatively simple and it's still a distributed system and there's a lot of complications around it. But it's definitely something you can get your arms around.

The actual amount of work to upgrade a new version of Airflow, it ends up being about a day's worth of work for each individual shot. It's basically figuring out the new chart version, doing some tests on the various different infrastructures. Now some scaling hacks that we have done. One is we've used Renovate, Renovate is a pretty cool tool to manage dependencies for a lot of different languages but it has to work for Helm and Terraform.

It'll actually give us new issues whenever there's a new upstream chart change for all of the charts that we use, so if there's a new version of an Airflow chart, we'll suddenly get an issue in our Plural artifacts repo to pay attention to it. Just that constant poll of the things that you're updating is something that obviously causes a lot of work and we would've had a greater manual process around if we didn't have that automation.

The other big thing that we've done is we've gone through the step of creating a full integration testing framework so when we push new versions of these applications, that bleeding edge version will deploy into all of our own clusters in AWS, GCP and Azure, and we'll run a basic test on them. In this case, it's running the aggregate health check on the application.

If it goes to green the test passes and we'll promote that version to a different tag that actual users can respond to and actually get their upgrade elbowed. We have a full testing suite to deploy these things at scale now as well, which is a big... Quality becomes the biggest deciding problem with managing this, and creating that automation to ensure quality is a big deal.

Benjie: Yeah, I think one thing I would pushback or I would have little concerns on just from real world experience, is I've had plenty of charts that, let's say, go green but then I started using it at scale and all of a sudden some weird, esoteric setting that I had no idea existed and it's like all of a sudden a disaster and it brings down production.

Obviously the integration test stuff is a huge step on that, but how can I get comfortable if I'm Bank of America and I'm using you guys for my elastic search, but... Not that Bank of America uses elastic search, but hypothetically, do you guys have plans, is really the question, to do load testing around these packages?

Michael: Yeah. Basically we've only scratched the surface of the integration testing problem. There's a lot of ways that these things can fail, especially at scale. I doubt someone running some mass throughput elastic search cluster is going to be truly validated by an elastic search integration test we'd do for the entire platform.

One thing we'd probably consider doing is having them have their own promotion pipeline that they can test for their specific infrastructure use case, instead of the main package pipeline that we're talking about because it's frankly going to beat their own configuration.

Now, the majority of our users, the vast majority of our users basically just use it with default settings, use a lot of these applications with default settings and they'll have a really crazy set up. It really does actually validate that it will land correctly with them, but yeah, you're 100% right, that there will be some edge cases that people are going to have to figure out themselves, to be perfectly honest.

Benjie: Yeah, I think it's just kind of cool what you guys are building out and how I could see... And maybe I'm going a little crazy here, but there could be a community integration testing effort long term or something where it's like, "Well, we tested out a canary version of this so we can help you validate." Stuff like that, just from a community perspective it just... It's an interesting thought. I haven't seen that yet, but I feel like as we grow out all this stuff, how can big enterprise contribute to the Open Source community? One thing that's just tickled my brain right there is maybe we can get them to do a bit of integration testing.

Michael: Oh yeah, that's actually a pretty cool point. You could have that one enterprise with a massive, at scale deployment run some tests there and have a different release channel for the package that it validates. That's pretty cool. Yeah. The framework is actually flexible enough to do that sort of thing, so we could actually do that out of the box with what we currently have.

The other kind of interesting thing is I don't think there's really good solutions for this, period, on the market right now. All the testing is locked into CICV. We need to do it this way because of just the nature of what we're doing, so if we're going to truly validate our packages, it needs to be running in infrastructure because part of the validation is IRSA working properly with it? You can't really tell that unless you actually inspect the result of the underlying EKS IRSA operator and if it injects the service count tokens and all that goodness and works appropriately.

The long term vision of it is that we'll have a full pipe on SDK that can interact with the Kubernetes API just to see if resources are looking the appropriate way and do other very common... Run Selenium tests as well, that's one thing we would really want to do for every single application that has a web interface and a very basic core. Or do basic Selenium Smoke tests around logins, that the UI is actually interactive in the way that we'd expect them.

Benjie: You mean Playwright, right? Because that's the new hotness.

Michael: Whatever, yeah. Cypress is good too.

Benjie: Cypress is great too, I love Cypress as well. No, I joke, I joke. I think it's super interesting trying to deal with dependency management for other people, which is kind of at the core of what Plural seems to be doing. What are my options for a canary deploy? So if I'm using Plural and I've got my elastic cluster running and then there's a new upgrade, I assume I opt into the upgrade, it's not just automatic. What's the model there actually? That's a good question, is it like ETCD where it just automatically updates or how does that work?

Michael: Yeah. You have a lot of different controls, you can control the specific tag you track. We have like three ones that are just out of the box, latest, warm and stable. You can also control in your console, on the application level there's actually a holistic policy mechanism for it. But what it'll do when the console receives an upgrade, you can require approval, you can just apply it or you can actually ignore the automatic upgrade entirely.

We also want to do things like maintenance windows and all those other interesting little control features, but they're on the roadmap and haven't been implemented yet.

As far as a canary deployment, what we probably recommend people doing at the moment is creating both a dev and prod cluster for the infrastructure they're creating with Plural, which is very easy to do. There's a ton of different ways to skin the cat, it's because we make it really easy for you to create clusters.

But you can have your infrastructure running in a dev cluster, validate it and then approve that. Either by tracking a follower tag or by having approval flow in your product cluster to have it deploy there.

Marc: I want to switch gears for a second, how does Plural make money doing this?

Michael: Yeah, that's a good question. We have a dual strategy around monetization. The first way that we'll do it is we'll have feature differentiation in the product, we'll effectively have an enterprise tier of Plural and the features that will go into an enterprise tier are very accepted enterprise-type features. One of them we actually recently implemented, SSO. Plural actually acts as an identity provider for all the applications that are deployed by Plural.

You can opt into using our OIDC provider and use Plural to log into your Airflow or log into your Airbyte or log into your Grafana or anything else that you deploy. You can add SSO to your Plural account now as well with directory sync, and that's going to be offered for people who decide to sign up with an enterprise plan.

We'll add other features that will be very natural and price things, but they'll all be focused on managing applications at scale in a business context. We'll 100% keep good, small scale tier, free and Open Source for the entirety of our existence. The other strategy is the vendor ecosystem that we hopefully will create around Plural also can create some monetization.

We hopefully will get other Open Source projects actually committing to using Plural as a distribution channel and offering their enterprise licenses on it, and we'll get a reseller cut as a result of that. So those are the two monetization strategies that we have at the moment. Maybe there'll be other ways to monetize it, who knows?

But really to be honest, what we're focused on right now is just getting people to use the product, so we're not really super concerned or pushy about revenue. We're much more interested in getting people to play around with it, giving us feedback, learning from our users on what we can do better, and improving the product from there.

Marc: But in a perfect world, you're able to work with some of the best, most popular Open Source projects get out of the packaging business because they see the value in it, they're actually creating the packages, they're able to use your SDKs to validate them and then push that as their preferred method for both the enterprise and Open Source versions of that product?

Michael: 100%, yeah. That's the long term goal.

Marc: Then you mentioned the SSO, is that using Dex, KeyCloak, something like that underneath the hood? Or did you implement the OIDC provider yourself?

Michael: It uses Aurea's Hydra project for the O Auth handshake, and you got to think of it more as a service provider. Dex is like an ID provider, I guess it might be able to do both, it's been a long time since I've looked into the docs of it. But the applications themselves receive the OIDC logins, so it's not that side of the handshake.

Then for SSO itself, we use WorkOS which is a SaaS product from some people who actually were at Stripe previously that does a really good job of making it really easy to onboard people via SSO. It has really good step by step instructions on setting up SSO with Octa or with Active Directory, whatever, G Suite or any of the other solutions out there.

But the big thing is that it has directory sync, a really good implementation of directory sync that's actually usable, instead of having to SCIM2 which is really terrible.

Marc: Yeah. I think we looked at implementing it one time and at the time there was only like three or four company that have successfully implemented SCIM and we were like, "Okay, you know what? There's other solutions here."

Benjie: I think I was at Replicated when we were looking at that, and I remember the nightmaredness of that.

Michael: Yeah, it's wildly easy now with WorkOS. I would 100% plug them if anyone is in that pain at the moment.

Marc: Yeah. The next part, right? Part of day two operations, we've talked a lot about how easy it is to instalL, you have Terraform, you can create the Kubernetes cluster and then updates. But then there's also support, right? Things might not be working right, I might have Airflow running and it fails. Do you get involved in there? If I'm using Plural to deploy Airflow and then something's not right, what do I do?

Michael: Yeah, we have a couple different ways. One, we have an active Discord channel that people do actually jump into and ask us for support over. We actually have Intercom in our product as well for people who might need support with our web interfaces in general, but people ask questions through it as well. Long term, we're actually building a support interface in the product.

I don't think it's fully ready for primetime at the moment, but the cool thing about that and the reason why we're doing it is we can create incidents proactively with it a lot easier. It's not as easy to post a message in Discord or something like that. So it'll create incidents from alert manager, alarms that fire, it'll create incidents when a deployment fails, all sorts of different, very common failures, or configurable failures so as a part of packaging the application you can create alerts, for instance, in alert manager and it'll pick them up.

That's also meant to help for those support experiences that you can't plan for as easily. It has a chat interface, it has the ability to create Zoom meetings in it and everything like that. We'll be making that hopefully a really good experience for triaging issues and managing them.

Benjie: Yeah. Support can be a huge part of the enterprise contracts that you hopefully will be getting a bunch of. All right, so let's switch gears here a little bit. Plural is an Open Source project, right? And we'll have a link to the GitHub in the shownotes. But talk to us about why you started there? What inspired you to do this as an Open Source thing?

I think we've covered this tangentially a little bit already, but I think it's also really interesting if I'm a founder that is interested in a separate project, was it smart to start Open Source? What was your thinking behind it? How did you end up doing this Open Source thing? Then as a followup I want to talk to you a little bit about CNCF stuff but let's just start there.

Michael: Yeah, for sure. So there's a couple of interesting things that we had a think through when we were starting this. The general idea I had starting Plural was, "Okay, I know I can deploy applications really, really repeatedly with this interesting dependency based mechanism that we have with Plural and that would be a pretty cool experience. But how do you bootstrap a business around it?" And the core problem, if you're creating a marketplace, ultimately that's what we wanted to do, we wanted to create an application marketplace, is you've got to have some degree of demand in the system for it to ever go from zero.

It became pretty apparent from the start that there's a ton of existing demand with a whole bunch of people who have these crazy operational pains around their Open Source infrastructure footprint, and we don't have to do hard B2B2B selling to get other companies to get onboard. We can just tap right into that problem and fix it. So that was a big deal, once we settled that, okay, that's the set of applications that we want to play with, then being an Open Source project ourselves is an immediate corollary.

We can't be deploying Open Source applications and ourselves be close sourced, it's a huge disconnect. The other thing from a business perspective is I think people who still have qualms around the Open Source model, it's usually around not being able to control IP and all of that. I don't think the long term value of Plural is really around its IP, it's more around its community and adoption of usage.

If there's a massive number of people onboarding on Plural, packaging applications on Plural, that's going to have critical mass and it's not going to be moveable. They're not going to go to an alternative because you basically create a network effect. A new application isn't going to want to go to a competitor if they can't depend on all the other things that aren't extra needs, like the best PostgreSQL operator, the best RabbitMQ operator, the best RedS deployment and everything like that and then have that full ecosystem around it.

Alongside not having all the demand already in the system and everything like that. The trade off of maybe not having a vise grip control over your IP, versus having a really strong goodwill with the community of users that we hopefully would create and bootstrapping that as quickly as possible was an obvious choice for us.

Benjie: That makes a whole bunch of sense. What was a challenge? What was one of the early challenges on this being an Open Sourced project? Obviously you guys are just getting started, we got to get you some stars, everyone go to their GitHub and give them a star. But beside star accumulation, what were some of the early challenges that maybe you didn't expect coming from a Facebook and a Twitter experience?

Michael: I actually do think the start accumulation thing, you can push it to the side but it's actually quite hard and a lot of the projects that do get really big, I think frankly, it's kind of just an unpredictable phenomenon that they actually work for. Unless, there's a very common play book where it works for us where it's Open Source for X, no code EP is Open Source for Airtable. That usually is a pretty consistent playbook, but for the other stuff it's really quite a mixed bag in being able to create adoption. The other thing that's definitely different is the release process in Open Source.

You've got to do a lot more documentation and a lot more outward engagement, like deploying in any SaaS service, not necessarily a Facebook where you churn and burn code constantly. "Okay, I need to fix, I need to add a feature. Okay, I'm going to program that feature, push it to a branch, merge it into master and it's going to go into production immediately." That's not how it works in Open Source, you have to have version releases with notes and everything like that. That process is definitely a little bit different.

Benjie: So let me ask you this question, when you are doing a release of Plural itself, just walk us through the mechanics there just from an Open Source perspective of doing a release? Again, you guys are early but it's interesting to understand what the requirements are there explicitly.

Michael: Yeah, Plural is actually quite complicated in terms of the codebase. It's not just one thing that I click through and it'll be released. There's App.Plural.Sage which has its own repository, PluralSH, Plural Loose, the admin console which has its own repository, PluralSH Console, and our command line interface could also be released at any given time which has its own repository, PluralSH Plural CLI.

The deployment of Plural SH Plural is actually very comparable to any SaaS deployment, even though it uses Plural to deploy itself. We'll create a PR, we have a test suite that runs, if everything looks good and we have an approval we'll merge it and it'll deploy, it'll bake up Docker images and deploy itself out of that.

The console is a little bit different because it is deployed to all of our users as well, same basic process where we have a test suite that runs on PR, merges it, it does go through the integration test framework that I have mentioned and then it can go to users from there.

The CLI is actually the most annoying thing to release because it's something you locally install on your machine so we don't have a force deployment process that can go through Plural itself for people updating it, but we use a system called GitReleaser to stage new releases and it will update a Homebrew tab that will allow you to get a new version.

Benjie: That's non trivial, there's a lot of moving pieces there. Yeah, I'm sure as you go further into this world, I'm sure you guys are going to automate away a bunch of those problems. Okay. Let's talk roadmap, what is on the roadmap?

Michael: Yeah, a couple big priorities for us. One big priority is around onboarding of users, we think that the getting started experience is still not perfect and we have some improvements in mind for that. One big thing that we've done, and it's actually been in development for a while, is you can actually use Plural in the browser.

We have a cloud shell experience where you can provide some infrastructure credentials, the basic setup for a cluster and it'll go straight into a shell which is actually resolving to a Kubernetes pod with all the command line interfaces and everything install appropriately and create your first cluster just right in the browser.

We've done other things with that to make it easier for you to get started, you could actually use a project we'll create on the fly to create your first cluster as well just to test drive the experience and see if it's something that you want to commit to. The other big thing that we're doing in the onboarding sphere is we're trying to create a fully interactive CLI experience where you can just point and click for all the steps to start Plural.

We're using the same framework that K9S uses, if you're familiar with that tool. That's a graphical Kubernetes dashboard on the command line, and hopefully we'll be able to have a really good, cool command line experience to configure your first applications and get everything working on Plural. You'll still have the same Git repository with all the artifacts, the core experience is different, it's just a better layer on top for interacting with the various primitives.

Benjie: Isn't that called a TUI? Don't we call that a TUI?

Michael: I actually don't know. A Terminal UI? Yeah, yeah.

Benjie: I think that's what I've been saying. I don't know. Marc, what is that called? Do we know what that's called?

Marc: That was a new one to me, but that kind of makes sense, Terminal UI.

Benjie: I've been seeing TUI, I've seen T-U-I, and you know me, of course I'm going to go TUI on that one.

Marc: Well, I was going to say that's interesting to be able to build that terminal UI right into the CLI, because you mentioned way back at the beginning here that Facebook would invest in building these web UIs and now you have the tool, you have the UI and it's built right in, in order to avoid the context switching and being able to manage and handle the operational challenges of running the software.

Michael: Yeah. This usability challenge is actually very different. The big issue is while we give you a guide recipe, for instance, installing things, it's still really easy to input the wrong information and the truth is, if you have something that's imitating a web form, a lot of people will interact with it a lot smoother. That's the intuition behind investing in that sort of experience. We do have a graphical experience for installing applications in our console, it's just you can't zero to console without doing something and we want to have the zero experience also be comparably good.

Marc: So talk a little bit more about the roadmap, is that where you're focused right now? Or are you focused on bringing new Open Sourced applications in? Where are you making the biggest investments as a company?

Michael: Bringing new Open Source applications is the other big focus. In terms of our core platform work, that's our main focus. Our main focus is on the onboarding experience. What we also have, a few devops people who are focused on onboarding of applications. We have two strategies around where we're focusing on the things that we're onboarding, the first one is we're focusing on the data stack.

A lot of these applications I've seen, people definitely prefer to self host them. You don't really love having Airflow running in distributed infrastructure for security reasons, same thing with Airbyte. Oftentimes they can't even connect to the databases that you'd want them to connect to, right? There's very common self hosting problems around them, they're really operationally annoying and they're oftentimes orphaned in terms of your core engineering process.

So we think we can definitely help a lot with that. The other, I guess you'd call it a pillar of applications that we're wanting to focus on is what I call a Kubernetes starter kit. If you think of what you actually need to do to use Kubernetes, you never just need a control plain. You also need things like ArgoCD, you need something for secret management, so Vault, you probably need some degree of VPN to get access to the internal network for debugging and other operational purposes.

And obviously you need the runtime layer stuff that's beyond Kubernetes, like, some Ingress controller, some solution for auto scaling because it's very frequently not always a built in feature for managed control plains. What we want to do with Plural is use our deployment system to basically solve for that entire surrounding piece of the Kubernetes world for you, and then you can start using Kubernetes as you would hopefully be able to really, really easily.

Benjie: That's cool, that's great. So I get not just the best practice and configured Airflow in the example that we've been talking about here, but everything about Kubernetes, like a Git Ops and good secret management and everything out of the box?

Michael: Yeah, exactly. And that will be more targeted for people who are using that Kubernetes cluster for first party development and not just for actually orchestrating their Open Source infrastructure.

Benjie: So going back to the idea that Terraform can be baked in and it can actually create the Kubernetes cluster, do you see patterns, best practices, any kind of trends around, "I have Kubernetes but now I want Airflow, should I deploy Airflow into my cluster or should I bring a separate cluster along for it?" One cluster or many clusters, how do you think about that?

Michael: The way we've done it, and I don't know necessarily if it should be considered best practice or not, we do have many applications on one cluster. I feel like it depends on the specific application because there's specific applications where it just totally breaks down. KubeFlow is the biggest example, KubeFlow, it's a very invasive product that does a lot of stuff, especially with SDL and then it's also extremely resource intensive.

I feel like most people should have that just in a dedicated cluster. But you basically use all the bin packing benefits if you don't pack multiple applications into the same cluster, so I can't imagine people wouldn't want to. It's also more operationally complicated to have a lot of the clusters lying around that you have to swap between as well and monitor and everything like that.

Marc: The last topic that we often like to talk about when we think about Open Source projects is community and engaging with the community. The project, honestly, Plural, I've seen it. It has quite a lot of depth to it, but as Benjie mentioned earlier, it still doesn't have a lot of stars yet on GitHub. So I'm curious, what are you doing today to engage with the community both in issues and roadmap and bug tracking and things like this? Or even code contributions? Any plans that you have to change the way that you're engaging with the community?

Michael: Yeah, so the community stuff is very much in its infancy and the biggest first step we had is we actually just brought in our head of community. That's going to be something that he's going to drive on a lot for us. The big things that we're doing currently, and I'm sure we'll end up creating a whole lot more initiatives around this, is we have biweekly demos that we advertise in our Discord channel for anyone who wants to track our updates.

We obviously accept pull requests as they come in. We have had a few people, for instance, improve our Airflow deployment. We had a user at a company called FSN Capital who wanted to use Cloud SQL for the underlying database and modified the deployment to allow that to be a possibility.

Then beyond that what we'll starting doing more is things like content creation. We'll really spin up our blog and really get more out there in terms of explaining why we're doing things and helping people learn from our experience as well.

Benjie: Wait, so you guys are members of Cloud, CNCF and Linux Foundations? Is that correct?

Michael: That's correct, yeah.

Benjie: So what does that look like? What does it mean, being a member in this context?

Michael: It's kind of amorphous. We're basically a sponsor. Our project is not a CNCF project in itself, or like a sandbox project at all. There's a good chance we won't go through that process for a number of reasons, but at the moment what it means is we have basically given some sponsorship capital to the Linux Foundation and we'll participate in a variety of different events as a sponsor as well. Things like KubeCon and other Linux Foundation events, I can't remember all of them. There's a lot of them.

Benjie: All right, so we have to ask, why not CNCF, sandbox? And it's early so I think it's unfair to commit either way, it's very early for you guys. But preliminarily, what are your thoughts on being a CNCF project?

Michael: I think for a lot of the projects that do it, it's an obvious no brainer win for them because it's such a huge stamp of approval. There's a few things that are a consequence of it that we don't specifically like. The first one is they take ownership of the brand, of the product, and in theory that might not be the biggest deal. Most of the companies around CNCF projects, you could take as an example Crossplain, they have a company around it as well but it's not called Crossplain, it's very different and this is the problem.

It's so you don't associate the company with the project as a result of the brand transfer. The other thing is I don't know, it's not straightforward for me to figure out what the composable piece of Plural is something that would be going in as a CNCF project because the system is pretty tightly coupled. The admin console really needs to have the CLI and API, the API doesn't really make sense unless it has all the admin console and the CLI to interact with it.

Maybe we could factor out some portion of the CLI and make it a CNCF project, but it just doesn't really make a whole lot of sense independently. That was the other big thing, we think of ourselves more as an Open Source project in the vein of something like GitLab where it's a dual license, Open Source project work up, be extremely transparent about the source code that we have but it's not necessarily a standard CNCF model as a result of all those complexities.

Benjie: That makes a whole lot of sense. I hate to admit it, but not every project should be a CNCF project, I guess is what you're forcing on me. But you're right.

Michael: Yeah. A lot of the things that go into CNCF, I think they're kind of like cloud equivalents of the standard GNU Unix tools in some ways. There's very focused, specific tool for a specific job types of things, ETCD, really good key value store and that's all it is.

Benjie: I'm sorry, I'm sorry. It's ETCD

Michael: Yeah, ETCD. Yeah, but Plural, it's a platform basically at the end of the day, and so it just doesn't fit in my mind.

Benjie: No, look, people that are listening to us right now, they have things to us and it's really important to hear both sides of that story so I think that's a super interesting way to think about it. So we're recording this pretty early, right before KubeCon, but I know that there might be some stuff coming up when we release this, which is probably going to be towards the end of June. Do you have anything you want to tell us about or anything that we should keep an eye out for?

Michael: I think that's too TBD. I think one of the things that we're definitely excited to show off when it's ready, is we're redesigning the entire user interface for that Plural SH, and it'll hopefully be a lot slicker and a lot more usable. That's probably the biggest real, huge chunk of new product that we'll be releasing. I think that TUI experience as well will be really cool, when that is also done. They always blow my mind.

Benjie: No, that's great. Well, look, Michael, we really appreciate having you on. Learned a lot here. I plan to go dig into a few things here. Really excited to see how you guys progress. Really appreciate you coming on and I look forward to getting you probably back on here in a year or two and seeing what you've done.

Michael: Thanks for having me. We definitely appreciate having the opportunity to share about what we're doing here, so really do appreciate it.

Marc: Thanks, Michael.