about the episode
about the guests
Marc Campbell: Hey, I'm here today with Rick Spencer, VP of Platform at InfluxData.
This episode is going to be a little bit different from others because Influx is not a CNCF project.
I wanted to talk to Rick about how Influx uses Kubernetes and other CNCF projects from an end user perspective.
Rick Spencer: Hi, thank you Marc. I'm super happy to be here.
Marc: Awesome. So before we start diving into the tech, I'd love to hear a little bit about your background, how you got into infrastructure technology.
Rick: So actually I see myself more as a developer tools person and I'm always at my best when my end users are developers.
Currently I would say that running infrastructure for developers is kind of a necessary evil if you want to offer that development platform for them which is exactly what we're doing with InfluxDB these days.
Marc: Yeah, it is.
I think the platform is getting more and more complex and so you can't just like, open up your laptop, write code, ship it to production anymore.
You have to start to understand a little bit more about that infrastructure in order to be a developer.
Rick: Yeah. Well in our case a lot more so that our users don't have to understand that.
Would you be interested in just hearing about like some of the steps in my journey that ended me up here?
Marc: That would be great for folks to here.
So I started out working at Microsoft on usability for their developer tools in 1998, and I got that job because I got a master's degree in cognitive psychology in an area called human factors engineering, so very focused on human performance.
The thing is, when I was a kid, all you could do with computers was program them.
So like when the whole 8-bit revolution started, we all learned how to program.
So I was a programmer since I was a tyke, probably 12 years old when I started programming.
So that was one of my foci when I was in graduate school, like the psychology of programming, and that led to an opportunity working at Microsoft in their developer tools division studying and designing developer tools and this is back in the .net era.
From there I ended up with a very deep interest in opensource and working on opensource projects.
Then I ended up somehow landing my dream job.
I spent quite some time working at Canonical leading the Ubuntu project, which was just a very interesting space to be in in terms of people using opensource and infrastructure.
Like that's kind of when the elastic cloud services started to take off at Amazon and using Ubuntu server there was very interesting to people.
I mostly stayed on the client side, really focused on the user experience and the developer experience of Ubuntu.
From there I ended up with a stint up at Bitnami where I really got to pursue much more of my interest in cloud computing and providing opensource software in a really easy to consume way for end users.
I was especially interested in like the developer tools and the databases and such.
Then after that, I ended up joining InfluxDB to help them run their new cloud product, the multi-tenant cloud service where we stay extremely focused on the developer experience and operating it so that people can build their time series applications really easily on top of us.
Marc: Cool. That is actually like a super interesting journey.
I'm curious, a couple of questions for you there.
Like when you left Microsoft and went to Canonical, Microsoft historically has had a hate and then love, like embrace and welcome relationship with Linux over time.
Like when you left, where was Microsoft in their view of Linux at that point?
Rick: Oh yeah, they hated it.
Just for an interesting story, I won't name check anyone, but I started using first Fedora and then Ubuntu at home which meant I couldn't use the Outlook client.
So I had to use Outlook Web Access, which meant all my emails from home are in pure text format.
Then finally somebody realized that was because I'm using Outlook Web Access because I'm using Linux at home, and so like somebody in my management chain actually dropped by my office and they sent me an email that I shouldn't be using that "crap" or however they put it.
So they were still pretty hostile to it at that point.
I find the irony that I left Microsoft to work on an opensource just like hilarious now because of course I consider them actually reasonably good opensource citizens these day.
Marc: Yeah, Microsoft is a very good opensource citizen I think these days.
At Canonical, it's like the Linux on the desktop, right?
Like Linux is the developer experience, we're waiting for that to become like the standard.
We see it a lot more these days, but I think still, you have macOS and Windows taking over most of the development experience.
Rick: Yeah. Agreed, I mean a lot of the developers on my team still use different flavors of Linux and really like it.
I use a Mac desktop personally but I use Linux very heavily in my development.
I just spin up containers all the time so I'm just constantly running Docker and the first thing I do when I have any development project is I just start a Docker file and pull some version of Debian to get started.
So I do think Linux stays extremely central even to the desktop development experience, even for people like me who are not like running it natively.
The only reason I don't run it natively is because I just like my microphone and camera and other software to just work without configuration which some people care about more than others and that's very subjective also.
Marc: Right. So let's move into Influx.
You're using Kubernetes a lot in the day to day operation of the Influx platform.
Like I'd love to hear at what level are you using Kubernetes and how invasive is it into the production of infrastructure?
Rick: Oh, it's everything in our production infrastructure.
It's all we think about, by the time we come to work and we leave at night.
I think it might be useful to give you a bit of an overview of like what I'm working on so I might contextualize a little bit about why we're using it.
Rick: So for people that don't know, like not everyone knows InfluxDB surprisingly, but it's like kind of the first and I would say the best time series database.
These are databases that treat the dimension of time as the primary dimension of your data, especially that you query from and that you index on, et cetera.
We are a database at our heart, and there is an opensource version of it.
We're on Version 2.0 now. The opensource version is like a single binary, it's super simple to drop on your desktop or drop on your server and use.
But I run the cloud version.
Between the opensource version and the cloud version, we try really hard to keep the APIs very similar and we also try to design things so that they work well together so you can use an opensource binary on the edge, have it interact with your cloud account in ways that are useful to you.
But ultimately what's ended up happening is you sort of over time evolve from being like a time series database to being really a time series development platform. There's some tools and other parts of our product suite that go with that.
So first a lot of people use this product called Telegraph. It's like totally open source.
A lot of people actually use it in Kubernetes to collect custom metrics there and there's an operator for that and a Helm chart.
But people use it in like all different scenarios and there's a bunch of different input plug-ins, transformation plug-ins and output plug-ins, which means it's like designed to be really easy to collect data from different sources and send it to different sources.
Of course the primary source is InfluxDB but Microsoft uses it, other people use it also to output to their platforms.
Then we have a suite of client libraries, so pretty much every language that you would want to use is wrapped up in a pretty easy to use library.
Naturally there's a whole CLI experience and a UI experience that goes with it.
One of the things that we offer which developers really take a lot of advantage of is our task system.
So you could basically register a query and the query can do all kinds of work and tell the query to run itself periodically and so this saves you as the developer from having to set up your own server, keep it running.
You also don't have to pull for information, like you can set up tasks that will push to your infrastructure, to your program.
So that just saves developers like a huge amount of effort, and along with that, we have something called Flux.
Flux is a language which is--
We call it mostly a data transformation language, like it does do queries if you just want to select out data, but you can do some really hardcore math transformations, but you could also push data to other sources and pull data from other sources.
So you can really do like really advanced data transformations and interact with other infrastructure.
So you can get a bunch of data, transform it and then actually HTTP post it to like some other external system if you want and we handle all that for you.
So as a developer, you don't have to set up infrastructure for all of that kind of integration and regular work, like down-sampling and that kind of thing.
Marc: That's cool.
That actually sounds actually like super useful.
Like the infrastructure just to set up something as simple as like a high availability distributed cron.
It sounds like that's covered, but that's like kind of one of the bullet points of the features that you just described.
Rick: Yeah. I mean it comes out as a bullet point and that's how we want it to feel for end users.
But the complexity involved in making that simple for end users, for our developer customers or users is significant.
Because thousands of users are running arbitrary queries at whatever period of time that they want and when the guard rails between them break down, it can cause a lot of chaos.
So yeah, really specifically, we don't want customers to have to set up that kind of infrastructure because we know how hard it is and so that's like one of the primary values that we try to provide.
Marc: Right, and I think we're going to talk quite a bit about like the value of that developer experience, like time to live and not having to wrestle with like a large, 90 page instruction manual on how to set something up.
Rick: We call it time to awesome.
Marc: Time to awesome. I like it.
Rick: Yeah. That's our guiding principle.
Marc: Key metric there, yeah. Cool.
So that's a great overview of like the platform Influx Telegraph and everything. Influx Cloud, essentially that's a managed version then of Influx.
Marc: You're running that all on Kubernetes today?
Rick: 100% Kubernetes. That's a multi-tenant SaaS service, so when you sign up for an account there, you get the economy of scales that you're running with other users.
One thing to note about this is there's this phrase like data has gravity and what we found is that a lot of our customers, if you're handling hundreds of gigabytes, terabytes of data for them, they don't particularly want to copy that data outside of what region they're in, much less to another cloud service provider.
So for most people that I talk to, when I'm asking them like how they're solving problems with Kubernetes, they have like just a different problem set because let's say you're using an HR application that's built on Kubernetes.
Do you really care where your hundreds of kilobytes of data is going?
Like you don't, but if you're talking multiple terabytes of data, like you care.
So for this reason, we have an approach that we call multi-cloud, multi-region.
So we operate multiple production clusters on what we call the big three cloud providers.
We do CD to that multiple times a day during the workweek.
So besides just having a large scale of users, we have a large scale of users over a large number of production clusters.
I haven't run into that many people who are actually like using Kubernetes at that kind of scale, or at least a scale described in that way.
Marc: Yeah, like I'm curious how much of that scale you can describe a little bit more.
So all the big three cloud providers, you're running those.
You have multiple clusters in multiple regions.
Are you able to give us a ballpark of like the number of nodes that you have total running across all of those?
Rick: Well I don't have that on my fingertips, but I think like our smallest cluster has like 30 nodes and our largest has about 100.
I'm just recalling this from the top of my head, but it's like in that ballpark and we have like--
I think our largest cluster right now is running like 1,000.
I don't even know how many pods we're running.
Marc: The application is truly multi-tenant then.
Like if I sign up as an Influx cloud customer, you're not spinning a separate cluster for me, I'm sharing a cluster.
Rick: Exactly, so if you sign up for an account, then you get an org ID, and you get a token to start with and then you can use that token to create other tokens to write data and then when you write that data, that data gets written into the same database as other users' data.
Of course we have all kinds of safeguards in place so you can't like see other users' data.
But it's really designed to be fairly compressed so that you would get the benefit of the economy of scale.
Marc: Then I don't have to worry about where those hundreds of gigabytes or terabytes of data live because you're worrying about that.
Rick: Yeah, exactly, and you don't have to care about anything.
To you it should just be an API.
Marc: Right. So the data then, like actually going back.
These Kubernetes clusters, are you using managed Kubernetes underneath the hood?
Like EKS and GKE? Or are you deploying Kubernetes, like the hard way or using your own installer?
Rick: I would say yes to the first two unfortunately.
Like so we are finding that it's actually a significant challenge to operate Kubernetes itself.
Like modulo, writing the application on top of it, just in a multi-cloud way.
So some of our clusters in AWS, we started with I would say rolling their own , I think the SRE team used something like Hops or something.
But it was not their managed service. On Google and Azure, we're using their managed service.
Then I think on the latest cluster that we rolled out in Amazon, we're also using their managed service.
So we have a lot of terraform as you can imagine.
Marc: Yeah, that was going to be my next question is how are you actually managing that at scale.
You're clearly not using like web consoles to create and manage those clusters.
Rick: No I mean I'm sure the SRE team does at times.
We found there were times like the problem of operating Kubernetes actually breaks down into two separate domains that are unfortunately extremely related.
The first domain is operating Kubernetes itself, and if you're only running like a single cluster, then obviously you're in a single region and a single cloud provider.
That complexity would be like dramatically reduced compared to what we're facing because you don't have to try to keep a certain level of consistency between the different environments.
But that skill set of understanding these are the cloud service provider VMs, these are the services, the Kubernetes service that they're providing on top and then we're operating Kubernetes itself on top of it.
That's like one deep, deep skill set. Then that then offers a Kubernetes API which then is mostly consistent.
So then we have another skill set which is like application developers, service developers, who know how to use the Kubernetes API itself.
Of course there are many times where these concerns meet in the middle but we actually have that now separated in between two distinct teams that work closely together.
Marc: So you have the platform team who's responsible for making sure that Kubernetes is available, running, predictable, reliable, the platform itself exists and then the development teams are able to count on that and then count on like a consistent, predictable API sitting there.
Rick: You're right, but we use slightly different and confusingly overlapping words for that.
So it's our SRE test that provides the Kubernetes service to us and then I run the platform team and we are all developers that author the Influx DBA API and operate the database and everything to everybody.
But yeah, it's exactly what you said.
As you can imagine, we have to work really closely together because it's very, very easy to use the Kubernetes API in what appears to be a completely valid way to solve a problem which does not work in practice on Kubernetes, if that makes sense.
Marc: Yeah. No that totally does.
I'm sure there's some war stories or like battle scars I guess if you will from like lessons learned that might be interesting to share if any come to mind.
Rick: Sure. I mean there's bullets dodged that like recently come to mind I guess.
So just to give you a little background, like we're totally a get-up shop and we put all of our application configuration including like what services do we want in production, what are the replicate sets, how many pods do we want, how much memory do we want them to have, like all that kind of configuration is actually done by the pod platform team, the application development team.
There was a case, we were like, "Well, we'll have much higher availability if we provide 1,000 of these pods."
So let's just stand up 1,000 of those and then like the SRE team was like, "You understand that if you do that and you get into a crash loop, then you will exhaust all the IP addresses available in Kubernetes. Also SD won't be able to keep up with the changes that are happening and you're going to have a total outage."
There's like nothing in the Kubernetes API or anything like that to keep us from making like that kind of mistake. Fortunately that one was headed off at the pass to speak, but I think that's like a good example of the kind of thing that the API allows you to do.
Like I mentioned before the chest of footguns that Kubernetes gives you.
Like since we're musing, like we're doing a lot of projects in Rust right now and like the developers hate it, well when they first started using it because it's like just so hard to learn and to compile, et cetera.
But once they wrote their program, there's these whole classes of bugs which they simply will not encounter.
They're not going to crash because they're referencing de-referenced memory, they're not going to have bugs, concurrency bugs, et cetera.
But I don't think Kubernetes is like that at all.
I feel like Kubernetes is more like-- During the early 90s when I was programming on the Mac and you could like access any area of memory.
Marc: Right, right. I mean it's like a super, I think we often forget, you're counting on it, you're relying on it at scale to run like a business.
But it is still like an early project and it's still the velocity of the project is crazy.
I'm actually curious, like how frequently is the team updating Kubernetes?
How up to date do you keep it with the current release cycle?
Rick: Well, we rely on the cloud service providers to keep it up to date.
Marc: Got it.
Rick: Which means that we're often running multiple versions for periods of time.
So I mean it's interesting, we actually did try to engage a bit with the Kubernetes community, with the CNCF community about how to practice Kubernetes.
Like how do you do it? How are people doing it?
There's probably things that people can learn from us and god knows there's things we can learn from other people.
So multiple people on my team actually like I said, take time, book it out of your schedule, and start joining some of the CNCF meetings and start meeting people.
Of course the one and only George Castro was like instrumental in helping us connect with different groups and stuff.
But my team members stopped going to those because they said they just talk about like how to run the meetings and how to induct things into the CNCF.
They don't really feel like that they were getting the opportunity in that venue to talk to people who were actually interested in helping each other operate.
So we've had to look elsewhere I guess for guidance on it.
I would also say we have a problem that a lot of our colleagues don't which is that we're offering a developer platform.
So if you're offering something where the end user can click a bunch of buttons and then hit submit, it's actually pretty constrained what those users will do.
But we have thousands of developers building on our platform and they will use the API and use the Flux language to solve whatever problem that they're having, and so it's easy--
Like we just are amazed at the ingenuity of some of our users and how they solve their problems.
Unfortunately for us, that means sometimes they go places that we didn't anticipate people would go, and then they do it like 10,000 times a minute, by plugging it in a python loop, you know?
We're always interested in finding more peers who are like facing those kinds of challenges and et cetera.
But we've been very lucky to have some really good hires on both the SRE team and what we call our deployments team.
We actually have a lot of experience with Kubernetes.
As much as you can have. So this is their second or third time through building up this kind of service.
Marc: Tons of experience, yeah. They've done it before twice.
Marc: Yeah, that's interesting. Actually like, let's go into that for a second.
Kubernetes is popular, everybody's looking.
I think most companies have either adopted Kubernetes or, actually that's probably the minority.
Most companies are working to adopt Kubernetes.
Like I think you're like this model, explaining this whole infrastructure, a lot of folks will probably be like, "That's where we want to get to."
You mentioned the story of the footgun that you avoided with etcd and some kind of challenges with CrashLoopBackOffs, you relied a lot on the knowledge and the skill of that SRE team to kind of caution against that.
I think that's where we kind of are lot of times right now.
So my question is really like how did you build that team and how do you continue to recruit and hire people with that skill set that can operate at that level with this crazy fast moving ecosystem?
Rick: Right. So like obviously, as a VP of engineering for the platform team, I spent a lot of time with recruiting.
One of the things, and people ask me like, "Wow, how did you build this team? How did you build it so fast?"
The first thing I asked them is like, "How many hours a day do you spend recruiting?"
If it's less than like a couple hours, you're just not spending enough time on it.
Then what does recruiting look like in this world?
Like it's not a matter of putting up a sexy job posting and people coming to you.
Like you can get lucky that way, but what we usually do when we have a need is the first thing we do is we look at what projects are out there, often in the CNCF or not that are like interesting in this space.
Then let's go to the GitHub repo and let's look at all the people who are contributing to it, and we start combing through that and then start looking at like, "Okay, who are interesting people there?" and then start reaching out to them.
We've had really pretty good luck in different domains, being able to find people who are like, "Yes, I would like to work full-time on this area of interest."
Pull people in that way.
We've also had a lot of luck, I'm sure you've heard the phrase A's hire A's, B's hire C's.
Rick: I mean Paul, the founder, he just started with a really good team and then that sort of built on itself.
Just really smart people hang out with other really smart people and so we pulled in whole networks of people.
Like one person comes over and then six months later their five top people are also working with us.
Marc: Sure. Yeah.
And I mean, I think you have a great compelling story there too and an offering of saying--
If I'm an SRE and I want literally to be on the cutting edge but like actually have not a toy project but like a legitimately like big project to work on that people depend on to run their businesses, like I mean it sounds like a great team and it's always fun too to be on a team of people who are just really, really good and smart too and you learn a lot from them and it helps you just learn more every day.
Rick: Agreed. Like oh man, I feel like I've learned more at this job than like my last few years.
Like I actually started to wonder, I'm in my fifties now, maybe it's time to transition away.
But I found just being in this environment, I feel more energized and just enjoying myself at work.
Like more than, or at least as much as like any other time in my life before.
Which I also think touches on the recruiting.
Like Influx does have a culture which I didn't create, it was part of the reason I decided to work here.
But we have just a very strong culture, Evan, the CEO, I think really helps set.
I think Paul the founder also is instrumental in it.
But we just have a really strong culture around execution but also we're a very human culture and we value humility and we really value people's, I'll say work-life balance.
But I just feel like that's such a jargony term good but like everybody says that, but like we really do practice that, and I think that's really helped with our recruiting too because after people have been here for a few months and realize that it's not just marketing, that you do learn a lot, that it is exciting.
As you said it's like so exciting.
It's one thing, it's exciting to have users but then to like look back two months later and you have twice as many users, it's really exciting.
Then at the end of all that, the company takes care of you in a really human way.
I think that makes my job of team building significantly easier.
Even though I still do put a lot of hours into it.
Marc: Sure, of course, yeah. I mean to build that team, it takes a lot of work.
Before COVID, was Influx a remote engineering team, distributed?
Rick: It was, yeah. So Paul the founder lives in Brooklyn.
I live outside Washington, D.C. Other leaders in engineering, like we were all over North America and Europe.
So that was like the engineering team, I'm going to say, was totally remote.
It's a little bit of a fib just because there was an office in San Francisco before COVID and some of the engineers had an affinity for working together in the office.
But that was more their option.
I actually have a penchant for hiring people with a lot of experience in working remotely or managing remotely, and I've been doing it since like 2008 or 2009 or something like that.
So it's actually a really good environment to work in remotely, just because it's first nature to us.
It's our default, we didn't settle for a remote workforce.
Marc: Right, and I think a lot of companies, engineering teams too were co-located in an office at the beginning of COVID and obviously forced to be remote and distributed and we'll see what happens, like how many of them end up getting back in an office and how many stay remote.
But having that discipline, the culture, the practice of like communication as a remote and distributed team is super different than it is when you just go to a room and draw something on a whiteboard.
So having that kind of built in to how you operate just gives you like such an advantage all the time, and especially during COVID when everybody else is struggling to figure out how to do that.
I got a kick out of all the people's tips and tricks at the beginning of COVID and it's like, I mean look, what do I know, but I do think it's going to be interesting.
Like a couple things, first of all, I think when everyone's working remotely, they can actually work a bit harder because no one's burned out from a commute or having to sneak away to run an errand.
You can really integrate your personal life and your work life in like a more fundamental way, just like much easier.
I also think when you're working with people and they're working from home, you really get to know them a little better because you see past their office persona.
So I think the relationships potentially, arguably get a little deeper.
At least different.
Then the other thing I would say is like when I do go to an office, I'm shocked by how much time people waste in offices.
Just like they're in the kitchen, supposedly that's what they're coming up with their greatest ideas but like I don't think so.
So it's my completely biased point of view that just like a remote workforce is a competitive advantage, at least in the tech industry, at least with people who are writing software, and so that's going to just slowly over time erode things.
But I don't know what Google, Microsoft, like all these companies that have billions of dollars in capital tied up in these massive campuses.
I don't see how they're going to pivot away from that.
But I think to your point, so we have a fully remote engineering team or company also and you're right that people can focus more and their time isn't spent on that commute but it's not like, "Oh wow, we as a company can take more of your time."
No actually, that work-life integration, when somebody who's used to working in an office realizes that, it's like, "Oh yeah, it's 1:00, I'm going to go take two hours and go take my kid to the park for a little bit and play," and then come back and work a little bit more.
Like to be able to like set their own schedule and not have this predefined set of time that they have to work and adjust their whole life around, it just gives you the ability to work when you have the creative energy to work and spend time with the family when they're available and you want to spend time with them.
Rick: Yeah. So like while we're talking about this, I have this theory that I call the two kinds of crazy.
Like most people, like when they start working remotely, they adapt to it just fine just like you said, and they're like, "Why haven't I been doing this forever?"
One of the two kinds of crazy.
The less common of the two, there are some people who just, they just cannot start working.
Like there's something about being at home and not having the transition, like I don't know, distractions, like something about the way that they're wired, that they have a really hard time getting started and sticking to something and they just give up on remote working.
Most of the time when someone does go crazy, it's the second kind of crazy where they cannot stop working.
They just stay plugged in all the time and they're always looking at Slack or having dinner, the Slack is open because there's other people working.
They want to know what they're doing.
I find that second kind of crazy a little easier to treat as a manager.
But you need managers who have the online social intelligence to really understand that people are locked into that and help coach them ut of it.
Marc: Yeah. I mean obviously you want to get in there and help as soon as you can to keep them from burning out.
I mean it's not a healthy culture if everybody feels the need to be online in Slack every minute that they're awake.
Rick: Yeah, I think like that's a really good point because you said like feels the need, I think it's that which is like really toxic.
Like who am I to tell someone they should be working less if they enjoy working and that's how they want to differentiate themselves.
But if they don't want to be working but feel compelled to for whatever reason, that's like a really bad situation.
So yeah, I'm really in agreement with you on that, but that's why I hire good engineering managers who know how to keep their eye on that kind of thing.
That's what they're good at, and they're able to really help and coach an engineer to make sure they have that balance.
Let's dive back into the technology for a minute here.
Like all in on Kubernetes, Kubernetes is running the platform, but Influx has been around longer than Kubernetes.
So I'm guessing, were you part of that transition or did you join Influx after they moved to Kubernetes?
Rick: So I joined Influx, my first week was the week before we launched the multi-tenant cloud service.
So it was like real trial by fire. So just real quickly, how did we end up here?
So it started out Paul and some other person, I forget their name, they had a startup called Airplane and I don't know what it is.
It's just like the company legend, so I'm probably going to get it all wrong, but this is the legend inside the company.
They started this project called Airplane which was going to help you keep track of your server metrics.
Like wow, it's actually kind of hard to keep track of your server metrics. We need a database for this. But there's no good database that we can store this time-based data in, these time series. So they created this database called InfluxDB to support this other company that they were starting and they gave it a license and threw the code up on GitHub or whatever and in the way of things that is actually what ended up getting traction was InfluxDB itself.
But it was before Kubernetes, it was a relatively traditional server, like you'd install the binary, or you'd install the UI separately, a few services around it, et cetera.
Then people who really wanted to run it at scale would then work with the company to get an enterprise license.
I'm skipping over one of our projects just to make it, but you got an enterprise license and then we would help you operate a cluster of these databases in your infrastructure.
But Paul was, not too long after that, like when I was getting traction, he was like, it just is a good business.
It is a good business, like we're not allergic to that business, but it's like I think this multi-tenant SaaS is really, cut out a zero or two to the adoption.
So they've started to work on what we call internal, we call it 2.0.
So we still have the opensource version, we have an opensource version of 2.0 which is like relatively compatible as I mentioned with the cloud version.
Then they chose oh man, this must have been three years ago.
But they chose Kubernetes because they knew data has gravity and we're going to need to be in multiple places.
So that was like the best cloud abstraction layer, even then, and that's really what we used it for is a cloud abstraction layer.
So we can run pretty much the same service everywhere and for you as an end user it's completely seamless if you use us on different files.
So then I picked up that project as the leader of that cloud project right sort of as it launched and so then I saw through the initial round of availability and that was interesting because what developers consider availability versus the Kubernetes ecosystem seems to be very different.
Like if a developer overnights runs 100,000 queries, they'll let us know in the morning if they got one 500 error.
Marc: Right, right.
Rick: They're really sensitive to those errors but Kubernetes loves to just like pass out 500s, especially when you're deploying and people say, "Well, you just retry. That's what 503 means."
But we actually can't do that because a lot of what people are doing with database has side effects. Like do you really want me to retry your delete?
So we put a lot of effort into those super smooth upgrades and we still have like a long way to go, for the end user perspective that's very smooth but we need to like really improve our experience as we're upgrading two or three times a day as we roll out changes.
So we went through that and then we had a period where we were really focused on, like what are the developers trying to do and what are they trying to build and going through.
It turned out a lot of non-functional requirements, like there was a lot of areas where they were a lot more sensitive to performance than we expected and so we had to like really optimize performance and some of that was at the code level, some of that was at the configuration level.
But then there are also some features that like, it seemed like they needed, like we got really close to a few customers.
Like spent a lot of time with them, and that really paid off.
Because now we're at this stage where we're really scaling quickly in terms of adoption, and so naturally really the focus that we're having right now is to keep up with the rate of increased usage that we're seeing but like I said it's not just the form that you can click on and hit submit, like it's developers, and the more developers we have, the more ingenious ways to figure out how to squeeze the most functionality out of the system and so that could often result in surprising work on our end to accommodate that.
Marc: Yeah, I think building a developer tool, one of the things that we've said before is that we've said before is that we like to think of it as cooking for chefs.
Like they have a very high standard and high expectations.
They know they could write code, but that said, they understand the challenges you're going through and they're empathetic.
They want it to work properly and like, One 503 out of 500,000 requests, they might be like, "Yeah, but why not zero?"
Rick: Yeah. But did you bring up a good point.
I do find our community and our user base, they're just like compared to working on Ubuntu or compared to working on Microsoft they're all just the nicest, easiest to engage with group of people.
So I'm just taking the approach of just being like totally transparent.
Someone's like, "What happened?"
Just telling them and like you said they're developers and they understand but I don't know what it is about Influx that attracted such a positive and engaged community, but that's been really nice, if that makes sense.
Marc: Yeah. I think that's a sign that you have something that people love.
I mean like if as a developer you're going to try a lot of different tools, like every day, every week, every month, and the ones that just aren't cutting it for you, like you're probably not going to take the time to give that feedback to them.
But when there's one that's like-- It's working and you just want it to continue to get better and better and better or transform a little bit, you have some requests, like you're going to engage with that community, like really, really regularly, so it's probably a sign that people just really love the product.
Rick: Yeah. I hope that you're right.
I believe that you're right, but I'm kind of biased.
The enthusiasm which many of our customers have for the product seems really heartfelt and also there's a lot of companies, like they've really--
They are building their company on top of us as their backend.
So like IOT companies, some of them offering IOT applications, but some of them actually offering their own IOT platforms, like using us as their backend.
They are really dependent on us, like doing a good job and as they scale, we need to scale.
One thing that's interesting is that there are companies that are creating their own server monitoring solutions and sometimes it's more of a traditional workload for us where it's just like a really big company with thousands of servers.
They need their own bespoke monitoring, but what we're seeing a lot now is new companies that are building monetary solutions for their own end users on top of us, who people send them their metrics, they send us their metrics, and they have a whole user experience and alerting and visualizations and everything that they build on top of them and use us as the backend.
Same with like finance applications.
For some reason a lot of cryptocurrency trading applications and stuff like built on top of us.
So I take my job very seriously because I feel a lot of responsibility to the people who have done that.
Marc: People are counting on you, so yeah.
Rick: Yeah. Yeah. I wanted it to pay off for them.
Marc: Right. So you tell a story there about how you adopted Kubernetes.
I'm curious, the CNCF ecosystem is more Kubernetes.
Obviously Kubernetes is sitting there as a graduated project and it's-- I don't know, let's call it the center of the ecosystem.
But there's lots of other things. You mentioned that you used GitOps.
I'm curious, like what other projects do you use?
Do you use service meshes, do you use CNCF monitoring tools to monitor all that?
Like GitOps tool set are you using if you can share that?
Rick: For sure. I'm happy to talk about that. So we use Argo very heavily and we use Jsonnet.
So we describe our application in a base Jsonnet library.
If listeners aren't familiar with Jsonnet it's a super set of JSON which is basically made for templating.
So we describe the base application that way, and then for the different cloud service providers and in some cases for the different regions, we'll have layers of Jsonnet on top of that.
So the base layer may say give me 10 of these pods, but then on top of that, we may say, "Well for this region give me 20 instead."
Maybe that's a bad example but that's the kind of like configuration difference that you can represent in Jsonnet.
Rick: Then somebody can change the Jsonnet itself, which is in a repo, or you can change some code.
If you change some code, that code will get rebuilt, our CI system will build a new container which will have a new hash which will then update the Jsonnet to say these pods should now be pointed to this container instead of the old one.
And then Argo takes it from there and we use a phased approach right now, so it will go through, I think they might call it waves.
We'll go through a set of waves of deployment.
So yeah. Like I said, we're not a Kubernetes tooling company.
So the last thing I need is to be managing bespoke Kubernetes tooling or God forbid have other users depending on our tooling.
So we prefer very much to go upstream.
We don't even mind paying for it, but if it's open source, so much the better.
We'll contribute everything back.
Rick: We use Istio very heavily, but I personally have a very love-hate relationship with Istio.
It adds a lot of complexity, and it creates a lot of metrics.
We get a lot of value out of it but when there are problems, like Istio just cognitively just creates this sort of extra layer of complexity that we have to reason through.
This maybe a year ago, so maybe it's all better, but like a year ago, we were having this problem where this one kind of pod stands up, I think it was like a gateway pod, and that pod then needs to like ingest a request and send it to another kind of pod.
I think it was a query pod. I think that's what it was but it doesn't really matter.
But then what would happen is we would do a deployment.
If the call pods rolled, let's say we updated the query pods or whatever was being called by the gateway pods, if they got updated, then the gateway would not get an update about the IP address and then users would start getting errors.
Gateway can't find the service that you're trying to provide.
And so you're like, "Why isn't it updating the IP address?"
So we ended up starting with like the Golang DNS library and just like super carefully just having to go with a fine toothed comb and of course at the end of the day it turned out there was some configuration in Istio where it cached the IP addresses in a surprising way to us or whatever.
Super frustrating. On the other hand, we're really looking to the future and we want to sort of migrate away from the idea that we have like a cluster here, a cluster here, a cluster here, and more like can we provide the service that you need in the right place.
So maybe it's better for you to have storage close to you, but the compute is better to happen somewhere else instead of in the same cluster, we call that federation, and can we start federating these services across clouds and across regions within clouds.
So instead of having it be like this cluster, this cluster, this cluster, a federation of services.
Well Istio we think is going to give us the opportunity to put this all in the same networking space.
So it would be like trackable for a program or to write code that can actually access services in different places without the code being completely insane.
So I could go on, but we use Istio a lot and that's my love-hate relationship with it.
Oh, it's also really good just for diagnostics.
Like we look at the Istio logs a lot to figure out how something went.
Marc: That's cool. That's a lot of clusters, a lot of nodes, and you have to manage SLOs, SLIs and everything.
How are you doing that and like are you using proprietary or like Prometheus for monitoring?
How are you thinking about that problem?
Rick: Well so we used both.
We're actually really interested in this company called Noble Nine because they've like templatized this and solved this problem once and for all for everybody.
But we're at that problematic space where we already solved the problem for us and like at what point do you like cut over?
We are already super good at collecting and analyzing metrics and alerting on it.
So we run Telegraf sidecars in all of our pods which collect application metrics.
So when we read code, we just pump out metrics and data about that code.
The Telegraf sidecars pick that up and then send it to what we call tools or internal production where we can basically monitor everything and we just have tons of metrics that we can query on and tons of data and different information that we can query on and piece together different understandings of what's happening.
We've run other code though that we didn't write, and those tend to use Prometheus.
So we also have a lot of Prometheus metrics that we get.
But we Prometheus metrics straight into InfluxDB and we don't use Prometheus itself, we just use the metrics.
Marc: Yeah, so that totally makes sense.
You're like obviously going to use your own product to collect that data whenever possible.
Rick: Yeah. We're super good at it, so yeah, why wouldn't we, yeah.
Marc: Yeah. It does exactly what you want and you have a team of expert developers who can make it do what you need it to do.
Rick: Yeah. Exactly. Although I don't know, I'm not a huge fan.
Maybe it's because of my job with the Prometheus metrics.
Like I get why are they are so into histograms and everything but a lot of times when I'm dealing with problems, like a specific customer has a specific complaint, so I say, "All I know is 99% of our queries are fast enough."
That doesn't satisfy them very well. They're like, "Well mine wasn't."You know what I mean?
Then Prometheus is like, go to your logs when that happens.
But I spend less time with the Prometheus metrics than the other metrics we collect.
Marc: I mean it's probably a whole different conversation, but like with that type of a problem, 99% of our queries are fast enough, but mine wasn't.
Like it probably presents a really unique challenge when you try to define what an SLO looks like when every query just has to be fast.
Rick: Right. Yeah, yeah, and also like the SLOs per user, right?
So like whether 99% of their queries are fast is like-- At least that's the way they're going to interpret it.
Marc: Yeah. That's what's important to me if I'm an Influx user.
I don't care how many queries overall for Influx were fast.
I want to know how many of mine were.
Rick: Exactly, exactly, yeah.
So we track that quite carefully and we're always optimizing our queries, et cetera, so we keep really close attention on our performance metrics for that reason.
Make sure that they're always going in the right direction.
Marc: So another question, when I think about like running Influx Cloud, I'm putting my data into a managed SaaS service that you're running, and recently there's been this massive wave of supply chain style attacks starting, the most famous one that we know of is the SolarWinds one, the Orion breach but there's been more and more.
Then there's like ransomware attacks where somebody gets into your infrastructure and encrypts your data and demands cryptocurrency or millions of dollars to give you the keys, they'll decrypt it.
So if I'm putting all my data into Influx Cloud, one question that I'd probably ask is what tools and what process are you using to manage that to mitigate that risk for me?
Rick: Well, so thinking about ransomware specifically for data that you have in our multi-tenant SaaS service, like first of all the data that we have that we keep, it's all encrypted by the cloud service provider.
So we're just very careful to make sure that if somebody can get your data, it's just going to be encrypted.
So they're not going to be able to read it.
So like nobody can really steal and look at your data per se and then we just have all the layers of security to keep somebody from breaking in.
In some cases that's a problem, because we take it pretty far.
So for instance, I cannot impersonate you. Like if you're a customer, even though I am the VP of engineering, I cannot get into your account unless you explicitly invite me. Like there's literally no way for someone else to look at your data unless you've lost your token.
If somebody stole your token from your own infrastructure but that's, i n the cloud what we do, we run periodic, we actually hire a company that does periodic sweeps of our APIs and they go over with a fine toothed comb to look for any kind of way that someone can crack in through our APIs and they don't typically find like serious issues.
They never found anything that would like lead to a ransomware type attack, but we do put a lot of work into making sure those stay really clean.
They tend to be more like, If a user's using a certain kind of browser, and somebody has the following list of data, they can spoof a website that could trick the user into turning over their data. Then we mitigate those.
We have two security teams actually.
We have one security team that's all about SOC 3 compliance and just like making sure and who engages with all the external vendors that help us keep the system secure and all that kind of thing.
They advise us on all of our practices, and then we have this like just amazing developer internally who all he does is focus on security issues.
So if anything comes up, if anyone reports it, it goes straight to him, but he's also very proactive.
He's also on the CBE mailing list, so he has like secret insider knowledge when there's an exploit that's known in the Linux community.
He can't tell us about it but he can know about it and think about it.
Marc: You can patch your cloud services so that customers, before it gets unembargoed, right?
Rick: Yeah, exactly. He can't really talk ab out that stuff but you can kind of tell.
I worked with him on Ubuntu and it was like really obvious then, he'd be like, "I'm not doing any of my deliverables this week,"and he'd be like, "Okay."
Marc: Something's up.
Rick: But yeah, I'm not really a security expert myself but we have quite a feel of the cloud product and I would ask myself as a customer, like do you trust Jamie and Peter and our security team, or do you trust yourself to keep your data secure.
It's probably a better bet to go with them, but recently what we've been really worried about more than the ransomware attacks are like to SolarWinds, these tool chain attacks, and we do build InfluxDB opensource which people can download and put on their servers et cetera.
So that feels to me like more concerning.
That's like a more, that feels like a more juicy target for people.
So we went through and we stopped a lot of development, and we went through all of our GitHub repos and Jamie, the developer I was telling you about.
He exhaustively searched every commit, because GitHub keeps everything forever.
He searched everything for anything that looked like a secret. So we went through just a month of just scrubbing secrets and rolling secrets and then also putting into place practices to never put secrets in GitHub again.
Like I've done it by accident, so like now, as soon as somebody does leak a secret to GitHub, it hasn't happened since then but if somebody did, it would alert us and then we'd know, r oll it right away, et cetera.
So we look at those layers of security, like making sure our APIs are secure and making sure nobody's hijacking our code.
We look at that pretty seriously. Oh we also went through and we added check sums like everywhere.
So like we now, well I'll say we use it as a best practice that if we're using a third party dependency, we will put checking the check zone, make sure that's part of our tool chain, to make sure nobody can slip anything in there.
Marc: Yeah, that way you can publish like a good bill of materials for everything and I think you're in a unique place where obviously if I'm taking that binary from Influx and putting it on my servers, it's a database, so it's going to have access to sensitive data or whatever data I write to that.
But there's no way any sane person would say, "Oh, I'm not going to trust any vendor here. I'm going to go build my own database."
So following those, you mentioned SOC 3 and the regulatory and compliance large customers using it that are depending on it, and it sounds like you have a really formal and mature and well thought-out security process.
Rick: Yeah, exactly, and it's like to the degree that like I don't actually have to worry about it that much.
I do worry about it. Anytime, anybody, I see anything on Hacker News about whatever exploit, I'm like, "Oh man."
Marc: Yeah. Which is like every day these days.
Rick: Yeah. Yeah. I mean I worry more about--
We also use Sealed Secrets, so if somebody got ahold of our GitHub repo for the cluster itself, there wouldn't be a catastrophe because there would be no usable keys for them in there and we keep everything in vault.
What I do worry more about is just our customers' own applications and their own practices.
I feel like we do have an obligation of making it easier to default to having a secure application.
If you get a token, a reToken from InfluxDB, and you publish it to GitHub by accident, it's not necessarily our fault, no one's going to hold us accountable that somebody did that.
But I'd still like to make it as easy as possible for that not to happen or for you to like easily invalidate and roll that token, et cetera, if that makes sense.
So we published best practices for how to manage your tokens and like the sample code.
Marc: So you talked a lot about the challenges.
One kind of last question for you here, you talked a lot about the challenges and like the work that's involved in running Kubernetes and Istio and everything that you're running in a cloud-native ecosystem today.
I'm curious, like from a really high level, kind of looking back at the way everything is, is it worth it?
Like is the value you're getting for the effort you're putting into it, like would you do the same thing again if you can make the decision today?
Rick: That's interesting. I was having that as sort of a shower thought this morning.
So what I think is, if we went back in time and I was there, which I wasn't, when I started out.
If I knew then what I know now, we would absolutely use Kubernetes again, but just we would make many fewer mistakes along the way would have been way more efficient.
Marc: More direct path.
Rick: Yeah, yeah.
So I don't think like if I knew then what I knew now, we wouldn't have used it again.
If we were starting the project now, I think we would still use it.
That said I think some of the parts, we may not have put into Kubernetes, at least not to start.
Because it's a database, so it's stateful and Kubernetes hates stateful services.
Then there may be some other things that maybe we would, maybe we would not put into Kubernetes.
Especially around tenant isolation of queries.
Like ways to reduce noisy neighbor problems and that kind of thing.
We might look for other solutions for that but like I like to say, the early bird gets the worm, but the second mouse gets the cheese.
I think right now is kind of the second cheese part.
Like now we as a company have adopted Kubernetes at the scale, now we know how to run it at the scale and et cetera, we're probably getting like a good value out of it.
If we were starting it now, I think that the industry or at least the ecosystem around Kubernetes has matured to the point where it would probably be easier to get started, even if you were running like as large of a process as we are.
Marc: Yeah, I mean, when you started no Kubernetes, you used the phrase I think, and we hear it a lot.
Like we made the bed on Kubernetes years and years ago and like here it is, 2021.
It's like it's an obvious bet but it certainly wasn't years ago.
There was lots of like-- Kubernetes was really, really early.
There were other schedulers like that were out there competing, that were honestly probably in your world, mesosphere, had some interesting work in the stateful set stuff that was probably pretty tempting for you.
Rick: Yeah. I don't know. I wasn't there for those discussions but yeah.
I mean back then Docker was still like a thing. I don't know, I would say this.
Most of the demos that I've seen from Kubernetes and like, "Oh, look at this cool CNCF. We want to get this cool thing in CNCF."
The domains that they're talking about and the demos that they do are like in my opinion often relatively trivial.
Sure, if you have like a trivial problem, a lot of solutions will work for you.
I think when you get to really non-trivial problems, I'm not so sure Kubernetes is a no-brainer for everybody.
If you look at like just the maturity of running VMs on AWS.
I wouldn't write those off necessarily, depending on the problem space.
But for us, like multi-cloud, cloud native, scalability, I think we would still alight on Kubernetes as the best solution to our problem.
Marc: Yeah, that's a good point. That common abstraction, that common API that you talked about earlier is super compelling.
So hopefully what we see is like Kubernetes becomes easier, the complexity is removed, it becomes easier and easier and easier to run where you actually start to get that value without as much of the cost of maintaining that infrastructure because you have quite an investment into maintaining that infrastructure to run Influx Cloud today.
Rick: Yeah. I've mentioned before, there's companies like Replicated and Noble Nine and other companies that are not Kubernetes vendors and they're not Kubernetes tooling vendors per se.
A lot of these companies, I think are offering compelling solutions to problems that you would have with Kubernetes by not making you add it to your operational workload, like your operational overhead.
Rick: So like you don't have to go and build your own CD system. There's solutions.
You don't have to build your own SLOs.
You don't have to build your own on-prem pipeline and solution for that, and you don't even have to host it.
You don't even have to operate it.
So that to me is a really promising sign about getting into Kubernetes because like, other projects that you can opt into that will maybe cost you money but will save you a ton of time, be really worth it in the end.
So I think that's a good sign of maturity.
Marc: Yeah. Best of breed, there's a good path, and need for times series databases, Influx Cloud is the best of breed there, so right?
Rick: That's what I think. Yeah.
Marc: Cool. Rick, I really enjoyed the conversation.
I learned a ton from your perspective from an end user the challenges, the success, the hurdles, the day to day of actually running and managing both an engineering team operating Kubernetes and the infrastructure itself that's necessary to run it. I really appreciate your time.
Rick: Yeah, any time, and of course, my whole team, every single person on my team knows 10 times more than I do about it, so feel free to join our community Slack and ask them about it if you want to get real details.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
The Kubelist Podcast Ep. #38, Exploring K0s with Jussi Nummelin of Mirantis
In episode 38 of The Kubelist Podcast, Marc and Benjie speak with Jussi Nummelin of Mirantis. This talk explores the...
O11ycast Ep. #64, Shared Language Concepts with Austin Parker of Honeycomb
In episode 64 of o11ycast, Jessica Kerr and Martin Thwaites speak with Austin Parker of Honeycomb. This talk explores how...
O11ycast Ep. #57, Monitoring K8s Applications with Shahar Azulay of Groundcover
In episode 57 of o11ycast, Jess and Martin speak with Shahar Azulay of Groundcover about monitoring Kubernetes applications,...