In episode 23 of The Kubelist Podcast, Marc Campbell and Benjie De Groot speak with Michelle Nguyen and Natalie Serrino about Pixie, a CNCF sandbox project that provides Kubernetes observability for developers.
Marc Campbell: Hi, and welcome to another episode of the Kubelist Podcast.
I'm super excited to spend the next little while talking about eBPF and digging into the latest and observability with two of the maintainers of the Pixie project.
As always, Benjie from Shipyard is here today too. Hey, Benjie.
Benjie De Groot: Hey, marc.
Marc: So, let's just dive right in with some quick intros.
We have Natalie Serrino and Michelle Nguyen both founding engineers at Pixie, now part of New Relic on with us today. Welcome.
Natalie Serrino: Hey, thanks for having us.
Marc: So, I'd love just to start with some quick intros.
Michelle, can you tell us how you got started with Pixie Labs, the observability of the cloud-native ecosystem?
Michelle Nguyen: Yeah, of course.
So, I am actually the first founding engineer at Pixie Labs.
So it's a funny story our CEO of Pixie Labs, he'd actually worked at the company that I was at previously, but we didn't actually share time there when we were at that company together, but he'd heard about me from somebody he kept in contact with.
And so he reached out to me and was like, "Hey, I'm doing this thing and we've heard a lot about you and we'd be excited to have you on board."
And yeah, he described the problem that Pixie was trying to solve.
And I was like, "Yeah, that sounds like something that people run into every day."
And I was basically like, "Let's do it." And so that's kind of how I ended up at Pixie, just starting off at the very beginning.
And we have our office just me and Zain hanging out for the first few weeks and months.
Marc: That's awesome. I think we're going to dive into that origin story of Pixie here in a minute.
Before we do, Natalie I'd love to hear your intro, your background story and how you got over there.
Natalie: Yeah, for sure.
So, I guess chronologically, I started out as someone more focused on hardware in college and worked at Intel right out of college.
But then I kind of realized that I would was more interested in working at a startup environment where things move really fast.
And I wanted to work on software where you can ship something to someone that same day, whereas in hardware, it might take actually years to reach a user.
And so I made that transition and started working in the data space and that's where I met Michelle and Zain.
And that was really great to work together at Pixie again.
And I think that along the way I tended to join companies that were working on a problem that I personally had had before.
And so I think with Pixie, as soon as I saw it, I knew that it was doing the thing that I would've benefited from as an engineer when debugging applications.
And so that was what got me really excited about it because it was a tool I wish I had had.
Marc: Yeah, that's cool. It's always fun when you're working on a project that you're passionate about and you see you actually can use and want to use.
It makes you as an engineer really more involved in the product side of the project too.
Natalie: Yeah. And that's a really good thing about working on something that is targeted at developers because we can dog food our own application every day.
Marc: That's true. So, let's talk about the origin story of Pixie.
You were both there early, early at Pixie Lab days.
Michelle, you mentioned that you met with Zain and kind of helped at the very, very beginning of the project.
Are there any stories that you can share about how the project was created?
Michelle: Yeah. So, to be clear, I'm not a founder of Pixie, but actually Pixie was started by Zain and our other co-founder Ishan.
And I think it was just an idea that they'd been talking about for a while and they were like, "Yes, this needs to happen."
And so, they just wanted to go and reach out to the right people, just like me, Natalie, to kind of go and just build this whole thing together.
Benjie: So, wait guys.
Before we go any further, will you give us a quick overview of what Pixie is and what it's trying to accomplish?
Natalie: Yeah, sure.
So, Pixie is basically an open source tool for debugging kubernetes applications.
And we felt like when Kubernetes started taking off and microservice started taking off, it decoupled where your application is running from the application itself and made it possible to scale up a lot of engineering teams running in parallel working on features and shipping them in parallel to each other.
But it also introduced a lot of challenges with observing what is happening in your application, relative to the monolithic architecture that was more popular before.
And so what we noticed is that just working as developers in our previous jobs, that troubleshooting and debugging distributed application is really challenging and it's a different paradigm than the monolithic one. And so a lot of that challenge was in collecting and surfacing the right data. We felt like that was a problem that could be made a lot smoother.
And at the same time, this was happening eBPF, which is a really cool Linux technology was taking off.
And it had been around for about 10 years.
And there was this opportunity to use eBPF to automatically collect HTTP requests, network statistics, infrastructure metrics, and all kinds of data that previously required a lot of manual instrumentation without the end user having to do anything.
And so that was one of the kind of core things that got us to build Pixie first place, which was how can we collect and surface this telemetry data for microservices applications and make it really easy to work with.
Benjie: So the eBPF stuff is super interesting.
For those of our listeners that don't know what that means, can you give us maybe a high level on what it is and why it's so powerful?
So one of our coworkers who works most closely with eBPF out of all of us, he kind of has this good analogy that I think makes it really easy for those who have never heard of eBPF before, but he says, basically think of it like a debugger. Right?
In debugger you set break points somewhere, and then the program, basically once it hits that point, it stops and then you can inspect the data, see what's going on in the environment at that time. And so the idea here is like, think of eBPF as essentially a debugger.
So, in eBPF you want to attach things to syscalls at the Kernel level.
So for example, it's like you want to trace some read syscall, so you can attach essentially your debugger, your probe right there.
And so the difference here is that instead of stopping execution and then letting you go and examine everything yourself in real time, it essentially runs a function.
So this function can go and do anything that you want that is safe within the Kernel.
And one of those things could be writing some of the values at the Kernel right there to some buffer.
And then from then on, you can read the buffer, do whatever you want with that data.
But essentially eBPF is being able to run these Kernel functions and basically be able to do that.
Marc: So you're able to take these syscalls and then using eBPF, that's what Pixie is really doing differently than previous observability products that maybe had different ways of instrumenting the code or attaching to the process.
You're saying it doesn't matter how your code is deployed, it's going to run against the Kernel, we're use eBPF to monitor what's happening at this level.
Michelle: Yes, exactly.
Natalie: And I think one thing to note is that is we kind of feel like Pixie has kind of three core pillars about how it works, and we can dive more into the eBPF stuff, but baseline visibility into your system using eBPF is one of those pillars of Pixie.
Another pillar of Pixie is that the data that we collect, we store it locally on your Kubernetes cluster.
And that is good for a couple of reasons.
The first reason is it gives you a little bit more privacy because this sensitive data is not leaving your cluster, it just stays exactly where it is.
And then another pillar that we have with Pixie is that we wanted to make Pixie for developers.
And so, as a developer, I want to be able to run really flexible analytics on my telemetry data.
I want to be able to use an API and hook it into other systems that I have.
And so scriptability and API driven access is another pillar that we have.
And we've used that to build out some cool applications like horizontal pod, auto-scaling on Kubernetes or Slack bots and things like that.
Benjie: Right. So the big thing going back to the eBPF stuff, because I think this is fascinating and I want to make sure I understand it all the way.
It's not a side car, it is not attached to a pod. It is in the Kernel space and it buffers out with Pixie in particular to a local data store relative to the cluster.
But this is not a side car, this is in the Kernel space itself. Is that correct?
Natalie: Yeah, that's right. And one thing to note is that eBPF is really flexible.
It allows you to attach probes to Kernel level things like the send and receive syscall for example.
This is really useful because this is the lowest level place that your system is making network requests or doing things, which means that no matter what language your program's running in, we will be able to intercept those calls and collect data using those probe.
And that in comparison with traditional language agents or something like that, that would be targeting very specific libraries.
And so it would be possible to miss something.
So it gives you really good baseline visibility into that.
But eBPF also allows you to trace the user space with U-probes.
And that for example, is how we're to trace encrypted traffic, because we are able to set up a U-probe on TLS library.
And so eBPF actually supports tracing in the Kernel space as well as the user space.
Marc: So, that's access to a ton of data between the K-probes and U-probes, and you get access to pre-encrypted data, then if you're with that U-probe you mentioned, the data stays in the cluster.
That's great. Can you share what types of data do you actually collect?
What's the source that you start with?
Michelle: So we collect all kinds of data.
So I think as Natalie mentioned before, we collect request and response bodies by being able to trace the...
Use the U-probes to trace SSL libraries, so we collect like HTTP, HTTPS response bodies.
But even for different protocols we're able to parse those things out right.
So if you have Kafka running on your cluster, then we can actually tell you, what are the messages flowing in your cluster, both the response and the request bodies?
We also collect just basic metrics about your system.
So your CPE usage, your memory.
We also, one of the things that I really like is that we get to use eBPF to do profiling.
So essentially we have a profile that's always running on your system and it just pings just like, what is this application currently doing right now?
And with that, we collect just a sample of just like, where is your application spending the most time?
And we can use that information to present you a flame graph so you can go and be like, "Oh, I'm spending a lot of time in this function and I don't expect to, so let me go and optimize it."
Marc: Something that I'd love to understand a little bit better. Pixie Labs built this really great standalone project.
It's a CNCF sandbox project now. You went through an acquisition and you're now part of New Relic.
And New Relic I'm sure adds a ton of value in lots of different ways, both through the product and the engineering team and in the vision.
That's a lot of data that you collect. What can I do with just Pixie?
What do I need integrations into third party systems?
Where do I start to get value by integrating multiple systems versus just running the Pixie project by itself?
Michelle: Yeah. So, Pixie, what we try to do is that we try to give you debug ability in the last 24 hours, because most of the time when you are trying to debug a problem in your system, you're doing it because the issue is happening right now, versus it's like, you probably won't go and be like, "Oh my latency was super slow last week."
I mean, you might. But Pixie is kind of designed for just that real time use case.
So we really only guarantee the last 24 hours of data when you're running in Pixie.
And that is in part like we are storing everything in memory and it's unsampled.
So we don't want to retain too much data in your cluster that you may not necessarily need. So, Pixie kind of more for that 24 hour timeframe.
So like where integrations like New Relic comes in is that we can ship long term data to New Relic for further downstream processing or storage or we might also use New Relic for features like alerts for example.
Pixie, we don't have alerts built out yet because we know that there's a lot of great players out there who have already thought about that problem.
And so we kind of let other integrations handle that kind of thing.
Natalie: Yeah. I think the important thing to emphasize here is that Pixie is a fully open source CNCF project.
And what that means is that there's no requirements to use a third party vendor to use Pixie.
Now, it is possible to use our API to integrate with other applications.
We've seen people write an exporter to various other environments.
Kelsey Hightower actually once did a demo with Pixie and he's one of the members of Pixie's board, where he exported Pixie data to Prometheus.
We also have a plugin to use Pixie data in Grafana.
So, New Relic has an integration with Pixie.
And so that is a really good way to get some of the features like Michelle was saying that Pixie doesn't offer like long term storage, but you can just use Pixie today out of the box fully open source.
And you never have to actually use a third party vendor if you don't want to.
And we have a UI with different views of your cluster that comes part of that.
So, Pixie's value that you're trying to provide is just the core Pixie project is collecting data automatically, finding and using eBPF to do this without code instrumentation or anything like this and retain, you said Michelle, 24 hours of data because it's effectively...
It's very useful for debugging a current problem like why is this thing happening right now?
Michelle: Yes, that is exactly correct.
And I think as a whole kind of what we want to use Pixie for is to establish baseline visibility into your system.
So, it's off the bat about you having to do anything.
It's like you already get a clear idea of what does your whole cluster look like right now?
And so it's not just like, "Oh, I'm actually actively debugging a problem," but also just for you to take a look and be like, "Yes, okay. I understand this is what things are looking like, latency is still doing okay overall."
And yeah. It just also gives you visibility into like which services are talking to which?
Because it's funny we've actually run into some people who are like, "Oh these services, they should definitely not be talking to each other. There's a bug with Pixie."
And then they realize, "Oh wait, actually these are talking to each other and it's not supposed to."
So kind of giving you information about that, not just like, "Oh, things are on fire right now let me try to go and see how I can fix it."
Natalie: Yeah. I think with the UI, one thing that really motivated me when we were designing it, is that we have all of this unsumpled data in the past 24 hours like Michelle was saying.
So we have things like all the SQL queries that you ran or the HTTP request that you have or profiles.
And I think one place that I can get frustrated with a lot of observability tools is I see a chart, I see a time series of request latency or something like that, but I don't really have a way of drilling down into individual requests or individual things that happened.
And so I think one thing that is cool about doing the unsampled way of computing all of this stuff, is that with Pixie UI, you can drill down from really high level views of the system and diagnose very specific problems.
So, we for example use a profiler to diagnose when we've had a performance regression where we're using too much CPU in a subsequent release.
And so you can start off looking at the high level details of the deployment and then drill down and see that this specific function is taking longer than it used to.
Benjie: So guys just real quick. I'm at the GitHub project page and I see the PxL domain language here.
Tell us about how we can use that with the data and what that stuff's for.
Natalie: So, yeah.
I think that's something that we're really passionate about because we want to not only collect this data, but we want it to be possible for people to write queries, to answer specific questions that they have.
Because we don't want us to just be only like canned results, and then you can't go into something really specific.
Because a lot of times when you're debugging a problem, there's something in particular that's going wrong and you want to isolate that particular type of thing.
And so PxL is basically Pixie's quarry language for working with data.
One goal that we had with making PxL, is that we didn't want people to have to learn another query language. There's a lot of query languages out there and it can be burdensome to learn all of them. And so what we wanted to do is make PxL completely compliant with Python. So it looks just like Python and we also didn't want to reinvent the wheel.
And so we wanted to follow in the foots steps of Pandas, which is a really great data analytics library for Python, and just follow the Pandas syntax for doing things like aggregates, joints, filters and things like that.
And so using PxL, you can say things like, "Hey, what were the requests that this pod got in the past five minutes?"
But you can also say things like what was the most frequent value for this field in my HTTP request?
Or any kind of custom analytics that you would want to run, you can use PxL to do that.
And you can pipe these PxL scripts into our client API and hook it up with other systems in a programmatic way.
But PxL is the thing that backs all of the views in our UI, so you can customize or extend those to fit your unique needs as well.
Benjie: So, PxL runs on my cluster like there's no client?
Well, I guess the client is on the cluster where the data is stored as well.
So it's basically an API with a nice Pythonic scripting language to get all these insights that you guys are collecting for us for the last 24 hours. Is that right?
Natalie: Exactly. Yeah. All the queries run directly on your cluster.
Benjie: That's really cool. So, have you guys seen some--
Through the community engagement around the PxL language, have you guys seen some contributed scripts if I wanted to go look at some samples.
Is that an option or is that not something that's been developed too much yet?
Michelle: Yeah. So our scripts, we actually have-- They're completely open source as well.
And we've had a lot of interesting use cases from a security standpoint for example.
So using Pixie data to go and figure out if there are like SQL injections occurring in your system.
So, that's just an example of some of the scripts that we have contributed.
But yeah, everything is completely open source and we definitely welcome use cases from everybody because I think at the end of the day, the nice thing about this community is, Yes, your application is different from the other persons, but kind of at the core of it, you kind of launch and monitor a lot of the same things.
Natalie: Yeah. One thing that's really cool is that we recently had a guest blog post from a user in our community.
So we said that PxL is the language for querying data in Pixie, and that's true.
But one thing that we didn't add is that you can actually use PxL to get Pixie to collect new data sources without redeploying your application.
And one of the ways that we power this is that we support running BPF trace scripts in Pixie and you can put that as part of a custom PxL script.
And so we have this one user in our community who wrote a great blog post about how he wrote BPF trace scripts and deployed them with Pixie to collect custom data sources in his cluster.
Marc: That's actually super cool. I'm actually curious when we talk about community and use cases and stuff like that.
There's a lot of observability tools out there and Pixie seems to have a pretty unique and special take on like the Kubernetes cloud-native way of approaching it using modern technology.
Has anybody shared any stories that you're able to share, obviously redact names and everything like this around, "Hey, this long running problem that we've been fighting for a long time, Pixie was able to finally uncover and help us figure out what the actual root cause was."
So, it was kind of funny at KubeCon because we were talking to a bunch of people that were using Pixie and we hadn't ever interacted with them before.
So it was just a sign that we were kind of in a different phase of the adoption compared to the really early days where we intimately knew every user.
So, they came to us and basically said that they kind of use the profiler that we have in a similar way that we do, which is they run the profiler on their various releases and then are able to diagnose performance regressions from release to release before it happens in production.
And as Michelle said before, we have had a lot of cases where people have realized that this thing shouldn't be talking to that thing and they realized it just by using PxL and seeing the communication between the resources and their cluster.
Marc: Yeah. That seems like a really super powerful functionality that you get by looking at the kernel level and not instrumenting code.
So, somebody accidentally is connecting to, or architecture sometimes can grow organically and you want a team, an SRE team or a platform team who looks at it and says, "Hey, this isn't the way that we should be moving this forward."
But getting the ability for them to diagnose, discover that is really, really powerful to offer at the platform level.
Natalie: Yeah. And we can't name names or anything like that, but there are a lot of cool customers that we have or users that we have rather who their Kubernetes cluster is running at the edge.
It's not a traditional cloud-based application.
So they have edge devices running in some particular physical context.
And Pixie is a really good fit for those use cases because they can run it directly on the cluster.
And there isn't this necessity of exporting gigabytes of data, phoning home to some remote cloud.
And so that's been an architecture that we're particularly excited about, like people using Pixie on the edge.
Marc: So I have multiple clusters in my organization, Pixie running in all of them, does it kind of exist inside the cluster so I can obviously monitor the pods and the activity that's happening there, but when we start talking about network activity and who's talking to what, is it really internal to the cluster network that Pixie is focused on, or also ingress and egress traffic?
Michelle: So, yeah. We do track traffic within the cluster itself.
So for example, you mentioned that if you had an account with multiple clusters that are running Pixie, we can't tell you that, "Oh, this pod and this cluster is talking to this pod."
And the other cluster here, it would just be like, "This pod and this cluster is talking to this IP."
So we haven't resolved that yet. We do have some basic just like DNS resolution, but for the most part we show you just IPs if it's something that's external to the cluster.
Marc: Got it. Let's go back to the challenges originally when you built it.
Creating this project from the ground up using eBPF not a ton of resources available, not a technology that everybody was intimately familiar with.
And there was probably a lot of learnings that you had to go along the way.
You were both there at the beginning and you went through that process.
Can you tell us what were some of the most technical challenges involved with getting Pixie to actually live to work?
Michelle: Yeah. So, I think Natalie and I can better speak to just everything else on top of eBPF.
So the team member that I mentioned before, he is the one who can speak most deeply about eBPF.
I mean, I do know some of the problems they ran into in the beginning there, but I can't speak to it as well as he can.
So maybe we can talk a little bit more about what are some of the problems that we ran into when just thinking through the architecture for Pixie for example, right?
Natalie mentioned earlier, we do have the use case that we strive for is just being able to do everything on the edge because we're storing everything in memory.
And so that's some of the problems there that we've run into, is just like, how do we efficiently store all this in memory?
We're collecting so much data. Right?
So, if we want to actually retain 24 hours of data, then we need to be able to build a very good basically data model for this.
And so that's some of the things that we did very early on at Pixie, was trying to figure out what that data model looks like, how do we compress that data?
And what tools should we use for that?
And so I think some of that foundation that we've built early on in the beginning has allowed us to just be able to collect and store all this data for 24 hours.
Natalie: I think that getting started, one of the main problems that I think about with Pixie is the fact that we run so much on the Kubernetes environment means that we have to do both the data collection side that most tools do, as well as the processing side that most tools do in a cloud environment and put both of those things directly on the cluster.
And a lot of times these are people's production clusters.
It's very important that we are able to execute this stuff in a really safe and performant way because otherwise you could disrupt someone's production application.
And so I think that one thing we focused a lot on was CPU, and we often hear the concern from people like, "Okay, if you're using eBPF, does that mean that it's just going to use a ton of CPU and hog up a lot of those resources?"
And so it was important to us to guarantee to our users that we would use less than 5% CPU, for example.
And so that takes a lot of engineering effort to not only build this application, but optimize it to the point where it can run on these sensitive production clusters.
Marc: There's a lot of engineering effort just to be able to actually confidently make that guarantee of less than 5%. I mean, it sounds cool.
It sounds like some of the stuff that you were talking about earlier where you're using Pixie to look for regressions and performance with the next release of Pixie so that you can confidently know that you're still staying below that advertised threshold.
Natalie: Yeah, for sure.
And then even though another person focuses more on the eBPF side that Michelle and I do, I know that it was initially a challenge to get the data instrumentation working on encrypted traffic for example, and that took engineering effort to figure out because you can't just use K-probes for that, because it's going to get the encrypted data.
And we wanted to show the decrypted data to the user because that's what's much more interesting.
Marc: Yeah. Otherwise you're just able to say it's talking to--
There's activity, but without any details, it's just noise basically that there's something happening, but you can't take any action on that.
Marc: I know that you're-- It's a CNCF project right now and you're really focused in the Kubernetes world.
Is it a requirement though for Pixie to run inside a Kubernetes cluster?
I mean, eBPF is obviously, that's just the Linux functionality.
Do you have any use cases or support running it just on non-Kubernetes environments?
Natalie: Yeah. So right now Kubernetes is really our area of focus.
I think that we will eventually want to be able to run Pixie on nodes that are not part of the Kubernetes cluster, but I think that for now there's so much to do with Kubernetes that it makes the most sense for a small team like us to stay really on target on a particular area. And so that's the one that we're focusing on right now.
Benjie: Okay, great. So I want to bring it back, over at Shipyard, we do use New Relic in fact.
And so I want to understand how we could use Pixie today and the integration with New Relic.
And then I also want to understand as a follow up to that, just high level, how was the acquisition, how's it going?
It sounds like it's pretty clear that Pixie is a completely separate open source project that's in the CNCF, but obviously, there's all kinds of transitions that happened when the acquisition happened.
So, I'd love to hear about how I can use Pixie at my company, and then also love to hear a little bit about that transition.
So as I mentioned before, if you decide to use the New Relic integration, then essentially you're shipping a lot of the data that Pixie collects to New Relic for just downstream ingestion on some of their other products such as alerts.
So for example, it's like you want to alert on something specifically or just be able to see Pixie data with long term retention, which as I mentioned, we've kind of focused on the 24 hour use case.
So being able to store that data for a longer amount of time is what you're able to do.
And just like being able to see Pixie data with just your normal New Relic offerings.
Benjie: And so that's just a different data provider in my New Relic console?
Michelle: Yes, that is correct. Yeah. Pixie is a separate data provider, so you'll be able to see that.
Benjie: So, then on the acquisition side, tell us just a little bit about how that went down and how that's working out.
Michelle: Yeah. I think we're super excited.
So New Relic, essentially what they decided to do is just they made this decision that open source is very valuable to the observability community, just because once this group all bands together, you're kind of all solving the same issues and being able to use an array of different tools to solve your observability needs is just crucial.
And so they realized the importance of that and they saw Pixie just working on our own thing.
And so I think we were excited when they reached out with the offer of, "We'd love to acquire you, but we will give you the opportunity to go ahead and open source Pixie."
And I think for myself in particular, I was just really on board with that because some of the things that I thought that was really special about Pixie before we were acquired was just this community that we're able to build up very naturally.
We had this thing called our Pixie knot meetings, which were essentially every month where we met with a lot of people who were just excited about Pixie or just like users of Pixie and just showed them what we were working on and just got feedback from them.
And then being able to just like have this strong bond with the community and then just continuing that with open source where now we allow everybody to go ahead and just freely see what we're doing and freely contribute and give us feedback and their ideas has just been really exciting.
So it was kind of just like New Relic saying, "Yes, we think that open source is important and that this product, Pixie would be a great part of that ecosystem."
Marc: So in addition to open sourcing it, you took it a step further.
And actually it's now a sandbox project.
You contributed the project into the CNCF ecosystem, were either of you involved in that decision and the subsequent process that was involved in actually completing it?
Michelle: Yeah. So Natalie and I were very heavily involved in just going through that process like contributing Pixie to CNCF and just preparing our code base for that.
Because let's be honest before that we are like a scrappy startup trying to build everything together and there's a fair amount of cleanup that needs to be involved when you're opening your code to the public and also making sure that we follow all the guidelines, that honestly is what really drove us to be like we want to contribute our project to CNCF.
Is just like vendor neutrality and making sure we kind of follow all of those things when we're going through that process.
Natalie: Yeah. I feel like it kind of sounds a little crazy on the surface.
Why would you acquire a company and then open source it?
But the more you dig into it, the more kind of contrarian ingenious you actually realize it is because what we're really provided with Pixie is access to a really valuable data stream.
And when people instrument their code or put data pipelines in their Kubernetes cluster to run PxL scripts or things like that, right?
There's this sense that that needs to be open source because you are tightly integrating it with your application.
And it is really hard for people to have some kind of vendor dependency directly in their own code.
And the value that I would say observability vendors have today is, is taking that data and making it useful.
But the data pipelines themselves, those things can be open source because the vendors are providing value downstream of that.
And so by open sourcing Pixie, obviously it's got a lot wider of an audience now, a lot more people can use it that wouldn't have been able to if it was a proprietary product.
And then New Relic can then take that data and make it really useful and integrate it within the context of all the other data that they collect for you.
And so I think that it is a really interesting symbiotic relationship and it shows kind of the future direction that we think that the observability space is going to go, which is more and more open source components in the application and on the edge and the value provided downstream of that.
Marc: Yeah. That totally makes sense.
I think open source for transparency and auditability, because this is access to a lot of sensitive data when it's running in there.
100% makes sense. But even when you think about long term ownership in becoming a steward in the ecosystem so that I can confidently...
Benjie was asking how would I get started?
But like a question that hopefully would be a follow up question there, would be how do I know that I'm not going to make a big investment in adopting this tool that I'm going to have to rip out in three years because of an acquisition and you discard it or something like this?
But it sounds like Pixie has put a lot of thought into that by open sourcing it and New Relic is completely on board and putting it in the CNCF helps really just shut those questions down.
Natalie: Yeah. I mean, it's all about trust. Right?
And I think that when a project is a CNCF project, that establishes a promise.
So the people that use it that says that it's always going to be open source.
We're not going to sneak up and change the license on you.
This is truly a foundation owned open source project that you can rely on.
And I think that it's a signal that this is something that we take very seriously.
Marc: Yeah. Nobody comes and just changes the license on open source project these days.
Benjie: Well, the CNCF sure doesn't.
Natalie: The CNCF definitely does not.
Benjie: So, that brings up a good point that I want to understand.
How does Pixie fit into the open telemetry ecosystem and are you guys a part of that?
Are you guys contributing? Or how do you guys look at that?
Michelle: Yeah. So we're not direct contributors to open telemetry, but we're just honestly huge supporters of open telemetry and just the philosophy of what they're trying to go for.
But essentially one of the things that is from the Pixie point of view, it's like we are collecting all of this interesting data and there are so many tools that you could pipe Pixie to.
But what open telemetry kind of gives you is that there's this API for sending that data.
So for example, Pixie, we are working on an open telemetry exporter.
And so if there were some tool that had an open telemetry importer, then you could automatically start using Pixie with that tool.
So it's just like if everybody kind of falls the same standard, then it's just very, very easy to just plug and play with all of these different integrations essentially.
Benjie: Great. So if I wanted to use Yeager, I could use Pixie to feed my Yeager? Is that right?
Michelle: Yes, exactly.
I think that the thing with open telemetry that's so cool, is it's really like a huge victory by the open source community to say that, "Hey, we want the data that is exported by all of these tools or created by all these tools to be interoperable with other tools. And I want to analyze my data together no matter where it was collected."
And so it makes the analytics process so much better when you have that standardization.
And it's been really exciting to see the huge adoption that it's gotten among anyone working in observability.
And so we're really excited about that and big fans of the initiative.
Marc: Initiatives like that are often really, really hard because there's a lot of different...
There's a lot of legacy existing products that want to adopt and then new folks who want to change the spec a little bit.
That's awesome that it's actually working out.
I think we've had a different podcast where we dove into that, the open telemetry project some, but it's great to hear that for Pixie it's actually a valuable and super useful tool.
Marc: I'd love to shift gears for a little bit here and chat about the roadmap.
You're in the sandbox right now, what is the team working on these days?
Natalie: One thing that we have coming down the pipeline is obfuscation support and basically redaction of PII.
So, in Pixie we do collect a lot of data and maybe if your organization doesn't want anyone who uses Pixie to be able to see that data.
And so we're going to support redaction of personally identifying information, we can support redaction of particular fields.
So, that is going to be a really cool thing coming down.
We are also excited to be able to support larger clusters.
Currently Pixie runs really well on clusters that are 100 nodes or less, but we want to support thousands of nodes in a cluster.
And so that is going to be a really exciting project that we're working on.
Benjie: Wow! That is some really cool stuff in there.
I want to ask you real quick about this PII thing.
I think Marc wants to ask you as well.
How are you guys going to do that in line?
And honestly, so I guess that's valuable because if for example, New Relic is getting piped the data, that needs to be sanitized before it gets over to New Relic in a lot of instances.
So, I see the use case there, but how in the world are you guys going to do that? Do you guys have an idea?
Natalie: Yeah. So, I think that there's different modes of it.
So for people that want the most privacy possible, we can make it so that we don't ever expose the request and response bodies, because that's the way to just guarantee that those bodies of those requests will never contain any PII because they're redacted in the first place.
The other thing that we can do is look for specific data types and basically parse those out from the bodies themselves for people that want to say, "Hey, just want to make sure no IPs come out, or no addresses or no credit card numbers," and things like that.
And so basically the two approaches here are, you can look at redacting the entire body, or you can look at redacting pieces of the body based what kind of category that PII falls into.
But from a technical implementation, is this going to be a scripting language or as a consumer of Pixie, how can I...
Is there scripts that I'm going to do?
I may use the Pixie language or what am I going to do? Or PxL language?
Natalie: Oh, oh, I see. Yeah.
No, this would just be like a feature that you could turn on.
So you wouldn't have to script it to do that.
Benjie: Can you tell us a little bit about some of the challenges that you deal with the larger clusters versus the smaller ones that you guys are pretty good with right now?
Natalie: Yeah, for sure.
So basically we store all of the data in memory and when you query and say, "Hey, what is my latency over time?"
We need to actually go to all of the nodes in your cluster and read out all the requests that it got and compute the statistics on top of that.
And for a really large amount of nodes in 24 hours of data, sometimes that can be a longer query.
And we want things to be really fast. That's a really strong value that we have with Pixie.
And so what we're going to be doing in the new year is basically saying, "Hey, why don't we run jobs that pre-compute statistics and summaries about this data, so that when I want to just get the high level view, I could immediately go access it?"
And the raw data is still there when I want to dive in there.
But for these common views that we access all the time, there's no need to reinvent the wheel every single time you look at it.
Natalie: And that would basically mean that the query doesn't fan out to every note every single time.
So, that's like one clear improvement that we can make.
Benjie: Okay. So, and yeah, just to be clear, so Pixie runs is like a stateful set on each node?
Benjie: Right. And by not querying every single node if you don't need to, that's kind of the path there.
Natalie: Yeah. So there's a lot of opportunity with summarization of data, better compression of data, enhancements to the query execution.
There's probably some performance improvements that we can make there, but we've been really happy because we built our query engine on top of Apache Arrow and we found that that gives us incredibly good performance for working with large chunks of calmer data.
So big shout out to them for the work that they've done.
Marc: That's cool that you're able to use that. Let's go on and chat about the community for a little bit.
So since you've open sourced and joined the CNCF, I'm sure most CNCF projects have community meetings, can you chat a little bit about how effective they've been, how you are running them?
If I'm interested in joining and participating in the Pixie community, where would I go? How would I start?
So this is basically the Pixie knot meetings that I mentioned before, but we try to have them monthly, although we've just had like some conventions or conferences lately, such as kubeCon.
So we've kind of had those as more of an in-person meeting.
So essentially what we try to do there is just give people a status update on just what we've been working on lately.
So demos of latest features that we've had and then just kind of open it up for questions at the end for any questions people might have or just feedback in general.
Natalie: Yeah. We also have an active Slack community, it's really great to connect with people on there and hear about their use cases.
I think that as the community grows, we're going to be thinking about how can we facilitate some in-person meetups, safety pending of course.
It's been so great to interact with everyone at kubeCon, we're going to be at kubeCon EU and would love to connect more in-person with people assuming that that still happens as the community grows more mature, I think that we would also like to make more local meetups as well.
Marc: Yeah. That's awesome.
As people still are coming back, I think that was one of the fun parts of the whole CNCF ecosystem before it was all of the different meetups across the place that you could go to and learn about the projects.
So, you recently joined the sandbox, have you started thinking at all about like what the roadmap looks like and timeline in order to apply for the incubation level as the next step?
Michelle: We've actually been talking about it for a bit.
So our hope is actually to apply to be, fingers crossed that everything works the way that we want, but we hope to apply to be an incubating project sometime next year.
Marc: And now, do you have specific roadmap functions that you want to be able to get or is it just more adoption and more like end users actually using the product right now?
Michelle: Yeah, it's more adoption and also just like being able to say that this project isn't in good state to just be able to be an incubating project and that we follow the guidelines.
And like Natalie said before that we continue our promise of being vendor neutral and being open source.
Marc: If there's folks listening to this that are running Kubernetes clusters right now and they're interested, if there was one thing that they could do to help you with that story a lot and get you what you'd need, is it more folks in the community meeting, more like end customer documented use cases that they're actually running it in production contributors in the code?
What's the biggest asset that you'd have?
Natalie: Oh man, there's a lot of good opportunities.
I mean, we have people helping us by... They're contributing new protocols that we can collect, for example.
So what I would say is that we have a list of protocols in our docs.
If there's a protocol that we don't support like maybe a SQL protocol or a database protocol that you use that we don't support, we really welcome contributions from the community for that type of thing so that everyone can benefit from it.
We like to do guest blog posts as well from people about, what are their use cases?
What have they found Pixie useful for? What observability problems have they run into in their work?
So, definitely things like that. And also just help us find places that we can improve the project.
After a while you look at something for a really long time, maybe you don't see it with the fresh eyes.
So it's always super helpful to hear feedback from the community on what they would like to see and what they think going well.
Benjie: So, you mentioned that you guys are going to be adding support for Java.
So, if I wanted to contribute... Like I don't know.
Rust or something to that effect, is there a good opportunity there for contributing I guess at the PxL language level?
Natalie: Yeah. So, that would be in the data collector part actually.
So, PxL's downstream of that, it just queries data tables that get registered in your system.
I want to quickly clarify the Java support thing.
If we're talking about tracing HTTP requests or even JVM statistics, we actually do support that today.
But specifically Java support is in reference to the code level continuous profiler, where you can see which functions are taking the most time.
So let's say you wanted to contribute the extension of that profile or to other languages besides the languages that we support today.
That would be a really cool contribution. And we'll probably get you pretty deep into some eBPF code if that's of any interest to you.
Benjie: Great. Is there a plan to inject things using eBPF anytime soon?
Or is it just a read type opportunity for this software?
Natalie: Yeah. We want to make sure that we made this decision really early on, actually that we don't want to modify requests that are being sent in the system.
We want to just be read only.
And so I think that there are a lot of really cool injection use cases for eBPF, and we've seen a lot of other cool projects do stuff along those lines, but for Pixie, we wanted to leave it at strictly observing what's happening and not mutating it.
Benjie: Great. That's a great distinction there.
Marc: I have one more quick question.
I'm curious where the name Pixie came from, either of you involved with naming it, or have the story there?
Michelle: Well, Natalie and I were not involved in naming it, but the idea is that Pixie is supposed to be able to provide magical experiences to people.
So, that's kind of how that name came to be.
Marc: Got it.
Benjie: So tell us, is there something around the corner with machine learning and Pixie?
Natalie: Oh, I'm so glad you asked that.
I think that's something that is really exciting for us in terms of both features that we have running today and future features.
So, a really core part of data science is the ability to run machine learning models.
And because many of us come from a data background, we wanted to be able to do really sophisticated analytics on machine generated and telemetry data.
And so, one thing that we did pretty early on is we wanted to make sure that you could run TensorFlow models in PxL and make that just part of your PxL script.
And so that is a capability that we have today if you want to run a custom model in PxL.
But there's a couple use cases that we have that I'm pretty excited about because it's always annoyed me when I didn't have this before in the past when I was debugging something.
One of them is, and this is a very simple application of a model, but we basically say, "Hey, we know the request path for an HTTP request."
Right? But a lot of times that request path has a URL parameter in it like a wild card or something like that.
And sometimes you can end up with a series of wild cards in a URL.
So one cool feature that we have is actually clustering requests by their logical endpoint that we learn in the system from the examples that you give us. And so when you use Pixie, you can actually drill down into particular endpoints in how they're performing. And this is exciting to me because a lot of times you'll look at the average latency for your service and it looks perfectly fine. But once you drill down into a particular endpoint that may only up a few percentage of your total requests, you see that it's totally hose.
And getting a service level view of the performance was not enough information you really needed to drill down into the endpoint itself.
And so by clustering these instances of your path into logical endpoints, it makes troubleshooting those types of issues a lot easier.
Another use case we have is very similar where we basically can cluster SQL requests that you're running into logical buckets, even if they get slightly different inputs or parameters.
And so you can do the same thing, but for SQL requests and see how these clusters of requests behave over time.
I think that the way that we think about machine learning in Pixie is I think a lot of times people approach it with the idea of like, "Oh, the machine learning needs to completely solve the problem for me."
But we think about it much more like human in the loop assistive.
So are there ways that we can cluster the data for you to help you visualize what's going on better?
Can we draw your attention to anomalies in weird things, but at the end of the day, the human judgment comes in to kind of classify or categorize what's going on.
Benjie: So you're talking about anomaly detection baked in just by turning on Pixie ostensibly?
Natalie: Well, so the cases that I named were more around clustering data so that you can easily see different logical clusters of requests, but I think anomaly detection is another really cool thing that there's kind of like natural support with the ability to run machine learning models.
Benjie: And that is running on the node itself in my private organization?
Benjie: That's pretty powerful. Wow! And that's today? That exists today?
So if you want to run a custom model, it may take some learning about exactly how to do that in PxL, because it's a very advanced feature and we don't expect the average user to be doing that, but it is supported in the language.
Natalie: But if you just want to use it today, you can just look at the cluster data and just see it in action.
Benjie: Right. Well, that's, that's super powerful.
My brain is racing on what I want to do with TensorFlow.
And there's so many things I can't think of one great example, other than anomaly detection, but that's really powerful. That's super cool.
Marc: I learned a ton about Pixie here today.
Natalie and Michelle, thank you so much for joining and sharing. I'm super excited.
I think next week I'm going to go put Pixie on my production clusters and see how it gets some more data from it.
Benjie: I'm going to put Pixie on our staging servers.
But I'm very excited about it. Very excited.
Marc: Well, thank you, Natalie. Thank you, Michelle.
Michelle: Yeah. Thanks for having us.
Natalie: Yeah, it was really awesome chatting with you both.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
The Kubelist Podcast Ep. #38, Exploring K0s with Jussi Nummelin of Mirantis
In episode 38 of The Kubelist Podcast, Marc and Benjie speak with Jussi Nummelin of Mirantis. This talk explores the...
O11ycast Ep. #64, Shared Language Concepts with Austin Parker of Honeycomb
In episode 64 of o11ycast, Jessica Kerr and Martin Thwaites speak with Austin Parker of Honeycomb. This talk explores how...
O11ycast Ep. #57, Monitoring K8s Applications with Shahar Azulay of Groundcover
In episode 57 of o11ycast, Jess and Martin speak with Shahar Azulay of Groundcover about monitoring Kubernetes applications,...