JAN 19, 2023

40 MIN

Ep. #57, Monitoring K8s Applications with Shahar Azulay of Groundcover

GuestsShahar Azulay

HostsCharity Majors, Liz Fong-Jones, Jessica Kerr

light mode

about the episode

In episode 57 of o11ycast, Jess and Martin speak with Shahar Azulay of Groundcover about monitoring Kubernetes applications, improving the UI experience of observability tools, and utilizing APMs. Shahar shares lessons learned from his storied career in R&D leadership positions, cyber security, machine learning and AI, as well as general advice for developers, SREs, project leaders, and executives.

about the guests

Shahar Azulay is Co-Founder & CEO of Groundcover, and a serial R&D leader. Shahar brings experience in the world of cybersecurity and machine learning having worked as a leader in companies such as Apple, DayTwo, and Cymotive Technologies. Shahar spent many years in the Cyber division at the Israeli Prime Minister’s Office and holds three degrees in Physics, Electrical Engineering and Computer Science from the Technion Israel Institute of Technology as well as Tel Aviv University. Shahar strives to use technological learnings from this rich background and bring it to today’s cloud native battlefield in the sharpest, most innovative form to make the world of dev a better place.

show notes

about the episode

about the guests

show notes

transcript

Shahar Azulay: I think it's one of these specific values that developers just can't live without. I mean, if something is really clear about this specific market, it's that the value is super clear to people all across the stack. It's been written into our blood stream, and I think that's what gets us excited about making products better and better in this domain, because we feel immediate impact and immediate values on developers and how they actually work with their stack, with the applications that they're trying to push out. So that's the thing we feel most connected to.

Jessica Kerr: Do you see a lot of developers getting excited, or is it more SREs? Or platform engineers?

Shahar: I think it's a general question of what role plays an APM inside an organization. I think it's very varied across the stack. Specifically we're a different APM, Groundcover is a bit of a different APM than maybe legacy APMs in the way we get installed into the organizations so that DevOps or the SREs are usually our gate into the organization since the installation is very infrastructure oriented as we're using eBPF.

But basically the value of an APM just goes all across the R&D team from DevOps and SREs, which might be more interested in matrix to set SLOs and track them, and eventually monitor the trends and behaviors of the system in general. But eventually traces and logs and all the connection between traces, metrics and logs, running an event would usually interest a dev, and I think if you want to eventually create an impact inside an organization, you would have to reach a dev around troubleshooting.

While the DevOps are more interested in the day to day usability of the system to track behaviors and compare different versions you deploy to production and stuff like that. So definitely seeing the system move around the R&D team, across the different teams, all the time.

Martin Thwaites: So do you see developers then as the growth part of what you're dealing with? Is that the new area that you're looking at, is developers? Me being a developer, rather than an SRE, it's something I get really excited about, so is the developers the new area or do you think that that's something that already existed?

Shahar: I think we're seeing it's a cycle, the cycles of life in development. We've moved from separating the infrastructure to DevOps, and creating positions like production engineering, to companies where you see a different approach of engineering or R&D or developers in general taking full responsibility for their application from writing the code, designing the metrics they're going to track and deploying them to production.

So I think as time passes, it's very clear that, like we see in security trends, that the developers should care about the way they're monitoring the application production. They think about it as they write the code, they basically engage more with the way that their application is going to be monitored, they get directions from the DevOps and SRE teams sometimes to integrate one solution or another. So they're definitely a key part of everything that goes on in selecting the technologies that eventually is going to track and monitor the applications.

We feel that even if the solution is super interesting to DevOps and SRE teams, it still has to bring a lot of value to developers so the entire organization is going to be convinced that it's the full package, that they can use it all across and every team is going to enjoy it. If the developers eventually won't get any value, I think it's a harder sell for an organization, basically, to fall in love with a solution.

Jessica: So DevOps and SRE get observability in, but for it to really land, for the marriage to really work out, the developers need to care?

Shahar: Yeah. That's how I see it. They still have a real impact on solutions. We even saw it in security with companies like Sneak and things like that, eventually. You don't want to force a solution inside the organization, no matter who's the stakeholder. Once it goes down, all the way through R&D and the developers really appreciate the value and see it as part of their stack, it's going to propagate really fast inside the organization and succeed better. So that's the way we see Groundcover also.

Jessica: Nice. Okay, so tell us about yourself? What's your background? How did you get this perspective?

Shahar: I'm Shahar Azulay, I'm the CEO of Groundcover. Most of my career has been around R&D, leadership positions in different areas. I've started in cybersecurity for many years, and my seven or eight last years before Groundcover was mostly around machine learning and AI. Before Groundcover I was a team leader at Apple around specific machine learning applications in the watch and the phone, so if you guys have any complaints you can ship them my way. I think Groundcover was borne from this background, for my background and our CTOs background, Yechezkel.

Basically we both come from years of being on the user side, using solutions like DataDog or open source solutions like Prometheus, Grafana and stuff like that. Eventually we've used them, we've enjoyed them, we see the value in them, and also the pain points of where they work less as you would've expected or wanted, or suffer from different integration or scale issues. So that's where Groundcover was originated from, from a first hand experience of being responsible for production and caring about what you're going to wake up in the night for.

Jessica: Which particular problems is Groundcover solving?

Shahar: So I think in general the interobservability market, specifically the APMs, there's a very clear trend which I think also we find concerning regardless of Groundcover, and that's we're being taught for over a decade about the three pillars of observability, and you get all the traces, logs and metrics and everybody can cite it by heart now and tell you what they need to collect.

But when you actually get to real companies and talk to real teams of real people that are working with real productions at scale, usually they don't have a full APM implemented. We see a lot of DataDog users, even, not activating the APM tier. APMs today, I mean, I haven't used Honeycomb, for example, but I know other solutions which are really heavily reliant on tiers, from a logging backend to custom metrics to infrastructure monitoring, and all the way up to full a application monitoring tier like an APM.

So we see a lot of teams not activating that, and we feel that there's a gap in reaching the data, basically. A lot of teams don't have access to APM grade data, and we feel that it originates from a few different points. But one of them is the integration, I mean, the industry has been heavily working on solutions like auto instrumentation. It started out with Java agents, where we could have had those and avoiding code changes, and all the way through to as much own instrumentation as you can in other solutions and other languages. But eventually it still requires you to be part of the development cycle in many different aspects.

Jessica: This is one of the great mysteries of the universe, as far as I'm concerned. Can you define APM?

Martin: Asking the hard questions, Jess.

Shahar: It's a philosophical question, yeah. I think an APM basically is an offering that tries to say, "I know how to take logs, metric and traces, and create an experience in one place where I put them all together so you can get value from the different verticals and enhance the value of what you're seeing."

Because if you're going to use a logging backend at one place, and a dashboard for metrics in one place, and just see spans flowing around one place, that's not an APM for me. I think an APM is trying to aggregate all this data into a sophisticated user experience so you can troubleshoot better, investigate performance better once you see them all together.

So I think DataDog or Groundcover, or all the solutions work great. I think Honeycomb the same. It's around those kind of designed experiences that provide logs, metrics and traces in one place so you finally understand the full picture, because a lot of the problems around application monitoring is that you have too much data.

So if something doesn't sit right in the same place with the same context, things are going to lose meaning and I think that's what great APMs do. So for us, clearly, Groundcover, part of what we're offering is a UI experience that we believe in that can help you focus and trouble shoot and stuff like that. But it's not just collecting the data.

Martin: You've got some great points in there, and I think for me, all of what you're describing there is around what observability is because observability is about being able to ask those questions of your data, it's about being able to... You don't really care about logging, you don't care about metrics, you don't care about events, you don't care about traces. What you care about is being able to ask questions.

You don't care about the individual bits of data, you don't care that one's going to one system or one's going to the other. That could be perfectly valid. What you need to be able to do is ask those questions, and that's where that user experience that you were talking about is really, really important because what it's about is being able to, from a user perspective, be able to ask those questions.

And us, as developers or SREs, are those users in that context and we need to be able to have that really, really good user experience to be able to do it. It's not about funky graphs, it's not about a big line graph that shows you things going up, and up's good, is it? Or is up bad? I don't know.

Jessica: Up and to the right, that's good.

Martin: Oh right, see, that's what I've doing wrong all these years. I've been going down and to the left. But it's about that user experience, that's to me where observability comes into its own, is having that really rich user experience that allows you to not care whether it's a metric, whether it's an event, whether it's a trace because it really doesn't matter. If metrics give you all the data and allow you to ask those questions about your data, then great. It doesn't matter.

If you need that context as you say, bringing those together, then that's really good as well. But the idea of it being three pillars, the idea of that being 100% observability, that is the thing that I don't think really works. It's really about that user experience that you were talking about and focusing on that idea of allowing those developers, those engineers, SREs, whoever it is, to ask those questions of that data.

Shahar: I agree.

Martin: So there's some really important bits there, but user experience was the big one for me because that's what brings good observability.

Shahar: I totally agree.

Jessica: So you mentioned that there's a lot of data and getting that into a presentable format that you can really get your fingers in and dig around in is one problem, and the other problem is getting it. Why do we have logs, metrics and traces? Well, because we have logs. Then Groundcover works with the EPBF, right? Which is like a whole new source of what is happening in your software.

Martin: Yeah, that's my big question. Please tell me what eBPF is, please. I have done so much research and tried to understand it, but I'm hoping that somebody like yourself who's career is now dedicated to it can give us some really succinct explanations.

Shahar: So I think eBPF, like every great technology, is not new. It's been making its way to our awareness for a long time, basically started out way back in the 90s as part of what we're currently experiencing as developers is TCP dump and solutions around filtering network packets at high scale, using kernel abilities. But basically, today it's a completely different kind of package from what used to be when we started out with BPF, which is not the extended version of eBPF.

Jessica: It's like we always had Linux kernel namespaces, but containers are actually usable by people who aren't deep experts.

Shahar: Yeah. In a sense, it's also that, but I think eBPF made a significant leap in capabilities that made it usable like it is today, when we started out with BPF a long time ago-

Jessica: And what is BPF?

Shahar: So BPF is a technology that started out way back ago when we tried to allow manipulation of packet routing, and basically trying to figure out to probe packets in the network using kernel abilities so you can be more efficient. You don't want to do all that in the user space.

Jessica: So kernel abilities, so we're talking about plugging into the operating system and getting it to talk to us?

Shahar: True. eBPF basically allows you to run code inside a virtual machine or a sandbox, inside a kernel. It's an ability or a safe ability to basically execute business logic inside the Linux kernel, which you would otherwise execute in the user space.

Jessica: Business logic inside the Linux kernel? Is that a good idea?

Martin: Danger, danger, Will Robinson.

Jessica: This sounds like stored procedures in Oracle.

Shahar: Yeah. But it's a good idea when you compare it to the alternative. When you look at the alternative, basically what has been happening for the past two decades or so is that people would write kernel modules in order to manipulate the behaviors of the kernel to do things in high performance. They would do it in load balancing, in DDoS prevention, in other different use cases where it made sense to say, "I need a different version of the kernel to actually allow me to operate my logic, which is different at scale."

It doesn't make sense since the development cycles in the kernel are so long, you would force people to write kernel modules which are totally unsafe, basically. You can crash the operating system. eBPF is the correct alternative in the sense that you convert it to be in a kernel API, if you wish.

You say the kernel now has the abilities to verify the code, make sure it's safe, make sure it will execute on time, and doesn't reach any sensitive parts of the operating system, allowing the users to write code that can actually be mounted into the kernel without the risk of a kernel module. So it kind of makes the kernel adaptive or programmable in a sense.

So instead of me having to go to the very closed Linux community and trying to push my idea for the Linux kernel for the next five or six years, I can, as a user, make the kernel programmable for my specific needs right now with eBPF. There's limitations-

Jessica: Because it won't let you break nearly as many things as changing kernel code would let you break?

Shahar: Exactly. It would give me the superpowers of being in the kernel, being visible to the user space, being visible to kernel abilities and run in high performance, but it won't let me access everything, access specific memory segments of the operating system.

Jessica: Ah, enabling constraints.

Shahar: Yeah. The eBPF verifies a major part of eBPF, and it's basically the parts that's going to check your software, make sure that it runs safely and smoothly, and once you've passed that gatekeeper, your program runs inside a kernel, but you can be safe and sound to know that it's running and it's not going to crash what you're doing. We see eBPF being implemented in security, in networking, in many different verticals inside the industry so it's definitely catching fire really, really quickly. I think it's a proof point that eBPF can be definitely used for may different things in production.

Jessica: Including observability?

Shahar: Including observability, and I think that's a major leap from what developers were used to doing. I mean, eventually you used to work hard inside your code as the maintainer of the code, basically, to integrate monitoring pieces of code into your stack. It would be either by using Java agents where you could, or by integrating an actual package or an actual SDK into your code, instrumenting the code in different depths based on the different language you're working on to actually allow the monitoring system to sample the application.

What eBPF is saying, "You don't need to do that. You can run out of band, outside the application, but inside the kernel, and be able to observe, basically, what the application is doing, how it is using the networking stack, how it is using the file system, how it's using the different resources the kernel eventually maintains." So you can see the APIs the application are using, you can profile the application using eBPF, you can do a lot of different stuff that would other require deep integration with the application. That's a major enabler, specifically in Kubernetes environments.

Martin: So you've got some interesting things there. One of the things that I'm really pushing for at the moment is easy mode. The idea that how do we get developers and engineers to start thinking about these things is, well, we make it easy, we make it so that they don't have to-

Shahar: I think, for example, when you look at a Kubernetes monitoring environment in that sense, I mean if before you would be working with a monolith kind of solution based on Java or whatever, it was a single application running on top of a machine basically. So you could say that integration of an SDK or a Java agent, or whatever, isn't such a hard work and there's not too much to gain with going out of the event for the application because it's one application running on one machine.

But today, I mean, it's very clear that the amplification of the value is super strong, you would run an agent on one node of a Kubernetes cluster, and on top there would be 100 different containers running five different types of services reading five different languages. That's a major amplifier of what you can do because suddenly you can observe all that with one touch so you don't have to go to those five different R&D teams, coordinate them to actually install or redeploy the app differently so they can actually get the monitoring capabilities. From an organization perspective it would require so much less effort, so much less coordination just to get the value, basically.

Jessica: So Kubernetes gives us the power to plug observability tools using eBPF underneath all the different applications in all the different languages?

Shahar: Yeah, so I think that's the major amplification of what eBPF can do because if you look again as we used to have in monolith systems, if you have one machine running one process, say of a Java application of a monitor application, then it's really clear that... I mean, eBPF is nice but compare it to a Java agent or instrumentation inside the code, it's not that different.

You can say, "Let's work hard, integrate the stuff into the code and get it done." But once you look to move to Kubernetes and the infrastructure is basically running so many applications on top, using user containers on top so once you run an agent in each of the nodes in the cluster, you suddenly see from that eBPF agent all the containers running on top and you can observe them instantly so it gives you the ability to see the hundred containers running on that node from the five or six different languages running on top without being integrated into each of those different applications, and that's a major jump not just technically, but also from an organizational perspective.

You don't have to coordinate with the different R&D teams, be part of the R&D development cycle to just make sure that everybody is on the same line regarding integrating the observability vendor into their stack. It allows you to decouple the observability from the R&D cycle and use one guy inside a big organization to deploy you across the entire production. That's a major time to value difference, and it allows people to experience an APM much faster and make it much less preplanned and much less tied to what they're used to actually doing as part of their R&D development cycles which causes a long time to value.

Jessica: Yeah, you don't want to have to wait for every development team.

Martin: So how does this interact with developers who need to integrate with their own platform? So if I'm writing a microservice, I know where the important bits are in my microservice. I know-

Jessica: You have the business logic and the account ID and the fields that you're making decisions on, which eBPF is not going to have access to because it's at the operating system level.

Martin: Yeah, you've got algorithms that they change the performance of that algorithm based on the imports that come in with that algorithm. Maybe if we pass in a six into it, it goes and takes 10 minutes whereas if you press in a one then it only takes 20 seconds. You want to know whether it's the people with a long name that are the ones with a problem. How does eBPF solve for that? How does it integrate with that?

Shahar: So I think the question, I have two answers. One is that eBPF doesn't solve that. I think it's a general question for what is an APM? And I think any APM is exactly not the place where you can cover custom metrics. Eventually an APM is trying to create all the different application metrics out of the box that should be used, like golden signals, error rates, throughputs and latencies, that should be used for tracking SLOs and high level stuff. But there's always going to be that business logic which is very specific to what you're trying to achieve that has to be defined by the developer.

Jessica: Right, right. But that can hang off of the framework that's created underneath.

Shahar: Yeah. We definitely think that collecting custom metrics is super important, in one place. We definitely see that it's the place of Groundcover to collect those custom metrics, which are not out of the box, which are created by the developers, and eventually combine them in one place with the metrics that we create out of the box for them without them having to do anything.

It's definitely a critical piece of the puzzle. I'm just saying that in every organization there's always going to be a different business logic which can be critical for scale, can be critical for performance investigation so it's definitely important. On the other hand, regardless, we think what Groundcover does really interestingly and what APMs usually try to do, we distribute the way observability is being collected so the agent is doing a lot of the data crunching on the fly so we can reduce data volumes before they even leave the node or being shipped outside of a cluster.

So eventually it creates a much more cost effective experience. By doing that, we're choosing specific verticals inside production, for example like the golden signals, to learn distributions inside the actual agents, and we use them to sample raw data. For example, Martin, to your question, we will definitely sample the high latency request, we'll definitely sample the high payload request and eventually use smart capturing rather than random sampling to showcase the different collage of examples that eventually depict the problems or abnormalities basically in production based on detailed raw examples.

I think that's a major duty from an APM, pinpointing the things you should maybe start to look at in order to investigate. On the other hand we would create all the metrics you would expect from an APM without storing all the raw data that we eventually use to create them, which is also a major boost in cost effectiveness.

So if you're looking at a P50 metric over time, you probably don't want to store all the spans that created it. But you do want to see the spans that you should care about like the examples that you mentioned, which are critical for investigation.

Martin: I think context is the really important thing, the context of those things. Coming back to what you were alluding to really, earlier on when we were talking, is this idea of correlation, the idea of different bits of data, providing more and more context. I think that's really key to the whole observability concept, is this idea of more and more context, whether it's the request data, whether it's the response data, whether it's things that are within that stack.

I care more about my individual application than I do about the landscape, being from a developer background. That's where my questions come from because, to me, this is about my individual application and I really want to know what's happening in my individual application and the things that affect my individual application. Whereas it sounds from what you're talking about around eBPF being more holistic, which is really important as well, is being able to see how a system reacts rather than an individual service or application.

Shahar: We will be able to pinpoint, just as every APM instrumented from the code would do, pinpoint the different services running inside a cluster, what each one of them is doing, the granularity of the data that is being collected by Groundcover is very, very deep. I think the major difference in what eBPF does is that it's not limited to the applications you're writing and shipping to production.

You can look at it also from this angle, every production today has third party components, right? You have different web servers and proxies and all different stuff running in the cluster which are usually open source stack solutions and stuff like that, that you integrate into production. Usually you don't instrument them, usually you don't have the visibility of what's going on with them like you have in specific cherry picked services that you integrate your observability stack into.

eBPF is kind of agnostic to that. Everything is equal from that sense to the eBPF agent. We can track the Kubernetes control plain alongside your application and see packets flowing from your code into [Istio] say, for example, and back because it doesn't really matter. You don't have to instrument the Istio code for it to see the traffic transiting through the Istio pod and telling you it took that much time for the request to pass through that service.

So eBPF is also really holistic in a good way of saying all production components are equal. We see everything, it doesn't matter the language it was written on, it doesn't matter if it's your code or not, which is an enabler.

Jessica: Are you able to tie that information in with the wider context of a customer experience of a request taking a long time? Are you able to propagate trace IDs and parent spans and structure the eBPF information in with the application level information?

Shahar: Definitely. Great question. Definitely one of the gaps of eBPF is distributed tracing, but I think... I mean, distributed tracing of seeing a request propagate through the different microservices and so on. But, for example, we do know how to build a full dependency map between all the services, see the protocols they're using and the error rate on each of these connections.

Our mission is to turn observability to be practical. Distributed tracing has a lot of respect. I think that it's definitely a killer feature in many cases, and yet what we're trying to say is that so many developers don't have access to this data for so many reasons, from cost to integration, that creating the 90% value of, "Just show me the spans flowing through all the different services, cluster them into issues, show me the error rates between Microservice A and Microservice B on this specific protocol, on this specific URL and so on," maybe without being sure and 100% of the propagation between the different microservices inside the cluster.

For me it's 90% of the value in 5% of the time, or the cost. I think that's what we're trying to say. There's so many teams that don't have access to distributed tracing that it doesn't make sense to fight for that feature in our perspective right now. eBPF is going to solve it, we believe it's going to be solvable, we already have been experimenting it with Groundcover in different frameworks.

We know how to solve it in specific cases. But solving it for all the frameworks is going to take a while, it's not an easy task. If people propagate trace ID inside the different requests that they're using, if they're implementing it using custom propagation or Open Telemetry instrumentation, we will definitely know how to pick that up and use that.

Jessica: Someday?

Shahar: Right now we know how to pick it up.

Jessica: Right now? Okay.

Shahar: Yeah. We're not showing it yet in the UI, but it's something that we're releasing soon. But it's definitely something that we're not trying to deliver out of the box. We're saying there's enough value without distributed tracing that you probably don't have right now, and eBPF is a way to get it fully accessible without all the hard work and for big organizations that can be a major difference. You probably see from your end how much time it takes organizations to implement Open Telemetry, and I think that pain is real.

Jessica: Right. They need to do it, but they need to be able to do it gradually, on the schedules of the different development teams. And observability can't wait for all of that to finish before it's useful.

Shahar: True. I think that it takes a lot of time and wherever something is complex, it will always end up also creating poor coverage because if something is complex, then you would've said, "Okay, so let's start with covering this specific service using Open Telemetry, we don't have time or desire to touch that legacy code. I'm not going to open it up to instrument it with Open Telemetry right now, so leave it aside. We're not going to touch that."

Jessica: Yeah. It's different if all your microservices are in Java on Kubernetes, then it's just as easy if you can throw the Java agent in with a single operator.

Martin: Just to be clear, same with .Net.

Shahar: Love that, Martin. Yeah. True. So there are specific use cases, very homogenous use cases, as you say, of Java and .Net where there's definitely advantages to previous practices like Java agents. so it's going to work and I think it works great, but in a heterogeneous environment we have Go Lang and Node and Python, I mean every company is really heterogeneous. Java shops today still exist, but when you come to more companies you see a much more diverse environment and today there's more democracy around frameworks, developers in teams choose their technologies.

It's very common you see it in data science compared to backend development, teams choose different frameworks to work with so I think getting one solution to work, that's the hard part. There's definitely pockets inside production that you integrate quickly into, but covering all the production at once using instrumentation and agents? That's hard work and we see organizations working hard to get it. I think it's a real pain point currently in the industry of APM is such a strong value, distributed tracing clearly also, yet so many people don't have it.

Martin: I completely understand the reticence for that amount of work. What I would say is that's something that's being tackled. The Open Telemetry community has recognized some of the gaps are around the easy mode stuff, like there are lots and lots of low hanging fruit that we can do with distributed tracing, with agents as Jess was suggesting.

As people who know me would say I'm Martin 10-Lines-Of-Code-.Net because I try to make everything around 10 lines of code, because it can be really easy to do and the value that you get from that can be, in my opinion, a lot higher than being able to get infrastructure metrics and being able to see low level I/O and that kind of stuff because it's about context.

Jessica: But we can work on it from both sides.

Shahar: I think maybe that's something which we need to touch on, and that's what eBPF eventually provides from an observability perspective. It's definitely not infrastructure metrics and network I/Os. You see full APIs, payloads that we collect, we collect Kubernetes events and attach them to these payloads. We measure metrics which are application grade, like error rates per status code and stuff like that, everything you would expect from an APM clearly supported today in Honeycomb and other solutions.

You still get this value without full distributed tracing, you still get these very, very deep and detailed application value without touching the code and I think that's... Even though 10 lines of code are definitely something I agree with you fully, if you can do it you should, but eventually reality or practicality is a bit-

Jessica: 10 lines of code in 200 apps is non trivial.

Shahar: Yeah, definitely. I think there's always gaps around that and there always will be so I think that there's definitely a really interesting enabler there. We see other observability companies moving into eBPF as a sensor because it can open up a lot of unseen parts in production that developers or teams just don't touch that easily or go through that instrumentation journey.

Martin: I think we're all on the same bandwagon, really, around this idea of, "We need better information about our systems, we need the outputs of our applications to be better queryable, we need that output to be more contextual, we need more data about those things." Whether that comes from eBPF, whether it comes from people amending their code, that's really immaterial. It's all about how do we get that thing in easy? How do we get it in fast? And how do we get the output and the visibility of that and that user experience of people getting the answers to their questions in the fastest possible way? So whether it's eBPF or whether it's using distributed tracing in your applications, what's interesting to me is how do we bring those two things together?

So how do we make it so that, yes, I can, as a developer, say, "These are interesting parts of my code? I would like these interesting parts of my code to be part of that distributed trace that I've just been able to do by adding the eBPF agent from Groundcover or whomever to say, 'Right, you've gathered all of that other information but I'd also like this extra bit of information because this is really important to me'"?

As Jess mentioned, a user ID, for instance. The user ID on one system is not the user ID on another system, it might be account ID that's important to you, and those are the things that, for me, are about how do we bring, whether it's Open Telemetry or something else, but how do we bring those two things together so that we have that ultimate context that we can ask a really, really obscure question that you really didn't know that you were going to ask?

One of the things that comes up a lot is, "Oh, this problem only affects iOS 14 users who are living in Norway who are using the French language pack." That really obscure question that you didn't know that you needed to ask. Whether it's through eBPF or not, that's the thing that we need to get to. I think that there's this idea about how do we bring Open Telemetry and the stuff around eBPF, how do we take those two data sources and bring those two together in one system or one context? That's where I think we're going to get the really big adoption levels.

Shahar: I agree, and I think it's definitely going to go there. Even if we look at Groundcover, we're not saying that code instrumentation is something that you shouldn't do. We're just saying that it creates a lot of problems and it's hard and sometimes the time to value there is really long.

Jessica: You shouldn't have to start with it.

Shahar: Yeah. But eventually there's specific value that you will eventually have to touch code to get. I fully agree with that, and I think eBPF and code instrumentation, even in the Open Telemetry project in future, will probably work hand in hand because there's so much that you can get from each.

Jessica: Yeah. And each makes the other more valuable.

Shahar: Yeah.

I think APMs today are kind of binary in that sense. You either instrument the code all the way through and you get application metrics or nothing, and there's a middle ground saying, "You might not have instrumented your code all the way through, but just use the eBPF agent and get so much data on the applications. If you want to get much, much more detailed about specific use cases in the code, go ahead. Instrument Open Telemetry and cherry pick the parts of the code that you want to instrument."

Jessica: If you could whisper something in the ear of millions of developers and SREs and project leaders and managers and executives, while they're sleeping and highly susceptible, what would you plant?

Martin: Just to be clear, don't do that. That would be creepy.

Jessica: If you could, if you were the Tooth Fairy and you could whisper something in their ear, what would you plant in their minds?

Shahar: I think that eBPF is definitely going through a major shift and I think that also cost reduction inside full blown APMs is something that's going through a major shift. We see it in the market. I think Groundcover is a really interesting solution. Maybe the first of many solutions trying to show what can be done with eBPF for application monitoring and how we can use smart capturing inside agents to reduce costs.

I would tell them we have an amazing and generous free tier and we totally welcome feedback from the developer community and we would love to see people try it out and tell us if it helped them compared to the stacks they're using and the solutions that they're using today.

Jessica: Great. So everyone is going to wake up with a toothache and a weird craving for eBPF.

Martin: That just sounds like a Tuesday night to me.

Jessica: Thank you so much. It's been great talking to you.

Shahar: Thanks, guys. I've had a great night.

Subscribe to Heavybit Updates

You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.

Content from the Library

Visit library

Jun 18, 2024

Podcast

How It's Tested Ep. #11, Frictionless Observability with Yechezkel Rabinovich of Groundcover

In episode 11 of How It's Tested, Eden Full Goh sits down with Yechezkel Rabinovich of Groundcover to delve into the evolving...

Jan 19, 2022

Podcast

The Kubelist Podcast Ep. #23, Pixie with Michelle Nguyen and Natalie Serrino

In episode 23 of The Kubelist Podcast, Marc Campbell and Benjie De Groot speak with Michelle Nguyen and Natalie Serrino about...

Jun 11, 2025

Podcast

O11ycast Ep. #83, Observability Isn't Just SRE on Steroids with Dan Ravenstone

In episode 83 of o11ycast, the Honeycomb team chats with Dan Ravenstone, the o11yneer. Dan unpacks the crucial, often...