Ep. #3, OpenTelemetry with Ben Sigelman of LightStep
In episode 3 of The Kubelist Podcast, Marc Campbell speaks with Ben Sigelman of LightStep. They discuss the inspiration and origin story behind OpenTelemetry, the challenges of observability, and the path from sandbox to incubation.
Ben Sigelman is Co-Founder & CEO at LightStep, a co-creator of Dapper, and co-creator of the OpenTracing and OpenTelemetry projects.
In episode 3 of The Kubelist Podcast, Marc Campbell speaks with Ben Sigelman of LightStep. They discuss the inspiration and origin story behind OpenTelemetry, the challenges of observability, and the path from sandbox to incubation.
transcript
Marc Campbell: Hi. We're here today with Ben Sigelman, the founder and CEO of LightStep and one of the founders of the OpenTelemetry Project. Welcome, Ben.
Ben Sigelman: Hi. Thanks for having me, I'm excited about this.
Marc: We want to talk about the OpenTelemetry Project and its addition to the CNCF landscape as a sandbox project right now, and I guess just to get started, Ben it'd be great if you could just tell us the background and the inspiration for the project. Where it came from, and the story around the creation of the OpenTelemetry Project itself.
Ben: Sure, I'm happy to do that.
OpenTelemetry, when we started the project the big joke was the xkcd comic about standards, where one of the characters is complaining that there are 15 standards, so we should create one new standard just to unify everything and then there are 16 standards.
OpenTelemetry, in some way people have some hesitation about the project because there were two previous projects that had similar charters.
One was OpenTracing, and the other was OpenCensus.
They were similar, but they weren't quite the same.
Unfortunately, they both had enough momentum that neither was going to die a heat death if we'd just let them be, yet the problem was that they were trying to solve more or less the same problem, which was ultimately a standardization problem.
Neither one was going to succeed for reasons I can go into, so OpenTelemetry was born out of a desire to supersede both OpenTracing and OpenCensus.
The reason I think that we thought that would work was that the leadership for OpenTracing and OpenCensus was the founding group for OpenTelemetry.
We all had committed to sunset those projects and replace them with OpenTelemetry, and that was about almost a year and a half ago actually, that we made that decision.
At this point it feels like it was very wise and that OpenTelemetry has a lot more momentum than OpenTracing or OpenCensus did at the time.
It is, I think, achieving the overall mission that all three of these projects had at some level, which was to find an open source solution to the problem of extracting high quality telemetry data from cloud native applications.
So the short version of Open Telemetry's purpose is that we want high quality telemetry to be a built in feature for really any modern cloud native software.
That's important because you do not want to spend your developer's time manually instrumenting things, nor do you want to tightly couple yourself to any particular vendor of Observability or APM or whatever you want to call it.
So OpenTelemetry is at some level this marketplace project where open source and closed source application software can buy into OpenTelemetry to create this critical data that observability tools need, and then observability tooling can consume it on the other side and build pretty compelling features without requiring their customers to manually instrument code.
The purpose of the project is very similar in some ways to the purposes of OpenCensus.
Each of those products had some problems, which I can go into if you want, but OpenTelemetry is an attempt to address those problems and consolidate the landscape into a single project.
Which I would say at this point seems like we've succeeded in doing, it felt a little sketchy in the beginning because it was so high stakes if we screwed it up, but it actually seems to be working. Which is fantastic.
Marc: Great. That makes sense.
I think I'd love to talk a little bit more about the projects that it replaced, and specifically why OpenTelemetry , you decided to donate the project to the CNCF. OpenCensus and OpenTracing, they had some adoption.
You created this group of the founders of those projects and created OpenTelemetry as the one standard "Rule them all," it's a great xkcd comic, I agree.
It seems to be working right now, so can you talk us through the thought process a little bit around deciding to donate the project to the CNCF and make it a CNCF project?
Ben: Happy to. I'll try not to be too long winded about this for the benefit of your audience, but the short and hopefully interesting version is that OpenTracing was, of these three projects, OpenCensus and Open Telemetry-- OpenTracing was the first one to be created and it was an intentionally very narrow project where the only thing it was actually doing was defining an API for distributed tracing telemetry, and that was it.
Just the API, we didn't have wire formats as part of OpenTracing.
There was a contrib area in GitHub where people, if they wanted to, could throw stuff in there.
Or they could just do it themselves, wherever they were.
But that wasn't even really officially sanctioned by OpenTracing and there was no governance for that code.
It was a very thin project, and that was supposed to be a feature, and it was intentionally narrow so that it would be a very small dependency to bring in so on and so forth.
The problem is that the market really wanted to have that, but they also wanted to have a similar value proposition for metric data, and arguably depending on who you talk to for logging data too.
Due to that desire, the OpenCensus project which originated out of Google but ended up getting some good traction with Microsoft and a few other large corporate contributors, as well as some smaller individual contributors and so on.
The OpenCensus project was very similar to OpenTracing, except that the scope was broader and that it incorporated metric telemetry as well.
It wasn't totally tightly coupled, but it was more tightly coupled in that it was difficult to depend on just one part of OpenCensus and you'd end up bringing in a somewhat larger dependency.
That was creating some trouble for them as well, so the idea with Open Telemetry was to take the scope of OpenCensus, which is to say all forms of telemetry are ultimately long term in scope for OpenTelemetry. But the very loose, intentionally loose coupling of OpenTracing and the small dependencies and the "Take what you need" approach from OpenTracing and to make a project that had all those attributes from the get go, and that required changes to the structure of the code and everything, and somewhat of a fresh start.
However we were really trying to avoid this issue with a third standard, so we had made some backwards compatibility bridge a requirement for both OpenTracing and OpenCensus to give people a clean path onto OpenTracing.
What your question was about, whether we donate to the CNCF or not.
I think part of the reason that we donated to the CNCF was that the CNCF was actually instrumental in helping this happen.
I'm personal friends and certainly work friends with a bunch of OpenCensus people and I have been for a long time, but there was a period there two years ago maybe, before these projects merged where things got fairly acrimonious, actually.
Which, news flash, that can happen on the internet.
But it was probably a very small minority of people on both projects that were squabbling, but that was boiling over on Twitter and it was getting very difficult to imagine these two projects collaborating.
Although the main request that we got from our user communities was to solve this problem of having these two projects, the same charter that we're actually creating confusion and slowing things down in the actual enterprise software world.
So we were somewhat stuck, actually , and the CNCF had a meeting in the Presidio in San Francisco where a bunch of people flew in from different places, and we all had a Kumbaya moment when we acknowledged that we had almost exactly the same goals and that there was no good reason for these projects to be competing with each other.
But CNCF brokered that, I would say, and was very helpful in re-establishing the trust and alignment that was needed to get these projects to merge successfully.
I think part of the reason that we donated the project to the CNCF was that they were already part of the conversation, and for me I was very grateful for that.
I think it's a great thing for the industry that they did that, CNCF sometimes gets critiqued as being a marketing organization.
This was very difficult. The hardest part of open source software is always the people, it's not the technology.
I think that they did a very good thing for the industry by actually engaging with both sides of that divide at the time and really making it into one project.
So, they had a real hand in that. That's a large part of, I think, why the CNCF choice felt obvious to all of us.
Marc: I see. Yeah, that makes a lot of sense.
In addition to just replacing or combining and reducing the confusion between OpenCensus and OpenTracing with one new project called OpenTelemetry, does OpenTelemetry add additional functionality in continuing the evolution of these projects?
If I already had adopted OpenCensus or OpenTracing, what would I gain from switching over to OpenTelemetry as the standard in the SDKs that I'm using?
Ben: Definitely, absolutely yes. I have to admit that I'm supposed to say that with a lot of enthusiasm.
I'm actually on the governing committee, and I'm often the voice for trying to keep the scope in check while we get this thing out into the world.
But there is a lot of stuff that's happening at OpenTelemetry that was never part of OpenCensus or OpenTracing.
It's great work, it's not that I have any issue with it, it's actually really exciting just to rattle off a few of the things.
Certainly there's more support for esoteric languages than we had in OpenCensus, so there's that support just in terms of particular frameworks for languages.
But beyond that, there's been a lot of effort put into OpenTelemetry automatic instrumentation, which is to say that as a post compilation step you add some dependency and suddenly you're able to get high quality tracing and metrics telemetry streaming out of your application in a way that's vendor neutral.
That's a real holy grail, actually, to make what would have been called an agent in the New Relic, App Dynamics heyday that is vendor neutral and that's a major accomplishment for the space.
So that's significant, though OpenTelemetry collector descends from the OpenCensus collector, but it has a lot more feature functionality than it did at the time.
There's an entire sig devoted to logging, which wasn't really in scope for OpenCensus at the time of the merger, and certainly wasn't in scope for OpenTracing.
I could go on and on, but there's certainly a great deal of activity in OpenTelemetry that didn't have a direct analog in OpenCensus or OpenTracing, although I think that the main charter for the project initially is to deprecate OpenTracing and to deprecate OpenCensus.
That's happening later this year and in my mind that's achieving that initial promise of the project, that we're going to fully replace these other projects and have one instead of three standards out there.
So, that's still en route and we're very close, like literally weeks away, which is exciting.
But that milestone in my mind should be the priority for the project until it's achieved, because that's the problem in the market that we were initially trying to solve.
Marc: That's a great milestone.
I'd love to talk more about that auto instrumentation and the post compilation step, that sounds super valuable.
I'd like to get a little bit more in the technical details of it.
Can you talk a little bit about the technical stack and the implementation, and some of the choices that were made in the OpenTelemetry Project from the beginning?
And how that's affecting the project right now?
Ben: Sure, I can try. The thing about automatic instrumentation, and part of the reason why this was something that was not-- This was not a light lift.
It depends a lot on the particular language. It's not like there is a single approach to that problem that's going to work across all languages, and in fact it's a little bit of a misnomer in languages like Go, for instance.
Where you can make instrumentation a very lightweight process and maybe get it down to a known finite number of lines of code, but you can't do a post computation integration in Go without changing the Go VM in some fundamental way.
But a big step initially was getting Datadog to participate in the project and to donate their automatic instrumentation libraries that were formerly proprietary, but were actually pretty modular as a starting point.
So that happened, I would say, about nine months ago or something like that.
I can't remember the exact time, but I think that announcement came around the New Year or maybe a little before.
But Datadog had proprietary but Apache license or BSD license open source libraries to get automatic instrumentation into their product, and they donated those to the OpenTelemetry Project as a starting point.
They weren't ready to use as they were, precisely. But they set up a lot of the scaffolding that you would need to build agents in many languages.
In other languages, like for instance in Java, that one ended up getting almost rebuilt from the ground up.
There was a project called Glow Root, which is an open source APM project.
Guy Trask is fantastic, who presently works at Microsoft, he was the benevolent dictator for life on that project .
So Trask had donated that to OpenTelemetry as well, and worked pretty closely with this guy Tyler Benson, who's at Datadog and had worked on their agent.
They rebuilt stuff up along with this effort from OpenTracing called the special agent project.
There were these three different approaches that they had to work through, ultimately they ended up with some resemblance to all three approaches, but it's actually its own thing.
In terms of the technical details, I could try to go into it but it's very language specific.
That's what makes this so difficult, actually.
But it also it makes it so valuable that you can have a single project you can integrate with, and even if your organization runs Python Java and has a Node app or something, you can expect to get high quality telemetry from all three languages without understanding how those agents work, which is a huge benefit.
Marc: That makes a lot of sense. I mean, we run a lot of Go and Node projects, and I know for instrumenting that code it's very different.
A post compilation auto injection of metrics is really easy in Node, but in Go we just we have to modify our code.
I assume every language then from what I'm hearing from you, they fit into a different place in that ecosystem and there's different lifts required by the engineering team in order to properly instrument their code to work with OpenTelemetry.
Ben: Another fascinating thing about this is that I think these things have been publicly announced.
Like New Relic for instance, that company went to market initially with their value proposition and had a lot to do with their agent, which was totally proprietary.
You just add the New Relic agent, your Rails app, and then lo and behold you had a dashboard with some useful data on it.
It was a good value proposition and I think that's why they are a public company now.
They have committed to basically ripping out all of their own agents and replacing them with OpenTelemetry.
They'll have a shell that's just to provide, and they want to give their customers one dependency, not two.
So they'll wrap it with something very thin, but it's just a shim around the OpenTelemetry core.
The reason that they want to do that, I'm sure they're excited about contributing to the project, but it's also for them and for other observability providers.
I think there is very little interest in trying to maintain proprietary bindings to every project that people in the enterprise actually depend on.
Like, it's not just these end languages. It's like, each one of those languages has 50 to 100 commonly used frameworks that you need to be able to instrument as well.
It's just an incredible burden from an engineering standpoint for vendors, and LightStep--
This is an OpenTelemetry thing and not a LightStep thing, but the reason that we've pushed so hard on creating these projects is that we really don't want to get in the business of building and maintaining all of that software.
I think we feel like it's better for the market in general for it to be an open source community.
It's also better for the providers where we can focus on solving analytical problems, and not on integrating with the 712th framework in the world that we hadn't seen before.
So I think the auto instrumentation piece was a vital requirement for a lot of large vendors to actually move to OpenTelemetry as their answer for instrumentation, because automatic instrumentation is an expectation for software organizations that are trying to adopt a tool.
They don't want to spend a lot of time instrumenting, and until open source could handle that part of the problem it wasn't really a viable strategy for many vendors in the space.
Marc: I see. I'd like to talk a little bit more about other technical challenges.
A lot of CNCF projects are just purely technical, and they solve a technical problem.
OpenTelemetry is there, but there was a lot of political and community work involved with consolidating projects, you talked about that and the auto instrumentation technical challenges.
But what about on the other side, for the technical challenges of integrating into existing systems?
Datadog, you mentioned they contributed their agents.
But if I'm a large org and I'm running Datadog and have an integration into their monitoring dashboard or Prometheus or even going kind of down the long tail of more esoteric tooling that the enterprise is used to, how deep has OpenTelemetry currently gone into integrating into those projects right now?
Ben: I think the entirety of the landscape of technologies and products and web services that OpenTelemetry may someday want to integrate with just feels infinite.
From a percentage standpoint, I couldn't say, but I would speculate that it's probably not that high.
However, for projects that are of high relevance to what we're doing, things like Prometheus for instance, we've put a lot of effort into making sure that those integrations either do work smoothly or will work smoothly so we don't paint ourselves into a corner.
I have actually, at this point, just literally lost track of the number of projects that have committed to adding or have added OpenTelemetry instrumentation at this point, but it's certainly significant.
The focus for the project has been really on the correct propagation of the context that's used for distributed tracing across the sorts of applications that deploy on Kubernetes.
So that in some ways can help serve as a rule of thumb for prioritization, if there's technology that's almost never incorporated into Kubernetes then it's probably not going to be a particularly high priority from an OpenTelemetry standpoint.
Although there certainly are projects to integrate with of their own accord, and that's fantastic.
Similarly, there's not a total focus but there's more focus on getting the tracing part right first.
The metrics and logging are happening in parallel, but it's not that they're not a priority, but I think that from a sequencing standpoint they're going to come later.
The GA milestone for tracing is an order of weeks away, it's supposed to happen in September of 2020, which is when we're recording this interview.
The GA for metrics will follow, so the tracing piece has been the highest priority from an integration standpoint as well.
Part of that is because we want to replace OpenTracing quickly, but part of it is also that there's more of a story for metrics already, so there's less pain in the world.
There's things like Prometheus and StatsD, so the level of suffering is not as high as it is for tracing where there really isn't a good, widely adopted standard right now.
Marc: That makes sense.
The adoption and the level of suffering for folks who are implementing other products in this landscape makes me wonder if you can help explain a little bit how OpenTelemetry currently is planning to fit and change the overall landscape of all of this area around observability for applications.
Ben: That's a good question.
I think it hearkens back a little bit to my comments earlier around how many vendors are perceiving the project and adapting their strategies in light of the project.
I would say that the overall value proposition for-- I don't know, "Observability" is the current term but it's not that different than what people used to call APM a few years ago.
I can go down that rabbit hole if you want, but for people who are monitoring their applications or observing their applications the value proposition that you get from--
Especially from a vendor you would pay had a lot to do with just getting the data out of the application, and that was partly because of the fact that operation teams would actually run the software and they were totally separated from the development teams.
Obviously there is a change happening in the world right now where developers own their code in production and DevOps blah blah blah.
So there's a change there as well, but with operators owning the code it was a big deal that you could run these tools without having developers make any decisions or change anything. So this post compilation integration was a critical feature.
I think it was so critical that it was actually a big part of the pitch, and what's happened with the OpenTelemetry Project is that we're basically providing high quality telemetry for free and without any kind of vendor dependency whatsoever, which really in some ways I think is quite liberating for the various providers of observability.
Whether they're open source projects like Yagur or Prometheus or whether they're vendors large and small, like New Relic or DynaTrace or LightStep for that matter.
But the board has shifted, in that everyone is collaborating on getting high quality telemetry without any kind of vendor coupling whatsoever.
Then the playing field for observability is moving to the analytical space pretty firmly, as opposed to where it was five years ago, where I think a lot of the effort was put into these agents.
So that's a really significant shift for the value proposition where open source and OpenTelemetry project are, I think we're going to do a much better job ultimately in serving the needs of the customers because it's vendor neutral.
I think we'll have broader support, but it's taken away a big part of the value proposition for the conventional tooling.
Marc: That makes sense.
Can you share a little bit about what's on the current short-term and medium-term roadmap for the OpenTelemetry Project, as it's currently a sandbox project right now?
I'd love to hear a little bit more about what the team is working on right now.
Ben: Since you mentioned the sandbox aspect, we're planning to go up for the incubation status pretty soon, certainly before the end of the year.
Which we expect to be fine. We meet all the criteria, at least as far as we can tell.
So that's one thing that we're planning on doing.
I already mentioned that we're going to add to the tracing pieces of OpenTelemetry very soon, it's imminent at this point.
We'd also like to GA the metrics pieces.
Having done both of those things, we should be able to formally deprecate OpenTracing and OpenCensus, I would observe that many end users have already realized that the writing is on the wall and are just adopting OpenTelemetry anyway.
But I still think it's an important thing for the project to finish the job and make sure that we get down to a single project and not three.
So, that'll be a major milestone as well. Looking into next year the project is just unbelievably populated right now with contributors.
I'm not a big fan of this metric as a signifier of success, but from just a commit and velocity standpoint OpenTelemetry is second only to Kubernetes in the entire CNCF right now for just activity.
There's a whole army of interns from Google and Amazon that have been working on this stuff over the summer, so there's a huge amount of velocity going in many different directions.
Although it's not exactly a typical milestone, I think a big milestone for OpenTelemetry will actually be the community management and governance facilities that are needed to organize that much activity and effort that's going into the project.
The roadmap for next year is not spelled out in a particularly static way, I would expect there to be considerable improvements to the core functionality around tracing metrics, automatic instrumentation, and also expect the logging thing to get something going that's more generally available.
Splunk has been investing a lot of time in that, although there's no commitment on that and a big metric for success for OpenTelemetry is just the projects that have formally integrated with it and are emitting OpenTelemetry.
API calls for OpenTelemetry data types over the wire, and I think that list will grow quickly next year.
We don't have a specific target for it, but just based on the sorts of conversations we've been having with other projects from the service members to Kubernetes itself and so on so forth, I would expect there to be a lot of announcements next year around those integrations as well.
So, that's the basic gist of where things are going.
But the focus still in the short term is on just getting everything GA'd and closing the door on the previous projects.
Marc: The commit velocity and being second only to Kubernetes, that's a great stat.
I think OpenTelemetry has just a massive surface area though, because you have to do all these integrations into different languages.
You talked about even when you break it down into the languages that you want to integrate into, it's the different platforms inside that language.
If I'm writing Go, there's probably an integration specific to the Go web server that I'm using in addition to just the generic Go implementation.
Is a lot of that commit velocity right now just spread out surface area around the different platforms?
Ben: Yeah, there is a lot of that.
There's also a lot of work on things like the OpenTelemetry collector, which I should probably speak more about.
The collector is also something that that's running in production for a number of well-known brands, and it's not a requirement to use the collector but I think a lot of end users have found it helpful in that it can run either as a sidecar type model or it can run as a central pool.
But it provides a mechanism to accept telemetry in a variety of formats and then emit it wherever you want, including multiplexing to different places.
It's just a data path for telemetry. When it's running in a sidecar mode, it can also reach into the application and grab certain statistics on its own too.
But the collector has received a lot of attention from many of the competitors that we're working on OpenTelemetry in the last six months, so that's another important piece of the project that unlike many of the other aspects is not just a mile wide and an inch deep.
The collector is actually a pretty elaborate piece of software on its own that like many things like that, requires a lot of effort to tune and to get into a place where it's suitable for production use.
Marc: If I'm not currently using OpenTelemetry in my project, but I'm writing code and deploying it and running it in Kubernetes in production, how would I think about how to get started?
Would the collector be a key piece, even though it's optional, that you would recommend somebody new in trying to adopt OpenTelemetry start with?
Or where would you recommend somebody starting?
Ben: I think it depends a little bit on who that person is.
I would divide things among the following populations, there are people who maintain a small set of services as application developers and there are people who work on a platform team for an end user company and are trying to make decisions that affect the entirety of the infrastructure, and then there are people who work on either open source or closed source projects that might want to integrate with OpenTelemetry.
Those are three very different groups. For the first group, I would just point them at the automatic instrumentation and say "Give it a try. And the documentation, if you search for it, will help you understand how you can hook this up to any number of open source or closed source downstream solutions. Just to see the data and get some value out of it."
For someone on a platform team that's trying to integrate the project, in that case there's a welcoming community.
In some ways, I just want to say "Why don't you just go into one of the OpenTelemetry forums? There's a channel that's actually going to be replaced by Slack pretty soon," and just start asking around and trying to get some advice because those are big decisions.
But if you had to start somewhere and you didn't want to talk to anyone, I guess it would be fine to just pick a couple of example services and do what I just suggested, just add the automatic instrumentation and get a feel for how things work.
Or you could pick up the collector and install that alongside some of those and see how that works, those seem like the right places to start.
Then for people trying to add OpenTelemetry to an existing project, whether it's an open source project or a piece of closed source infrastructure or what have you, the API docs for OpenTelemetry are pretty good, actually.
I think all you really need to do is integrate that and the appropriate calls and write some tests, all of which is pretty well documented.
But after doing that, you can add it to the OpenTelemetry registry, which means that other people will be able to discover it without stumbling upon your GitHub repository.
So, that's another population that is equally important, I think, to the project.
Marc: When you think about somebody new and they are adopting OpenTelemetry or the roadmap that you currently have in the path to move it into incubation, is there any particular type of use case or feedback that you're really tuned into right now?
Something maybe that's new that's in the project or just something that's interesting?
Ben: I've done it a few times, and I may be sounding like a broken record right now, but I think the promise of adding a single dependency and getting decent telemetry out of your application is pretty compelling.
Especially if you try to do it manually, which is a real bear.
I think that's very compelling and I'd recommend people give it a shot.
It's my fiduciary duty to say that you can send the data and stuff without getting out a credit card or whatever, and you can see how that data looks and you can start to understand performance bottlenecks and so on and so forth.
Of course, you can do similar things with open source software that you can just download and run on your own.
Like Yagur, for instance, I think is another popular option just to kick the tires.
But I think that's the thing to try, I feel like there's -- The trick of OpenTelemetry is that on its own it isn't going to do anything.
You need to pair it with some kind of observability solution, so you have to think about that at some level, like a two step process.
You add OpenTelemetry and then you have to send the data somewhere and get some value out of it.
It's not a project where there's a self-contained OpenTelemetry demo, I think the whole point of OpenTelemetry is that you shouldn't have to do very much and you suddenly can light up some pretty compelling tooling that requires good data.
It needs to be paired with some kind of observability solution, and the most exciting thing about OpenTelemetry is that you don't have to do very much.
You add this one dependency and then suddenly you're able to see things you didn't know about your own software.
Marc: That makes a ton of sense, the single dependency solution where a lot of us have old legacy code that was not properly instrumented when we built it.
It'd be great if we can throw the single dependency in and integrate it into our current observability dashboards and be able to instantly get some metrics and interesting data out of there.
Ben: Exactly.
Marc: I'd love to talk just for a minute about the differences in the observability ecosystem in general right now.
Like, if you think about it there's obviously the three pillars of observability that everyone talks about.
Tracing, metrics and logging.
You're really deep in this ecosystem at LightStep and Google, and I'd like to hear your thoughts about those three pillars of observability and really how somebody who's writing an application should think about solving that.
If I'm building a small application and a proof of concept, should I start with tracing from the beginning?
Or when do I adopt all these different parts of the OpenTelemetry Project?
Ben: I should say that I gave a talk at KubeCon December 2018 called "Three Pillars, Zero Answers."
It was basically an argument that we should stop talking about the three pillars of the observability.
So I apologize, I'm going to give the TL;DR on that. I definitely think there are pillars of telemetry and traces and metrics and logging could easily be three of them, but that's the telemetry, not observability.
The thing that I think is so difficult about observability right now is that you've got a lot of smart people, whether they use OpenTelemetry or not, they've managed to get tracing metrics and logging to coexist in three tabs of their browser.
Each one has its own data type appropriate query API and visualizations, and so on and so forth.
Yet I don't think that they've really addressed the core problems that they have from a business standpoint, and that's because I don't think that those three things, tracing and metrics and logging, should be separated at the level of the product that people are actually using.
It's more than just having different modules of an observability application where you switch between metrics, logs and tracing.
It needs to be much more tightly integrated.
My point of view is that for micro service applications, the tracing data on its own should rarely be viewed by human beings, but should be used to inform the analysis of everything else just continuously. The tracing data is the only thing we've got that actually represents the dependencies of individual transactions, which is just unbelievably valuable, if you can find tooling that knows how to take advantage of that.
To be more specific, if you're at the top of the stack and most micro service applications are pretty deep and these deep systems often have five plus layers from the top of the stack to the bottom, if you're at the top of the stack and your pager goes off because your latency is high, it's so much easier to figure out what's going on if your observability tooling understands your dependencies and can actually separate the slower requests from the faster requests and figure out how the dependencies behave differently in the slower requests versus the fast requests.
You can't do that without tracing data. However, that's really not the end of the investigation.
You then need to figure out, "What's different about those slow dependencies?"
That data often will be found in logs or in metrics or sometimes in trace tags, but the data layer for observability needs to be able to pivot between these different telemetry types.
Even to satisfy a single user request to help understand what the contributing factors are across these different forms of telemetry in the context of some user problem, which will usually be either a sudden change of behavior or just steady state performance analysis.
In either case, you have a comparison between things that are good, things that are bad, and you need observability to be able to look at those two data sets across different types of telemetry and find the correlations that explain the difference.
That really shouldn't be, in my mind, they shouldn't be thought of as "Pillars."
There's really only one question in observability, and that is "Why did that change?"
It's like something changes and you need to figure out why, and you're not going to be able to answer that question by looking at just any particular form of telemetry and having these three tabs running in parallel and doing the join yourself as a human being.
It requires a level of expertise and training that I don't think is realistic, so I have a pretty strong opinion that all three forms of telemetry are necessary for telemetry, but that observability solution should be simple and oriented around discovering and explaining changes.
Marc: It makes sense.
I think it sheds some light into how you've built OpenTelemetry and how you've thought about the project, as software is just getting really complicated and more micro services in traffic patterns are changing.
You mentioned it's a lot easier if you have these spans and these traces that you can use to debug the solution, but it's honestly probably required at some point.
There's a lot of pagers going off in the middle of the night, and without that data you have no idea how to debug that.
Ben: Exactly.
Marc: Cool, all right. Thanks, Ben. I think that answered all the questions I had around the OpenTelemetry project.
I'm excited to see it move from the sandbox to the incubation group inside the CNCF.
Ben: Yeah, me too. I'm crossing my fingers. It hasn't happened yet, but I think we're in good shape.
I should just close off by saying that although I've had some role in the project and I'm one of the nine people in the governance committee and I'm just one of literally hundreds of people working on this project, I'm running a lot less code than most of them.
So I'm happy to talk to you about it but the success of the project really is entirely due to the efforts of these hundreds of people who have been busy working on it for a year and a half at this point.
Marc: Absolutely. I think we definitely appreciate you joining and talking about the project, but everybody who's committing and contributing.
Even just using and providing feedback is making the project successful and making it better.
Ben: Exactly.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
Machine Learning Model Monitoring: What to Do In Production
Machine learning model monitoring is the process of continuously tracking and evaluating the performance of a machine learning...
O11ycast Ep. #73, AI’s Impact on Observability with Animesh Koratana of PlayerZero
In episode 73 of o11ycast, Jessica Kerr, Martin Thwaites, and Austin Parker speak with Animesh Koratana, founder and CEO of...
Generationship Ep. #19, Auditability Matters with Stefan Krawczyk of DAGWorks
In episode 19 of Generationship, Rachel Chalmers is joined by Stefan Krawczyk, co-founder and CEO of DAGWorks. They dive into...