Library Podcasts

Ep. #47, Outcome-based Observability with Gibbs Cullen of Chronosphere

Guests: Gibbs Cullen

In episode 47 of o11ycast, Liz Fong-Jones speaks with Gibbs Cullen of Chronosphere. Together they compare the roles of product managers and developer advocates, and discuss the phases of outcome-based observability.


About the Guests

Gibbs Cullen is a Developer Advocate at Chronosphere and former Amazon Web Services product manager.

Show Notes

Transcript

00:00:00
00:00:00

Gibbs Cullen: The way that we view observability at Chronosphere is all about-- The mission is being, "how quickly can you remediate a problem?"

In order to do that, you have to understand a lot of what's going on.

And the way that the pillars are structured are more like inputs to that problem.

The things you can use to figure out what's going on to help remediate your problem faster.

But the way we were thinking about it is it should be more output focused.

So that's why we shifted more to the phases versus just the three pillars and our phases are no triage and understand-

Liz Fong-Jones: And I love that focus, right?

Because you should be able to fix the problem even before you necessarily fully understand it.

Otherwise you get rabbit hole and you're not actually solving the problem for your users.

Gibbs: Right. Exactly. And sometimes you have to know the problem.

So you have to know that something's not working as it should or know that something's wrong.

And then triage, you can try to remediate during the triage portion or triage phase, but not until you resolve the problem initially.

Are you able to then go and take the time to understand really root cause why did this happen and try to prevent it from happening in the future?

And you can do all these things with your telemetry data, so your logs, metrics and traces.

But if you don't look at it from the outcomes perspective, just looking at it from the inputs perspective might not be enough to get you where you need to go.

Liz: Yeah. There's this thing that we've been talking about where we agree with you this idea of outcome based observability, because otherwise you're collecting signals like you're collecting Pokemon, but they're not actually doing you any good.

Gibbs: Yeah. So that's how we view the things and being able to have a solution or a platform where you can use all those.

The inputs like the three pillars to do the remediation is going to be important, but it's making sure you have everything that you need.

The resources you need to be able to utilize them in an effective way and get the outcomes, I guess you need to run an effective observability function.

Liz: Excellent. So now would be a good time for you to introduce yourself.

Gibbs: Okay, great. So my name is Gibbs Cullen.

I am a developer advocate with Chronosphere where we are building a cloud native observability platform.

And I have been with Cronosphere for almost two years now.

And then before that I was over at Amazon for about five and a half years, mostly doing product management.

Liz: That's interesting that you talked about this idea of coming to developer advocacy, not necessarily from an engineering perspective, but I almost think about product management as an engineering function.

Gibbs: Yeah. It can vary for sure.

I think I had some different types of product management experience at Amazon where I was much more product focused and more on the business side.

And like, what is this product going to look like?

And how are our customers going to interact with it?

So I guess they can both be... Engineering deals with similar issues.

But I think probably the way that you are thinking about it and more traditionally as product management is seen, is being the left hand man or woman of the engineering team to try to develop these products and get them out there and hopefully solve problems for customers, whether they're internal or external.

Liz: That definitely is a helpful perspective to have because often engineers tend to focus in on the how rather than the why.

So it's really helpful to understand how are people in the end going to use this.

Gibbs: Yeah, definitely.

And so, yeah, no, I think all of that product management experience really gave me a lot of...

Because I didn't really know what a developer advocate really was prior to Chronosphere.

I know we have some developer relations and developer advocates within Amazon, more so in AWS, but it was all--

What appealed to me about that role was where it was just more focused on the how and also the why.

And it was all about helping promote your product more so and trying to get a better understanding of what customers were looking for, or your users were looking for and trying to articulate that back to build a better product.

Liz: Yeah. It feels like that's something that every company that works in developer tooling absolutely should have, but that not everyone does have developer advocacy or at least a robust developer advocacy department.

Gibbs: It definitely varies. I'm the only developer advocate at Chronosphere for example. We-

Liz: Although you're hiring though, right?

Gibbs: We are hiring. So that's good.

We definitely need to keep building this function now because I think it is a very valuable function and there's many different components of being a developer advocate.

And I think it's really hard to have someone that focuses across the board-

Liz: Right. We were preparing for this and Chronosphere has recently announced tracing, but your bread and butter has been metrics for the past couple of years and that's kind of-

Gibbs: Yeah. So exactly.

And so I think, having more people to focus more in on these different lanes of the company will be important.

It's definitely a really interesting space though. And I find it really fascinating how developer advocacy or devrel or developer relations, I feel like there's so many different terms for it and they all are slightly different I guess, but how it is so different across the industry and by company.

So it's been an interesting space to be in.

Liz: Thinking of which, what caused you to come to Chronosphere to come into the observability field?

As you were saying you were working more on a product team at Amazon that was not necessarily AWS or dev tools.

Gibbs: So yeah, I think I was at Amazon for five and a half years.

And during my time there, I went progressively more to smaller and smaller teams and more earlier stage products, all under the umbrella of Amazon, of course.

But my last role at Amazon was on this team within AWS that launched this new program called Datalab.

And it felt like a startup because I was one of the first product managers.

It was a really small team and I kind of got to wear a lot of hats and do a lot of different things. And so that really worked my appetite to want to join a startup outside of the Amazon umbrella where the stakes are a little bit higher in this.

Everyone, I feel like it has to really come together on that common vision.

So yeah, that's what led me to go to a startup, just the early stage startup like Chronosphere and with each time I make a switch in my career, I do try to go somewhere where it is a little bit uncomfortable and it's a lot of new material where I can just keep learning.

Eventually I'm going to have to stop doing that and get to a point where I can become a little bit more specialized and more of an expert, which I think observability as a whole is a space that I think I could probably try to do that, because there's just so much material--

Liz: And we're still early on, but I think that's part of what's great about being a dev advocate is, we get paid to professionally be learners all the time and to put ourselves in the shoes of people who are brand new.

Gibbs: Yeah. And if you don't, then you're going to not be as relevant.

People look to developer advocates to learn what's going on with the product.

What's the latest in the industry? Get the opinions, you kind of set the tone.

And I know you do a great job at this because you're always advocating or open source or OTEL or Honeycomb, but yeah.

You have to be a learner to be successful in this role.

Liz: Yeah. It's one of those things where I like coming in and building demo apps and seeing, how does this actually work?

What are the roadblocks? How do we fix it?

And I think in that aspect, it's been an exciting couple of years for OpenTelemetry.

I'm glad we're finally towards the end of the release candidate at 1.0 road, at least for tracing, but now we have to conquer metrics.

And now after that we have to do logging. And after that we have to do continuous profiling there's-

Gibbs: Yes. Never ends-

Liz: Yes. The pillars are bullshit as far as collecting them all, but no, they all provide some amount of value.

So unlocking that value for OpenTelemetry ecosystem customers is important.

Gibbs: Exactly. I think having them together is going to provide more value than having them all just be independent.

So that definitely makes sense, but yeah, never ending amount of work to do.

Liz: So speaking of this idea of discovering things as we go along, to what extent do you wind up dogfooding things at Chronosphere?

Are you involved in that process of experimenting with the tooling and seeing where it works, where it doesn't work? What have you learned?

Gibbs: We definitely dogfood all of our products and we use Chronosphere for our own monitoring and observability needs and purposes.

And so I get excited by that because if we are our own customer, that means that we're going to want to product that's as efficient and cost effective and all these other things that our customers might want.

So that keeps us honest, I guess, with what we are really providing and making sure we're continually making a better product.

So I personally don't do a lot of the testing of products while they're being developed.

Once they are developed, I will definitely play around with them.

But I know our engineering teams are very involved with dogfooding during development and then also once we have new products rolled out, we continue to use them.

Liz: Yeah. It's one of the fun things that we've done at least with regard to Honeycomb is treating our dogfood environment as user acceptance testing.

It's the first to get the new release. If it breaks, we'll find out about it pretty quickly.

And that way we get a chance to give it one last test before it goes and reaches all of our customers.

Gibbs: Yeah. And that's really important because otherwise you're going to be relying on your customers to give you all this feedback and sometimes you might not get everything that you need out of that feedback loop.

Liz: Yeah. There are definitely a couple of traps with that though, because right, if you're an expert in the product, you're not going to necessarily see things that someone who is a non-expert is going to notice.

Gibbs: Yeah, that's true.

Especially in observability in products like Honeycomb and Crisp where I get we have engineers that are...

A lot of people came from observability or had an understanding of monitoring and metrics beforehand.

And so that's not the case for a lot of our customers where a lot of engineers, they're still new to the space.

Liz: Right. It's that exciting thing where people are embracing this journey of production ownership.

And that means you have to learn all these tools that previously other people were doing for you.

Gibbs: Right. Exactly.

But that gets to your point of sometimes if you're are to too close to the problem and if you are an expert, you might not see all the different angles.

So I think we definitely have to be aware of that.

Especially given that we have our engineering team is probably not the most representative of the overarching engineers that are working in observability across all these different companies that are adopting these cloud native monitoring or observability solutions.

But yeah, I think we do a good job of testing ourselves and always looking for feedback from customers or users of our products.

Liz: Speaking of cloud native, you mentioned that you are conceptualizing Cronosphere as this cloud native observability solution.

What does it mean to you for a observability solution to be cloud native?

I know that the CNCF has TAG observability. How do these things relate to each other?

Gibbs: I think for us the shift to cloud native, it's still ongoing and it hasn't--

I feel like it came into conception, not even that many years ago, but it's forced companies to reevaluate how they manage their infrastructures and everything.

So I think the shift going from more VM based architecture infrastructure to more application or microservices based, container based architectures is just driving explosion in metrics, where people we talk to are saying 10 to 15 times more metrics than they used to.

And they're old, VM based.

Liz: Right. Exactly. The containers are ephemeral, the TAGs are constantly going to be changing therefore-

Gibbs: Yeah.

Liz: Yeah. That definitely is a large architectural difference. In the tracing world, it's been interesting.

We've been talking about this idea of that in theory, your application should be agnostic to where it's running and therefore the difference as far as the tracing signal is concerned is simply that you have an encouragement for proliferate more microservices that you need to understand better, but it's not necessarily a change to the format the data comes in.

Whereas it sounds like in metrics, you have all these containers that are being started up and checked down and you only get metrics for a certain period of time from any one container.

Gibbs: Yeah.

I think that's been a big driver in creating these solutions at Chronosphere where we call ourselves more cloud native is because I think the goal is to be able to come up with a solution that can accommodate this level of scale and the increase in metrics that we're seeing across the board.

Liz: Yeah. So I guess there's the data volume challenge.

What are the problems of trying to correlate behavior between applications that are now jumping around between different hosts?

Gibbs: Yeah, that's an interesting problem.

Because I feel like if you do have applications jumping around various hosts, you might be having to jump between different dashboards or views to get that single painted glass or that overview of what's happening across all of your hosts.

So I think having a solution where you can have everything in one place, that's going to be really important. Because that problem is going to continue on and you're going to have... The complexity is just going to get greater and greater and greater.

So I think as long as you can maintain a single point of view of everything or overview of what's going on, then that'll be important.

So at Chronosphere I know we're trying to really make it so that everything's integrated really tightly traces, metric data, all integrated really tightly.

So you don't have to switch between different panes or windows to see what's going on.

Liz: Yeah. That cognitive switching cost of switching tools per signal is just way too high.

I think the other interesting thing that you said that I really liked was the idea of you shouldn't need to care which container something is running in.

That you should have this overall view and that the older ways that were focused on per VM views just don't work when you're more interested in the application itself.

Gibbs: Right. And then you can then have the ability with traces or whatever telemetry data you want to go in deep dive and look at VM or container individual containers.

But I think that shouldn't be the first step that should be part of the triage or remediation.

Liz: Right. Exactly. The joke that I tend to make is you wouldn't look at a tree trunk with binoculars that just doesn't make sense.

If you want to start zoom in at the right level and then be able to zoom in or out from there.

Gibbs: Right. No, of course. And it sounds really obvious, but I think-

Liz: People have all these old tools and they're like, "You know what? I just want to use this one tool."

And it's like, "Does it really make sense to continue using that same tool? And applying to all the problems?"

Gibbs: Right. Yeah, no, I know exactly.

And I think as data continues to grow and cardinality continues to grow maybe people that may be holding out on or holding onto some of these older legacy tools may start having to be forced to think about other ways of managing their data.

Liz: Yeah. Oh, speaking of cardinality and data storage engines.

I know that Chronosphere is built on M3, which is open source.

How has that been, to be based upon a fully open source product?

How does that relationship play out?

Because we've seen these weird things with Elastic deciding that you're going to fork and take things close source or source available.

What's that dynamic been for you and the broader M3 community?

Gibbs: I find it really exciting to be based on a fully open source product.

I think back at Uber they built M3 internally to meet Ubers and their own metrics monitoring use cases.

And they could have very easily just said like, "Okay, we're just going to do this just for Uber."

But they decided from day one to open source the project.

So it's been open source from day one. And I think that's really cool.

Because while all these standardizations are great, more open protocols and everything, I think without having companies like Uber build these new custom solutions can maybe limit the innovation that we continue to see in the space.

And the fact that they decide to open source it, I think there's other solutions that are similar, Cortex and Thanos all created a similar vein.

And I think without having that drive to be open source from day one, we wouldn't be where we are now in terms of having all these different solutions, continuing to innovate-

Liz: Right. It's the competition.

And also being able to look at what the other project is doing and be able to incorporate features that's very handy.

I was part of Google, not when the Bigtable papers was originally released, but I was working on Bigtable for a very long time.

And it was very interesting to see that HBase came along because of Bigtable. That people went and implemented Hadoop, implemented HBase.

And then when we released cloud Bigtable, we had to re implement the HBased protocol and map it onto Bigtable because we didn't release Bigtables open source in the first place, right?

Gibbs: Right.

Liz: So that transition from the data platform is our secret source to the data platform is open.

I think the that's been a huge strategic shift to the way that large companies see things now.

Gibbs: Yeah, exactly. And things are going to always be changing and continue to change.

And so making sure you're continuing to be adaptable and being consistent or compatible with these open standards is going to make any sort of changes a lot easier.

And so that's something that we've held strong to is just making sure that we are keeping consistent and compatible with all these standards so that our customers don't have to go through those pain points that you may experience for more closed standards.

Liz: Right. Exactly.

And that makes it seamless for someone to move from self hosting M3, to using you or using open metrics to send data to-

Gibbs: Yeah. Exactly.

Liz: Both Prometheus and to you. Yeah.

Gibbs: Exactly.

Liz: Yeah. It's definitely a thing where at Honeycomb, we decided, you know what, actually we're deprecating our moving to maintenance mode, our previous telemetry in just SDKs because OpenTelemetry exists now.

There's no reason not to kind of use the common shared technology.

What's your experience been working with TAG observability at the CNCF formally SIG observability.

Gibbs: Yeah. So I've started regularly attending and participating in the TAG meetings, maybe beginning of this year.

And it's been an interesting experience just sitting in and now getting to a point where I feel like I'm participating a bit more.

One of the chairs, Matt Young, he really has a lot of these ideas for ways to grow engagement with the TAG.

So I'm helping out a little bit with that.

And a lot of these efforts are going to be eventually to try to recruit people of different backgrounds and of different roles and of different interactions with observability.

Because right now I think a lot of people I've talked to want to get more involved with the TAG and really like what they're doing and want to be more involved with open source.

But I think it's a little bit hard to get yourself in there since most of the people that are running or highly involved with the TAG, are experts in the field and-

Liz: Experts in the field, maybe work at vendors or work at companies who have done this a long time.

Gibbs: Yeah. There are mostly engineers.

We don't have many end users, don't have many people that are not engineers.

So I think I would love to see.

And I think the TAG is working to try to do initiatives, to try to recruit more people that can have a more representative voice and perspective of what's going on across these CNCF observability projects as a whole.

And not just from the perspectives of these vendors.

Liz: Yeah. It's interesting to see the balance between the SIG instrumentation in Kubernetes and then the overall TAG for the CNCF, as well as the OpenTelemetry project.

There are so many interesting efforts going on there and I'm glad to see that all of us are focusing on extending these set of voices that are represented at the table.

Gibbs: Exactly. And I think part of me also feels like there's so much mind share to be spread across these groups.

And I know that there are all their own thing and kind of siloed, but I've always been curious to see if there was ways to get these groups together and maybe collaborate because a lot of the groups do have similar goals and missions.

So I'm still new to all of it a little, for the most part.

But I think that would be an interesting thing to see in the future if that would be at all possible.

Liz: Awesome.

So the last thing that I wanted to talk about is so both Honeycomb and Grafana and Chronosphere have announced fairly large funding rounds in the past couple of months, what does this mean for our space?

Gibbs: Yeah, I know it's great.

Liz: Congratulations, by the way.

Gibbs: Well, congratulations to you as well. Very exciting.

And then it is kind of crazy to see all of this funding come together in this space in such a short amount of time.

But I think it really just goes to show that there is a really pressing need for observability and these types of solutions in the market.

And I think this increased amount of funding will allow, at least I know at Chronosphere, it's going to allow us to continue our efforts to really build out more of a platform for our customers or users to really come and be able to deploy all the three phases that we discussed earlier in one place.

And so by having this additional funding, I think we're going to see that across the board, just having these solutions become more robust and become better solutions for small companies or big companies.

I think we'll start seeing a lot more come out of these companies with the additional resources and funding.

Liz: Yeah. It's definitely a thing where all of this investment in this space means that we can develop things that are both individual to our companies, as well as things that raise the tide for everyone and help everyone get better at this.

Gibbs: Right. Exactly.

Liz: It's going to be definitely a very interesting time, but I'm glad that we're still in this phase where we can contribute to these common solutions and raise awareness of the outcome based observability rather than fighting over the definition of observability.

Gibbs: Yes, I agree.

Liz: Yeah. The other thing that I think is really cool about this is it validates, I don't know if you've seen the Stripe Developer coefficient survey from a couple of years ago.

Gibbs: I don't think I have.

Liz: But Stripe basically interviewed a bunch of engineering teams and they basically said we're spending 17 hours of every 40 hours on break fix work.

And it's like, "Really? That's 45% of your week that's just gone like that."

We all deserve so much better than that.

And observability, I think is this huge piece of it that is finally getting invested in enough that we can make it table stakes for everyone.

Gibbs: Right. Yeah, exactly. Because it's not sustainable to have an operation where your developer headcount growth and your metrics are going literally together.

So I think, these solutions and platforms like Honeycomb and Chronosphere eventually, I feel like they're going to get more robust and more powerful where these companies can rely on them more and not have to put people behind the problem necessarily.

Liz: Awesome. Well, thank you very much for joining me today.

It was a pleasure having you on the show.

Gibbs: Yeah. Thank you so much.