In episode 45 of o11ycast, Charity and Liz are joined by Chris Holmes of Simply Business. Together they unpack OpenTelemetry, the importance of tracing, and the observability adoption process at Simply Business.
About the Guests
Chris Holmes: It would be around I think, three to six months.
I think we first started looking at it in April and we started running some spikes to prove that it filled the problem that we had.
We needed to add some tracing to a service that didn't have any tracing.
We needed to add tracing that was, it was a heavy amount of effort to add the tracing compared to a typical application because of some fundamental design decisions behind that application.
We knew we had to add in quite a lot of custom implementation ourselves.
Rather than choose a vendor specific library, we looked at OpenTelemetry as an alternative.
At the same time, we were also looking at swapping between vendors so it gave us an extra level of flexibility that we could use.
We could effectively point OpenTelemetry and multiple different vendors and test the tracing against that and see if actually the user experience in each one of those vendors was to our, I guess, taste.
Liz Fong-Jones: Cool. Definitely that comparison shopping is a really awesome feature.
Why hadn't you considered open tracing or open census before that?
What was special about OTel that was awesome for you there?
Chris: Honestly, I would say it's mostly just the timing thing.
The team who looked to OpenTelemetry at the time, they didn't have any tracing and they hadn't really because technology's complicated, they were looking at other problems and solving other problems and it was only really at that point in time that we sort of looked and said, "Look, we probably could benefit from having some tracing. We've got this gap here that we have other services that have tracing look what they can do and the things they're able to inspect when a problem happens. Well, we probably should add something here."
And it seemed at the time that OpenTelemetry was mature enough to add to our application.
Liz: Yeah, so you definitely didn't want to kind of be the very first but once it was and of proven out enough, you were willing to start adopting it in seriousness.
Charity Majors: How much reimplementation of your instrumentation did you have to do when you switched to OTel?
Chris: Well, because this application in this particular instance didn't have any, it was largely start from scratch.
It definitely wasn't a matter of just downloading a library and running some config.
It did start off that way.
We started by just looking at the out of the box instrumentation so we could just take how it interacts with Redis and see those interactions and how it interacts with Kafka and see those interactions.
But what we needed to do ourselves is add some glue so we can actually connect those interactions together because what we were missing was something that happens, sort of act as a seam between our application itself.
The applications are an event driven application so it follows quite a lot of CQRS principles.
Liz: I see. It's not just strict request response you kind of had to.
A lot of the, I guess, out of the box instrumentation, all the standard processes for most applications that we have had to observe or add tracing to or logging to, they are those standard request response applications.
They act in a very consistent way.
We can take a very similar approach for a lot of those things.
In some cases to actually measure things and understand how they behave we can probably just scrape the load balances and get metrics out of them.
But this application was different.
We couldn't take the load balances logs because what happens immediately after load balances was we take the information we get from a HTTP request and put it onto a queue.
We don't actually know whether the system's behaving adequately just by looking at HTTP traffic.
But what we have as well in this particular application, which makes it quite special. It needs to be quite fast.
We really need relatively low latency.
It's a telephony application and we have users both internal to the organization and then actual customers who are on the other side of the phone.
And when someone interacts with the application, we have to respond relatively quickly because people just don't really have patience, especially when they're on the phone.
And so we have made some decisions, mostly around being able to achieve a low latency.
We use streaming across our applications and we use a particular library of streaming, which didn't have any out of the blocks instrumentation.
That's the major thing we had to come and add.
And why we chose OpenTelemetry is because we knew we had to add in some extra, it had to be boiler plate of our own and we would rather do that using a standardized approach or a standardized approach we think is going to not have to be swapped out when we change vendor.
We might change vendor in five years well, we're unlikely to have to write that code again because we've chosen OpenTelemetry, at least that's our bet anyway, in this regard.
Liz: Now would be a good time for you to introduce yourself, who are you? And what do you do?
Chris: Sure. My name's Chris Holmes.
I am an engineering manager. I work for a company called Simply Business.
We sell business insurance to small and medium size companies.
We work in both the UK and the US. We write a lot of software, we consider ourselves a tech company.
We have a variety of different sort of internal services we run.
And one of them is a telephony application, which is the thing we've recently implemented.
That definitely gives a lot of context to the idea that people don't like to be kept waiting on the phone and your customer support representatives deserve a positive work environment where people are not abusing them on the phone.
Chris: Yeah. Yeah.
One of the big reasons we came to this app is because the previous version of it, we did create a request response version of the app.
One that was built, I guess, in a more traditional manner.
I think it was a Rails application and it interacted with Twilio in the background to sort of coordinate all the phone calls.
But the feedback we had just generally was poor. It was too slow.
Sometimes requests would take five seconds to respond because it had to interact with Twilio in the background and everything was too slow.
And our consultants, as we call the people who work in the content centers, just had a very bad experience overall.
Whereas migrating over to this event driven system, we sort of were able to work optimistically.
We could make small changes that could be handled in the background when we interact with Twilio and then we could just push back the change so the consultant could move on or not get blocked by clicking a button.
Liz: That sounds a lot like what Charity and I frequently discuss on this podcast, which is this idea of shifting things left, breaking down things into smaller units.
It sounds like by moving away from you have to wait for the whole thing to succeed before you return anything, that that really enabled you to accelerate your software delivery.
Chris: Yeah. Yeah. I think that it really allowed us to do was modularize the code base.
We were able to effectively create, microservices might be a word you might want to use, but effectively we have services that consume messages and perform a very particular function and then publish their own messages back into the streams that something else can then pick up.
So we're able to effectively, when we want to add a new part to the system, we can just start with creating a service from scratch that performs that role. And generally they interact well all with one another. There is a cost for that though. We have 20 to 30 services to provide this platform. We have a relatively small team. It doesn't need loads of people working on it but there is a lot of complexity involved with it.
And so onboarding people takes a lot of time because they have to understand the grips of well telephony as a domain generally anyway because telephony is a whole different thing from web services.
But they also have to understand, well, we chose this particular architecture. This is what CQRS is.
These are how our services talk to one another.
And one of the reasons we looked at tracing was so we could actually help the team visualize what's going on in the system.
Because previously it was, you can look at the logs, you can read the text that comes out of the system.
Charity: You're just stabbing around in the dark.
Chris: Yeah. Yeah.
You didn't get to see how the HTTP request me make to Twilio where they happened and what happens when they retry?
What was the impact of that?
Whereas now we can visualize and one of the nice things, I think the best part I've had about this was from one of our staff engineers who works on the team and he effectively was one of the original sort of developers of the system he's been working here quite a while.
And his view of this was effectively, we were able to see inside his brain.
This is the way of him expressing without having to draw a lot of diagrams, how the system worked.
Liz: Right. It's self documenting.
It's the system as built not the system as designed even where you discover these things that don't necessarily match what the engineers originally thought they were making.
Charity: Instrumentation is like comments plus reality.
Chris: Before I worked here, I used to work at the government digital service and one of the services we ran, the service I worked on there was a sort of microservice thing, I guess at the sort of dawn of microservices.
It was a bunch of different Java services that talked to one another and we always struggled to onboard people for the similar things of explaining these different roles.
And we would get pieces of paper out, and we'd put them on a whiteboard and draw lines between them and we'd write down the roles of these different services.
But it was still building trust or experts who had to come along and say, "This is how it works. I promise you can go read some Java code and hopefully maybe find the answer."
Charity: The mental model is everything.
Chris: Yeah. But as the person who would run those sessions, I always knew I was probably wrong somewhere in there.
And we always struggled to, well, we didn't have tracing at the time was the big story.
We didn't have anything like that.
And having something like this now available to us just opens up so many opportunities that I feel that we were missing before.
It's such a really awesome evolution to see how the rise of microservices and event driven services has led to people realizing that the previous approaches didn't work for them before and definitely don't work now.
It's not like the problems that tracing is solving are unique to microservices.
They existed in the old systems too, it's just that we lived with them for so long until we couldn't.
Chris: No. And it's true, at the moment, all my teams, so I work with actually a few teams.
If I didn't say this already but each of those teams uses tracing in their job on the day to day basis.
They might not necessarily use it as much as each other.
Each team is sort of self organized so they choose how much they want to use the tools based upon their needs but each one, depending on whether they're working on some part of a monolith or a much more simple request response application, they still benefit from this stuff.
They still want to know how well their database is performing.
How often other clients are hitting their service and having a visual way to present that is in my opinion, the right way to do that, rather than throwing away information through time series metrics or grabbing some logs.
Charity: Are you talking about like a service map or what?
Chris: I would say a mixture of things but being able to just list out key traces and knowing which things are hitting things where and then being able to inspect and yes, create map from that.
But largely just being able to look through the chronology of a trace itself through the spans and seeing the impact certain, don't want to say activities or jobs or background processes can have on our systems is a thing that a lot of my teams are impacted by quite a lot through the services they're running or through other things.
Liz: I love that thing that you just said a moment ago about kind of how time series data or logs throw out a lot of that important data and how traces capture that.
Chris: Yeah. Well, I have particular issues with time series databases.
The inability to take a time series database and go back to the original information, the original state and see what was actually going on at that point in time, frustrates me quite a lot.
I had to use those things a lot in the past, whether it was Graphite or Prometheus or other vendors and have generally struggled over time with those tools to really feel confident in the information they were giving me because again, you've thrown away information.
You can't go back to the original events that built up that data.
Liz: It almost feels like penny-wise pound-foolish.
That we used to save so much money, air quotes, on metrics, not realizing that it had a cognitive cost down and people tried to look at it and couldn't get the results that they wanted or couldn't as you say trust that the data was actually accurate.
Charity: And they couldn't make correlations between one request or another request either.
They had to just use their imagination and their past knowledge.
Chris: There's definitely a lot of effort that I've had to go through in the past to fill in the gaps between.
Charity: And to make those connections.
Chris: Yeah. I have this log over here. It's telling me something.
I have this graph over here. It's telling me something similar maybe.
Maybe I can overlap some times or draw some lines on this graph and show that there's a correlation here that X caused Y.
Liz: Right. Ship from kind of time based correlation causal correlation to be actually able to see the kind of chain of causation.
I guess that kind of brings us to the next topic that we wanted to ask you about, which is what is the adoption of the observability been like at Simply Business?
Kind of how did it spread between departments or teams?
Kind of how did people realize that that was something that they could get value out of and that they should start focusing on?
Chris: Yeah. I can't give you the full story.
I've been there for two years so there is stuff that happens in the before times that I can't tell you about but generally over time, I think we followed a model that I've seen elsewhere, certainly in my previous job, which was that generally we've chosen cloud based tools rather than sort of self hosted tools, which is generally good.
I've worked on teams where we spent a lot of time and effort hosting elastic search clusters and no one else got anything else done. But to your question, we chose good tools for different needs. We chose a logging tool and we chose a monitoring tool which effectively is your metrics tool. We chose a error management tool and we've chosen a tracing tool and they were made available to teams and teams would choose how to use them.
And as generally it meant the different tools would pick up different technology that they like the most and then sort of adapt that.
A lot of teams would be using metrics and they'd use those metrics for dashboards or alerts.
Teams would use logging.
And we have, I'd say over time, understood the value on having consistent logging.
We have consistent schemas, consistent trace IDs and you're unlikely to see an application that hasn't got relatively thorough logging inside it.
What wasn't consistent or it's becoming more consistent across the business is the use of tracing.
It was because it was generally seen as a thing you turn on or thing you had to be made aware of.
It's generally been a thing that has been slower to adopt with people.
Charity: What causes people to turn it on?
Chris: I'd say overall, I'd say a few things.
One thing was somebody who'd already experienced it and could say to back to a team of, "Hey, you're missing out on this good thing."
Having people who are champions of tracing has been probably the thing that I think is the most effective of getting people to turn on in their applications.
I understand that they need to turn it on in their applications or they'll benefit from it.
Most of our teams generally are expected to be responsible for running their services so there's the build you run it ethos that we try to follow.
We have a team that provides this platform infrastructure.
They sort of make these tools available to people and they will coach people on how to use a tool and bring in trainers to come in to teach people.
But really the best experience is on job and having people who have experienced those tools going within the business and going to teams and saying, "Hey look, this is what you can do."
Or getting involved in an incident.
Being in an incident where you're able to quickly debug a problem because you have good tracing is in my experience the best way to convert other people.
Liz: Yeah. We had a previous guest on the show, Nick Herring from CCP games, who said that when you turn up to an incident with the shiniest firetruck that actually puts the fire out, people get really interested in learning where they can get one of those firetrucks.
Chris: Being able to go in and show, okay well, you can do a query like this and look this query, it's a different query language.
That is a problem. It's a new language that people generally need to learn most of the time.
It might be a bit inconsistent from the query language you might be experienced, used to from say logging, but it's still something that you can look at and then also correlate between those two things.
It's not saying don't use logs but use this other thing as well and see if you can build a comparison.
Build up your confidence that actually when you're having an incident, your assumption about the incident is correct.
This really neat thing where I think you hit it on the nose when you're like, teams who are running their own services, who are practicing production ownership, that causes them to want to learn how to do it better as opposed to just being like, yeah, this is not my problem.
Was that always the case at Simply Business that you had production ownership?
Or was it a ship that you made away from kind of a centralized on call or knock?
Chris: It's definitely something that has shifted over time as we've moved to having more complex services, again, more teams, more services that we have to support, it's become more and more important.
You can't just rely on a single centralized team who are not going to understand how the telephony platform, the actual services that we sell insurance from work it's just, well, it wasn't sustainable for people.
Obviously, like everybody else, we spot trends.
If we know the at Google or some other giant company are following similar methodologies, we want to understand why.
We're not going to say we're just going to copy them but we're going to look into understanding why they're doing it.
Liz: Yeah. It was really cool, I think I was invited to give a talk at Simply Business, I think a year and a half ago and it was really cool to speak to such an engaged audience who wanted to learn about best practices from the outside.
And I certainly am hearing you describe a lot of things that are similar to practices that I espoused in that talk a year and a half ago.
Chris: Personally, as someone who was watching that talk, I would definitely say it impacted how I thought about problems and how I could go about solving some of the problems I saw on my teams.
Liz: Kind of as you're thinking about the way that you evolve your stack, so you're describing this transition from hosted metrics service, hosted logging service, kind of APM tools, kind of this consolidation, kind of what's next for you?
Kind of what are you expecting the future of observability and of telemetry at Simply Business to look like?
Chris: It's still being worked upon but the major thing we're looking at is trying to reduce the number of vendors'products that we are using.
Most vendor products, most people are maturing.
It's definitely a space that's had a lot of growth and development in it.
We found that we were using three vendors who had very similar product offerings overall.
And so we were looking at, well actually, could we take reduce these three vendors down to one vendor?
Can we use a single vendor to do logging, metrics and tracing?
For few reasons or definitely one of those big reasons is why are we spending money on three different things?
In some cases actually spending the same amount that we get if we bought all those three things together, while we could buy a single suite of products from a single vendor and get the same outcomes, which is that we have these three different things that we need.
As we're looking at how those products are maturing one of the problems we have is because we've got three vendors, is that you can't look at your traces and then look at your logs right now.
Or at least we haven't been able to up until this moment.
Liz: Right, exactly. You'd need the same correlation IDs, you'd need deep links.
These are things that are problems when you're using different vendors and even sometimes problems within the same vendor, unless you've annotated your things really well for sure.
Chris: Yeah. That's something definitely we've made a few mistakes on in the past.
First of all, pretty much, if you want to go between logs, there's copy and pasting, which is fine but in the heat of the moment can be a bit annoying.
You have to juggle browser tabs and all the others stuff and I don't know about you but I have about 50 browser tabs open at a time so it gets extra hard.
We've also at some point decided to add in our own trace IDs, which we probably should have done a bit more research on in the past but now, if we look at say go back to OpenTelemetry, OpenTelemetry has support for various different kinds of trace IDs or ways of propagating context.
And if we'd chosen OpenTelemetry earlier or looked at that earlier, we could have actually skipped a bunch of tech debt that we created for ourselves.
It's been really cool seeing the kind of in extensions of OpenTelemetry and of the W3C trace context standard to encapsulate more and more forms of transport and not just to HTTP.
We added SQL commenter recently to the OTels project, which allows for propagating over database calls with the common field in SQL.
It's definitely evolving.
I'm not necessarily sure that even if you had adopted OTel earlier that you wouldn't have run into the same problems but I think now is a fine time to think about how do we standardize?
How do we upstream this?
I think the just general analogy of the best time to start investing money was probably earlier but the best time is today if you didn't do that.
What we don't have, like I said, we don't have going between traces and logs.
I think when we unlock that and again, we probably will do that by looking to the OpenTelemetry standard, getting our logs to emit the same trace context information.
One thing we still need to explore is what that means because open telemetries, logging, implementation, at least for the most, the SDKs we use isn't fully defined or isn't implemented.
Liz: Right. Exactly. Logging is very much a alpha to beta area thing.
Although span events are definitely something that are broadly supported.
It's just a question of how voluminous are your logs.
Can you encode them as span events?
Or do you really want separate logging? Interesting things to think about there.
Chris: Yeah. I think that's got me thinking actually, so in a few services, it probably makes sense.
The service actually in question, the telephony service, the actual thing we talk about a log is effectively an event log.
We're logging those events that we see in the systems.
We have to think about how we're going to redact sense of information from them.
And that's a big challenge for us, just generally.
Telephony system, so we have a lot of phone numbers that we put in our logs and probably for a cloud tool well, for a cloud tool, we don't want to include phone numbers or other bits of personal information.
And that's true for other products that we have as well.
The system we use to send SMSs, for example, we don't really want to include the same information.
But that information is already in the event because it's event driven system.
Liz: Right. Yeah. It's one of those really cool things where even if you can't clean up the underlying telemetry, the OTel collector's data transformation is really helpful for that.
Chris: Yes. Yeah.
That's something that we do need to think about for our future version of this architecture.
What we've gone with so far is a centralized open sandwich collector that we've put into our infrastructure.
And then our applications themselves just use the HTTP exporters to send the spans across to that collector.
And while we would like to use the OpenTelemetry redaction feature from the collector, we've found that in the past that moving redaction away from our applications has caused people to forget to redact or not be aware that it's a feature that they have available to them.
We used to use Logstash.
Well we do still use Logstash for our logging.
And that's the thing that we found is that people have in the past, not been aware that we have all these Logstash filters that filter out or meant to filter out personal information.
And so those things would get out of sync with the actual applications as they're changed.
But we have the opportunity with OpenTelemetry .
We can take the agent, the effectively the cycle version of OpenTelemetry's collector, stick that in extra applications, configure it from next to the application and do the redaction there rather than having to come up with some workarounds of our application.
We do have currently have workarounds in our application.
Liz: Yeah. That sounds really exciting that there's kind of all of this future work.
You standardize both kind of on a vendor neutral technology while also consolidating vendors.
What's your overall verdict on OpenTelemetry?
Do you think that other companies should adopt it? What's your experience been overall?
Chris: I think, well, we're obviously sort of doing at the bleeding edge.
It's still being developed to one.
I think there was a learning curve at the start that maybe you might not have had if you're using a vendor specific tool but that's likely to change.
Certainly the things that slowed us down was making sure that we had, getting access to documentation that was relatively up to date or different features.
There were bugs and we've effectively had to take those bugs and get them patched.
If you care about open source software, you think that is a benefit for the industry or for your business, then working on something like this where you can sort of improve software that's used by other people is a benefit.
However, what I would say is we're a special case.
We were using application that needed customer instrumentation.
Charity: Every case is a special case.
Chris: Yeah. Yeah. That's true. That's true.
If your application also needs to add in customer implementation or events or spans that are very specific to your application and you don't want to be adding a lot of special stuff to your code base that is tied to a single vendor, I don't see why you would not choose to choose OpenTelemetry.
The switching costs are there, there a thing that you have to deal with.
I think the sensible reason to do that, there's a reason people use standards.
Charity: Yeah. Cool. Well, thank you so much for coming today.
It was nice to get to hear from you, Chris.
Liz: This was such a fun conversation. Thank you very much.
Chris: Cheers. Thank you.