Ep. #1, Monitoring Observability with Monitoring Weekly’s Mike Julian
about the episode
about the guests
Joe Ruscio: All right, welcome back to the podcast. Super excited to be joined today by a good friend of mine, Mike Julian. If you don't know he's an O'Reilley author, Practical Monitoring came out last year.
Mike Julian: Yes, in December.
Joe: The editor of the wildly popular weekly digest newsletter, I know you don't have enough of those yet, Monitoring Weekly. Go get that one. His day job is the founder and head consultant over at Aster Labs, which is a premier consulting shop for monitoring observability with some customers you may heard of, like Docusign.
Mike: Yes, Docusign.
Joe: Great. It's probably going to come as no surprise today that we're going to be talking a lot about a subject that is also near and dear to my heart, monitoring observability, how that intersects in monitoring SaaS teams and in the cloud. The first thing, if we can just get started, to table set. Because a lot of listeners may not be quite as deep on the inside baseball, but I often like to ask people given how much has shifted in the broader industry in the last ten years with cloud computing and cloud-native, how do you feel? Specifically with monitoring observability, what do you think is the biggest trend lines over the last decade in that space?
Mike: The thing that I've noticed, the biggest change has been the rise of cloud-native. Which means that we all started off many years ago using Nagios and Big Brother and all these old school tools, and monitoring was like, "We have the monitoring server and it hits other servers, and other static servers, and nothing changes in your world." So you could have a task sitting on your monitor like, "Go add monitoring to the server." And that was fine. It sucked, but it was OK.
But now you just go edit this manual configuration and it's OK because you don't do it but once a week at most, and now the entire infrastructure rolls over inside of an hour. You can't think about the world that way anymore. Of course, now we have not just a load balance with a couple of servers but a load balance with ASGs behind it. The entire world is changing constantly, and answering the question, "How many servers do I have?" is a non-trivial question at times.
Joe: Right. Or, "How many containers do I have?"
Mike: Right. "How many containers is running this application?" Who knows.
Joe: This is probably something you help out your customers with. What type of people do you typically work with? What's the general arrangement you see companies working in solving this problem in? Like how do they organize, who do you interact with?
Mike: Most of the companies I work with are the fairly large companies. A lot of them no one's ever heard of but they're pretty much all larger, older companies.
Joe: So, a "legacy enterprise" versus a "hipster enterprise."
Mike: Definitely legacy enterprise. Like Big-E Enterprise sort of stuff. They will often try to be getting into the monitoring era by having monitoring teams or observability teams, and most of those teams really focused on building tools and not necessarily teaching the engineering teams how to use them. They're buying the tools, they're trying to build monitoring, but because they didn't grow up with it like a lot of the hipster startups they don't really quite get it yet.
Joe: This is part of their broader move into the cloud, as it were. Part of their "digital transformation" is probably the word they're kicking around. As part of that they've now got some new monitoring and observability initiative, trying to catch up with the rest of the teams.
Going back, and I'm cheating here a little bit, listeners may know that I founded a monitoring company back in the day before moving into venture. But one thing that has fascinated me over that time period, and maybe you have some thoughts. As people try to solve this problem moving from Nagios, the first big thing to hit at least from my perspective, was Graphite.
Mike: Yeah, absolutely. Graphite completely changed how everyone views everything.
Joe: For people not aware, Graphite came out of Orbitz in 2009 and was the first open source-pliable tool that you could adapt to a whole bunch of different ephemeral situations, because you could just push data from any source into it and then you could serve as a single source of monitoring truth for the rest of the org.
Mike: How I've been explaining it to people is Graphite was a complete game changer, and looking back on it I don't think people realize exactly how big of a deal it was. Prior to that we had tools like Nagios but the only thing they did was they would check a system for data, get back some value and decide, "Is this a good value or a bad value?" And then after that just throw the data away.
Joe: That's a good point. There's no historical context. It's just, “What's on fire right now?” And nothing around, “How did it get in this state?”
Mike: Exactly. With Graphite the model changed, so we were no longer checking the system for, "Is this a good thing or a bad thing?" We're just saying, "We make no judgment about this. We're just collecting data. And then we're going to look at the data to decide if this is a good thing or a bad thing." That gave us huge amounts of flexibility and more capability in deciding how our systems are behaving, because we could look back at historical trends and we can apply statistical analysis to it where we couldn't before. Graphite now, it feels really old school and there's not a ton of people using it anymore.
Joe: It's not a greenfield tool anymore.
Mike: Right. If you're going to say, "Let's go overhaul monitoring," no one's saying, "Let's install Graphite." But it was a total game changer and all the tools we use now grew out of Graphite.
Joe: Definitely. We owe some motivating factor to it and the ground it broke. I totally agree with that. So you're probably familiar, you mentioned it's maybe not as much of a greenfield use anymore but it's been supplanted by something that has, at least in the monitoring and observability space, had a dramatic impact in the last two years. Prometheus.
Joe: Are you seeing this a lot?
Mike: Yeah. The adoption of Prometheus has been wild. It just feels like it came out of nowhere. To me, I'm still not sure why. It's a cool tool but it still feels like there's a lot of challenges around it. The adoption of it is crazy.
Joe: Can you describe, just briefly for the listeners, what it is and how it works?
Mike: Yeah. Prometheus is-- I'm trying not to get into too much inside baseball here. There is the argument of push versus pull when it comes to monitoring, and Prometheus is solidly on the pulling model. You have to hit your servers to get this data.
Joe: I was specifically inspired by the infrastructure of Google.
Mike: It is essentially an open source version of Google's monitoring system, Borg. In fact you often see Googlers when they give presentations talking about monitoring at Google, they're using Prometheus to do the demos.
Joe: They feel like it most closely emulates the tools that they're using.
It always strikes me that maybe modeling our architectures after massive behemoths is probably not the best way to do things.
Joe: Yeah. It is an interesting conundrum. When I'm looking at new startups it's always this notion of, especially in infrastructure and dev tools, where you see a very smart team come out of a very large organization and say, "We've built these tools just like we had inside this very planet web scale org." Like Facebook, Google, Apple and Fang.
And usually it's very binary, it's either a very good thing or a very bad thing. Because it either is the smart problem that works for everybody, or they're solving a problem that only exists in five companies. It does seem, from at least the adoption rate though, that Prometheus must be landing on the former.
Mike: Yeah, absolutely.
Joe: It struck a nerve, it seems, in the industry.
Mike: It has absolutely struck a nerve. I'm seeing a lot of companies that when they look to start their monitoring journey and overhaul how they're doing things, they're almost leapfrogging the previous steps that I would expect them to do in going right to Prometheus.
Joe: What are those steps you would normally expect?
Mike: Normally, I would expect someone to say, "Let's implement a time series database and we'll take the Nagios setup that we currently have and just start writing the data to disk." And the change is how they're thinking about it, but making incremental steps. Instead what I'm seeing is that people are saying, "No. Let's rip everything out and put Prometheus in." In some ways, it's almost a rash decision. People are looking at it as "The new shiny." It's cool.
Joe: Engineers love a good re-platform.
Mike: Man, they do.
The power of resume-driven development cannot be overstated.
Joe: Right. There's certainly some of that.
Mike: I don't want to say that Prometheus is bad, because it's not. It is actually very good. The major adoption I've been seeing is coming from the more modern infrastructure companies running things, like Kubernetes, which links up with Prometheus' model quite well.
Joe: Yeah. Prometheus is a cloud-native computing foundation project.
Mike: It is. CNCF as well.
Joe: There are many factors going into it, and I'm sure that's part of it. It's hooked up to Kubernetes and Envoy and some other highflyers there. One thing I thought was interesting and I wanted to get your take on from the pull-based model, what that means is that application endpoints that want to be monitored by Prometheus in its normal configuration expose an endpoint that can be queried, and will return whenever it's queried the current state of the system.
Then Prometheus will in some loop, at some period, execute those queries and get the latest data and that's what constitutes your history. What is fascinating about that is now you're starting to see in projects like Kubernetes and Envoy and other things, and much more broadly, is everybody starting to support Prometheus endpoints. What do you think about that? That introduces some interesting possibilities.
Mike: It does. I'm seeing other monitoring tools supporting Prometheus endpoints, which is fascinating. But it's cool.
Integrations with monitoring tools is really what makes or breaks a company.
When I look at companies like Datadog and when I ask the question, "Can they integrate with X?" The answer is pretty much yes.
Joe: Right. They have a massive list.
Mike: It's a massive list and it continues to grow constantly, and I never have to go to a VP and say, "No. They don't support that yet," because they do. Which makes it such an easy sell. When I look at all these tools taking the Prometheus language and saying, "Here is a Prometheus endpoint." What I see is them making it a really easy sell for me to put any tool anywhere.
Joe: It's interesting that if you look at the major players in the industry today, Datadog or New Relic. They've all had to build out these integrations painstakingly with large teams of people. There's a massive investment that historically had to be made as a vendor, whereas it's still early days, but if the trend continues and a Prometheus endpoint becomes standard in any popular piece of infrastructure software, let alone if people start building those into their custom microservices. Then suddenly in some ways it levels the playing field for new vendors. Which is great.
Mike: Yeah, it's awesome. That massive investment that you have to do up front like, "Let's build tons and tons of integrations," is almost too much an investment. It can kill the company before you even get off the ground.
Joe: Speaking as having built one, it was always this constant balance of trying to build features in the product that drove real value, and then building the glue that got the data from all the places you needed.
Mike: Absolutely. With Prometheus endpoint. It's almost democratizing the integrations. It definitely lowers the barrier to entry for new vendors, which is great.
Joe: That leads to-- I'd like to segue into another interesting parallel and another emerging trend in the industry in about the same time period, the last two years. Tracing. I assume you've been talking with some of your customers about that. But specifically in tracing, there's a similar effort, open-tracing, which is being driven in part by LightStep which is a fantastic Heavybit company. Are you seeing tracing?
Mike: No. Tracing is being talked about, but not by any of my customers.
Joe: Fair. It definitely seems more, just to continue, it seems more "hipster enterprise" right now.
Mike: It really does. It's one of those things that you see it at every meetup you go to, you see it at every conference you go to, and it's always by the big Bay Area darling companies.
Joe: Right. Lyft, Uber.
Mike: Pinterest has one.
You have all these distributed tracing frameworks, or implementations, and they're interesting and they make for a killer demo. But for most people it's not that useful yet.
Joe: I guess we should probably, in terms of context. The way distributed tracing is popularized by Google, almost eight years ago now The Dapper Paper was published. But this notion that if you have a distributed system, that we now call microservices, with some type of fan out architecture where a request comes in and subcomponents get farmed out to other microservices, that you have some unique context that follows the serial requests out. Then when their request is finished you can draw a map, literally, of how their requests fanned out into your infrastructure and how long it's spent at each particular spot.
Mike: Perhaps, to put it more concisely.
It's the idea of being able to trace a single request through your entire system as it propagates through the entire system.
And one really cool use I've seen of it is for complex distributed architectures, being able to map out what services are talking to what others.
Joe: Literally just even knowing what the map is.
Mike: Just knowing what your map is, is actually pretty hard when you're running distributed systems.
Joe: Yeah. At Manorama two years ago, I think it was Uber, where there was a talk on distributed tracing. The very first thing the engineer said was, "Nobody at Uber actually knows what all the services are." Literally the only way they can know what's happening in the system is from a trace.
Mike: Netflix has a tool called Visceral that they use, and one of the cool things about it is you end up with this massive map. At the time they had something like 950 microservices in their systems, and it's just as complex as Uber. How do you know what services are being used for what, and how do you know what the level of traffic is?
There's effectively a Dunbar's number for microservices. Any one person can only track so many microservices in their head at any one time.
So it is interesting, I do think you're right. I do think it's very early. I don't know what the number is, or if anyone has done any quantitative analyses but I suspect there's a number of microservices you need to have interacting and I don't know if it's 10, or 20, or 30. I suspect it's probably, let's say 8 or more, where you actually start to need it. Or, maybe you don't realize it yet.
Mike: I would argue it's probably a lot more than that. I know companies that have tracing, and hesitantly admit that it's not actually that useful for them. I wonder just how much it has to do with knowledge level of the engineers, complexity of their environment, and in that sense a simpler environment would probably have less need of this than a more complex one.
Joe: How much do you think it is the complexity of the tools? Because at least until very recently, and honestly, I would say even literally in probably the last year, if you looked at the tooling that was available outside of Google, Twitter at open source, the Zipkin project some years ago. But in many ways it's a spiritual twin to Graphite, where it added this fundamental capability, but one that wasn't maybe super approachable.
It's only between I'd say LightStep, and then as far as I'm aware Datadog has introduced a tracing capability. And then AppOptics, which is from my former company, introduced just some number of months ago. Outside of those, trying to make it approachable, it is a very complex thing for an engineer to wrap their head around if they're not already steeped in the space.
Joe: A lot of it has to do with a pattern where in most orgs you see a handful of people who can grok and understand what it's doing, and the tools, and I just think there's a lot of work the tools have to do to make it work. It's a very complex topic.
Mike: Absolutely agreed. It's a complex topic and I don't think the tools are doing a very good job of making the users better.
Joe: I definitely agree with that. Part of that, what is interesting is there's two efforts. I mentioned open tracing, that's one thing that's definitely happening there. This is this effort to make the wire format of a trace context. Because again, historically a big problem with these is they were generated, the actual context and the way you did it was install a client-side SDK in every single application in your infrastructure.
In dynamic language it's probably monkey patched, but it actually augmented every incoming and outgoing request and response transparently to the user code.
Which when you're in an organization like Google, you can say, "If you want your code to go into prod you will have this RPC library." But as an open source, or if you're just one engineer inside a large company, or if you're a much smaller time vendor trying to sell your solution, getting someone to agree to whole-hog into polyglot now especially in 2018, where almost every interesting org is polyglot. They have three, four or five languages, no ops. Open tracing is this effort to try to overcome that. How do you see that working?
Mike: I agree. It's a good thing that people are trying to standardize on this now. For the longest time it was several competing standards, you had different factions trying to compete with it. Now that we have open tracing and everyone is saying, "Why don't we just use that one thing?" That's great.
Joe: Yeah. It does seem, when we're talking particularly maybe at the service mesh layer and they're starting to have a lot of success in standardizing or making that available. If you are using Envoy or Nginx or something like that, you'll be able to plug that into your open tracing compatible tools. Are you familiar with the OpenCensus project?
Mike: No, I'm not.
Joe: That's interesting. I won't spend too much time bloviating about it, but I would like to maybe get your fresh take on it. So, that's an effort being driven by Google but most recently joined by Microsoft, to standardize a set of client-side libraries for APM and trace data.
Not just a wire format, but literally, if you drop this library into your application it will create the trace, it will collect some APM data, and then push that out in a consumable form. It's in a sense, kind of going back, trying to standardize or completely remove the investment normally required. Where you'd need to build out instrumentation for every language you want your tool to support.
Mike: I love it. I would worry about how the incumbent vendors are going to handle it, given that they have an investment in not accepting it.
Joe: Yeah. It'll be interesting to see if in terms of whether that is supported as a subset of what they do, or if it's something that will be ripe for the old embrace and extend ploy.
Mike: Absolutely. One thing that I've been seeing is people taking the age old StatsD protocol and building a fully compliant agent around that spec, but then extending it some. Embrace and Extend is an age old method that works really well.
Mike: Yeah, StatsD. There's a throwback.
Mike: That really kicked off everything with Etsy's blog post and the announcement of it back in 2011.
Joe: Yeah, "Monitor everything."
Mike: Yeah. And their church of graphs.
Joe: Yes. We've talked about a couple of hot trends, the tracing and Prometheus. There's at least one other I want to touch on. Machine learning.
Mike: To put it simply, I think it's mostly bullshit.
Joe: Mostly. There's a qualifier there.
Mike: Mostly bullshit, yeah. There's a qualifier there. I will say in monitoring it's pretty much all bullshit, but I'm sure it's useful in other non-monitoring contexts.
Joe: Interesting. In your job when you're working with your clients you probably, I assume, do a lot of sourcing and evaluation. Are you seeing a lot of vendors talking about this machine learning?
Mike: I see a lot of vendors talking about it, but there's a huge disconnect between what the vendors want to sell me. Like the promise they're selling of, "Machine learning is going to make everything better." And what my clients actually need which is better monitoring.
Joe: Right. Which there's probably some subtle but important differences between those two.
Mike: Yeah, absolutely.
Machine learning is a neat engineering idea, but it really seems that people are building these machine learning products around the basis of, "I want to do machine learning," and not around the basis of, "I want to solve a problem."
Joe: It does seem something, if we're zooming out until the broader industry, it's something that over the last five or six years has shown in certain verticals really big successes. I don't think there's been any really horizontal platform wins, and that speaks to at least the current state of the art. The different models and techniques need to be tailored to the specific vertical.
Mike: I would agree with that.
Joe: Nobody has solved some unified theory of machine learning yet. So taking it back then, to the monitoring specific vertical, what would you say is the most successful application you've seen of it?
Mike: I would say probably alert aggregation.
Joe: Like clustering?
Mike: Yeah. I can't remember the vendor that really first popularized this, but there's been a few others that have followed. The concept is that I have alerts flowing in from several different monitoring tools and if something goes wrong, I probably have multiple alerts talking about the same thing.
Joe: There was BigPanda and Moogsoft, I think?
Mike: I think so. You end up with all these different alerts coming in, and you can run a little machine learning on it and start to pick up on what is actually happening. Where's the problem likely located and you eventually develop a whole model for it which is cool. But after a while I have to wonder, if you consistently have these same sorts of problems and machine learning is telling you, "The problem is over there," why don't you just go fix that?
Joe: The other places I've seen it applied, in two other places to two different degrees of success. One is log analysis. There's a whole bunch of those. Even Elastic has some basic capabilities built into the open source product.
There's a whole slew of vendors who all have their own take on clustering log events, and it seems like it would lend itself to have some reasonable success there.
The other one that maybe has been a lot less effective, and way more noise and signal, no pun intended. Anomaly detection in telemetry.
Mike: Yeah, absolutely agreed. Anomaly detection is the big promise that we've all been told for 30 years.
Joe: Yeah, literally. That Holt-Winters paper at Usenix was 20+ years ago.
Mike: Seriously. And we still don't have it, because it's a hard problem.
Joe: It's a really hard problem.
Mike: It's a really hard problem. There are some situations where we can solve the really simple versions of it, but it's not really anomaly detection at that point.
Joe: Right. What's interesting, and this is probably a good segue, but what was always interesting to me is anomaly detection for some signal, and there are some telemetry where I really don't know the shape of it or it's super seasonal. But then there are other things, if I'm thinking about my key performance indicators of a service like latency.
I don't need a neural net to know that 3-second response times are bad. They're never good.
And I would like to get your take on this, at least when I was selling to customers that was the problem that most of them weren't even coming close to solving yet, was not very sophisticated scenarios. It was literally just, "Is the service ridiculously slow? Are my users in complete agony, or not?"
Mike: Yeah. Absolutely agreed. You don't need to do really complex stuff. I had someone ask me recently about static thresholds on disk space, and I know static thresholds are bad and I shouldn't. But when it comes to disk space I shouldn't have a static threshold, and I'm like, "What are you doing instead? I don't have an alert on disk space."
Joe: Because it would be static, and static is bad.
Mike: "Because it would be static, and static is bad." And I'm like, "Why don't you just try setting a static alert of 10% free is a bad thing, and see where you go from there?" If you're not doing the basic stuff, the foundational work, there's no way you can possibly work with the more advanced stuff either. Nothing complex starts out complex. You evolve over time.
Joe: Crawl, walk, run.
Mike: Exactly. And when it comes to alerting, people are trying to do really complex stuff right out of the gate because they know the foundational stuff is old, and it's bad. All this stuff, but that's not true. We still have problems, and by the way, the stuff we've been doing for 20 years actually kind of works. It's maybe not the best but if you're not doing anything, then go with stuff that's half as good. It will be fine.
Joe: Your bread and butter is coming into an organization that knows fundamentally they need to get better at monitoring and observability. So, the Google SRE book by the way, has a bunch of really good information on this topic.
Joe: One of the single key takeaways that I liked is a pyramid of reliability.
Mike: Dickersons Hierarchy of Service Reliability.
Joe: Yes. Modeled after the hierarchy of needs, but for service reliability. And the very lowest level of that is monitoring. The single first thing you need is monitoring, and it goes back to, "You can't manage what you can't measure." If you're thinking about the Drucker quote. So when you come into an organization it's fundamentally to help increase service reliability and the speed at which they can move. What is a state you typically find things in when you come in?
Mike: Man, it's all over the place. Some companies, I'll come in, and they have absolutely no monitoring of any kind. Their monitoring is, "Someone comes and yells at us when it's not working." "OK, that's cool."
Joe: Monitored by the customer?
Mike: Yeah, monitored by the customer. "Our alerting is Twitter."
Joe: "We have a big TV on the wall with a live Twitter search."
Mike: Yeah, exactly. "Let's just pay attention to how many inbound calls we get that the site's down." And in those situations, "Cool. Let's just go install Pingdom and monitor the website." And that's it. Step one is, "Is the site up?" And then we can get more complex from there. You can't get to Netflix-level monitoring right out of the gate. It took them years and years to do that and millions of dollars in investment. So when I come into a company, I have to set those expectations of, "This is going to take a while. And the state of your monitoring is somewhere between non-existent and pretty bad."
Joe: And actively harmful.
Mike: Right. Like "Actively harmful. You have bad information. It's better to have no information at this point." I always start with the simple things of, "Is the application working? What's the throughput on the application, like the number of requests coming through, the errors, and how long is all this stuff taking? What's the latency?" And we just start there.
Joe: You just help them get some key service level indicators set up?
Mike: Yeah, basically. We just start out of, "What matters to you? What does this application do, and why does it matter?" Then from there we can develop KPIs that we care about, and then start tying those to technical metrics. It is very simple stuff. I don't immediately install a dozen alerts. It's more like one or two, and they're very high level stuff.
Joe: Typically do you see, and again to me having come and built a company in this space and watched it evolve over the last 10 years, what are you seeing now in terms of build versus buy on these systems? 10 years ago for a lot of people, build was the only choice.
Mike: It's interesting.
I always thought that large enterprises were going to bring stuff in-house for compliance and security reasons, but what I'm finding is that's not actually true anymore.
A lot of them are saying, "No. We don't want to run this. We don't have that expertise. We want to pay someone, we want to buy SaaS tools and rely on those." So that buy versus build pendulum has definitely swung in the other direction, where a lot of people are outsourcing to SaaS companies which is awesome. It's hideously expensive to run this stuff yourself and you're probably not an expert at it.
Joe: Definitely not an expert at it.
Mike: Definitely not an expert at it, so you should absolutely pay someone else to do it.
Joe: Certainly the way the space has evolved, and particularly in the last few years, IT always moves in cycles. There's this period of creative disruption where a whole bunch of new capabilities that incumbents don't have emerge in different players, and then you go through a convergent cycle where the dominant players who emerged in the earlier side start to roll up all the capabilities they need. It definitely seems that in monitoring, we're pretty clearly in a convergent cycle at this point.
Mike: It's wild seeing all these companies start buying up other companies. It's not like a logging company buys another logging company, it's like logging company buys an APM company and a server monitoring company, a tracing company, and now has a full stack of, "We're a one stop shop for all things monitoring."
Joe: There's New Relic with the Opsmatic acquisition a couple of years ago. Between New Relic infrastructure, Datadog acquired Logmatic, building out APM historically as massive infrastructure monitoring provider.
Mike: Elastic bought probably three or four companies.
Joe: Yeah, I was going to say, Elastic who's now just announced their IPO. They're definitely positioning themselves in the same place. They bought an APM company, they bought a machine learning company. They acquired Prelert two years ago to work on logs, and then SolarWinds acquired my company and has driven five or six acquisitions in the cloud-native space.
Mike: SolarWinds didn't just buy your company, they bought half the companies.
Joe: Then you've got CAs picking up companies like Runscope. So it's a clear convergence, going back to your point as an enterprise buyer or even SME, having a smaller number of vendors who have all of the pieces of the puzzle. They each have their own strengths and some of their pieces may not be as good as someone else's version of that, but for the typical enterprise buyer, being able to just go to one vendor and say, "We're just going to get the stuff from them and it's good enough."
Mike: Absolutely. If there's only one vendor I need to work with that makes it a much easier sell. But at the same time, if one of the vendors pieces of product isn't up to par, then it's not a problem to not use that and go buy something completely unrelated.
Joe: Sure. If that is a critical need for you.
Mike: Right. Just to use an example out of my head is Datadog. Datadog has a fantastic infrastructure product, what if their logging support isn't what the client needs? We don't have to use logging, we can go buy that from Sumo Logic or Elastic, or whoever else.
Joe: Or if one part of their product doesn't meet the security compliances, or something.
Joe: By that point it's just a tradeoff on budgets and complexity, right?
A big concern that I hear from clients is they just have too many tools. So, the consolidation on the monitoring companies kind of hides that complexity.
If I'm buying Datadog, Datadog is really three or four different products at this point. But as an enterprise buyer I don't think about it that way anymore, because Datadog is just one product. I only have one tool. But that's not actually true, I have four tools still. It's just one check.
Joe: Certainly from a procurement perspective that helps. How you can only have to drive one vendor through procurement.
Mike: Even inside of engineering perception, having that one vendor kind of hides that it's not multiple tools. It really is, but the perception is that it's not.
Joe: Which leads to, with all that happening, what do you think are still the biggest open opportunities? They're either open opportunities or renewed opportunities in the space right now that aren't being filled?
Mike: Something I've been harping about for years, and this is due to my professional background, is network monitoring. And I'm not talking the servers on a network, I'm talking actual physical in data centers, in corporate office networks.
Joe: Like switches?
Mike: Like switches, routers, network taps. Actual hardware.
There's only a few companies that do this even moderately well and they're all very old incumbents. They're very slow to move, have really long buying cycles, and everyone doesn't like the products and they have really annoying salespeople.
There are a few companies that have tried to go after them but then got seduced away by the, "We want to build cloud-native stuff." And I'm like, "No. There's a huge opportunity in network monitoring," but no one's really going after it.
Joe: That's interesting. The thing about that class of boring problems where--
Mike: Boring problems but, "Holy crap, the amount of money to be made."
Joe: Boring, but theoretically probably wildly successful. If you look at Stripe, before Stripe billing was a boring problem.
Mike: I've got a friend that works for the University of Florida and he works for the entire system. He's a network engineer and procures tools and hardware on a regular basis. I asked him, "What do you spend every year on maintenance and software and tools and all that?" And he's like, "It's a minimum of $12 million a year."
Joe: $12 million a year?
Mike: Right. A minimum of $12 million a year. And he's like, "When I buy licensing for my monitoring tools they're highly specialized, and we'll drop $150, $250, $300,000 a year on maintenance." And I'm like, "That's pretty good money."
Joe: Yeah, I know. I think you're right. It's an interesting challenge because a lot of times monitoring companies seen to be founded or created by engineers who are working on some line of business application. Scaling in some strange way and couldn't find the tooling they needed. Most cloud-native, where most new applications are being created, network monitoring really isn't a concern.
Mike: It's just not a thing.
Joe: It's not a thing at all because you never see switches or routers.
Mike: Yeah, we just trust that Amazon has our back and we move on with our day.
Joe: I think you're right. But I'll be interested to see where that next company comes out of.
I don't think the hardware is going away. There's still computers somewhere, there's still data centers somewhere.
When you have companies like DigitalOcean that run certain data centers. What are they doing for monitoring? How are they monitoring their physical infrastructure? It's still a problem.
Joe: That's a good point.
Mike: Even the companies with large corporate offices, they still have connectivity inside. What are they doing to monitor it?
Joe: Swinging the pendulum way to the other side of the timeline, one thing I want to get your opinion on and something I've been looking at a lot is serverless monitoring.
Mike: Boy, you did go to the other side there.
Joe: We're going to go from the data center to serverless and just skip everything in between. What do you think, I don't know if any of your customers are doing serverless yet, or LAMDAs.
Mike: None of my customers are. It's still very much an engineering toy. But there is still a very open ended question of, "How do we monitor this?" For that matter there's still the open ended question of, "How do we deploy it?"
Joe: "How do we build applications out of it?"
Mike: I think that serverless is where Docker was a few years ago in that, "That's really cool and works in my machine," is about the extent of it. It's great for prototyping. But once you try to run an actual infrastructure on it, there's a lot of questions that we haven't solved yet. One of the biggest ones is monitoring, "How do we monitor this?"
There are some vendors that are starting to do some pretty cool work around it, but it's still very young. We don't really understand, "What are the failure modes of it? What does scaling look like? What kind of information should we get out of it? What are the best practices around serverless monitoring?" It's still very much an in-development area.
Joe: It's like you said, there's certainly probably at least four or five vendors now, all independently building serverless monitoring. To me, there's a couple of axes that are interesting. One is how many serverless-only infrastructures will there be? And part of that is, how complex is this going to continue to be, outsource batch pieces? Or will you literally be building a complex 10,000-function application that needs to be traced end to end?
Mike: There was someone that made a comment at Monitorama some years ago, I'm trying to remember who it was. A former Netflix CTO.
Joe: OK. Adrian Cockcroft?
Mike: Adrian Cockcroft made a comment, "When you have serverless functions they take microseconds to run. If you have your entire infrastructure is microservices and it completely turns over every few seconds, then how do you know what the last minute looked like? How do you get this information, how do you understand the state of your world when it doesn't exist anymore?"
It's completely changed. Like, you can't pull these things. It's far too fast so you have to admit the data out. But keeping a state of the world is actually important. Knowing what your world looked like and, "How many LAMDA functions did I run in the last minute? During this one minute period, how many were currently running, serving how many requests, that did what?" That's a hard question to answer.
Joe: The other interesting question is, how driven or motivated will the providers themselves-- Amazon, Google, Microsoft-- be? Given that serverless specifically is embedded into the platform far more than any computing paradigm that preceded it. Even containers you can run on EC2 instances. Will they be driven to provide much higher fidelity monitoring and observability capabilities earlier than they have historically done those kind of things?
Mike: I sure hope so, but I'm not holding my breath on that one.
Joe: Agreed. All right, we've come to about time here.
Mike: Thanks for having me.
Joe: Yeah, it's been great. I love to talk shop on monitoring. We'll definitely have to have you back some time.
Mike: Sounds great.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
O11ycast Ep. #61, What Comes After o11y with Heidi Waterhouse of Sym
In episode 61 of o11ycast, Jessica and Martin speaks with Heidi Waterhouse of Sym. Together they explore the sensations of...
O11ycast Ep. #60, Customer-Centric Observability with Todd Gardner and Winston Hearn
In episode 60 of o11ycast, Jess and Martin speak with Todd Gardner of TrackJS and Winston Hearn of Honeycomb. This talk explores...
O11ycast Ep. #59, Learning From Incidents with Laura Maguire of Jeli
In episode 59 of o11ycast, Jess and Martin speak with Laura Maguire of Jeli and Nick Travaglini of Honeycomb. They unpack...