May 31, 2017
DevGuild: Content Strategy – A Fireside Chat With Om Malik
Om Malik, Partner at True Ventures, took the stage at DevGuild: Content Strategy for an intimate fireside chat with Peter Farago. Hear about...
In episode 14 of O11ycast, Charity and Liz are joined by Mehdi Daoudi of Catchpoint. They discuss the importance of team players when scaling, as well as the hidden value in measuring the experience of your employees, not just your customers.
About the Guests
Charity Majors: I often say that tools create silos.
If you have a team, the edges of where they stop using the tool someone starts using another tool and accidentally creates a silo for that team.
Have you seen this, or do you feel like we need to have a common language that multiple teams can speak?
Or do you feel like every team should have its own unique tool for its use case?
Mehdi Daoudi: I totally agree with you. I think tools create silos, and in my previous life I've seen it.
When I became in charge of monitoring at DoubleClick, where the system engineers were all CPU stats, the network engineers would look at the capacity on their switches.
Everybody would look at their stats, but then nobody looked at the overall user experience, or "Were we delivering ads?"
And how fast we're delivering ads.
No matter how hard I tried, they were never able to convert what I was talking to them about, like "What is our quality of experience or quality of services?" Into something meaningful to them.
So yes, tools create silos, but I don't think we can get rid of those silos. We have to work with them, in my opinion.
We just have to overlay more data and translate the data into their own languages until something else comes better.
Liz Fong-Jones: Do you think that tracing is going to save us? Is tracing the magical thing that ties all of these things together?
Mehdi: Maybe, we'll see. I have a lot of hopes around tracing.
There is a lot of adoption, there is a lot of buzz, there is a lot of excitement that I see with our customers.
I see it more in Europe than in the US, for some reason.
Mehdi: Yes. I see it a lot more in Europe than here, but it's starting to catch up.
I think the open tracing, the fact that there is this universal language or this portability of the tracing, so you do the abstraction of the monitoring tool.
I think it's the way to go and I think it's a great start.
Charity: I think we are all ready to stop innovating on ingestion.
Liz: Yeah. This sounds like a good moment to introduce yourself.
Mehdi: Again, thank you so much for having me.
My name is Mehdi, I'm the CEO and co-founder of Catchpoint, and we've been around for almost 11 years.
Before that I spent eleven years at both DoubleClick and a year at Google, and Catchpoint is about 200 employees today with five offices worldwide. I'm having a lot of fun.
Charity: Nice. Five offices? Do you have teams that span locations, or do you tend to have a single team embedded in a single location?
Mehdi: At the beginning, we had very monolithic or very siloed environments, so we had an engineering team in New York.
Now they're everywhere.
We have single contributors in Phoenix, in India, in Venezuela, in South America.
It's a little bit all over the place.
Liz: So, you don't even necessarily have people working in individual offices?
You've embraced fully remote, or at least partially remote?
Mehdi: Absolutely. I think that's the way to go.
We got to the point where talent is so scarce that you have to adapt yourself and change.
Charity: Exactly. It becomes a justice thing of equality and justice, too.
You have to give people opportunities even if they don't live in New York or San Francisco.
Mehdi: Absolutely. There are a lot of people that want to go back to where their families are because they're starting to raise children or get married, and they need the support so the grandparents are there.
We're starting to embrace that and encourage it, so we had actually just recently one of our early engineers, our product manager decided to move to Seattle and start his family.
He's actually living in the woods in the Seattle area.
Charity: How has this influenced the technology that you build, or the products that you build, the fact that you've gone distributed as a company?
I'm an immigrant to the US myself, so I'm a huge believer that the more diverse you are and the more creative you are, the better ideas you have.
I am a huge fan of the fact that we should not box ourselves in our thinking and where we get our folks and how we encourage creativity. We're fairly open about that.
Liz: I would also imagine that given what your company does, that having developers located in parts of the world that are not the US means that you have very different experiences with people's bandwidth, with people's latency.
Mehdi: Liz, we just released what we call our employee experience monitoring solution that now is focused on measuring the employee experience with SaaS applications.
Because there is an explosion of that.
Liz: I remember when I worked at Google, we had folks from the Australia office complaining that their source checkouts took 10 times as long as anyone else because there were so many round trips.
Each round trip was like, 500 milliseconds.
Mehdi: Exactly. I bet every transaction had at least 5,000 packets, you multiply that and you get to the number.
So we just released that, and one of the things we're noticing is how actually the majority of the SaaS solutions we use are not fully distributed.
Some of them are, but the majority of them are not.
Even though they call themselves "SaaS,"they're still mapped to a data center in the East Coast because that's where we're headquartered.
Liz: Oh, God. Everyone US-East-1?
Mehdi: Exactly. That's one thing we're noticing.
It's a horrible employee experience, so I have my team in Bangalore that interacts with this CRM solution and they are literally 10 times slower than the guys in New York.
Charity: This is a thing that we see so often.
These tools that get built by teams that live in Silicon Valley, they think that the recruiting "The best and the brightest," I have air quotes going right now.
So they recruit all the ex-Googlers and the ex-Facebook people, and then they build a tool that is only legible to ex-Googlers and ex-Facebook people.
This is something that Christine and I were super aware of early on, and we intentionally went after Hack Academy graduates and people.
They're great engineers, but it's not because they themselves are individually the most brilliant engineers, but it's because you build a great team.
The best teammates are not necessarily the best individual engineers.
In fact, I find that there's almost an inverse correlation.
When you're too good as a cowboy you have a really hard time becoming a team player.
Mehdi: Yeah, we don't have cowboys. We don't tolerate that across the board.
We used to, and actually what we found is cowboy might be an interesting beast to have as a start up. But as you grow and you scale, you cannot have cowboys. You need team players.
Liz: How did you navigate that transition?
Mehdi: It never ends up very well at the end, but our hiring process and our interview process now takes into account whether or not that person is going to be a team player or not. So, we see that--
Charity: How do we interview for good team players?
Mehdi: You ask questions around, "Walk me through a situation that you were in and you had to find a way out."
Typically, somebody that is not a team player is going to say "I" versus a team player is going to be the one that exhibits a little bit of trust.
What trust means is the ability to get naked in front of your coworkers and ask for help.
Liz: Yeah, I think that there's really no better test than having someone actually sit and pair with some of your team for a day. I know that Pivotal does that, and they love it.
Charity: We do this too. Liz, you didn't actually get interviewed, but you're special.
Liz: I got interviewed by caring with people on business strategies one whole afternoon at a time.
Charity: That's true.
Mehdi: Yeah. Liz , when I hire an executive to join my team, they actually come and spend a day with us and we put them--
We invite them to join the exec team meetings or whatever it might be, we just want to see how they actually act versus pretend or play.
Charity: When we're hiring engineers, we actually biased the process.
This isn't what we said we were going to talk about, but I think this is interesting so let's keep talking about it.
We biased it so heavily towards communication skills. So , we'll send a take home coding test the night before.
They're not expected to finish, because we don't want them to spend all night on it.
But spend an hour or two then make a note of where you are and bring it in, because that is not the actual interview.
The interview is you come in the next day and you sit with a couple of people, and you talk us through what you did and why, and what the tradeoffs are and what's left to do and where you left off. Because we believe that anyone who can communicate about what they've done can definitely do the work of writing the actual code, and the reverse is so not true.
There are so many people who can write the code, but they can't tell you why or how or what the tradeoffs were.
We believe people can learn whatever technical skills that they need as long as they have that ability to communicate and to learn, and to be humble and collaborate.
Mehdi: I will add something on top of what you said, which was brilliant, by the way.
It's the ability to also hear the other colleagues if they have an opinion on how they would do it differently, and that ability to absorb that feedback is one area we pay a lot of attention to.
Especially on the engineering side, how they've taken that feedback and what are they going to do about it.
Liz: The common thread that I'm picking out of all of this is the idea of really focusing on not just measuring your customer's experiences, but measuring your employee's experiences.
Whether it be in terms of the tools they use or in terms of the people you're bringing onto the team, that if you're not really able to understand what's going on then you think everything is fine when it's actually not.
It's slowly crumbling out from under you, or some people are coping with things that they shouldn't have to be.
Mehdi: You can get away with that when you're a 20,000 or 100,000 employee company, like you and I were at Google.
But when you're a few hundred, 50 or 20, one bad apple can literally destroy it.
Charity: I would argue that you can't necessarily at 100,000 either. It's just more hidden, buried in layers. Cool.
Now that we've solved the problems of distributed co-working, but I actually do feel like there's a real space in the market for better tools around collaboration for debugging.
The same experience you get from looking over each other's shoulder, but virtually.
Which is why in Honeycomb we've baked in just the primitives for history, for example.
Being able to go back through your history and see what you've done.
We really want to incentivize people to add annotations, so that the original intent that's in your head gets put in the tools so that that becomes your source of truth rather than your faulty memory.
We haven't really had a lot of bandwidth to expand on these, but I'm so excited about it because teams are just like distributed systems.
Every single node needs to be able to go down without destroying the cluster, and every single human needs to be able to go on vacation or go to sleep without destroying your forward progress.
I feel like there's a lot that we can learn.
Mehdi: The burnout that we-- Liz is aware of this, we've done a few SRE surveys for the past few years, and last year we focused a lot on the on-call and the troubleshooting and the burnout.
When you talk to SREs these days, they almost have PTSD.
The question is, how do we--?
We cannot burn some of these amazing resources night after night troubleshooting things, so how do we build this level of redundancy, as you mention, where somebody needs to be able to go on vacation or not be on call one day?
Charity: Yes, but at the same time, we have to be sure not to paper over problems with more and more human lives.
This is an issue that I have with monitoring software, I feel like for too long we've had ops people sitting between software engineers who were writing code and the low level system, just sitting there interpreting the graphs.
Just helping the software engineers understand the consequences of what they've shipped.
I've always said that anyone doing ops is the most closely aligned with users of all of the engineering teams, because when your users are in pain you're in pain.
Liz: But sometimes it isn't the case.
Charity: Sometimes it's not.
Liz: Sometimes the ops teams are working themselves to death over issues that are not actually affecting end users.
Charity: That's less common, but also true.
Liz: Like the prototypical, "Your disk is 90 % full." 90% full, right?
So I think that this is why I think this conversation is really interesting, and that you cannot have observability without measurement of user experiences.
But nor can you operate your system solely on the signal of "Are our users in pain or not?" Without the ability to dive in.
Mehdi: Absolutely. So obviously, observability is something that is very popular these days.
Charity: How would you define observability?
Mehdi: For me, it's funny. I'm going to age myself.
In 1999 I decided to quit because I was tired of the DoubleClick system going down every two minutes.
I sent my resignation email to our CEO at the time and he said, "Stop whining about it. You're now in charge of monitoring. Fix it."
So that's how we got into the monitoring business to some degree, and so we created this group.
We didn't call it the monitoring tools, we didn't call it the tooling team, we called it the quality of service.
My vision was, "How do we look at all the telemetry, all the signals we were getting from--?"
We had 17 data centers, 5,000 ad servers, thousands of switches and networks and all kinds of gears and storage systems.
We had literally millions of metrics coming in.
Liz: You don't know which metrics are important, right?
Mehdi: Right. So the madness was like, "OK. Who's going to sit and look at all these charts? Nobody."
So for me, observability was like "How do we put in place--?" And this is, again, without the word "Observability."
The concept was, "How can we tie the user metrics, which is at the end of the day what pays our paychecks, that's what keeps our businesses going, so how do we tie the customer metrics that we were getting with the IT telemetry that we were getting? How do we find the correlation, or how do we connect the dots?"
Our goal was, "How do we build tools to connect the dots?"
So at the time, there were not many open source projects and tools out there so we ended up buying this software from Smarts.
This is a company that I think got acquired by EMC later on, but it was one of the best correlation engines that I've seen at the time, that was literally able to--
Charity: So, your answer is "Secret sauce?"
Mehdi: Yeah, I guess.
Charity: Interesting. OK.
Liz: Or at least at the time, the best that people could do for observability was trying to do automatic metric correlation.
Mehdi: Right, exactly.
Liz: But the goal was still the same.
The goal is your system is failing, and you're trying to figure out why.
It sounds like it's not necessarily that the "Why" of observability has changed since 1999, it's that the "How" of what's possible has changed.
Mehdi: Exactly. It's really the "How" that is changing. I think it's going to keep getting better.
But going back to your question, which is "What's my definition of observability?"
Is the ability to connect various dots from all this telemetry lag that we have, and how can we quickly answer what is broken and why?
Charity: It sounds like you have a similar definition to me and I'm coming it from the perspective of what is different about observability than monitoring.
Because monitoring, I've been on call since I was 17 and I'm very used to the process of something breaks, you fix it, you postmortem, you write a monitoring check and you make a dashboard so you can find it immediately the next time.
That works great when you're only finding these genuinely new things once every couple of weeks or months.
Most problems used to be pretty predictable when you had a single apt here in a single database here, and you could look at it and--
Liz: Which on the other hand, was an awful experience.
You write the playbook once and then you get paged for the same thing and you apply the playbook 20 times.
Charity: Oh, my God it was awful. So now we've matured out of that, now every time you get paged it should be something new.
It should be something that you're seeing for the first time, because the assumption is that you've automated away--
You will move the problems from the immediate "It's down. You need to go, human needs to go fix it."
You move from that bucket into the "It's auto-remediated, and a human can wake up and get to it on their own damn time, and get it into the non-critical state without affecting your users."
Because the golden rule is make many things able to fail without your users ever noticing.
So, we've gotten better at resiliency.
Mehdi: It's an amazing vision. I think that's where we're going to end up, no matter what, in the very near future.
But you have to remember that you're still dealing with a lot of organizations that are not equipped to do this lightly.
Liz: That brings us to the next topic we wanted to talk about, which is how do you get from point A to point B?
What was your own evolution? So you were in 1999, you brought on this magic system that started doing automatic correlation.
How did you evolve from there?
Mehdi: One of the big things that we did back then, and this started in '99 as well and it took us two years, is we ended up--
We wanted to put what would be called an APM product by today's standard.
There were not that many choices, or the choices where the typical vendors. CA, NetCool, MicroMuse.
These are tools that that date us all, and then the price tag was insane.
It literally would have costed us about $30 million dollars to put an APM in place back then.
We ended up actually going back and building our own APM at DoubleClick, very similar to what Google ended up using on their own when it comes to every system having a web page that you can screencap and find the data on the metrics and everything.
We built this this APM system and it was extremely useful where it was agentless, or the code level we gave APIs to the engineers and we told them, "Listen. You know best how to monitor your system.
I'm not going to tell you how to monitor your system, so implement this API and send us the hard data."
Liz: That idea of having engineers have ownership over their telemetry, but giving them frameworks to make it easy. That sounds very familiar to us.
Mehdi: That is, in my opinion, what we need to do.
Rather than just keep putting agents and agents and agents on stuff.
Also it doesn't scan with serverless, containers and all that stuff. We have the engineers who need to own monitoring.
That was one of the evolution that we were a part of back then, that was phenomenal to be honest, because we went back to telling the engineers "You want to sleep better at night? Then instrument your system. Instrument your application."
" Otherwise, we're going to force some telemetry on you. It's not going to mean anything to your application, you're going to get that alert that says '90% disk utilization. '"
Charity: What would you recommend a team today do? Where should they start?
Mehdi: So, where do people start? This is something I see on a daily basis, where I think sometimes teams take on too much.
They try to bite on the bigger thing and they go on this massive project that takes sometimes three years.
Charity: Let me give you a scenario. You've got a team, they're paging themselves, and they really care about their systems.
They're doing their best, but they're drowning in alerts. Where do you start?
Mehdi: Turn off the alerts. Literally, turn off the alerts. I've done it many times.
Make sure that you go back and you look at "What is going to get the CEO of your company to call you at night because he got the phone calls from 20 other customers. What is that single thing?"
Liz: Right. That's getting to the point of service level objectives, prior to service level objectives but the same concept.
Mehdi: Exactly. So, you have to pick one metric or one system, whatever it is. Just start there and fine-tune your processes, your escalations, your everything that goes around monitoring.
Charity: You say "Fine tune your processes," but what does that mean exactly?
What would you recommend that a team start doing? So, you've turned off all your alerts.
You've found one end to end health checked around something that makes you money to care about.
Now, what is the next step that you take?
Mehdi: At that point what we end up with-- And I get involved sometimes with some folks to do this, you document that.
What did we learn? What did we do? What are the thresholds that were, if any thresholds where necessary, was there a performance threshold?
Was there a reachability threshold? Were there reliability? Because that's the other thing that people don't think about, is like--
Charity: So you're saying "Start doing retrospectives?"
Liz: Retrospectives, and we talk about SLOs being a living document.
Not a thing that you're afraid to touch, but a thing that you can revisit, and in order to read that you need to understand why it was set up the way it was.
Charity: Software is a living system. It's always changing.
Your user's requirements are living system, so it should be constantly revised.
Liz: I also want to take a brief step back and I want to say that the thing I'm noticing here is monitoring alone does not buy you observability.
That you can have all the monitoring in the world, but it sounds like basically you have to iterate and figure out how you find the right things to look at.
Charity: That's about asking questions, and observability comes back to the idea that you should be able to ask any question, understand any state that your system has gotten itself into, even if you've never seen it before.
Even if you don't have any code that handles it, so it's about capturing the data at the right level of abstraction and that you have the data to answer these questions.
Liz: Then that gets to Mehdi's point about "You need to have the engineers instrumenting their code so that they can understand what is going on."
Charity: It turns out the people who build it have the original intent in their heads, and understand it better than anyone else can if they just come in the door and impose something on them.
Mehdi: You see, we started with the silos. You don't want an ops team and whatever team, so that's why the telemetry that we're gathering is the unifying glue across all these teams.
It's a universal language.
Liz: If you use the same telemetry, if different teams are using the same telemetry and have the same unified view, then you're not arguing about "Is it down or not?"
You can see the same data.
Mehdi: The thing that I still see 99% of the visits I do with customers is the finger pointing that happens within an organization.
Charity: This is why you constantly need to have this conversation so that everyone buys into it. No one wants to have something imposed on them.
Liz: Also, the element of blame. It sounds like you're talking about blame as well.
Mehdi: Listen, we went through this at DoubleClick, or we didn't do a lot of blaming but then when Google acquired us we went through the grind machine, and we started doing root cause analysis.
That's where blame lives, and I'll tell you the first series were a little bit hard because you had to put--
You had to leave the blame at the door. But that's how we learn, and again we have to use telemetry because that's the mathematics.
Two and two equals four, and even if we met people from Mars today they will understand that.
Liz: To recap, set up the standards for how you collect telemetry and encourage developers to write their own telemetry and add data to the system, and then start approaching things with a retrospective approach with taking blame out of the equation.
Then where does that get you ? How can you measure progress?
Mehdi: The golden rules, everybody's talking about those things. There is the SRB book, so everybody is trying to implement all these amazing concepts and they're trying to learn to cook them with their own recipes within their own companies. Because once again, one thing that people are making some mistakes, which is "Let's try to do what Google does."
No, you shouldn't do what Google does. Take those principles and apply it to your company, because you're not Google and you're not Facebook. It's great that you want to become like that, but the resources are a different scale. Implement the right process for your company, for your thing, rather than trying to imitate others.
But it's really about reliability, and this is where I go back to. Once you have all these things, all this right telemetry and processes, etc.
The question I ask sometimes my team and others, "Was our performance, was our reliability and our performance and availability the same at 2:00 in the morning versus 2:00 in the afternoon?"
Charity: Does that necessarily matter, though?
I would argue that that's not necessarily important for everyone.
Mehdi: If you don't have customers at 2:00 in the morning, I agree.
Charity: It's not about the maximum possible reliability, it's about your customer's expectations and meeting them.
Liz: I think that Mehdi and we are talking about the same thing. This is why I hate time period based SLOs.
Your 2:00 AM matters less than your 2:00 PM If you're a business hours peak business.
You should weight your SLO by the number of requests, and not by just number of time windows.
Mehdi: Instead of reliability, the other word I use is "How consistently are you delivering that level of service that you've promised people?"
Liz: I love that. "Consistency." That's a good way of thinking about it. It avoids people thinking about perfection.
It's consistency. If I'm serving 1 % errors all the time, then that's consistent.
People's expectations are they hit reload, and if they encounter an error one in 100 times.
But if it goes 100% down, then that's not consistent. I love that.
So Mehdi, one thing I heard you say is that you as a CEO make visits to your customers.
Tell us a little bit more about that. You're a, what, 100-200 person employee company. What's the value that you get out of visiting your customers?
Mehdi: Yes, I live in Los Angeles and our headquarters are in New York.
I'm usually always in New York, so I clock about 400,000 miles a year.
Part of that is at least every week I must visit two or three customers, I must speak to two or three customers.
I do this for two reasons. One is to keep my sanity, to make sure that we're still on the right track, make sure that I don't get surprises from an upset customer or an angry customer or anything of that nature.
But what's interesting is, and I learned that at DoubleClick, we went to a great university when it came to customer management.
If you listen to customers, if you really listen to what they say and their challenges, and if you solve that for one person or one company, you solve that for another hundred.
I am very close to our customers because we want to get ahead of what challenges are they trying to solve, and I think we care.
The other thing is, when we started the company in 2008 it was obviously a great year to start the company.
I highly recommend, it's a great vintage year.
The economy was burning and we couldn't raise funding, so we self-funded the company.
But in 2010, our customers became our investors. Literally, we bootstrapped it but our customers saved us.
So I have this very weird relationship with our customers where I look at them as our first investors, all 450 of them.
I go and visit them to thank them, to listen to them, to hear about their problems.
But again, just to show the appreciation we have for the awesome responsibilities they've given us.
Liz: That sounds amazing. I hope we can keep doing that, between you and Christine, Charity. As we scale out.
Mehdi: Yeah, please never stop.
Liz: Excellent. I think that we have reached the end of our time together, so thank you so much for coming in and talking to us, Mehdi.
Mehdi: Thank you. It was a pleasure and an honor to be with both of you.