Library Podcasts

Ep. #48, Mastering Migrations with Adriana Villela of Tucows

Guests: Adriana Villela

In episode 48 of o11ycast, Charity and Liz chat with Adriana Villela of Tucows. They unpack Adriana's journey into observability, the community resources that lit her path, and her experience driving o11y at Tucows.


About the Guests

Adriana Villela is Sr. Manager of Observability & Platform Solutions at Tucows. She was previously Head of DevOps Standards & Practices at BMO Financial Group.

Show Notes

Transcript

00:00:00
00:00:00

Adriana Villela: I did a bunch of stuff. I took to Twitter.

I follow both of you on Twitter. So I learned a lot from your posts.

I got into watching a bunch of videos, both from Honeycomb and LightStep, really great resources.

I'm not usually a video person, to be honest, I prefer just to skim articles, but the videos have been awesome.

And I've taken to various Slack communities.

I've asked lots and lots and lots of questions, both of the Honeycomb community and the LightStep community.

And everyone has been really awesome about answering my questions. So yeah.

Liz Fong-Jones: Why do you, that is, why do you think people have been so interested in helping you along that journey?

Adriana: I think it's been because these communities are super passionate about observability, I mean--

Charity Majors: I think of it almost as, because it's so much overlapping with the community of folks who are on call who know what it's like to suffer, and there's kind of a band of brothers effect, that you've got where, you know how much pain they're probably in and you just don't want to see anyone suffer.

So you're... I always find that the ops teams are the most eager to share information.

It's why we're also always trying to publish our post mortems and being super detailed, right?

Adriana: Yeah. I totally agree.

I mean, we've got the solidarity in our pain and unlike some communities where it's like, "well, we've gone through this and therefore you too must suffer." We're like, "no, why?"

Charity: Right.

Adriana: There's a better way.

Liz: Right. I think it goes to this idea of hug up, right.

That we're here to level each other up rather than kind of beat each other down.

Charity: Yeah, exactly.

Liz: And how does that kind of contrast with your--

what were some of the other skills that you kind of picked up over the past couple of years that you're interested in really deep diving into and learning about?

Adriana: Well, I guess my, my foray into observability started with the DevOps, so-

Charity: Yes. Can you tell us about that?

What made you get started being interested in observability in particular?

Adriana: It was kind of by accident to be honest.

So I have this really good friend.

We nerd out on all things tech and we were working on like a little side app together and, he was working as an SRE at another company.

And I was doing some, I guess, SRE-esque work.

I was on a release management team at another a company.

And we were just shooting the shit just talking about how is it that we can make our little app, which was running on Kubernetes, more resilient.

And then I think I, I pointed him to one of your tweets Charity, and then I guess he got super addicted to your tweets.

And then he kept always telling me about all the stuff that you were posting, "okay, I guess, let me follow her."

And we found ourselves having these massive debates about observability, I would say even starting like two years ago.

And it was funny because like honest to God, I did not understand observability at the time.

I cannot wrap my head around this.

I don't understand, I kind of get it, but I kind of don't, how do I explain it to people?

And then when I took my current job where I'm basically leading the observability team at Tucows.

I need to make sure that I do right by my team and make sure that we're following good observability practices.

In theory, I knew observability is a good thing, needs to be... these are good practices to be followed, but how do I ensure that I can speak intelligently about observability to my team and to spread it across the org.

Charity: I loved that you did outsource it to a vendor. I feel like that's what many leaders do.

They turn to the vendor that they're most comfortable with. They're like, "okay, educate me about this."

And then they just swallow it and pass it along. And I love that you took a much more independent approach.

Adriana: Yeah, totally. Yeah, and for me it was I need to understand the concepts behind it.

I don't want to just take a very biased view on observability-

Charity: Well you work for a large and successful company, that means there's a lot on the line.

Adriana: Yeah. Yeah, absolutely.

Liz: Yeah. I guess you've introduced yourself now.

So what was it that made observability kind of click for you finally?

What combination of things got you to understand, for instance, how is it different from monitoring?

How should we be thinking about this?

Adriana: So I think o11ycast really helped actually.

And especially, I think it was the first episode where Charity was giving a definition of observability night, playing that clip over and over and over again.

And then I'm like, "oh my God, this makes sense."

And so it was that, And then a discussion that I had with, with my friend Bernard, who was the one, where we were having--

We kind of stumbled on observability together where he's like, "Well observability is one of those things where if you have some sort of issue with your application where everything's fine for most of your Firefox users, but you've got like the one guy using Opera where like the API calls aren't going through, like, this is what observability picks up on."

And then I started digging deeper and deeper and deeper and you know, like o11ycast and reading, like various papers online.

So now for me in my mind, observability is about getting that holistic view of your system, but also understanding what your system does without having to know the nitty gritty details of the system. For me, I'd say icing on the cake is the fact that you can address those unknown unknowns.

And, you know, Charity had that great blog post about like dashboards.

And it's like, yeah, dashboards, aren't terrible. But like, the problem is you've got like a lot of teams out there who have these dashboards of problems that they've encountered before. And pretty soon you've got like this huge monitor flow of t he dashboard.

Liz: Yeah. The idea of dashboards is technical, that is pretty powerful.

And I think the other really cool thing, I think is that concrete example right. Of wanting to debug the experience of the lone Opera user.

Adriana: Yeah.

Liz: Of being able to actually know what you're using that cardinality for.

Adriana: Yeah. Yeah.

Liz: I think one of the stories from early in Honeycomb's time was the idea of talking about high cardinality without talking about what high cardinality could do for you, right.

And kind of that hump was something that held us back a long time.

Charity: People talk about, about high cardinality and cardinality a lot now.

But when we started five years ago, everybody was like, "what the hell is that?"

And people kept warning me not to talk about it because it was so offputting to people because nobody understood what it was.

So I'm delighted that, the world has come around to realize just how important it is because it's at the base of everything.

Your browsers, that's a high cardinality dimension, your users, that's a high cardinality dimension, your... everything you care about, everything that's identifying is a high cardinality dimension.

Liz: And Especially the cross products, right. People did find it, saying, "Okay, maybe we can fingerprint on just the three values, are you on Windows, Mac or a mobile device, right."

That doesn't work anymore.

Adriana: Right.

Charity: Yeah. Hard coding no longer works.

Liz: So I think that brings us to the next question that we had, which is what was the most common misconceptions that you've encountered as you started in turn, trying to evangelize observability in your community?

Adriana: I think this confusion with, first of all, you need a wall of dashboards, number one.

Number two is people are like, "Oh does such and such a vendor do APM?"

And I'm like, "Huh, why?" This focus on the monitoring, I guess was the biggest thing.

Charity: You know, I didn't know what APM meant until a year and a half after we started Honeycomb. I'd never heard the term before.

Adriana: I think it was in one of the o11ycast episodes where someone mentioned that, if you don't grow up with APM and you go into observability, it's so much easier to understand the concept.

Versus if you do grow up with APM, trying to retrain your brain around, it becomes so much harder.

And for me, because I wasn't really exposed to APM.

It was like, "Whatever, this shit doesn't make sense to me, observability is a lot more intuitive."

Charity: Right?

Liz: Yeah. That's almost like the difference that we see between people who expect to start with pre-canned dashboards, as opposed to people who expect every time to be able to start with an empty query builder and just fill it in as they go.

Adriana: Yeah. Yeah.

Charity: I have a lot of sympathy for that because I think that creating queries from scratch is not easy for almost anyone.

Much less when you're in the middle of, an outage or something that's very stressful, and then being faced with an open query browser is just... it's really hard.

And it takes you out of the moment, takes you out of flow debugging.

And that's why I think, while I think it's necessary to have that capability, I think it's also just as important that we have the ability to sort of curate things that pull together our history.

Like, "oh, this was a useful graph."

It's really hard to compose a new query, but anyone can tweak a query that's close to where you're trying to go, right?

Liz: And I think it's that closeness, right?

Is there something that's closer than just rate duration, right?

If you're starting with rate duration, you may as well be starting zoomed all the way out as opposed to somewhere closer in.

Adriana: But if you're trying to figure out if it's--

What query it is and you already happen to have this useful graph that showed you a different, bad query, then you can just plug in, a different variable and come up with the same result or a different result with the same pattern later on.

Liz: Did you ever run into people who are like, "why do I even need this, right?"

"Why is my existing monitoring not good enough?" Did that come up for you?

Adriana: Yeah. I would say especially people who are really into logs.

They're like, "but my logs are good enough."And I'm like, "but there's no context."

I'm actually when of the challenges I have and-- it's just an education thing, but it is a challenge nonetheless, is that I'm trying to steer people away from logs.

But everything you need is in the traces, because it encapsulates all of that good information and then some.

Charity: And you didn't have to think of putting it in there in advance.

That's the problem with logs is you have to have logged it and who thinks of logging every single thing. Well, you don't

Adriana: You're right.

Liz: Or people don't and they just throw everything in the kitchen sink in there and it's hard to find what you need.

Charity: Right.

Adriana: Yeah. Yeah, exactly.

Liz: So, I think that brings us to talking about then, how does someone, what are the steps that someone should go through, right?

If you are already using logs or you're already using metrics, how do you evolve from there to a state of having better observability, kind of what's that journey been like for you?

Adriana: Part of it, it starts with the education.

So internally I've been running a couple of sessions, we had at a town hall recently, I gave just a high level presentation on observability and on OTel.

Charity: Nice.

Adriana: Next week I'm actually doing a learning session.

So doing a deep dive into observability and explaining the difference. What's trace, what's a span, what's a log.

Charity: Yeah.

Adriana: And yeah, it's evangelizing it's--

Anytime someone contacts my team about observability, I direct them to my blog posts and my team's a unique team because we're not running software, we're not observability practitioners, but we are defining the practices and standards around observability.

So we want to do right by it.

Liz: Definitely has a lot of overlap with the kind of world of SRE and DevOps, right.

Where you're trying to level up other teams and that's your kind of primary role, right?

It's not, we are taking the pagers or we're writing all the tests for you, or writing all of your observability, right?

It's... we want to empower you.

Adriana: Yeah, exactly. Exactly.

Charity: But where is that first step?

Where do you begin moving from logs and metrics to the world of observability.

Cause Liz and I, a while ago we wrote up this maturity model, but there doesn't seem to be any linearity to it.

There doesn't seem to be any path that people take.

We identify five or six different areas where you can be weak or strong and how you can get better in them.

But I'm super curious, where do you start? And then how do you build on that?

Adriana: So like I said, first with the education and then it's basically getting people and instrument their code.

Charity: Yeah.

Adriana: And because we're trying to educate teams on that.

It means sometimes working closely with teams to explain, to-- sorry to them, how to best instrument their code.

So we're not necessarily there instrumenting their code because it's like asking me to-

Charity: Right.

Adriana: Touch code that I don't understand, but we can at least provide those best practices around that and help you troubleshoot and pair with you.

And so, one of the things that we've done, OTel's been great because it has instrumentation examples in various languages, but open source.

One of the pitfalls I guess, is that documentation can be a little bit lacking.

Charity: Yeah.

Adriana: Or out of date.

We're looking at basically creating reference implementations in a number of the languages that we use at the company that are little bit more complex.

And also because the OTel spec is constantly changing some of those examples can be outdated.

So we're at least trying to keep up with the latest changes in the Otel spec.

And then of course documenting the hell out of it.

Liz: And hopefully also kind of integrating with the fields and kind of things that are very useful for your business domain.

Adriana: Yeah.

Liz: As though there are never going to be in a centralized docs repo.

Adriana: Yeah. Absolutely. And the other thing too is because we are...

I'm really pushing for us to go OTel, making sure that we run an OTel collector gateway and as part of that, we have some common tagging that we do--

Charity: Yeah. One technique that I've used in the past that I've heard other people using is kind of following the pain, right?

You pay attention to who's on call and what they're getting paged about.

And you just start instrumenting, you shine the light on whatever hurts and instrument that part.

And then you shine the light of whatever hurts next.

And you just get people to be in the habit of having instrumentation first approach to debugging.

And that spreads out a reasonable amount of instrumentation when you can't get teams to sit down an instrument, which sounds like you can, which is phenomenal.

A lot of people can't get their teams actually sit down an instrument.

Liz: Yeah. I was just going to remark on that.

That's really unusual because we often hear teams being like, "why is the auto instrumentation not good enough?"

"Why can I not just write my business launch can be done?"

Adriana: Yeah. Yeah.

Charity: Do you find that you need to help inspire them to get this done by showing them the results or walking them through how much easier it can be to debug?

Adriana: We're lucky in the sense that the vendor we're using right now, I think because they were dying before from not having an observability vendor.

Charity: Yeah.

Adriana: That they see the value in having the extra insight.

So it was like, "oh my God, this is the greatest thing ever."

So now what we want to do is take it a step further, where let's go the OTel route so that we have, first of all, the industry is moving in that direction.

All the major observability vendors have embraced OTel.

So I don't want to be in a position where if we need to switch vendors at some point in the future, because it doesn't suit our needs, this isn't going to cause us grief in the future.

Charity: OTel is a one time and you're done.

Adriana: Yeah, exactly.

Charity: Yeah.

Liz: So what are some of the other things that are kind of facilitating your journey?

We often have talked on this podcast before about cycle times.

How long does it take from the time that someone adds the instrumentation of the time that it's running in production, they can see what they've done?

Adriana: I think it's a pretty fast cycle time.

Like I said, we've got some teams that are running stuff in prod right now, where there's an incident and they're already able to-- they get those insights right away.

So now I think the challenge is really getting them into some of those good habits that I think you guys have mentioned in this podcast, which is like, know what a healthy system looks like, right? Don't just wait to-

Liz: Oh, right. So you're looking at it when it's healthy. Not just when it's broken?

Adriana: Yeah, yeah.

Liz: Yeah.

Adriana: So that's what we want to encourage teams to start doing because not everyone is doing that.

For example, one of the teams that I manage, not the observability team, I have a platform team and we have some custom software and it's really cool that the SREs are habit of when it's their on-call week, they check the system regularly.

So they know what healthy looks like and what unhealthy looks like.

So we want to make sure that we propagate those behaviors across the org.

Charity: Yeah, for sure. So you didn't get much pushback from your teams.

Did you get any pushback from your management?

Adriana: I would say some people were a little more hesitant, mostly on the Otel stuff.

Not so much on the observability stuff, because it's like, "well, we've already instrumented it with the spender library, you're asking us to switch to something."

And then there's the common pushback around OTel was, well, it's not ready. And the...OTel as a whole is not ready, which is not true.

So part of what I'm doing now is educating people. I'm like, "no, that's not true, traces are 1.0."

We're getting there with metrics and logs. Well, I don't know. Do we actually care about logs?

Liz: Right? Exactly. Do you really need logs?

If you have span events has been a common thing that I've raised when someone's like, "oh, I want to dump all of my logs."

Well, do you really need to dump all your logs and for the logs you do want to send, wouldn't it be better to if they were attached to traces anyways?

Adriana: Yeah, exactly. Yeah.

So that's been the battle if you will, is, is just getting them to understand.

Okay, OTel is maturing a fair bit. I think one of the key components, which is tracing.

That is the money, that gives you that end to end visibility. That's fairly mature.

Charity: How many services are you guys running?

Adriana: I don't actually know. I'm actually pretty new to the company.

I've been here for five months. So I'm still waiting my way around.

Charity: No, I'm just curious because that's a very microservice heavy type thing to say.

If most of your things are caught via tracing instead of via events, that usually means you have a lot of services.

Liz: And it also speaks to kind of decentralization, right.

If you have, if your services themselves are cattle rather than pets, right.

In honeycomb, we fully acknowledge we have pets, right?

We have six microservices, we know the names of each of the six microservices, right? We all are on call for them.

Adriana: Right. Right.

Liz: And that's very different than a much more decentralized company, which is Tucows has been around for a long time, right?

Adriana: Yeah. Yeah. We have.

Charity: Yeah. You guys must have really mastered the art of migrations, if you're doing stuff that is this cutting edge and you've been around for that long.

Adriana: Yeah. I have to say, I've been really lucky that working with my manager and also the CTO of Tucows, they're pretty chill dudes and very open-minded and they've given me runway to basically drive observability in the direction that it needs to be-

Charity: That's great.

Adriana: going, because when I compare, I've had stuffy ass corporate jobs before and this is night and day, to be honest.

I worked at a corporate job where it's like thou must go through the corporate hierarchy to do anything. And there's no autonomy.

And I've been really lucky that I've been given autonomy and there's trust in my skills, which honestly as a woman in tech, it's huge because unfortunately I've been mansplained and in meetings I've had male coworkers not make eye contact with me, when I ask a question and look at another male coworker to answer the question.

So to be in an environment where I have autonomy, there's trust, there's trust in my skillset, trust in what I bring to the table.

Charity: That's fantastic.

And that tends to have a ripple effects that you have trust in your team and everything.

And that's awesome. I'm curious, you guys, if you set a D metrics for your success, if there's anything, you negotiated in the beginning where you like, "aha, this is a sign."

Maybe we're getting page too much. Maybe we're having too many of these kind of outages or whatever.

And this is a metric where we will know that we have succeeded.

If we roll out this observability stuff, is that a kind of discussion that you guys had at all?

Adriana: It is a discussion that we're having, but we haven't settled on that yet.

So one of the things that I'm really encouraging the teams to adopt is, start thinking about those SLO based alerts.

Charity: Yeah.

Adriana: And I think a lot of organizations still are even struggling to achieve that SRE mindset, right.

There's so many job posts out there for SRE and it's not SRE, based on the job description.

Liz: Yeah. It's one of those things where, it was a natural fit for me, at least, to jump from more working on SRE, full time to working on observability as a way to actually empower people, to achieve the outcomes of SRE.

Adriana: Yeah.

Liz: Where if people were trying to do SRE without observability and it wasn't working for them.

Adriana: Yeah, exactly. I feel like-- I mean, observability of SRE superpowers.

Charity: Yes. Exactly.

Liz: Why is that? What have you found that kind of does that?

I know I've given my perspective a lot, but it always helps to hear what other people have seen.

Adriana: For me, if you don't have observability, your job is 10 times harder. I mean can you imagine, well, I'm sure you can.

Sifting to logs yeah-- Sifting through logs to figure out where the problem is.

So for me, one example I can think of from earlier on in my career is, I was working on an app where our users were complaining it was slow.

So I contacted the database guy.

I'm like, "Hey, it's slow. Do you see anything in the database now?"

"No."

"Look at the Oracle dashboard."

"No, it looks good."

And I talked to the network guy, "no, network's fine."

Talked to the storage guy, "no, the disks are running fine."

I'm like, "oh my God, well, there's a problem still because... but we just don't know where it is."

And observability-

Charity: And you did have traces?

Adriana: Yeah, exactly. Exactly. And for me, so the aha for observability is, this problem wouldn't happen anymore.

Charity: It's so true.

If you don't have observability, every single issue is a one off, because if all you have is aggregates it's up to you, the human to drill down and try and figure out what's going on below that, it's up to you, the human to sit between your metrics and your logs and your traces or whatever, it's up to you, the human to do these super heroic acts of debugging every single time.

That's not something you can train people to do.

You can't expect software engineers to own their own code and production, if you don't have observability, in my opinion, because it's just too much, it presumes too much low level systems knowledge, it presumes too much high level systems knowledge.

It presumes too much deep knowledge of the individual tools.

Whether it's Strace or MTR, whatever.

If you don't have observability, you're sending them off on a quest every fucking time that there's something wrong and that's just not reasonable to expect-

Liz: And worse, right. It's not just under normal circumstances.

It's when they're under a lot of pressure and stress, right? You want these things to be much easier when people are under stress, especially.

Adriana: Exactly.

Charity: Yeah.

Liz: So, you mentioned kind of one challenge was, the challenge of ripping out vendor specific instrumentation that wasn't helping you achieve observability in favor of vendor neutral Otel based approach that was going to give you observability via one of several routes.

Adriana: Yeah.

Liz: Or kind of some of the other things, right.

Obviously you've got some pushback on... right, so you got the pushback of, we have to do all this technical work.

Adriana: Yeah.

Liz: Was there also kind of pushback over the cost of solutions or pushback over how much value you would get kind of, how did, how did that conversation go for you?

Adriana: The pushback over the cost, isn't so much of an issue because we ended up going with a vendor solution.

So at least people saw the value in that, rather than try to maintain infrastructure in house, a vendor can do a much, much better job than we can, in trying to cobble together, a bunch of open source tools.

So that, was honestly very beneficial on the... I, think the two most challenging things are basically making sure that we convince teams on the value of OTel.

And I think people are getting excited about it.

Is just the... "oh, how do we do this, and still meet our deadlines" very reasonable.

And then just making sure, also that they're following proper observability practices, because yeah, that's great, you're instrumenting your code, you're sending your metrics over to an observability backend.

Yeah, but are you actually utilizing the tool properly to its fullest extent?

Are you following those observability practices? Do you even know what you're looking for?

So I think that becomes the biggest challenge. So it's more...it's an education on the practices.

Liz: So people are bought into the idea that they want to do this.

And it's just a matter of making sure that they know how and making sure that that's practiced?

Adriana: Yeah. Yeah. It's kind of like in the DevOps world, right? It's like "I have Jenkins, therefore I am DevOps."

"Oh I have observability tool, therefore I have observability." No.

Charity: This has been super fascinating talking to you.

Is there anything that you wish you could tell yourself six months ago or whatever it was that you began your investigation of observability, anything that you wish you had known then?

Adriana: My biggest thing is I wish I hadn't been so shy to reach out to the community earlier.

I was a little bit scared at first. I think I started becoming bolder. I attended o11ycon.

Charity: Yeah.

Adriana: This year and it was so cool to see everyone so passionate about observability and that kind of empowered me to get super duper excited about observability.

So yeah, I would say, start asking questions earlier and don't be afraid of how people will respond to your questions because the community is fricking awesome.

Charity: It's a very kind community. Part of what I love too.

Liz: Yeah. It's the fact that we're all collaborative, right?

We were having a discussion right before we recorded the episode about how to marshal people from the Otel community to talk to Adrianna's engineers, right.

And that was... I don't care, whether or not you specifically are using Honeycomb the product, I want OTel to be successful.

Adriana: Yeah, absolutely. I hope that at some point, my team can contribute to Otel as well.

Because, I think it would be amazing.

Liz: Yeah. Well there's so much room where you uniquely know as people who work in Pearl and work with kind of the lower level stuff, what you need in order to be successful, to help other people in the same shoes.

Adriana: Yeah, absolutely.

Liz: Awesome. Well, it was a pleasure having you on the show.

Thank you for joining us today.

Adriana: Thank you very much. This has been my dream.

So yeah, I really enjoyed our chat.