August 14, 2014
Weathering a Sh*tstorm: PR during a Crisis
Gina is an expert at media relations and communications with over 10 years of experience. She’s previously led Corporate Communications an...
In episode 39 of o11ycast, Charity and Liz speak with Jordan Simonovski of Atlassian. They explore the differences between tracing and logging, tactics for measuring reliability, and the journey of the observability team at Atlasssian.
About the Guests
Jordan Simonovski: I think having introduced new practices and having actually talked about things like SLOs and observability in smaller organizations has been a lot easier.
When you're suddenly in an organization of 5,000-ish engineers, all kind of working remotely.
Charity Majors: Larger than the town that I grew up in
Jordan: Everyone's working remotely now as well.
And that affects the effective dissemination of information in the org.
It gets a lot harder and there's only so much information you can kind of throw at people in terms of blogs or webinars or any other kind of training material.
At this point, it kind of starts becoming about how we can make it easier for them to do the right thing or what we think is the right thing.
Jordan: And not so much kind of incentivizing not so good practices, I guess.
Charity: It's about great defaults, creating great defaults, so they automatically do the right thing.
Jordan: Yeah, yeah. And having worked in, I think, DevTools before, that's exactly the place that you kind of want--
You want to give them the best thing first.
And if they do want to do what you think is the wrong thing or what I think is the wrong thing, you let them do it, but not by default.
Charity: Well, you've been engaged in doing this over the past year, right?
I mean, you came on to it last year that a year ago, I guess, like what was the state of the team when you got there?
Like what were the initial challenges that you got dropped into?
Jordan: The initial challenges were kind of understanding the observability landscape in the organization.
So we kind of break things up into how we collect all this telemetry, how we actually process it through, I think we have like pipelines that this goes through.
We sanitize things that we think aren't right for kind of throwing into different vendors, buckets and things like that.
So we want to keep things as safe, as possible as well, and kind of strip out unsafe data, which tends to happen.
And we do have like a regulatory responsibility to make sure that doesn't happen.
Liz Fong-Jones: Yeah. One question I had for you is--
If you can kind of centrally mandate what those good defaults are, like that works, great.
What happens if you need to grassroots influence rather than being able to come in with a gold standard?
Jordan: We currently have a saying in the team in terms of like influencing grassroots style and stuff with some of the newer stuff we're working on we do have a saying in the team called sell the sizzle not the sausage, as we like to say.
So we want to get people excited about all the new kind of functionality or new opportunities that come up from good observability tooling and what developers can actually infer from the data that they do throw in.
So to say.
Liz: Hmm, so as you're saying, right, like it's about the results you can achieve and not about the process and kind of the sausage of making it.
Charity: How did you go about figuring out where to start?
Jordan: So where to start is understanding the problem at hand and for us, one thing that we do recognize is that the developer experience isn't necessarily great for what observability looks like, and that's something we really want to improve on.
So how do we make it easier for developers to ask new questions from their systems?
How do we actually make it easier for them to say, oh, I'm in an incident?
What can I look at to actually understand what's going on? What's breaking?
Charity: And what was the answer for them?
Jordan: So we've kind of gone all in on open telemetry as a standard.
And as part of that we have kind of worked on so like safe defaults.
So assuming a good data model from the ground up and saying, cool this is the information we think is most important.
So what kind of information can we actually grab out of our SDK?
So we're like auto wiring stuff into our Java apps and whatnot.
And what can we actually pull in on their behalf?
Liz: Yeah. The more you can make things consistent the more things you can make work out of the box the easier that people see the value rather than having to figure out which one of these five fields name, service name is correct service underscore name, service name.
Charity: Right? All of these developers are writing code.
It's not their job to make good observability it's their job to solve problems for customers, right?
And to the extent that you can make it easy and magical and automatic for them to get the observability, then you could open their eyes and see how much better that makes them at their jobs.
But it's really asking a lot to ask anyone to do two jobs to occupy two or three or more, you know positions on a team.
If you have a team of 5,000 developers, right?
Jordan: Yeah. Especially because we really kind of we value that you build it, you're in it kind of DevOps philosophy.
Charity: I love that.
Jordan: But you also don't want to make people feel dumb when they're working with particular systems and that's something I've experienced a lot in the past is I'm working with something new.
I'm trying to query data. I'm trying to get it out of this weird new system that I'm working in makes me feel so dumb.
That's going to be like, "I don't know how to get any of this information."
Charity: Because you're not in it every day.
Liz: So now would be the time for you to introduce yourself.
Jordan: Sure. I am Jordan Simonovski. I am an engineer at Atlassian.
So I work in the observability team and we're shipping some corny stuff at work which we'll probably get talked about one day once it's all done.
Liz: So tell us a little about the journey of the observability team at Atlassian right?
Like, are you building your own stuff?
Are you kind of interacting to piece things together kind of, how did that model come together?
Jordan: So as part of having engineering tenants in the team we do want to rely on people that build stuff--
That people whose job it is to build out these really good kind of solutions and tools for things and for us to kind of piece it together.
I think what our Charity's recent blogs were saying kind of I don't want to be scoping buckets of water out of the ship.
I kind of want to make sure that ship is going where it needs to.
And so focusing on the one thing which is providing a better experience for our devs is not something where I think we would necessarily say yes we want to build our own TSDBs and, and things like that.
We kind of want to leverage the things that are out there already.
Charity: Right. I'm so glad that this is see change that is finally taken over most of the industry.
Like so glad, like I remember five years ago, even it was still very much up for debate.
And I think that now, like the build versus buy.
Liz: Right? Like people were like, you know, I'm the next Uber I'm the next Google I have my own TSDB.
Charity: Don't just, we love our technology, you know?
And you're like, I've told this story before but like I fought it tooth and nail when they wanted to go outsource, mailed it did Gmail.
I was like, I ran postfix. I ran clam Ivy antivirus.
I ran, you know, mailman. I ran, I trained by an spam filters.
I could tail-f my own like Cyrus IMF and get my own rep through my mail spool.
I miss those days. All right.
But like I have come around to the view that what I wanted for my business to succeed which means that that is not where my time is best bed but I think that for all of us especially when we've done it really well, and we know that there is a bit of a loss when you give up that control there is some loss you lose some control over your destiny.
You have to learn to fit yourself into like the mold that everyone else you have to become like everyone else as a user. Right?
And like, I feel like it's dishonest to not admit that, you know, yeah. There was a loss there. Yeah.
You lose some, you know, fidelity of your personal dream of, you know, being the best at this.
But like, if that's your personal dream, you should go and work for one of those vendors who is solving that as a category problem for the world.
That's what I came around to a hundred percent.
Liz: So that kind of explains the story of how Atlassian came to think about the problem of observability data as like a force multiplier, rather than as a we want to build our own indoor and differentiate the heavy lifting.
What about the coming to approach the problem of observability as something that was maybe distinct from monitoring distinct from the previous solutions?
Kind of how did that evolution happen?
Jordan: I feel like observability as a concept is of we have a whole bunch of different definitions that are under, it's kind of like DevOps almost where no one has a concrete definition of what it is for us.
It's the ability to understand or asking you questions from our systems based on what we put in.
So based on the telemetry, we're kind of thrown. We want to ask any questions.
We want to infer new knowledge and things like that. So that's how we define it.
I'm sorry if that's not the official Atlassian definition for those listening.
Charity: You notice that I'm controlling myself and I'm not saying a word I'm here to hear about your experience, Jordan.
Jordan: Things have been changing and I think as Atlassian has been building out a pretty big SRE capability we've been taking on a lot of the stuff from the community.
A lot of the stuff that kind of Google SRE has worked on in the past as well.
So observability has kind of moved away from just, "oh I need to look at this when it's broken" to "how do I use this to constantly make decisions about what I work on today?"
Charity: How could I be in a constant conversation with the code that I'm shipping every day that my users are using every day.
I love that. It's beautiful.
Jordan: Yeah. And we've kind of built out a whole bunch of really good SLO stuff internally.
Charity: Tell us more. Talk to us about SLOs.
Like how did you, what sold you guys on them and where did you start?
Cause I think a lot of people are sold that, yeah this is the future, but I'm over here now. How do I get there?
Jordan: I can't say for sure what it is.
I honestly, haven't been around the org long enough to know why that particular decision was made to double down on SLOs.
I do know that we have huge reliability requirements from customers and things like that.
And so as part of that, we've had to kind of rethink reliability as moving away from it as this kind of binary way of thinking about reliability as in like "you're either reliable or you're not" and we're more kind of moving towards this more mature way of thinking about it.
Right. And that's something SLOs has give us.
Liz: Yeah. I love this idea of that site r eliability engineering is driving demand for service level objectives.
It's driving the desire to be in conversation with code and proactively manage reliability instead of being reactive.
Jordan: Yeah. I think I listened to your discussion with Jacob as well recently.
And you did kind of talk about not kind of going all in on SLOs and thinking about them as a the only way to make decisions or treating them as the only metric.
But I do think they're a really good way of starting conversations about reliability and having that conversation alone is probably enough then kind of saying, "oh we're reliable enough depending on this one metric."
Liz: Yeah. It's interesting dialogue of getting on the same page about what is reliable enough and then how do we manage to that?
Jordan: Yeah. It's an interesting place, but it's been good.
And in terms of previous orgs that I've kind of worked in the past, this is a lot more mature as the way of thinking.
Charity: Yeah. How do you SLOs to engineers, cause I love that you have the whole, you build it, you run, you know that whole ownership, ethos how do you sell it to engineers?
Because they have to invest some work in them
some learning to learn about what those are and
to shift, you know, I assume their whole, you know
alerting apparatus over to the new style
How do you approach them and how do you convince them that this is a good idea.
Jordan: It's been a big shift, particularly for a lot of our product engineers who haven't really thought about things in this way.
I do like though the mindset that things like SLOs, continuous delivery, how we do canaries, et cetera., are all just the way of us to test in production and understand things in production and kind of moving--
Telling developers things in that way instead of saying, oh, this is an operational task.
It's like, no, it's not an operational task.
It is. But at the same time you test your code before it goes out.
But what you don't test is your systems and how I think emergent failures kind of happen in systems because that's not something you can really test for with a unit test or an integration team.
Charity: Totally. I love that. Thinking of SLOs as a way of testing and production.
Like I, I don't think I had ever heard that said before. That's totally true.
It's a way of giving yourself a budget and allowing yourself to experiment within those parameters.
Jordan: Yeah, yeah, for sure.
Liz: And also to have that rich safety rail to know that if you start deviating from those parameters that's a sign that there's something wrong
Charity: You know, how much is too much and that's a real gift.
Jordan: Yeah. You can kind of stop testing once you've burned through that error budget that you do have.
Charity: Envelope pushed. I will stop pushing now.
Liz: Right? And hopefully you take those lessons and learn them.
So I guess that brings us to the next question of how do you kind of bring together the measurement and the service level objective and the practice of observability in terms of the data for analysis.
Like, are those two things different at Atlassian or are they the same?
Jordan: I would say they're more or less the same. The issues I think that we can run into sometimes things can kind of get bucketed into this kind of too hard basket in terms of how we measure SLOs.
And that's something that I think we kind of need to solve for in observability.
So how do we make it--
Particularly data processing is a lot harder to measure in SLOs and availability.
Liz: Yeah. But for definitely for like request for response stuff there it feels like people are staring to find the best practices.
Jordan: Yeah. And there are a lot more robust conversations around it now than previously, which has been great.
Liz: So when computing with SLOs, are they driven based off of like request logs?
Are they driven based off of tracing?
Like how do you think about that kind of mapping of SLOs to your telemetry data?
Jordan: We kind of mapped them based on a lot of things.
Like we do have our trace data available.
Like if, if developers do want to query that for this SLOs, they can.
They can kind of pull stuff out of logs if they really want to.
So they can pull these metrics out of any kind of store that we use as to say and kind of feed that into their SLOs.
I wouldn't say we necessarily have one way of, yes this is the only way you can measure your SLOs but it's just making that interface available for the devs to kind of feeding them into our metrics provider and going from there.
So does that answer the question?
Liz: Yeah, I think it does. So it basically it is, there is not a kind of standardization of that.
So you are standardizing upon, you must have SLOs, you must keep SLOs, but here are three ways that you can measure it and here's a variety of ways you can debug it.
Jordan: Yeah and the standardization of that stuff is coming.
It is stuff that we're currently working on and we do think is important to have. It's just these things take time.
Charity: Standardization of what stuff?
Jordan: So we're kind of in terms of using or other with building out our new stuff where we're kind of going all in on open telemetry and we're saying, yes, this is the way of measuring our, our systems and things like that moving forward.
So we're shipping everything using open telemetry and we're standardizing on top of that.
What we kind of do as well is we build kind of like our own little SDK on top of that.
And we say, yes, this will make it easier to do I guess, as you would say, observability driven development.
Liz: Yeah. We think about those things in the open telemetry world as like people's distros, right?
Whether it be at a distro from a vendor or a distr from a large company that wants to standardize these are the fields these are the kinds of automatic instrumentation, right?
Like when you get more things out of people's way, right?
Like open telemetry stops being missed, like, you know pile of widgets, you can assemble any way.
Right. And it starts being a little bit more opinionated.
Jordan: Yeah, of course.
And there are, I think Atlassian specific things that we maybe want to do but for things which kind of more generally are applicable we do have devs that are being worked on kind of contributing back as well, which has been great.
Liz: Yeah. That's super, super exciting. Yeah.
But you were making a transition kind of away from logging right? In the course of doing that, right.
Like when you're telling people you're going to use open telemetry, you're going to start using tracing data. Right.
That's open telemetry style. What is that replacing?
Jordan: It is replacing, I think a heavier reliance on logging and that's something I think that happens everywhere where even the bugging an app locally a dev will just kind of go console log and kind of.
Charity: I love the sentence that you, that you have.
Why are dev so caught up on debugging solely was logs? Talk about that.
Jordan: It's just something that, yeah, I've kind of noticed everywhere and tracing kind of exists as this is this fancy new technology that I don't really know anything about.
And so they kind of default to just using logging to actually debug stuff
Charity: So is this just like the natural evolution of the fact that everyone starts with like print F and console.log and like, this is just what we learned.
It's like, we learned to leave notes for ourselves while we're developing code. Right.
We leave these little notes for ourselves sprinkled in anywhere and everywhere.
That makes almost no sense to anyone else who's coming along after trying to understand this code.
And it's so randomly emitted too.
Like if you accidentally put something inside a loop you're like X value is count, blah, blah, blah, blah, blah And you're spamming yourself.
And then you take down your entire loves cluster. Right. And there's been no real rhyme--.
There's just been no method to this madness which is why I feel like the shift to observability, like at its core I would say that observability is based fundamentally on the arbitrarily wide restructure data blob one request per hop, per service where you just pack all that context in but there's only one, right.
You can get wider, not deeper. Right. Cause I mean, there are so many reasons it's cheaper.
It's, you know, it's effectively free to just stash more key value pairs on this blob but it allows you to keep all that context together, oriented around the user's experience. Right.
Liz: Hm. But I would say that to Jordan's devs, right.
Jordan's devs are used to this idea that the print app is free. Right.
Maybe it's cognitively free to them to put print-F in their code.
Charity: Right. Yeah. But it's not because every time you admit one you're creating a network hop and you know you're curating a TCP IP connection.
You know, the handshake is incredibly expensive at scale and all of those things are disconnected unless you attach, you know, a request ID to each of them or a trace ID to each of them, then you could get them to a space where you can reconstitute it after the fact.
But that's a lot of extra work when in fact if you just kept that as one blob, then all of a sudden you can cross correlate and you go, oh, oh the requests that were an error that looked like this also had that, that, and that, and it's just it blows people's minds.
Like it's like a sea change. Like it's no longer a log. It's a unit of telemetry.
Jordan: Yeah. And I think that for me kind of working with our new tooling has been a big kind of eye-opener.
And you kind of moving away from other vendor SDKs as well as well by doing that which there's still kind of depend on you're doing the little bits and pieces all over the place and then reconstituting the data.
Liz: Right. You just have to correlate things into one white event.
And I think that justice doing a lot of work.
Because previous to open telemetry, it was hard to kind of aggregate all these things into one span and collect all that connects together. Right.
You were forced to kind of emit multiple printouts.
Charity: Yeah but it does make kind of a nice bridge because once you've done that work to aggregate the data, then you can use Grep.
You can use all of the command line stuff that you're used to using with logs.
If you just get used to like stashing that value into the blob, instead of outputting that value to the console--
It's such a small thing, but it's such a huge thing.
Jordan: Yeah, for sure. And the next step is kind of making it easier to do that or making it part of the dev flow.
Charity: Right. And this is what I think observability teams at large companies like your own are so well positioned to do right.
You sit between 5,000 engineers and the tool.
And that means that you can look for opportunities to like reduce, reuse, recycle, you know to do things just once to, to normalize, to make things automatic, make things magical to make these feel the same.
And I think an under appreciated thing here is making it so that when you switch from team to team, you still feel oriented like you still know what's going on, you know where to look you know what the values mean.
You know, we're in the tool to go look, you know, I feel like this is this huge tax that so many companies have.
It's like, it feels like you're changing jobs when you shift from one team to the other.
And so people stay on that team for way too long.
And I feel like you should only really stay on a team for like two or three years.
And then it's good for the organization to move people around. It's good for you to move around is good.
It keeps your brain like fresh and learning new things.
But companies haven't invested in in this dev tooling layer that is really, really difficult.
Jordan: Yeah. And kind of looking at what Atlassian has built out in that space without deployment tooling, we have I think more or less entirely standardized on the deployment process for devs or the deployment tooling for devs.
So in a way it does make it a lot easier for them to move around apart from teams wanting to use different libraries for whatever reason.
And even if you're not moving around like you have to debug for the entire system, right?
Like you don't just get to debug your little service you need to hop around and be able to comprehend the entire stack.
Liz: I worry that. Like the problem that Jordan is defining here is that the problem has been solved for CICD, right?
Like everyone uses the same tooling.
What I'm not hearing is that people who have yet fully migrated from, you know, using print-f bugging or using like a certain vendors framework, as opposed to we are using open telemetry, we are using attributes.
Right. Like I, I don't hear that yet. Like what are the kinds of blockers?
What are the barriers that you're seeing?
Jordan: I think one barrier is probably, I'm still kind of young.
And for me to kind of come into the older engineers in the org and say, well, well, you've been doing this wrong the whole time.
Let me tell you how to do this a bit better is a hard discussion to have.
And we do have older engineers in the team too which try to push this forward.
Charity: But it's not that it's about, we're constantly learning and shifting as an industry.
Like we're all learning new things from time to time.
It's not that we've done any something wrong in the past is that holy moly, like best practices have evolved.
Jordan: Yeah. And we are kind of pushing for that.
Like the stuff that we're building out now.
Yes, we do kind of leverage open to elementary as much as possible.
And that's the direction that we want to head towards and sell the developers on.
But the stuff at the moment is just really kind of telling them, well you don't actually need to do all this.
We can make this a lot easier for you.
But the issues you kind of run into are, "oh but this is something I'm most comfortable with."
And so there's a big kind of learning piece there.
I think that we probably should invest in a bit more.
Liz: Yeah and I think part of what my role as a open telemetry developer is to make it feel as natural as possible to your developers.
So that it says easy as possible rather than being kind of this high bar.
You must be this good at tracing to use open telemetry. We're not there yet. We will be.
Jordan: Yeah. It's all happening but not there yet, but yeah, it's in progress.
Liz: What sort of timeframe are you looking at for this?
Like, are you looking at doing this in the next six months, 12 months, or is it going to be a transition that takes several years to accomplish, to get people off of logs and into more structured events?
Jordan: I can probably say it will take a while longer than 12 months.
We're onboarding teams kind of one at a time.
And what we're trying to do as best we can is to kind of get their feedback on kind of what we've built so far.
And as we bring them on and we say, cool you're doing everything this new way. Is it useful to you yet?
And these are the kinds of conversations we want to have with them and say, yes.
Like how much better can you understand your systems?
And based on that feedback, we go, we go back, we fix things up and then we come back and kind of do the same thing.
Liz: I love that. Not trying to roll out to your entire org at once but instead of thinking about how do you iterate with individual teams and get each team strongly bought in?
I think in a lot of that ways that goes back to the kind of that first question that we asked you, which was about how you introduce new concepts to an arc, right?
Like it's that kind of champion perspective except that your internal team's selling to your other internal teams.
Jordan: Yeah. That's exactly how it's working
Liz: What are some of the kind of benefits that you've seen.
You mentioned that CD is ubiquitous at Atlassian that everyone does it.
Everyone does it the same way, like kind of--
How often are you
shipping and how has that kind of accelerated your
Jordan: How often are we shipping in terms of team-wise?
Liz: Yeah. How often does each team ship these days?
Jordan: Oh, I would probably say easily, multiple times a day.
We have a lot of teams shipping multiple times a day.
Some, it really depends on a per team basis.
Most products, I would say we do ship multiple times a day or at least once a day.
And we still do have I think regulatory concerns through the org.
So we do have to make sure that yes things are being done properly but that's all kind of automated.
And we build that into our, our actual process.
So you bring it all left you make it easier to do the thing that you're meant to do instead of running around getting permission to do a whole bunch of things.
Liz: No one likes having to ask them for permission to push the button.
No one likes having their change rolled back because it got bundled in with five other changes.
This is something that like charity has been on our case recently about, because we've started to cross that threshold of like a individual builders.
Now bundling more than one commitment is a little scary to us because we're used to being able to move fast.
Jordan: Yeah, exactly. I don't think that there are many things moving too slowly.
I think infrastructure wise or platform tooling will always be a bit slower, but product teams.
Yeah. They're shipping all the time.
Liz: And do you think you're the observer, Kim, do you think of yourself as a platform, as an or as a product team when it comes to that?
Jordan: I don't know how to answer this mostly because I don't know if I've probably had that discussion with our head of ops ability until now I would like to see ourselves as a platform team.
It doesn't matter really what's under the hood but we do want to give devs an easy interface or an easy way to work with our observability tooling.
And that's the stuff we're actually investing in building out as a platform.
I guess if you will. Yeah. Not so much a product.
I think we probably have done a bit of product centric things in the past, but it really doesn't work when you just kind of build out a product and you say here you go, here's this thing, good luck.
We kind of want to do a bit of the extra heavy lifting for them and make it easier to work with, with the products I guess under the hood.
Liz: Developers kind of had any hesitation about kind of adoption of like, oh my God.
Like if I write the wrong thing and it has a tag, it might blow up my expenditure right? Or it might cause the system to get slow.
Like, has that been a barrier to adoption?
Or is that something that you've been able to kind of change expectations and say you know what, it's okay to have high cardinality now?
Jordan: I don't think the developers themselves are necessarily worried about high cardinality.
It's always been my team kind of handling all these infrastructure.
Who's like, oh, this is going to get expensive real fast.
But the developers that's something they actually want.
They want the high cardinality. They want to be able to kind of tie particular events to something that happened and, and work out exactly what it was, based on the high cardinality data that they kind of ship.
I wouldn't really say it's been a concern for them to worry too much about it, but yeah we want to make sure it's not moving forward either.
Liz: Yeah. And I guess that kind of makes sense, right?
That the high carnality data is often the concepts that developers want to relate to. Right.
Like user IDs. Right like, you know, which country, which user agent those are things that are familiar to developers that they want to be able to deal in.
Jordan: Yeah. And that's kind of what they've relied on doing at the moment where some of the tracing stuff we do, like if we are sampling things developers won't always be too happy with it.
So let's say support's working on a particular case or something and a user complaining about something breaking on their end to be able to trace exactly what's happening to that.
One user is something developers really want.
And when you kind of start sampling gets a bit harder they want kind of, I want to see everything is the kind of mindset that we have at the moment which is great, but also expensive.
Liz: Yeah. That problem is definitely a super interesting one.
We've been thinking about it a lot recently at honeycomb.
And it's kind of like it's this thing where everyone wants to have, you know all the cheap, the cheapest possible storage and also the highest fidelity data and at the same time, right?
Like you don't get there without like hard work frameworks the way that you get there has to involve dynamic sampling.
It has to involve being able to turn up the precinct for that one user.
Right. Like, and say, like, go do it again. Right.
Jordan: Yeah. And that's something we're keeping in mind.
I think where with everything that we're building out that's something we want to do.
That's something we want to say yes.
Assume safe defaults, kind of sample our users by default.
But if they want more or if they're in an incident or whatever it is we want to be able to kind of for them to be able to turn it up themselves more of like a self-service thing on our end. Yeah.
Liz: So thank you very much for joining us, Jordan.
It was a pleasure talking to you and I definitely learned a lot.
Thank you so much for sharing your insights today.
Jordan: That's alright. Thanks for having me.
Charity: It's really good to chat with you.