In episode 40 of o11ycast, Liz and Charity speak with Nick Herring of CCP Games. They discuss the challenges faced by engineers in the game space, the debugging procedure at CCP, and tactics for adopting large system changes.
Nick Herring is Technical Director of Infrastructure for EVE Online (CCP Games).
In episode 40 of o11ycast, Liz and Charity speak with Nick Herring of CCP Games. They discuss the challenges faced by engineers in the game space, the debugging procedure at CCP, and tactics for adopting large system changes.
transcript
Nick Herring: In a blind rage, I think, is where that all started.
Charity Majors: Yes.
Nick: Ultimately, we went down the path of, we were introducing new technology constantly, and we became the piƱata for anything that would go wrong, rightfully so, because we were the ones making the biggest changes.
But one of the problems that we ran into was we hit cardinality almost immediately, and we started being unable to answer any questions that we had in detail about what was going wrong in certain places.
And then we started looking around specifically for cardinality problems, specifically for tracing, and I think with those key words, I wound up either on Twitter somewhere or a Google search result of some kind and actually landed on Honeycomb itself.
Charity: That's amazing.
Nick: The specific call-out of cardinality is what really caught my eye immediately, because it was literally the problem that we had and nobody else was talking about that specifically.
Everyone's always just like, "APM, and metrics, and this, and that."
And I was like, "Yeah, but there are these specific problems that no one is talking about how they're solving."
And we found you guys, and it was literally like a North Star that we discovered at that period in time.
Charity: Back in the days when I was responsible for marketing, I was like, "High cardinality is the best marketing term ever."
And this means we got two people that way.
Liz Fong-Jones: So what did those pain points look like?
What was it that was like, "Oh my God, we can't live like this anymore."
Nick: So, EVE Online was basically built originally with database engineers and network hardware specialists.
So they had Cat 5 and a bucket of SQL, and that's basically how they made the game.
And so, as those things evolved over time and how they dealt with all of those problems, we got to the point of like, "Okay, we have roughly 7-8,000 procs at any given moment that we have to deal with. What can we dig into those? Where are those problems coming from?"
And the bigger piece was that we were starting to introduce newer technologies from the perspective of third-party APIs into the game, and we had to trace those all the way from the outside to the inside and back again, because it was very clear as to what was happening in those pieces.
And so, the original goal during that point in time was just simply, was this proc called from an action that happened outside the cluster, or was it called naturally from inside the cluster?
And then, if it was from outside the cluster, was it from the third party APIs or was it from our desktop client?
And that was the only real important question because the first most time-sensitive question was which door do we close and how do we find that in the quickest way?
Liz: Right.
Charity: And there's no pattern matching that you can't do.
You can't go, "Oh, it's always this one. Shut it down first." It could pop up anywhere.
Nick: At one point in time, it was literally the open the gate, close the gate meme, where we were just like, "Turn off all the third-party APIs. Okay, turn on some of them. Binary search all of our end points."
Charity: Binary search. That just gave me an unpleasant twitch up my spine.
Liz: So now's a good time for you to introduce yourself.
Nick: So my name is Nick Herring.
I'm the technical director of infrastructure for EVE Online.
Basically, I started with CCP probably it'll be eight years ago pretty soon, and started building up a team to modernize basically the stack of EVE Online.
And we've been slowly but surely, what I describe to people when we're interviewing candidates or something like that, I basically tell them we're jumping generations of methodologies and technologies.
We're basically going from metal and data centers to containers and cubes. That's how far we're jumping.
Liz: And therefore there's no guide to it, right? This is not a path that many other people have tried, except for maybe some of the enterprises who were like, "Oh no. We need to get with the cloud native thing. Better adopt all the cloud native stuff."
Nick: Yeah. I mean, thankfully we have a bit more agility in that regard as far as what we can tackle, but it is the same basic problem space of, do you do this from the inside out?
Do you do it from the outside in? Where do you start?
Charity: How do you make a decision about what to adopt, and what not to adopt, and what to adopt first?
Nick: Github Stars. Just kidding.
That's a really good question. So it's very much a judgment call.
It's one of those things where I only realize this when I'm talking to some of our engineers who want to get into more senior roles, and they come to me with questions about things like, "How did you select the Protobuf? How did you select gRPC?"
I'm like, "Well, probably a month of losing my mind of like, is this the right decision? How do we move forward with these types of things?"
And it boils down to trying to get a tracer round through the entire system that makes sense for what you need to do.
And do the paradigms and idioms in that technology reflect what you're trying to accomplish? And that just takes a while.
Charity: It sounds like you took a very observability first approach then.
Nick: For technology selection, to a degree, I guess observing from the perspective of how do people naturally think about this, which is also problematic because you also have the competing idea of a bit of the Stockholm syndrome of the existing developer base within the company.
And those two things clash almost immediately.
I think from a very generic perspective, one of the first problems we had to solve when modernizing all of this was we need a ubiquitous language.
And part of that was, okay, we're going to learn from almost 20 years of, I'm going to say, dynamically typed mental models, and why that becomes problematic over time, and how we start addressing those pieces and scoping those parts.
So when we got into starting to change all of this, we knew, back in the early days of Zipkin, and Jaeger, and all of those things, we were like, "Oh, those are awesome ideas. We're nowhere near that yet, but we know we're going to need that at some point in time."
Charity: Yeah.
Liz: And then you waited, and then the technology stack matured, and also, you had a keener sense, it sounds like, of what the business problem you're trying to solve was.
Nick: Yeah. So a concrete example of this is we have the unique situation of we have to tread incredibly lightly when we introduce these technologies because this is a Python stack that's been fine tuned over 18 years now.
EVE Online just had a birthday recently.
And when you're doing that, you kind of have to shed some capabilities when you're integrating these pieces.
For example, when we started implementing distributed tracing, you have that concept of, "Oh, this context gets passed along everywhere."
I'm like, "Oh, that thing's going to be massive if we pass that around everywhere," and so we just threw it away.
Because ultimately for us, we wanted to see where things were going and how long they were taking. We didn't necessarily need all of the context of transfer across boundaries, or subsystems, or whatever the case may be. We just needed to know the start and end points and how many and how quick.
So that's kind of where we are right now, so we just threw that context out the window.
And of course, there's tons of engineers internally who are just like, "No, that's not what the spec says."
And we're just like, "Well, yeah. But we're not getting any value out of it yet."
And we need to understand where those boundaries lie, because for us, serialization is incredibly expensive.
Charity: Interesting.
Liz: Yeah. You can't afford to add that much latency when you're adding these tracings, so you have to throw away the baggage and just focus on, "We need to pass the trace ID and span ID only."
Nick: Exactly. Yeah.
Charity: That's so fascinating, because it runs counter to most of the advice that we give people, which is you need that context.
It's going to be so helpful for you when you're trying to reproduce bugs, when you're trying to understand what's going on, when you're-- But it just goes to show that there's an exception to every rule.
Nick: The only reason that we were able to achieve that is because we didn't--
It feels like there's two mentalities there, as far as distributed tracing is concerned.
There's this mentality of, "Okay, what's going through the wire, and how do you maintain the context through that wire?"
But on the flip side of that, if you have tooling that is good enough, you can still rely on humans to build some of that context.
So when we're looking at traces-- And this is why I assume all of the questions come from, "I want to query anything in the trace, but I want to have a clause where it compares against the root trace."
Liz: Ah, yes. The HAVING clause. Yes. We hear a lot about that.
Nick: Yes. And I think this is one of the most extreme reasons of that, is because we don't pass along that context everywhere and most of our context comes from what does that first trace say?
Because going back to why we were adding this, we wanted to know how it got in the cluster.
And the only thing that knows that is the first bank, but there's no reason to transmit that anywhere else.
Charity: Yeah. So yeah, that makes sense. Right?
Actually, I think it is in line with our best practices, that you don't have to pass all of your fields around as baggage if you have good cross span querying capabilities.
Nick: Yeah. I mean, that was just one of the decisions that we made because we also have to make sure that we don't bounce off our own ecosystem, right?
Because there's a bad time of, "Oh, we integrated this," and then it just performs so badly.
We have other tools that we have integrations with, and they give us runtime telemetry of certain things, but we run the risk of locking the node when we turn it on.
Charity: Right. Of course. The old GDB solution.
Nick: Yeah, exactly. Exactly.
But that's not necessarily the case. So when we introduced the distributed tracing and we turned it on, we kind of didn't really say anything to anybody.
We were like, "Look at this graph we made."They're like, "How is this performing?" I'm like, "You tell me."
Charity: Right.
Liz: It is so magical, that first moment where you show someone a trace and they're like, "Whoa, I didn't know if we had that."
Charity: Yeah. All of those hours that they spent digging through logs, trying to reconstruct what happened, and suddenly, bing, there it is.
Nick: Yeah. I think in a recent conversation me and Liz had, I described it as the way we get adoption internally is that you don't tell people to use this or try to do that.
In a problem space at a point in time, you show up with the shiniest firetruck.
And that's kind of how we presented this.
Charity: That was going to be my next question.
Because whenever you've got a lot of people whose livelihoods are invested in a solution, they've literally built it brick by brick, how do you motivate them?
How do you make them willing to, eager to adopt this sort of change?
And it sounds like, I love that, you show up with the shiniest firetruck.
Nick: I mean, because at the end of the day, you can talk about the tool all you want and it doesn't matter unless you start using it in front of people.
It's kind of the power of the first time you have pair programming with any kind of senior, and you're like, "What's this?"
Or they're too senior and you're like, "Where's your IDE?"
Liz: It's one of those really cool things, right?
I think this really hammers home this idea that Charity has had about bringing everyone up to the level of your best debugger.
Nick: Yeah. So that's another tricky spot, right?
Because when we started digging into this, whenever we started adding any of this tooling or any technology, it then became the questions of, "Okay, well, how are you guys doing this now?"
And then when we get the answer to that, we're just like, "Oh man." Okay. How many people-
Charity: You've been extracting your teeth using that?
Nick: Yeah. So, that's the other tricky part, right?
Is because whenever that happened, it always funneled to one person.
Charity: Of course it did. Of course it did.
Liz: The one tracing wizard or the one Splunk wizard.
Nick: Yeah. Splunk wizard is the one. Yeah.
Charity: The original person to solve the problem gets stuck solving the problem forever. Yeah.
Liz: And the other funny joke there, right, is that players of EVE Online, myself included, we make these jokes about, CCP said the logs show nothing, right?
Right? Because you cannot log everything. You just cannot.
Charity: I have that sticker that says, "Log everything? Fuck you."
Nick: I think that's the other thing that we're getting to, is that I think a lot of people are realizing that we can get to a place where we can track all of those things, because EVE Online has been built over the top.
Again, the hammer was SQL, right? So we have a ton of internal systems that have--
So we have internally, things that we call EVE metrics.
And inside of EVE metrics is basically all these procs that run on a daily basis and that aggregate all this information.
And the most painful thing I saw when I first joined CCP is like, "Okay, we released today. Great. It didn't explode. Okay. But how's it performing?"
And they're like, "Well, we won't know until tomorrow." I'm like, "I'm sorry. What?"
And so, that's kind of when we first were like, "Guys, okay. First of all, we need to introduce some actual, real time metrics. Not necessarily observability, but just metrics. I just need to know what's happening now."
Liz: Right. If you can't even measure your known end nodes, then you have no business understanding your end node end nodes.
Nick: And well, it's also interesting because, like the previous point, there's still those heroes that they have, and it's part of this hero culture that's kind of reinforced by this tool chain.
But it was also exciting to see us introduce metrics providers and then see people just, there's an explosion of dashboards, and they're just looking at how people are digging into things and looking at things.
It's also always great when-- It's like you're watching an old Western poker match, where people start throwing down cards except for its links to different graphs in Grafana, or Honeycomb, or whatever.
"No, it's not that. It's this."
Charity: I love that phrase. I'm writing it down. The tool chain that reinforces hero culture.
That's so true. And that's something that, ever since the beginning--
I have worked at precisely on three teams now that did not have that, where it wasn't the person who had been there the longest who was the best debugger, but it was the person who had the most curiosity and persistence, even if they had joined fairly newly.
And those were the teams that were using Scuba and Honeycomb because they unlock that.
They reward your questioning, and your exploration, and your curiosity instead of just forcing you to stub your toe and find all the edges with your flesh, and then build up your corroding layer of dashboards upon dashboards that just need to be excavated, and nobody understands them but you.
And I think that that's so important for the human side of the cloud native, next generation of stuff is you just build tooling that rewards curiosity, and that doesn't-- The tool chain should not reinforce hero culture.
Nick: And that's tricky, right? Because-
Charity: So hard.
Nick: It's tricky because it creates a blind spot for the management layer that they never know about.
And it's tricky because sometimes it creates an identity for that individual that they have a hard time reconciling.
Charity: I love being the wizard who is just like, "I know what it is. It doesn't say that anywhere in the graph, but I know it is."
I love being that person.
Nick: Everyone does.
Liz: You love doing it until you can never take a vacation, until people are calling you at 3:00 AM.
Charity: Yeah, no.
And this is why I want to bang my drum a little bit about the necessity for ops tooling, dev tooling, to grow up a little, and use real designers, and get out of the whole VIM mode where you have to memorize all of the magic commands to make it do what you want it to do.
But the tooling should really get out of your way and just enable you to do your job.
You shouldn't have to be learning how to use the tool every time you use it at the same time as you're trying to solve your problems.
It should be much more intuitive, and much more natural, and much more-- It should treat its users like they're humans, not engineers.
Nick: That's what I tell anybody that asks me, "How can we get you guys to switch from Honeycomb?" "Do you have BubbleUp?"
Liz: So how is the adoption process when people have, at this point, five different tools they could look at?
What does that flow chart look like for you, in terms of where people start their debugging process?
And then, how you expect that to evolve over the next year?
Nick: So we're getting more and more into the realm of metric based alerts that get people's attention, and then we dive more into what patterns we see in our reporting for things like Sentry, for example.
All of our crash reporting goes there as well, and being able to see those patterns is pretty huge.
It at least highlights where we should be looking.
And then the evolution of that is, when we get into the forensics part, that's kind of when we get into the Honeycomb space of what's going on.
So I would say that the journey either starts from a Reddit post, that's the worst case scenario, or it starts from a Grafana dashboard or an alert that's coming from Prometheus Alertmanager or something like that.
Charity: Are you guys doing SLOs at all?
Nick: So, we are. We're basically piloting that concept in Prometheus right now, which is a little bit of extra work for sure.
And we've been looking a lot at the SLO stuff in the Honeycomb as well, but we need to get the culture in place to take advantage of those before we get the Ferrari out.
Charity: What does that mean?
Nick: So it means that the problem right now is we're completely reactionary.
There's no, "Oh, let's check the metrics in the morning," culture.
It's a, "Oh, there's an alert going off. Let's deal with that."
It's not a, "Well, okay. How is our services running today?"
Charity: Is there a culture of, "Oh, I just shipped some code. I should go look at it."
Nick: Yeah. There's definitely that. So that's difficult to discuss, because there's two camps there.
We have all the new tech coming in, and we're basically, everybody's in Golang and in CUBE and we have teams that are adopting that and coming on board with that, and they automatically get Sentry and Honeycomb integration and Prometheus, and all those come out of the boxes so they can see those things happening.
They can deploy those on their own accord, whereas the war machine that is EVE Online lurches forward every month or so with a massive update.
Liz: Patch Tuesday.
Nick: Patch Tuesday.
And we've also, this is the other weird part, but it's not weird to anybody that knows about EVE Online, is that we've conditioned everybody that 11:00 GMT is when the server goes down.
That's just when that happens.
And so, there's a coordinated focal point for those deployments, where it's like, "Hey, do you have something going out with the war machine? Great. Okay. Make sure you have eyes on those things."
Liz: And it's conveniently timed in the middle of your work day so you can go and look at it, but it's bashed with 100 other changes.
It's not, you ship one commit out at a time.
Nick: Yeah. And that was the other part of the frustration, was that we would introduce Protobuf messages into a Python code base.
So problem number one is dynamic typed mindset versus a strongly typed mindset.
Protobuf, even in Python, will tell you to fuck off if you do that wrong.
That's just how that works.
And so, we won't be able to detect those if something within the monolith core mechanisms starts trying to feed in a different data type or information, and Python won't tell you ahead of time, either.
And even with testing in place, which is just ultimately a glorified compiler at that point in time, without an entire test suite of onlining an entire cluster and any kind of behavior test against that, which we're working towards, that gets really hard to find.
And so, ultimately, there's a deployment, and then we have alerts in Sentry where, "Hey, this thing started screaming into the void," and that's when we start kicking into, "Okay, well what is this angry about?"
Liz: Yeah. That makes sense that your workflow is very tooled around right now, this idea of things will likely massively break at 11:00 AM, and you need to very quickly identify which team is responsible, as opposed to the new world of teams own their own stuff and it is abundantly clear which service is down, and that the team already knows.
Nick: Yeah. And can engage with that at their own speed and with their own operations as well.
Part of this is another piece of the other cultural change that we're trying to achieve is teasing apart the two words deployment and release.
Charity: I have a sticker to this effect as well.
Nick: This is the latest one we're trying to get a lot of people's heads wrapped around.
It's really difficult when you have a server that goes offline everyday at 11, and then the universe literally changes.
And so, we're getting more into experimentation and A/B Bayesian testing, that kind of stuff.
Charity: Testing and production, baby.
Nick: Yes.
So it's interesting because my dream scenario is I think something I heard forever ago in I think it was GOTO Con in Amsterdam or something.
Somebody was talking about how I think it was some bank or point of sale provider, they had everything running in production obviously, and it was global.
But their globe wasn't one-to-one because they made this one little island that was robot island.
And I would love to get EVE Online to that place, where we just have these really angry robots shooting each other in the face constantly to where we can understand what works, what doesn't, and get that information ahead of time.
Charity: Sort of progressive deployment model.
Nick: Yeah. And so, it's also interesting because even if we were to snap our fingers and change all of the technology instantly, we still have these interesting game design problem spaces where we can't do rolling updates through the universe because EVE players are notorious for min-maxing.
Liz: And doing things like, I expect to remain online until downtime.
That is a feature of the universe.
In fact, it has become this almost bug that people rely on as now a feature.
Nick: You run the risk of people also manipulating markets, and when we roll out like, "Oh, this part of the universe got this new piece, and this part of the universe doesn't have it yet. How do we reconcile that? What does that look."
Charity: Right. Interesting.
Liz: Yep. The joys of single shard universes.
Which, these are things that I think Charity and I know very well because, Charity, you ran Linden Lab, I worked on Puzzle Pirates.
And this idea of, you're cramming several thousand people onto one universe.
It's not two distinct shards that you can A/B test.
Charity: People, man. They're the highest cardinality problem there is.
Liz: Which I think brings us to our next topic, which is player experience.
That is a thing that game studios care a lot about in a way that other domains don't necessarily.
What's unique about this problem, in your view?
Nick: So for us, we have all of the advantages.
We have all of the weird advantages.
So obviously, single shard, not an advantage in most of these cases.
However, the fact that the EVE universe ticks along at one Hertz is quite the advantage.
We have a whole 1000 milliseconds to worry about things before something needs to be corrected or back-filled, or back pressure, or whatever the case may be.
We obviously want to challenge that going into the future, but it is one advantage that we have.
But it's interesting, when we hooked up distributed tracing, we started watching what those things look like.
And at those extremes, it actually does affect our player base.
And these were myths, or like me and Liz were talking about earlier, where do you draw the line between somebody complaining about something and it actually becoming a problem?
And we're getting to the place where we can actually instrument that and see what that is.
One of those examples is in EVE Online, when you're traveling through the universe, it's very important that you can lock targets and how fast you can lock targets, because that's a round trip problem.
"Okay. I've locked this target. Now I can fire at that target."
You can't do that out of order. You can't queue that up and tell the server, "As soon as it's locked, start firing."
And so, for players in Australia, for example, their latency actually makes that a problem, because they're getting to that boundary, that edge.
So traveling for them through the universe becomes an issue, or the opposite, they can't interdict anybody who's traveling through the universe either.
Liz: Whereas the people who are located in London have a much easier time camping people and instantly locking in, firing, and destroying someone's ship.
So, yeah, player experiences are very myriad, even if you have the server ticking along at one Hertz.
Nick: Yeah. So it's just, we find people that are at these edges of these boundaries, but then also, we find when we instrumented EVE with Honeycomb, we found fascinating code paths all over the place.
We found just things bouncing around like, "Oh, it goes from client, to proxy, to server, to proxy, to server, to client. Wait, what? Why is this doing this?"
Liz: It's that tangled web that you don't see if you just have the flat surface map.
The flat service map says A is connected to B, and that's fine, but seeing the stuff bouncing back and forth, that's where you go, "WTF?"
Nick: I mean, we also discovered super interesting things like, going back to lock on, when you lock onto a ship, that takes time based on skills, ship, size of the ship, modules that you have.
There's a lot of gameplay factors involved in that.
But then we found out when you make that request and how long it takes to respond, that time is the gameplay logic.
So it's not that your client says, "Hey, I'd like to lock onto this. Let me know when it's locked on."
It goes, "I'd like to lock onto this," and it won't tell you the answer, depending on how many skills you have, loaded modules, ship size, all these other things.
We're like, "Why are these..." Because we started looking for outliers.
Every time we found an outlier, there's tons of them.
We're like, "Add target. Why is add target an outlier?"
Liz: Or I think the recent thing where people, where there was complaints from people asking, "Why is it that my client has all these outstanding calls to request bounties?"
And just all of these weird things, where CCP wound up disabling this functionality because it was causing too many fleet performance problems.
Nick: Yeah. That one... That one's a tricky one.
That's a by-product of too much information being exposed.
Some theory crafting being half correct.
But yeah, in the end there's a lot of weird-- It's interesting what happens when you give the player base certain bits of information and how they react to that.
Charity: Yeah. For sure. You're surfacing some of what you can see to them.
I remember dealing with this so much at Parse, because we basically surfaced MongoDB to mobile developers through mobile APIs.
And so, they'd be gleefully constructing a query using C-sharp or whatever and they had no ability to see that, well, it actually compiled to do five full table scans.
Especially when you're dealing with developers as customers, you have to surface enough of the mental model so that they can make good decisions or bad things just happen.
Nick: Well, so that's a huge part as well.
Tech is a majority of our player base.
They're either working in tech or that's their hobby. It's unbelievable.
And so, even when we expose things like the third party APIs, it's very similar to what you were talking about, where we have to be careful around how people use those, and we have to watch how people are using those to make sure that they're not being abused.
I mean, there's another great example of people using our third party APIs, where our player base invented a galactic metal detector.
And we couldn't figure out how they were fighting their enemies so quickly. And it was-
Liz: Oh, I remember this.
Nick: It was just by virtue of an API returning a none versus an empty list.
Charity: Oh, God. Thank you, types.
Nick: None basically meant nothing was in that part of the universe.
An empty list meant there was somebody there who's not giving you access to something, which means they're an enemy.
Go blow them up. And people were building Death Stars, and people were showing up before they would complete them because they were using the third party APIs.
And we're like, "How?"
Charity: That's amazing.
Liz: Yeah.
And I think it's one of those interesting things where, as you say, the third party API really was a catalyst for, we can no longer measure this simply by looking at aggregate metrics with client performance.
You need to be able to break things down by, API key, by end point, by version.
Nick: Well, so that was another big cultural shift, a mental model shift as well, where up until that point, everyone was assuming we had one client in the universe and that just wasn't the case.
There's multiple ingress points into the universe.
What that manifests as is completely-- The older APIs had this glorious bug of whenever you use them, it would just eject your ship out into space.
If you weren't logged into the client and you tried to use the APIs, it'd be like, "Oh, you're online. Cool. Here's space," but you were not connected to anything.
And then, people were wondering why they were losing stuff in space.
Charity: Fascinating. What is up next for you guys?
Nick: As far as tech?
Charity: As far as transformation goes.
Are you 1% through this journey? Are you 50%? What are the challenges that you see coming up next?
And how long until, maybe we should take bets on this, how long until you spin down the last bare metal?
Nick: Ah, yeah. Okay.
So at this point, there's a presentation that we gave in EVE Vegas in, I think, 2019 that kind of lays out how we're going about doing this and what that means for everything.
It's ultimately, one of our near-term goals is we'd like to get to the point where certain elements of the UI will just turn off if the service isn't available, as opposed to, "Everyone off the server now."
Because that's how the monolith works at the moment in worst case scenarios. We're at the point now where we're on the cusp of a really big turning point, where we would have a closed loop that would basically tie off all of the legacy systems in the sense that we've been introducing gRPC connectivity to the desktop client and we've had it in the server for quite some time in that ecosystem.
But we're getting to the point now where we can provide the developers with the same routing mechanisms that the server was able to do, except for now, it's Erlang doing it, and we don't have to worry about Python going, "You get a packet. You get a packet. You get a packet."
Liz: So it's death to the old Browning layer, essentially.
Nick: Ultimately.
So the original design of the home brew network is a guaranteed one hot mesh network, which, quadratic problems are fun.
And that was the first ceiling that we saw when we started looking at this problem.
We're like, "Oh boy. I understand what the company wants to do and try to get more people concurrency those kinds of things, but guys, there's a very low ceiling to that, that we need to deal with very quickly."
And that's kind of what kick started all of this, when multiple people in the company were doing napkin math and going, "Ah, guys. This doesn't work out."
Liz: Right. Exactly.
I think that goes to the business problem that you're trying to solve overall, which is not just development velocity, but it sounds like EVE has this marketing machine that is built on the idea of, we will give you the biggest real fights with real classes and real players concurrent ever.
And yet, if the backend can't handle it, then that becomes a problem over time.
Nick: Yeah. And that's part of the interesting bits, because one of the things that we want to start using for distributed tracing is, with all the new technology that we're introducing, there's a 30% bottleneck that we could possibly remove if we offload serialization transmission and multiplexing to a separate threat.
So we're basically threading off those three pieces. And when that happens, we need to be able to watch what the simulation traffic is doing, because EVE's internal simulation engine is called Destiny, so everything's very deterministic and you basically replay it on the client, and that's how everybody winds up in the same place, so on and so forth.
And we want to swap that out with Protobuf messages over gRPC connections to a message bus that, when you're in a fleet fight, instead of poor Python trying to tell 6,000 other people that one person got shot in the face, we just tell Erlang, "Go do your job," and hopefully that makes everything go much, much faster.
That's one of the things that we're tinkering with right now, but kind of gives you an idea of where we're at with the capabilities of the ecosystem.
Liz: And then there's the journey towards measuring that and making sure it's working correctly, and if not, chasing the outliers down.
Nick: Right. The interesting part there is because it's deterministic, we're planning on instrumenting it so that we can see the outputs, like how do they travel?
And we can do that in parallel. Because ultimately, we can deploy it.
It's kind of a canary deployment, but we would send all the traffic and we would make sure that the end result is the same on both sides.
Then we could confirm that the outputs would be the same, and then we could basically switch that over.
Charity: Throughout all these transformations, have you lost anybody?
Has anyone just been like, "This isn't for me. I was comfortable doing the old ways and I'm going to leave and find another job that is more to my liking."
Or is it the kind of thing where you get people over the hump and then they're much happier there?
Nick: Honestly, the only part where somebody has been deterred by this was when we started introducing on-call requirements.
Charity: Yeah. Tell us about that.
Nick: It's like, you want to get all of this freedom and this ability to do this? Great. You need to maintain it.
Charity: With great power comes responsibility for your own code.
Nick: And to be fair, it wasn't the developer that had an issue with this, to be clear.
So the answer to your question, no, we haven't really lost anybody on this.
But that was probably the most friction that we had.
And that was more about the company understanding what that meant and the friction that came around that.
Charity: Yeah. If you're going to ask developers to be on call, part of the handshake is you have to make sure it doesn't suck.
It shouldn't like take over your life.
I believe that it's reasonable to ask anyone who works in a 24/7 highly available service to be woken up once or twice a year for their code.
But once or twice every on-call rotation? That's too much.
Nick: Yeah. It's been a learning experience for us because we use that to build--
So we ended up calling them strike teams, which is hilarious because people now say they're going on strike.
So these teams, they're basically the on-call rotations.
But what we did was we've been basically experimenting with this idea that if you're on call, you're already high context switching, so you might as well just be in high context switching mode and shield the rest of the team.
Liz: Yep. Funneling all of the toil to one person at a time.
Such a good pattern. Such a good pattern.
Charity: You shouldn't be responsible for any product project work being delivered if you're on call.
It should just be your job to be curious, go fix things, whatever bothers you, whatever seems to break.
People could actually look forward to the on-call rotation if it's kind of liberating.
It's almost like a freebie week. Do whatever the fuck you want, as long as it's making things better.
It can be something that people actually look forward to.
Nick: And as somebody trying to focus the amount of surface area for the technology that we use, it's a double-edged sword. "Look what I made."
"Oh no. No, no, no. Let's not add that to the mix just yet. I don't disagree with you, but we have a lot of things to roll up here."
But I think in general, to step away from the on-call part and get more into the migration of these things, it's been super fascinating, because--
I quote this conversation because it was the most-- I really wish I recorded it.
But I had a conversation with one of our long-term, I'm not going to say UI engineers.
She gets a lot of stuff done in the UI realm.
She works on the-- There's a sequence of changes in EVE Online called the little things, it's quality of life improvements and a lot of stuff where she basically just laser beams on things that she knows is super annoying for our player base and then just gets them out of the way.
And sometimes, it's a domino that she pushes and becomes an entire system.
I think ghost fitting, for example, was a byproduct of this, where you could actually simulate building a ship and what that would look like as opposed to needing to go buy the ship and actually buy all the parts.
Anyway, having a conversation with her about, "Okay, now that we have this different mentality for how we consume and ask for data," because they've all been conditioned to make the single largest request that they possibly can and get as much data in that one response as possible--
That they don't even consider the idea of like latent loading, or lazy loading, or any of these other concepts where the idea is cash on the smallest piece that you possibly can and do that as fast as you possibly can.
In the aggregate of that is a much better experience as opposed to I click this window and the client goes, "Hey," and then it pops up.
It changes the mentality of when do you get the data?
How long do you wait for the UI to pop? What's the responsiveness of all of these pieces?
And so, I think watching people go through that and watching her get to a point where she was like, "Oh, and then we could do this."
And then as soon as she said that phrase, "And then we could do this," it was like a runaway train.
It was like, "Oh, then we could change this. We can change that to this."
And there's tons of points in time. That was just the first one that I experienced during this.
But we've also had that happen with a lot of other developers where it's been an interesting back and forth of establishing boundaries of what it is the work that we're trying to accomplish in the change.
As far as, I work with our technical director for all of EVE Online, and he's working with the teams with me on this, and he points out to the teams when they say things like, "Would you be okay with," when they're pointing that direction at me, because I'm the one delivering the stack that they're working on.
"Would you be okay with this?"
And he's like, "No, no, no. That's not how this works. Are you going to be responsible for that decision? Are you okay with that? He's just giving you his experience and advice on why that might not be a good idea."
And some of those are really hard conversations to have, and they go back and forth constantly.
But then there's that one point in time where they realize, "Oh, I can take this and run with it."
Liz: And that ownership mentality is so, so key to unlocking so many other things.
Charity: Yeah. For engineers, we got into this profession because we like solving problems, because we like understanding things.
And sometimes, that's just been beaten out of people and it needs to be gently reintroduced. But once it clicks, it can be so liberating.
Nick: I agree with you there, except for, in my experience through this so far, it's been, they want to make sure that the problem that they're solving succeeds, but in the environment that they've been in, they have to make sure everything around them will succeed.
And so, I have to have the repeated conversation, "How do I do an end to end test?"
"You don't."
"Why not?"
"Well, first of all, you can't possibly keep all of that in your head. Second of all, you're not responsible after this point in time."
And getting them to let go of that piece and focus on the part that they're really good at, it's like, "Listen guys, if that message doesn't get there, that's my fault. That's on me. I need to go check message bus paths and all sorts of other things. That's for me to figure out, not for you."
Charity: What a great conversation. Thank you so much for joining us, Nick.
Nick: Yeah. No problem.
Liz: Yeah. Thank you. This was a pleasure.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
O11ycast Ep. #75, O11yneering with Daniel Ravenstone and Adriana Villela
In episode 75 of o11ycast, Daniel Ravenstone and Adriana Villela dive into the challenges of adopting observability and...
O11ycast Ep. #74, The Universal Language of Telemetry with Liudmila Molkova
In episode 74 of o11ycast, Liudmila Molkova unpacks the importance of semantic conventions in telemetry. The discussion...
Machine Learning Model Monitoring: What to Do In Production
Machine learning model monitoring is the process of continuously tracking and evaluating the performance of a machine learning...