December 20, 2017
SF Metrics: Richard Waid & Ben Hartshorne
Watch the talks from October's SF Metrics Meetup, including LinkedIn's Richard Waid on the evolution of monitoring at LinkedIn, and Honeycom...
In episode 21 of O11ycast, Charity and Liz speak with Jessica Kerr. Together they explore complex learning systems, how to view outages as opportunities, and the power of feedback loops.
About the Guests
Liz Fong-Jones: So tell us about symmathesy. How did you become interested in it? What's the definition of it? What even is it?
Jessica Kerr: Symmathesy is a learning system made of learning parts and I found it in this book by Nora Bateson, who's an anthropologist, and she has grown up with systems thinking because her father was Gregory Bateson, who was one of the huge systems thinkers of last century.
And Nora talks about how system has become too mechanical, because there are a lot of systems like machines that are mechanical but living systems are not.
An ecosystem or the economy or your team-- these are all made up of parts that learn and because of that the whole system learns.
Russell Ackoff said, "A system is not the sum of its parts, that's an aggregate, it's a product of their interactions."
But symmathesy takes this further because the parts are each a product of all their past interactions. So we are a product of all the teams and families and communities we've been a part of, ever.
Charity Majors: Which is why it's so hard to predict the future, well, this seems like a good time for you to introduce yourself.
Jessica: I'm Jessica Kerr, online known as Jessitron, but my friends call me Jess.
Charity: So excited that you're here today you're talking about this and I'm thinking about, I've been thinking so much lately about like how we build high performing teams, because people tend to approach this by hiring or like looking straight at the team.
How do we get the best people? How do we get the best engineers?
But that's like, when you think about the sum total of things that enable an individual to ship software quickly and reliably on a team, like, I feel like maybe 10% of it is what is in your brain.
Like the algorithms, the data structures are like, they are necessary but not sufficient skill because like so much of it is like, what are the defaults?
What is all the software that's been written and built to help to get you to this point? All the libraries, all of the deploy scripts, all of the expectation.
Jessica: Yeah, see. It's not just the people either.
Charity: It's not just the people, it is expectations, it is the pressures, it is the scar tissue that you have from a wrong in the past it's all these things like, yes, we should be thinking about how to build high performing teams but there's so much that goes into it because like you said, like tools, create who we are.
Jessica: Yeah and you think that we create high performing teams, but actually high performing teams. They create us, exactly.
Charity: They do, this is the thing, yes a high performing team creates good engineers. That is the number one thing that it does.
Liz: It's kind of the "seeing the light thing," right?
Like if you've never seen a high performing team, then you don't know where to start but if you have someone who has seen a high performing team, they can kind of help reproduce those practices within your organization.
Charity: Yes but not exactly, you can't just take it as a recipe you have to like adapt it.
Jessica: Yeah, copy of the questions not the answers.
Charity: I like that.
Liz: So what are some of those questions for you?
Jessica: The question is always, how could we be doing this better?
Today, I've been reflecting on personal responsibility and how it's a cop out. It's easier to say I take responsibility, this is about me, it's my fault and I'm going to do better next time and not that it's easy but that's easier than working together to effect system change.
Liz: I find that so interesting because if you don't know my personal life, my personal life revolves around getting a bunch of nerds online to coordinate with each other, right?
The complete context, whether it's even minor World of Warcraft, and it's kind of interesting to observe those systems dynamics of, right, like things like psychological safety that someone feels safe saying, "Hey, yes, I screwed up "I'm the one who wrote the RAID, right."
As well as kind of the broader things of how do we make it easier for people to know where to stand, right?
Like those are kind of interesting things that also relate to how we develop software engineering teams too how do we make it safe to talk about how an outage happened safe to make sure that we can ship software safely in the future.
Jessica: Yeah, you should have known that about how not to delete the RAID array along with all the other information in the entire world that was hypothetically available to you at that time.
Liz: Yeah, that goes a lot to the kind of Aldous school of thinking, right?
Like about, talking about hindsight is 2020, right like you have to assume that people have the best intentions at the time and worked with the information they knew.
Jessica: Right, whereas it's so easy to create a story in hindsight of how this was obvious.
I had a great story the other day, my partner Abdi was talking about how in the old slightly random role playing games, there's a nerdy word for that and I forget it I'm not a good enough nerd, this is fine the role playing games that were just a little random.
And like he's typing in the terminal and you look around there's various objects in the room and he's like, I don't know what to do I kicked the sink, a bunch of snakes pop out of the drain and you die.
Liz: NetHack, this is NetHack yes.
Jessica: Yes, thank you, it's NetHack and that's like satisfying, that's fun because you never would have predicted it but it makes sense in retrospect,
Liz: And you learn something from it.
Charity: This is why I love outages, I love outages because they're a break in the routine is the time when everyone stops and looks and inspects, what are we doing?
Does this still make sense? What is changed? What's new?
Like, it's an opportunity for us to like, all of the things that we take for granted and we never think about, why is it this way?
Should it be this way? Because we can't but you can't operate throughout your life if you're stopping every second to reevaluate.
Jessica: Exactly, you can't work together to effect system change on everything.
Charity: You can't, and it's perfectly focal point, it draws your attention to a moment which is an opportunity for you to like, take everything and again and like re-ask yourselves, like the fundamental questions of why are we here?
Charity: Are we doing the right thing?
Liz: At the same time though, it shouldn't necessarily take a full on outage for us to do that examination right, we can do that continuously.
Charity: Its just that we put it off in bed, we're fucking busy.
And so honestly, working with non-engineering teams has been so enlightening it really draws my attention to all of the work that's been done in this field to retrain our brains to see these outages as opportunities.
Because in the business side, they don't yet, they're terrified, they don't, they think they want to press, "Oh, it's my fault?"
And it's like, no, this is not how we talk here like, it's not about your fault, this is any reasonable person in your shoes would have done that so how can we talk about how to do this differently as a system?
And like that district, that jarring like has just made me so aware of the work that's been done and like how much benefit everyone could, like--
VP sales and marketing, they're traditionally kind of at each other's throats because they know that their heads are on the chopping block if we fuck up. How do you retrain people to be--
Jessica: Accountable to tell the story.
Charity: Yes, absolutely, you're in charge that means that, you're accountable, but it's not about fucking up.
It's like how can we get the information out so that we can all see it and apply our brains and do better.
Liz: And learn from it and I think that learning element is so crucial because I think recently our colleague, Danielle published this article about why AIOps is bullshit and it's kind of if your machines are like taking the learning away from you, then you aren't learning.
Jessica: Right yeah, now if it can say, "Hey, I noticed something interesting here." And then we can apply our amazing power of telling a causal story about that in hindsight.
Charity: Yes, any machine can detect a blip up down, whatever only humans can apply meaning to it and honestly, I feel like systems are blipping all the time like we cannot alert, it seems very simple, right?
Just page, if there's a problem but in fact, things are failing all the time.
Jessica: Right and which of them are a problem.
Charity: Which is a more a problem? Which of them were intentional? Which are actually good sides? Which are just noise? Which are--
Liz: Right, like just a release going out, right releases are inherently blips.
Charity: It's a change, it's a big change which is why I think the AIOps is doomed for the foreseeable future because like, it's all about training a system to detect anomalies in the system as it is today and every time you ship something, you've changed all.
Jessica: Yes, because our software is part of our symmathesy it's constantly learning from us because we change it, and then when we learn from it, because we're actually doing the Ops
Charity: So like Richard Cook, the famous like socio-technical diagram where he's got there's the tools in the middle, there's the artifacts at the bottom and there's the people and the meaning at the top, and if you look at the systems that we have--
So you've got your team, you've got your production system and you've got your tools that mediate between the two, and okay.
So imagine like you're on my team or whatever and I'm like, I'm going to write a cron job whenever you read a test that fails, I'm going to make a page you and your manager is going to send an email to the entire company, " just Jess broke something."
Like, how's that going to impact your willingness to take risks?
Jessica: I'm going to write a whole lot of tests that fail.
Charity: Of course you are.
Jessica: Than be like aha! I had an effect on the world.
Charity: The emotional consequences of the tooling that we write like, I don't believe that you can be a senior engineer until you've spent enough time in production.
Otherwise I don't care how many data structures and algorithms you know, your intuition will have been trained on something that isn't real and that means that I can't trust it.
Liz: The challenge goes in finding what in production means, right? Like in production doesn't necessarily mean on call on the weekend.
It just means having some real world exposure to the consequences of your--.
Charity: Software engineers do not spend enough time looking at their code in production, just watching it run just asking them, like I just wrote this blog post about ODD, Observability-Driven Development because like TDD stops at the border of your laptop, right?
Charity: Like it marked out everything that's interesting.
Charity: Which is, you needed to do that its useful.
Jessica: And that gets back to personal responsibility because you personally can see a change in your personal world that you have complete control over.
Charity: Its very satisfying, very satisfying, very good for the ego.
Liz: It's not just responsibility, it's autonomy, like it's autonomy and the ability to effect power, yes.
Jessica: Yeah exactly I mean--
Charity: So many engineers have the power to change their code, but they aren't getting--
Jessica: Are more ambitious than that.
Liz: Let's talk about that ambition, right? Like you started playing around with observability in your own code recently, kind of what motivated you to do that?
And kind of, what have you seen so far?
Jessica: Well, we're starting an app, Abdi and I are starting an app for our personal--
We want it to be in the world and because we enjoy developing, the elevator pitch is okay so there's lots of people that you know and you'd love to catch up with, or that you don't know yet and you'd love to have a Zoom chat with them at some point, but you hate calendaring.
This app is like, okay, send this invite to your friends, some of them will sign up for a time slot I mean, for MVP, it's just going to be one time slot, okay, who wants to chit chat with me at 7:30 on a random Tuesday?
Charity: Sounds great.
Jessica: And then everybody who accepts you create a relationship and then randomly some of those relationships get calendar invites on Monday, that's the concept.
And so we started a Rails app because Abdi is an expert in Rails, on Heroku because that's the easiest.
And then Abdi got the systems tests running, because he wanted to see change before we make, actually implement a feature.
And I was like, I'm not doing anything until I can see what's happening in production, like I want to notice when I hit the site, I want to see that happening.
Jessica: I know that when, I mean, the first thing we did of course was deploy something and see that the world was changed because now this Heroku app URL shows something and it didn't before.
But when I hit that, I want to see something change and I know I can get that from Honeycomb and it was so easy to hook it up with Rails.
Charity: Like closing that feedback loop of actually seeing it run in production, I feel like what we need is to create that hunger in engineers so like, to not feel like their job is done until they've seen it there.
Your job isn't done when you've merged to master, you don't move on then you wait until you look at it in production and you ask yourself, is it doing what I expected it to do?
Because we would catch 80 to 90% of all problems before users ever get to them, right?
Like you have it all in your head, you know what you're trying to do, it's the freshest it's ever going to be.
And if you're instrumenting as you write with the expectation that I'm going to need to see this, right in prod with users hitting it, that feedback loop it's so simple and it's so powerful And like--
Liz: But there are prerequisites in that feedback loop that's kind of what I feel it is.
Charity: It sounds simple but there's a lot that goes into it, right?
Did you read the Stripe developer report where they showed it like 43% of engineer's time goes to bullshit.
It does not go to moving the business forward, doesn't go to anything useful it's doing the work to get to the work that you need to do, right?
And so much of that, to my mind is that they can't see what they're doing and they aren't used to being able to see what they're doing.
And so you spend a lot of times like trying to look for the thing or doing the wrong thing or doing something but not closing that simple loop of just like looking at it making sure it's doing what you wanted it to do.
Jessica: Yeah I think of it likes science, except in the physical sciences, we're really limited in our instruments we have to keep inventing new instruments to inspect a world that doesn't care whether we inspect it.
Jessica: Whereas we get to change the world.
Jessica: So it reports to us.
Charity: Yes, it's a little mind castle in the sky, which is why engineering has been so hard.
A lot of it has been that we've expected people to construct these sky castles in their minds and hold them there and when they have a bug or something they're literally trying to trace it in their mind.
Jessica: And the mental model.
Charity: Its hard and your mind is always out of date.
Charity: And if we can just bring it no more tactile, like an experience of just like getting used to using a tool for that.
Jessica: Or getting used to the part that yeah, mental models are always incomplete and out of date and what we need to be good at is building and updating and restoring those mental models, not holding them and feeling smart.
Liz: Yeah, exactly right. I think we need to plan for failure, we need to plan for people to rotate off projects and that means being able to reconstruct from first principles.
That was kind of one of my favorite things about repeatedly switching teams is that I really, really honed that ability to work on first principles debugging.
Charity: I noticed that when you came in the door at Honeycomb, like Liz came in the door and like the first day, like she was productive, she was just like, I see this, you could tell this is a skill in and of itself learning to reconstruct and act on partial information but--
Jessica: Yeah, learning to form a theory, come up with hypothesis and then test those.
Liz: It's scientific, its being scientific about our engineering.
Charity: Now there's a concept.
Jessica: That's right, I don't want to be a software engineer I want to be a software scientist.
Liz: The other thing I wanted to kind of go back to was the decision that you use to kind of build on Rails those on Heroku, right like kind of, what were some of the motivating decisions there, right?
Like you could have built your own Node.js server or you could have, hosted on Kubernetes, right like what made you pick kind of those two technologies?
Jessica: Okay, so a technology choice, we have this idea and the culture that suits the language to the people who were using it.
Charity: Yeah, to the most part.
Jessica: Because it's our relationship to that language I mean, yeah some languages are not suitable to certain problems, if you're going to do data science, you need the relationship with the language to the community.
Do that in Scala or Python because that's where the work is.
Charity: Right, so like I have somewhat counterexample.
So at Parse, we wrote the first version of the API and everything in Rails and we chose it for the reasons, basically the people reasons because it was fast and there's people who knew it and everything.
And three years in, we realized that we had to rewrite the whole fucking thing because threads, because the entire platform, million apps all go down in a heartbeat without thread.
Jessica: Does that belong to Lambda? We're writing the first version in Rails because we have some questions to answer about whether it's useful.
Charity: It's a great mocking language.
Liz: Yeah prototyping, right, it's absolutely prototyping.
Jessica: Yeah, it's going to good for that because Abdi is an expert in it, if he weren't, it wouldn't be useful for that but also--
Charity: If it was just you, what would you have chosen?
Jessica: Oh, if it was just me, probably TypeScripts just because that's the thing I'm the least bad at right now.
Switching languages a lot means I'm not spectacular at any of them anymore.
B ut I don't want the focus to be on learning a technology right now.
I mean, so when I say that we chose it based on us, it's not just who we are it's also who we want to be, so it's about what we want to learn.
Liz: Yes, exactly your growth areas, right? Like you don't necessarily want to learn, right like about the intricacies of how your language runtime works, like--
Jessica: Yeah that's not my growth focus.
Jessica: I want something boring for a language and I want a boring deployment.
Charity: Yeah, totally.
Jessica: So Rails to Heroku is a happy path.
Whereas what I want to learn is more about like really seeing a change in the world, I'm interested in the results of this app and whether it's useful, so I went straight to observability, before any features.
I want to learn Honeycomb that's where I'm interested in expanding and I know Honeycomb will follow us right over to Lambda if we ever implement this right.
But I know deploying Lambdas is an effing nightmare and I'm not interested in that particular pain right now.
Charity: Right, what are you hoping to learn from observability?
Jessica: I wanted when people use the app, when they hit the page at all, I want to see it.
Jessica: And of course it also teaches us about our technology for instance, yesterday, I was going through a trace and I was like, "Abdi, did you know that Rails makes 13 database calls for a single page load?"
Charity: Oh my God!
Jessica: "For a single page load?" And he was like, "Oh no, the SQL shows up in the logs."
And I'm like, one of the SQLs shows up in the logs, One of those database calls actually hit one of our tables.
The other ones are all like show time zone and set error handling and a couple of them are, the long ones are digging through the information schema.
Liz: Yeah, I had that same experience when I was helping out the folks at Dev.t o/thepracticaldev with instrumenting their code and finding out why are you sending so many stands?
Oh, that's because we really are making that many database calls, right like all these things get hidden away from you in order to understand the operational performance you actually have to service them.
Charity: The miracles of protection.
Jessica: And then he was like, "is that just the first time?" So I hit the page again and I said, "No, the second time, it only makes eight." but it's just so easy to see things.
Charity: And isn't it addictive? Don't you see a hook on that dopamine hit of just like knowing what's actually happening.
Jessica: It's just like the test turned green, that green test was a dopamine hit and still is for the people who really love TDD, but changing something on my laptop isn't enough.
Charity: And what's amazing is when you start finding things that you didn't know were there that were problems when like, you're like, you just put your mind goes, "Oh shit 60% of the traffic to my MySQL system is a health check, I am DDoSiNg myself."
When you find those things like before we had this experience over and over, our customers do where they just like, they get Honeycomb set up, they just start clicking around just idly and find these terrifying things in their systems that they had no idea even existed.
That's what gets me just like really excited we've all got warts, lots and lots of hairy warts,
Liz: But it's kind of interesting that for you, the kind of thing that got you started with thinking about the user insight component of getting insight into what your users were doing with your system.
And then you found your way to kind of the operational resilience and operations side rather than a lot of people come to it the other way around right, they come from an offset account and they say, "I want to understand the operational performance," and then their product development software engineers adopt the kind of observability mentality later.
Liz: That's really cool to see it working both ways.
Jessica: Right, yeah I know I'm not going to have like major operational problems for a little while because we don't have users.
It's a Hello World right now, but I can drive my own work.
Jessica: This really makes me want to implement a feature now. Now, I'm like, I really want to be able to send this to people we need off, okay? It's that early.
Charity: I've heard from several product managers, who use Honeycomb who are just like, this is a perfect product management well, they're product managers who have engineering backgrounds.
And what I love is when you have this single source of truth, when you have business people and engineers and engineering like because tools create silos, like the edges of where a tool gets you is an edge of reality.
And if you aren't careful, you can get in a position where teams spend more time arguing with the nature of reality itself than trying to solve the same problem.
Liz: And this is where I really loved, Jess your analogy about kind of working in the same kitchen, right?
Like, are you working the same kitchen? Are you kind of having to coordinate elbow to elbow? Right.
Jessica: Right, that's joint activity.
Jessica: As opposed to let's meet at six o'clock you bring the potatoes, I'll bring the meat. That's just coordination.
Jessica: And that isn't teamwork and that's not discovery and your tomatoes and your meat might not go together.
Charity: You might have everyone bring a dessert.
Jessica: Yes, yes. Yeah.
Liz: So kind of, how can we make our tools more like kind of working in the same kitchen together than kind of just merely coordinating?
Jessica: I really liked that, Charity, about the edge of the tool is the edge of a reality, how can we be in the same world? Because that's such a problem in the real world right now.
Jessica: Its people living in different realities? How can you work together to effect system change when you don't live in the same world?
Charity: You can't even agree on the basics of what's happening.
Liz: Yeah, we have to share with our teams, we have to figure out what is the problem, we have to be able to decompose the problem, right?
Like these are all steps that need to happen in order for us to be able to productively make some progress.
Charity: And when it's done well, it's completely invisible, it's only when it breaks down that suddenly we're aware of how much goes into this.
Jessica: Right, how much ground we're standing on.
Charity: Yes, exactly.
Liz: Right, like, if it's already broken down, how do you figure out how to fix it?
Charity: We stand on the shoulders of so many giants, just like tens of thousands of engineering hours that have gone into the best practices in defaults that we take for granted today, it's really all inspiring to just like, think about that.
Jessica: Yeah, that's, what's cool about being human.
Charity: Yes it is, I feel like I was so fortunate in my career because my first job after dropping out of college was Linden Lab, which was this very weird quirky, but amazing like really talented engineers were thinking about things from first principles.
We were managing these really large systems before the days of Sharp or Puppet.
Jessica: A high functioning team built you.
Charity: The high functioning team built me and because of that, my standard for my jobs has always been sky high like I am not satisfied, like with mediocre--
And this may be really restless but I feel so grateful for that because there are so many people who are so much better engineers, who I am, who haven't had that experience.
And they just don't the jobs that they're in. I'm like what, you're better than this, they're just like what, this is normal. And for raising my standards, I am so grateful.
And Christine and I have said so many times that if for Honeycomb, if all we ever managed to accomplish, is just raising that bar for the people who come after us. That's enough.
Jessica: That's an effect on the world.
Liz: Well, let us all hope that we managed to all have that kind of a positive effect upon the world.
I think that that's kind of what, at the end of the day, motivates a lot of engineers is can I have a positive impact in the world.
Charity: Meaning, autonomy and mastery.
Jessica: Mastery, autonomy and purpose.
Charity: Meaning purpose, same thing.
Jessica: But I think, well, yeah, you said meaning, which I think is better because it's not about me mastering something and it's not about me doing what I want on my own computer.
Jessica: I want interdependence, not autonomy.
Charity: You want ripple effects.
Liz: That shared sense of accomplishment, right? Like that shared sense of--
Jessica: That's so much better than an individual accomplishment. Oh my gosh.
Liz: There are definitely people who thrive on being lone wolves but for every lone wolf, there's like so many people who value cooperation so much more.
Charity: Even the lone wolves kind of underestimate how important it is to them it can be something that you don't cognitively think of yourself because yes, I have always been one of these people like I'm rather late to the realizing how important other people are to me.
But it was always there I was just able to take it for granted for a long time.
Liz: Or kind of the other alternative way of thinking about it, right. Is like sometimes your lone wolves are the people scouting ahead right?
Like paving the trail for other people to follow right, or to help those other people along.
Jessica: Yeah or just to give a new idea of what's possible.
Jessica: Of what could be, which Honeycomb definitely does.
Charity: Thank you I'm so excited that you finally managed to try us.
Jessica: Yeah, me too.
Charity: It is fucking lonely as hell, but there are kindred spirits out there like you.
Liz: Yeah, exactly the first time that I saw you speak about symmathesy I was like, "oh my God right, like, like Jess is onto something, but like no one else was talking about symmathesy and yet it clicks, right?"
Like how do we popularize this? How do we spread it?
Jessica: Yeah, what's great about talking about symmathesy and how great teams make great people is how many people say "yes, I've been thinking this and now I have words to put around it."
Charity: We found that a lot with the observability, like when we started talking about the high cardinality stuff, the observability stuff.
And like a lot of people like nitpick about the definition and shit, but like, it was hard to build, but it was way harder to figure out how to talk about it.
And I remember the morning when I googled the definition of observability and I realized that it had this rich lineage in mechanical engineering and control systems, and it's about understanding what's happening inside the system just by asking questions from the outside with no prior knowledge with no like library of past outages.
With no, like, that chicken and egg thing where you have to know what you're looking for before you go and you search for it.
Jessica: You have to know where to put the print out in order to see what it says.
Charity: And like having that language it's the words to put to it has been, and so when people are like, "yes, "I've had this problem, yes, this is a problem for me too."
So language is really powerful.
Jessica: Yeah symmathessy has the best word I've had for the people and software and tools in a software team all learning from each other.
Charity: Have you not heard of people repeating that word back to you, like if you have you encountered it in the wild yet from somebody who was like, "Jessica, let me tell you about symmathessy."
Jessica: No they always have to ask me how to pronounce it,
Charity: That's how you really know that you've been, you've made your mark on the world, you're going to be in the line of the supermarket one day and someone's going to be like, "So have you heard of "this thing called symmathesy?" I really think you'd like it.
Jessica: That'd be great, yeah. I didn't coin the word that's important to me, so I am building on Nora Bateson's concept at least.
Liz: Yeah, I think we definitely see the word socio-technical crop up a lot more, but soon I hope symmathesy is also going to pop up as well 'cause it--
Jessica: Because the focus on the learning--
Liz: The learning rather than just the people--
Charity: It all goes together.
Jessica: Yeah and the tools mediate.
Charity: The tools are where you start, the tools are the lever that will change the world as Archimedes said.
Because the tools are the edge of your reality and what you're doing is you're deepening your reality with better tools.