Ep. #3, Distributed Systems with Paul Biggar of Dark
about the episode
about the guests
Charity Majors: Paul, have you ever been on call?
Paul Biggar: I've been on call a lot. When I started Circle, I was on call from the first customers until they finally took my pager away three years later.
Charity: Oh wow.
Paul: I started as a primary on call and then some iteration. I mean, at the start we didn't have pager duty, and we didn't have monitoring, which is the precursor to observability. You might have heard of it.
But you know, your Twitter blows up because your site is down and you're effectively on call regardless of how you position.
Charity: This has never happened to me.
Rachel Chalmers: So, this might be a great time for you to introduce yourself.
Paul: I'm Paul Biggar, I founded CircleCI, and more recently I founded a company called Dark.
Rachel: You and I met back in the CircleCI days. What drew you to continuous integration in the first place?
Paul: There's a couple of different answers to that, but I remember being in a room in Paul Graham's House in 2010 when I was doing Y Combinator.
Rachel: Was this at the party?
Paul: Different party. And we were spitballing. My Y Combinator start-up was stupid, and he was like, "Why don't you do compilers as a service?" And I didn't even know what that meant.
But a year, year and a half later after I'd been working at Mozilla for a while I realized that they have this big problem of like, the automated testing thing, and the release engineering, all this sort of thing. I was not exactly directly involved in it but I was a downstream user of their CI suite basically, and I spent a year thinking, "You know, if I was in charge, I would do this differently."
Then when I decided to use time to do a start-up this had been on my mind for a year.
Rachel: Tangentially, I kind of want to do a shout-out to Mozilla as kind of the Bell Labs of our industry. There's so much amazing stuff and so many amazing people coming out of it, like Rust language, and it's just a generator of cool.
Charity: It does feel like release engineering is something that is very systematically under-invested in at pretty much every company over the size of 50 people that I can think of.
This is where faults get injected into our system, this is where chaos enters the system, and yet it's not seen as being prestigious work.
It's seen as being very laborious, it's seen as being the crap work that you do and you have to, not something that actually affects your life more than any other piece of code you can probably write.
Rachel: Well, there's been this truism on the finance side forever, that dev tools don't sell and dev tools don't grow into big venture exits.
Charity: That's why we still have Capistrano.
Rachel: It's kind of crazy. I'm not even sure it's true anymore.
Paul: I kind of think it's true. I think up until recently the secret to selling dev tools was selling infrastructure, and so the companies that made money were, with some exceptions. Like GitHub who just had all the users in the world. But I think there's--
Charity: Everybody tells you, "Sell to ops. They have budget, they have checkbooks. Devs don't."
Paul: I mean, almost every dev tool that's been successful, if it hasn't been selling infrastructure it's been selling top-down to enterprise.
Charity: That does seem to be changing.
Paul: Yeah, I think so. The number of people in the industry, who are the people who are coding in the industry, is rising at this astronomical rate.
Charity: And the tools are getting better. I remember when I left Facebook and I realized that you can now cobble together the same exact build-to-play pipeline using all of these smaller start-ups, almost all of which have been found in the last five years.
Rachel: Right, that's one thing I see happening. A lot of the tools that are invented inside huge companies like Mozilla, and Google and Facebook.
People leave and then they do these start-ups, and suddenly you have this accessible tool chain.
Charity: Because they don't know how to live without it.
Rachel: Exactly, you get accustomed to that lifestyle.
Paul: The upside of that is obviously that you have the tool that you can use. The downside is, you now need to know all these tools, and the complexity. The industry has been exploding as a result.
Charity: It's true, and there are very few reliable narrators when it comes to how to plug them together and what you actually need, and what you don't.
Paul: Well, you obviously need to use the tool that the person on stage is telling you to use.
Charity: Well, of course.
Paul: And then some other tools as well that integrate nicely.
Rachel: You've talked a lot about accidental complexity, which I love as a phrase for describing what's even happened since you founded CircleCI. It's just skyrocketing number of variables, number of abstraction layers that people need to get their heads around now.
Do you want to talk more about that?
Paul: I actually gave a bit of a talk about this at the Honeycomb meet-up a couple months back, but basically when we started CircleCI people had a problem and that was that their Rails monoliths took a long time to test. Our product was, we take it, we paralyze it, it's great. In-between then and now, microservices happened.
microservices have been happening for 30 years under different names and so on, but people actually started doing microservices for the first time in history, I guess. That completely changed how people tested, it completely changed what CircleCI's product is.
It also, I think, has had a complete change on the industry, even how people think about their code bases and splitting them across multiple--
Charity: And their teams, like the organizational structure. I think it's had a huge--
Rachel: And what they're responsible for, used to be a sys-admin. Like, "These hundred servers are mine. No one may touch them." And now, what is it that you own? What is it that you're measured on? How do you define success in that role?
Paul: There isn't a right answer to any of it. There's a couple of opinions.
Charity: It's dizzying.
Everybody has advice for you, but it's always what they've seen work once.
Rachel: Right. Confirmation bias, "I did it this way and I succeeded, therefore the only way to succeed is to do it this way, in spite of the 99 other people who did it that way and failed."
Paul: We have a very fashion-oriented industry.
Rachel: We do!
Paul: Whoever writes the blog posts that gets the most likes is the thing that becomes best practice.
Charity: The one that actually made sense to the most people.
Rachel: Well, makes sense, or appealed to this week's aesthetic.
Charity: That's true.
Paul: Or is written by the famous person.
Rachel: Paul, to found one startup may be regarded as a misfortune, to found two smacks of carelessness. Where did you find the courage to start Dark?
Paul: Oh my God. So, this is my fourth startup.
Paul: The first two failed tragically.
Rachel: Do you need help? Is there something we can do?
Paul: So, apparently on the third one you start to get okay at it. The first one after you make a successful one, they'll give you money without really too much work.
I actually had this sort of thought, I spent a lot of the intervening time between when I left Circle and when I founded Dark thinking, "What will I do with my life?" And I had a lot of ideas that were mostly not venture-backed, that were mostly small, low stress start-ups that you could sort of have a nice chill life but still have have meaning and work, and that kind of thing.
That's not what I did, because every time I started thinking about, "How would I build those?" I realized that the tool that I wanted to build them with did not exist.
Charity: This is how Parse got started too, you know. They were going to build mobile apps and then they suddenly went, "Oh my gosh, everybody is doing all of this every time?"
It became Parse, because there's so much, just, boilerplate that you have to redo every time and it's tiresome.
Paul: Yeah. You go to a hackathon for the weekend, and at the end you've got your web pack pipeline set up.
Charity: Yes, exactly.
Paul: So, our goal with Dark is very much like, reduce background work.
Paul: The reason I talk about accidental complexity so much is, our goal is basically just putting a circle around all the accidental complexity that we can find and seeing if we can remove it in a sort of a holistic back-end package.
Charity: Tell us what Dark is.
Paul: Dark is a tool to make coding a hundred times easier. Specifically, to make back-end services easier. So, you would go to Dark, you would use our editor, you would use our infrastructure compiler. And you would use our language, the Dark language.
Because you're using all this holistic stuff, you get a lot of stuff for free and that's basically what we're doing.
Charity: How do you know if it's working?
Paul: That is a very, very good question.
Paul: We're about six months into the development of it, maybe--
Charity: I meant, how do you know if your software is working?
Paul: Oh, how do you know if your software is working? Well, Types, Charity.
Paul: One of the things that we're making sure that we do with Dark is we're not making any new things, we're just bringing them all together.
So, the things that people use today to make their software work, Types, Fuzzing, testing, continuous integration. They're all part of it.
Charity: I think of all that as being the basics, right? I'm trying to gently nudge you into mentioning observability here.
Paul: Oh, I see. So, actually Dark is really centered around the idea, or at least the concept I think, of observability. Because you're always writing in production.
Charity: Love it.
Paul: There is no separation of the code. There's no process to take the code from your laptop into production.
Charity: All of those places are so fraught with errors in things that get dropped, which is why I love it.
The best software engineers I ever worked with at Facebook would spend half their day in their IDE writing code, and they wait for it to eventually make its way out to production, and then they spend the other half of their day in Scuba or ODS, just trying to understand the consequences and effects of what they had shipped, or what their intern had shipped.
Because the understanding becomes the hard part much more than the development part.
Paul: When you think about how hard it is to replay a bug that a user had on your site--
Paul: You're going to have to replay it through several microservices, and fetch it from different logging mechanisms, and inevitably you're going to be missing something anyway.
Charity: This, to me, points to why it's so necessary that we get comfortable with testing in production. Which is very much a Dark-friendly concept.
Paul: Absolutely. I totally believe in it.
Charity: I see teams flushing all this time and energy just down the toilet, trying to get staging in sync with production. Which is actually, in fact, impossible, because every single time you deploy an artifact using a deploy script to a production that's a new thing. Right?
Charity: You can capture and replay the past, but you can't predict the future. So, whatever you're doing on staging is inevitably dumb.
Rachel: It's theater.
Charity: It's theater, and it makes you feel good about yourself. We have limited cycles, and we are spending all of our time there which means we're not spending it on hardening production.
Guardrails making it so that you can actually see what's going on so that you can slice and dice in real time, so that you can experiment.
Rachel: The guardrails are critical, though.
Paul, how do you think about making sure that testing in production manages failure in a graceful way?
Paul: I think feature flags is probably one of the best tools that we have for that. In Dark, the way that you do it is that once users are using a particular route that code is immutable, you can't change that code. You can't edit it, there isn't a process of going into it and making a change.
What you can do, is you can take a section of it and say, "I'm going to flag that off." And you can run multiple traffic both ways, and all that sort of thing. Basically, what we as developers are trying to do is get some personal certainty that the code that we write is going to work.
The best way to do that is to take real traffic, run it through the code that we've just written in a safe way, validate that the answers are correct, whether we're doing some sort of statistical analysis on it or just eyeballing the result.
Charity: When you put it that way, it's insane that we haven't done this sooner.
Paul: That's my position, too. Thank you, Charity.
Rachel: It seems, though, like it would be very hard for legacy developers, developers with the older mindset, to embrace this.
Charity: I feel like, yes, it is hard for them to embrace it, but I find that often I have a hard time convincing people how easy it can be, if they just do the thing they want to do instead of the 10 or 20 steps before the thing.
This is a problem we have all the time too, where we're like, "No really, this is hard because you haven't been able to ask the right question. It's incredibly easy if you can just ask questions with high cardinality and feels." And it sounds like it's very much the same thing for you guys.
Paul: I think it's very much a case of showing them a demo of what they can do on their own data.
Paul: Obviously that's not necessarily an easy thing to do.
Charity: Yeah, but it's killer.
Paul: Our industry has a history of these amazing demos.
The world is changing as a result of these demos, and that's sort of what everyone really tries to do.
Charity: Got to show them on their own data, because then they know that you're not making it up, you're not cherry picking.
Paul: The other answer to that, and it's one I'm not particularly partial to, but the industry grows at such an incredible rate. The estimate for the number of programmers there are today is upwards of 50 million, and there'll be new people along all the time and there's still people writing it right in COBOL.
Some of them retired, and some of them went away, and then some of them got bored.
Rachel: COBOL's a great language. You and your co-founder Ellen publicly committed to diversity, while we're talking about all of these new coders coming in.
Do you think Dark's culture affects what your code is like, and vice versa?
Paul: Absolutely. We are are big believers in inclusion. It is one of our core values.
There's a couple of different reasons for this, and one of them just from a business perspective is we want there to be a billion developers using Dark. Obviously we're not going to get there if we don't open it up to way, way more people than are currently coding today.
I think, as well, in the current political climate it's very difficult to not look around and see all the bad things that are happening and see the related situations in our industry, and how we've made it not a great place for people of color or for just generally anyone who's underrepresented in our industry. Non-white dudes, basically.
I guess it's fair to say, though, that we have both a business reason and a values reason for doing that and it's sort of core to who we are.
Rachel: What's the advantage of getting a billion people using Dark, other than that you make a ridiculous amount of money?
Paul: When Ellen and I started working together, I'd drawn up this sort of values questionnaire, and I had a lot of, you know, potential co-founders fill it out and basically, making sure that we're on the same page.
And the page was that we're building something big.
I'm not going to all this effort in order to make a small side project, or whatever. We're really doing a thing that we believe in, and a thing that we believe needs to exist in the world, that needs to exist for a lot more people, and it dovetails with a ton of different things and inclusion is one of them. The answer to that question is, you know, "Why would you do it?" It's like, because that's what we wanted to do.
Rachel: Like Trudeau getting asked about all of the women in his cabinet and saying it's 2017.
Paul: Right, exactly.
Charity: We talked a little bit about being on call. A lot of engineers seem to regard this as a curse, a punishment, a thing that is being imposed upon them, a thing that has to be avoided at all costs.
What's your view?
Paul: Well, I think one side of it is definitely that people need their sleep, and being on call is sort of damaging to our sanity, at the core of it.
Charity: There's definitely the flipside. Ops has a long and sordid history of masochism and we cannot ask people to join us there. Like, I'm over 30, I now want to sleep through the night too. We just have to raise our standards for what we are willing to impose on people and participate in.
Paul: I loved the early Stripe story, where, and who knows how true these apocryphal stories are, but where they set an alarm for every single error they got. Wake them up in the middle of the night if there was any error at all. I guess when you're dealing with payments, that's the sort of situation that you can put yourselves in because you don't want to drop them.
But the idea of, when you keep it clean, then the number of calls that you actually get is relatively low.
And the problem that I feel that people have when they're on call is that the costs of other people's code gets externalized to them, to the person who's on call. So, I mean, it's basically like, how much does your company value you? Are they putting you on call because someone has to be on call? We've made a really, really good job to make sure that it's as good an experience as possible.
Charity: Our on call experiences, it's a rare week whenever anyone gets what they want. It's incredibly rare, and we always post-mortem it, and do everything we can to make sure it doesn't happen again.
Charity: I've been at many companies where that was the case. We just expected that you got woken up two or three times a night, you know, and it's really hard to dig yourself out of that hole once you get into it.
Paul: Right. Often when people interview, they ask you, "What's the on call going to be like?" And you can tell just from how they ask what scars they have in the past.
Charity: Oh, trauma. Absolutely trauma. It does come down to valuing people's time.
I feel like every manager has a responsibility to, if not be on call themselves, it's not always possible, at least to fucking graph, know when your people are being woken up and have it impact you and take it seriously.
Give them the time and the permission and the space and the support to pay down that technical debt so that it's not that bad.
Rachel: It's absolutely about taking responsibility, I think. You talked about how resentful people get when they're the negative externality of somebody else's lazy code. The advantage of putting engineers on call is they become responsible for their own code and they appreciate the consequences of that. But managers have to be respectful of people's time and of people's ability to affect the outcome.
The real burnout comes from not being able to make meaningful change.
Charity: A lot of engineers, because they're not exposed to that feedback loop, they don't actually learn how to write good software. It's not that they're doing it on purpose, they just don't know, because they've never had that feedback loop of, "Oh, this is what happens when I do that," When I have this way of degrading that's not particularly graceful when I don't shrink the critical path.
Paul: I think, you know, coming back to what we were talking about earlier about microservices and continuous deployment, one of the best things that we can do to reduce our critical path is lower the diff of what we're shipping.
Charity: More smaller changes--
Paul: And more certainty around what outcomes they're going to have.
Charity: Exactly. I mean this is just part of distributed systems, right? Failures happening all the time, and it has to be not that big of a deal.
Paul: No matter what. Like, some day some shark is going to take a bite of an undersea cable--
Rachel: Cut off Australia entirely.
Charity: Well, what are developers missing about the future of software engineering and shipping quality code?
Paul: I think our feedback loops have gotten terrible.
Charity: Gotten terrible?
Paul: I mean, maybe they've always been terrible, but--
Charity: I think they are getting better honestly, and they've just always been bad.
Paul: I think back in the good old days, and by that I mean when I was in college and not writing actually valuable software, I actually think back to how we wrote software in college and how easy it was relative to what proper code bases are like today.
There is a feedback loop where you'd write something, and you tested, and it's on your machine, it's not interacting. It's not a distributed system, I guess, is basically the thing. And that hasn't really been brought back to distributive systems. Tools like Honeycomb are obviously doing this, CircleCI, as you know, is trying to do a little bit of it, Dark is going deep on it.
I remember there was a blog post a couple months ago by the Instagram engineering team, and they talked about how they were saving data that happened in production, I think it might have been in the case of exceptions, so that you could have it on your machine, you typed a couple of commands, and you could actually replicate it yourself.
That's the world that we need to be going to.
Errors, exceptions, things going wrong--
Charity: Real data, real services, real networks, real traffic.
Charity: Absolutely. Couldn't agree more.
Paul: Real traffic is an important one because it's very easy to--
Charity: It's easy to think that tests are reality.
Charity: That was me rolling my eyes.
Paul: Well, the tests are reality if you somehow live in a world where your system is entirely consistent.
Charity: Or, all of your clients are robots.
Charity: That would work too.
Paul: So, this is the problem. If you're doing a test, you've written a couple of MOX or Unitus or maybe even integration tests, but they're not working at a scale where you might have a partition in your thing, or there just might be incredible load, or a hard drive is going wrong as it's being written.
You need to test under that world or else you can't really--
Charity: Exactly, and in distributed systems we just have this infinitely long tail of things that almost never happen. And once, they do. And you can't predict and test for all of them, just like you can't predict and monitor for all of them.
And you shouldn't try.
You should be instrumenting your system at a level of abstraction that'll empower you to ask new questions.
Paul: I think fundamentally the problem is that most people are not writing distributed systems. They're writing websites. Or web applications, which just happen to be distributed systems.
Charity: There's a great talk, I forget the name of the person who wrote it, on why web programming is the original distributed system. It is! We just aren't used to thinking of it and treating it that way. That's why it has a bad reputation in terms of good quality.
Rachel: It does feel like there's a intellectual chasm that we have to cross between, you know, "I'm writing this to run on my web server," vs. "I'm writing all of these things to interact with one another on other people's clouds in real time, and if three of them go down the other 12 will take up the slack."
Charity: Our solution so far has just been, "We're just not going to do it, and say we did."
Rachel: If you're a young engineer coming out of Trinity's CS department today, how do you prepare yourself for this very different world from the one we grew up in?
Paul: I think the obvious one is that you want to take the Distributed Systems elective, which I did not do, and I've regretted for decades since.
It really depends on what you're trying to do as an engineer. Are you trying to be in the ops-y side of things, and making sure that systems stay up? Or are you going to be more on the product engineering side? Because you can't know everything.
Charity: I would argue though, that the fundamentals of operations are no longer optional. I think that understanding roughly what happens to your code after you hit publish, even if you're a mobile apps engineer.
You need to understand the fundamentals of what's going to happen when things start going south.
Paul: I'm not sure I agree.
Paul: I mean, I think that optimistically everyone would know everything.
Charity: I would not say that at all. I'm just saying that if you can't model in your head roughly how failure works, your stuff is not going to be very good.
Paul: You're one hundred percent right.
Charity: Now you could say, "Well, stuff doesn't all need to be good," and I would say, "That's also true." Most things fail and it's usually not because your code wasn't pretty enough.
Paul: I think back to younger years when people talked about, "Oh, you don't know what HTTP looks like, what TCP looks like," or, "You don't know all seven levels of the OSI Layer," and that sort of thing. When people actually talked about, "This as a level 4, and this is a level 3, and--".
Charity: But I think that failure, and I'm not talking about any particular type of failure, just the act of making code reach humans and then sometimes not work. That seems like a pretty fundamental thing.
Paul: The rewards for making it reach the humans are far, far higher than the cost of it occasionally going down. You get rewarded for building the thing, and probably someone else takes the slack when it goes down.
Charity: Well, we're hoping that this is changing.
Paul: I think the incentives around buildings also mean that it may not ever change. I'm thinking specifically, you know, when PHP came out and everyone was saying,
"Oh, these PHP developers, they don't have any idea what they're doing," yet they're building the entire internet.
They're building Wikipedia, they're building Facebook and so on.
Rachel: Facebook is an interesting example, though, because what they've done with hack is just reinterpret PHP so that it works in a really modern distributed system as kind of a genius--
Paul: Seven years later? I mean how far were they and how successful were they by the time that they actually started doing that?
Rachel: If you ask them.
Paul: So, they started HPHP in 2009, maybe. And what, Facebook was four years old then? I'm not sure on my history. And they already had a couple hundred million users. That's certainly the scale that they should have to rewrite it.
Charity: Some of this is obviously aspirational, absolutely agree. But I think there's value in articulating what we aspire to as an industry. Because we can't just tell people, "Quality doesn't matter, go forth." Because software is eating world.
Every industry is now a software industry and there are real costs to failure in industries.
Medical industries, building industries--
Rachel: The TSP migration that went south.
Charity: I mean, it's not just pretty web sites. I feel like I hear more and more grumblings about our need to raise our standards as an industry to be more like engineers, which is different than developers. You can be a code monkey using code and there are more and more and more of those, and I don't mean that in a derogatory way.
But there's also software engineering which I think should be more rigorous and should absolutely care about the quality.
Rachel: Certainly the civil and mechanical engineers would love that because they get a bit miffed when you talk about software.
Paul: I think I have the same goal as you, which is software works better and fails less and we get woken up in the night less. My belief of how we get there is not that we try to affect a change in humans, which I think people have been doing for a long time, but rather that we build better tooling.
Charity: I think I agree with you completely.
Rachel: Tooling can change behavior, though.
It can't change human nature but it can encourage certain outcomes over others by gaming the incentives.
For example, if you can't tell whether what you've built is working or not, you will build it differently than if you can. And that comes back to the question of responsibility and ownership. If you have agency over what your code does in production, if you can see and affect that, then I think you feel a lot more affinity for it and for the users.
Charity: Nobody is going to want to put energy into caring about something that they cannot affect or change. I mean, that's that's just wasted energy. What are vendors and service providers missing about the future of software engineering?
Paul: I think there's a habit of vendors to think about the world as their place in it, and to think a lot about the competitive dynamics of the marketplace and how to make themselves more important than the other people in the space.
And I think what they're missing is that fundamentally a better experience for users is the only thing that actually matters.
Rachel: Well, I think there's a huge distortion coming in from the finance side, particularly from the very large school of venture capital which wants to create natural monopolies. It's in some ways misaligned with what engineers are trying to do. Good engineers are trying to build open platforms that enable people, and that kind of investment is trying to create closed platforms that take advantages of inequalities in the market.
So, I get very frustrated with this mismatch between the two biggest constituencies in venture-backed software. The entrepreneurs and, not all, but some of the investment community.
Paul: I think it's inherent, and I think it's definitely part of the venture-backed worlds. Although, you also see a ton of bootstrapped people who are having the same mentality. And you know, we are the center of the world and everyone else will conform to who we are.
Charity: We all read the same blog posts.
Paul: I don't actually have any solution to it, unfortunately. I wasn't coming in with a big principle here.
Rachel: We could overthrow capitalism, maybe?
Charity: Tear it all down.
Paul: I think that's probably the closest thing to achieving this.
Rachel: All right, I'll put it on my action items.
Paul: I'll get my red flag.
Charity: Awesome. Thanks for coming.
Rachel: Thanks so much.
Paul: Thank you.
Content from the Library
The Right Track Ep. #12, Building Relationships in Data with Emilie Schario of Amplify Partners
In episode 12 of The Right Track, Stefania Olafsdottir speaks with Emilie Schario of Amplify Partners. Together they discuss...
Jamstack Radio Ep. #110, Online Whiteboards with Shin Kim of Eraser
In episode 110 of JAMstack Radio, Brian is joined by Kim Shin, founder of Eraser. They discuss the importance of collaborative...
The Kubelist Podcast Ep. #31, Kustomize with Katrina Verey of Shopify
In episode 31 of The Kubelist Podcast, Marc and Benjie speak with Katrina Verey, Senior Staff Production Engineer at Shopify....