Library Podcasts

Ep. #37, Synchronization Layers and Logs with Garry Shutler of Cronofy

Guests: Garry Shutler

In episode 37 of o11ycast, Charity and Shelby speak with Garry Shutler of Cronofy. Together they pull back the curtain on building scheduling tools, making observability tools accessible to teams, and keeping deploys boring.


About the Guests

Garry Shutler is CTO and Co-Founder of Cronofy, the scheduling platform for business. He was previously Lead Developer at Zopa and Head Developer at Adgistics.

Show Notes

Transcript

00:00:00
00:00:00

Garry Shutler: It kind of started a couple of years ago.

We were growing as a team and I was in a position where I knew what the system was doing.

I could look at a bunch of dashboards and logs, and I understood like the heartbeat of the system having looked at it like that, from I'd like bond request an hour through to how ever many requests a second there were in the moment.

And that made me a huge blocker on pretty much everything. And every incident I had to be involved in.

And so it was looking at how I could reduce my bus factor and observability seemed to be the tool that was, that, that is the way.

And we were a offsite about 18 months ago and I was like, we're going to get so out of observability so that all the people sat around here have an idea of what's going on in the system not just me.

Because yeah, I understood that I was like tank with the matrix.

This green symbol is going everywhere and I can see the woman in the red dress, but people are coming and going.

I've started looking at logs for a day and I still don't understand, it's like--

Charity Majors: That's an amazing metaphor I love it so much because that is what it's like.

When you were the person who built the systems from the ground up, you know, you get this sixth sense.

You can't even explain it sometimes just like, there's something fishy about this, right?

And that means that whenever anything goes wrong, people come to you because they don't trust the tooling, they trust you, right?

And like so much of like observability is about making it so that what's in your head is more democratically accessible to other people.

But with the meaning that you would attach to it, right?

It's not enough to just give them 10,000 graphs, right?

This is why I have one of my favorite secrets is like log everything, find nothing, die unfulfilled.

You know, it's like, you can give people all of the information, but unless there is like the grooves that you wear in the system, like the weight that you attach to it.

That's why some of what's in your head, in the preparation of it, it doesn't actually work.

You just give them everything.

Shelby Spees: Yeah, I saw someone say recently, that intuition doesn't scale.

So whatever we can do to externalize that, and make it part of our system and more available to other members of our team, I think many of us have been on both sides of this, where you know, you're the person who's that single point of failure and you don't have the time or the resources to get that information out of your head.

Or you're the person who's just joining a team or sort of new to that ecosystem and you're trying to get that information and yeah.

Charity: So I think this is super familiar to a lot of people, so maybe we can break it down to like, how did you get from point A to point Z?

Like, how exactly, where did you start and what did you give them first?

And like, what was the feedback cycle?

Like, like how did you start to transfer what's in your head into theirs?

Garry: So one of the first steps we did was experimenting with like a bit more structured locking.

So we already had Kibana that we were looking at logs for.

So emitting sort of canonical logs, which I think there's a stripe blog posts about that.

And I was like, ah, let's try that as like the gateway to something closer to observability by that single dot line per transaction and put that in some key points in our system.

And that was like, right, is this now telling me what I see when I look at 100 logs at a time?

And can we get some visualizations to that sort of thing?

And the answer to that was yes, but it was quite limited and you had to have a view for everything.

And it was a bit of a square peg into a round hole of how Kibana kind of wants you to work.

But it sort of proved that there was better visibility through this route and then sort of reemit to late last year, we had the time and budget to be looking at a proper observability tool.

And Honeycomb had been high on the list.

Charity: I love that you went to structured data first.

And I feel like this is one of those. It's really hard to map feelings to like logical, like on the ground, like steps, right?

Because it sounds kind of fuzzy and everything.

But I do feel like the structuring of the data and the gathering of it in the right way is such a necessary step.

And it's hard to like convey to people what this does.

But then the shift between gripping through logs to thinking of logs is, or conversely, if you're coming from like monitoring land, they shift from going from top level aggregates to these event logs that contain, you know, all of the data in these arbitrary wide structure data BLOBs, which then you can feed into all your computer stuff that you do, right?

Like you could make functions out of this.

You can make algorithms, you could make percentile budgets.

And when you're just used to thinking of data as strings, that's not available to you.

When you're used to just thinking of data as dashboards and aggregates, that's not available to you.

And I feel like it's such a, it is like the first step towards observability.

And like, I think we see this over and over again.

You can't do it if you don't go through that first step first.

Shelby: Absolutely, and I think before we get into this last part of your journey, this last year of assessing observability tooling, I think now would be a really good time for you to introduce yourself.

Garry: Yeah, I'm Garry Shutler I'm CTO and co-founder of Cronofy the scheduling platform for business.

Charity: Cron as a service?

Garry: Well, that's kind of where it came from. So scheduling tools.

So anything about time when people are available based on the contents of the calendar and rules to have set up and all that sort of thing.

And see, I really came up with the name cron being developers and building an API to begin with was where we kind of started.

Charity: Totally, that's awesome.

I feel like some of the most successful startups of the last five years have been companies that have taken fundamental Unix tooling things and then surfaced them to others is as developers.

So that's really cool.

Shelby: So what are the sorts of like incidents you're running into that Garry, you were like the only person who could solve them?

That are like specific to what Cronofy does?

Garry: So one of the main things we have is like this synchronization layer, where we've talking to your calendar server to understand all the events that are in your calendar and that's happening all the time continuously in the background.

So there's thousands of events happening a minute peaking, tens, hundreds of thousands of events.

When so interesting thing that was surfaced really well through Honeycomb is the way the exchange protocol works is when you dismiss an alert that actually triggers a change notification that our system goes oh, something changed.

It doesn't actually impact your availability, but we have no way of knowing that.

So quarter to the hour, every hour, we have this nice flurry of everybody going dismiss.

To get work out why that was happening from blogs it was actually really hard.

It's like, it took me ages to work out exactly what was going on here. I wonder if Honeycomb helps me.

And seeing those jobs come in as inbound notifications and then triggering jobs that are actually checking to see what has changed.

Charity: When you're reading logs, you're reading like what one process does at a time, right?

But so many of interesting problems these days are the convergence of, you know, what hundreds or thousands of jobs are doing at the same time and you're never going to get that from any log file.

Garry: Yeah be able to easily sort of go, oh, that jump started run pumping volume.

And so that job started run pumping volume is like, there's a direct correlation just 'cause those curves match.

Charity: Yeah, you know, for people like you and me though, I think it is a bit of it, it's an interesting challenge in that it feels like you're giving up a bit of control when you go from the command line to a tool, right?

Like there's this thing very deep within me to distrust all tools except the command line, right?

And, like, if the tools fail, you fall back to the command line, you just like, you know, you go deep into strace and like, you know, you go into this stuff.

And it's been an interesting process for me over the past few years, realizing that the reverse actually happens more often where my command line tools fail me and I need to fall back to my tool so that I can see what is happening in aggregate across this distributed system.

Garry: Yeah, and so as we are adopting observability tooling is forcing myself to stop going to logs and be like, right, start there, can I answer the question there?

Because they are not going to just train the team on how you can answer the same question.

And so for me, very early on, it was, I am answering this question at least as fast, maybe faster than I would do going to the vlogs that I was comfortable with.

And I didn't understand fully where to go at that point.

And we weren't pushing as much the extra attributes and so on that we've decorated jobs and so on with over time. But things were just like leaping out much more easily. And also the team could do that. Like the team can now investigate things much better and about as quickly as it all quicker than I can.

And that was the sort of the goal we started out with.

Charity: Isn't that a wonderful and liberating feeling.

I feel like the transition between, you know, a system where the debugger of last resort is always the person who's been there the longest, you know, into a system where the best debugger is the person who's most curious, who has worked in this the most recently.

Like that is such a-- On the one hand for those of us who were used to be in the debuggers of last resort, there's a bit of ego to get up there.

There's a little bit of a sacrifice, but it's, it feels really good.

Shelby: I appreciate that you said like you hadn't started adding a lot of like custom attributes.

You hadn't done a lot of customer expectation yet.

Like I think people underestimate how much value you get from structure event data and distributed trace data just with the auto-instrumentation, just with like top level HTP requests data, top level, like database query, spans.

And so I think that was really cool to hear that like, before you even learn sort of the ins and outs of the tool, you already started just being able to answer your questions as quickly as you could before.

Garry: Yeah, well, we have to strip it back even further than that, because we're dealing with pretty sensitive data.

There's lots of PII, we have European customers, so GDPR and the California CCPA.

And so we wanted to be very sure about what we were sending to Honeycomb.

And so we actually started off locally that sending way too much data that we don't know, everything that we're sending.

So we actually disabled all of that stuff.

And like you said, put the things around the web request of it, it was this path and this method and this response code, and that's the thing.

And then similarly around our job framework was like, it is executing this class and that's about all we allowed it to do, but just knowing volumes of endpoints and volumes of jobs was a good starting point.

And then for the most important jobs, for example, the ones that I put in the next canonical log lines, where those log lines were, let's push every attribute of that log line into the span and have self replaced the canonical log lines with the telemetry.

Charity: Well, to me, the fact that, you know, it's useful out of the box without doing much of interpretation to me, that is because it brings to mind the fact that just like people right now who don't have observability, who just have, you know, Prometheus or Datadog or whatever, aggregates, there are so many gremlins in their software that they have no idea about.

Like they just, you know, and as soon as you pick up that rock, you add event level instrumentation, everything it's just like, oh shit, there's a bug.

Oh, shit that happens, oh my God, there are these outliers, you know, oh wow, this user never succeeds, right?

Aggregates cover up so many sins, so the initial switch is really powerful. But also I feel like there's kind of a battle for hearts and minds going on right now between the people who are like instrumenting your code is a waste of time, you should be spending all your time in doing business value, attached to an agent that does all the magic for you. Versus the people who are correct who say that only you will ever know your code. Only you know how to pick up the things that are important that need to be.

You know, like yes, automatic stuff can do amazing things for you, like out of the box, yes.

And then it should support you like, it's like commenting your code, right?

You can never get back to that original intent if you didn't capture it at the time you make the change, it's the same with instrumentation.

You can't expect anyone else to divine that from your code.

It's unreasonable. And it's also like, like it's on us as providers to make this easy for you, to make this intuitive, to provide handlers and backups and retries, and like all the stuff that makes it as easy as adding a printf, right?

But it's on you, the developer to add the fucking printf once you're like, this thing matters, right?

And no agent will ever do that for you correctly.

And the thing is that people who are like still sticks it in post, you know, and the agents are fine.

They spent way more time than, you know, the developer adding a printf.

Well, the people who fix it and post spend way more time pouring over the config options for their agents going, what field matters, which one should I like pluck out?

You know, they spend way more time doing their, orders of magnitudes of more time doing that.

Shelby: I was talking to someone earlier about this, about sort of crossing the threshold from agents that live outside of your code to SDKs that live inside your code and how much more you can get from just like, it's the same level of effort to like deploy an agent as it is to like add a SDK to your dependencies.

And it's the same amount of configuration, it's the same API key, one is just outside of your code and one is inside.

And I think that's something that we don't talk about enough is just that immediate like gain.

But Garry, I think you touched on something really important where I was so excited when I started learning about structure event data, and like, look, you can capture everything, you can capture your user logins and you can capture like the exact comments someone made on the exact like Discord bot, whatever.

But like, you make a very good point that like, as we start adding all this customer's instrumentation, it expands our like security risk footprint, right?

Especially for people using auto-instrumentation, we've seen this with our Rails integration where, you know, Postgres will sanitize the queries, but my SQL won't and it'll include all the query parameters.

And so like, can you give any advice about how people should like assess that risk?

Garry: We're pretty risk averse when it comes to data.

So we just stripped all of that out and it is something that we would like to at least get statistics on volume of queries being made and that sort of thing.

So we've got latency of the overall requests, but maybe like counters of how many queries we're doing.

And if that's like what that's doing 50, what?

It five or 10 or something might be acceptable, but 50 would be a bit of an outlier.

Because then there's no data in there, but it gives you something to look at.

And then maybe we can put some additional telemetry in there.

But so if the lowest hanging fruit we had was every request for our API belongs to an old client.

So we added that ID and it's usually relating to a specific account on our side.

And so we added that ID. Now we can look at support cases, oh, we're getting slow responses from API.

I can go every request for your clients here is a chart of your licensee for every end point that you've called in the past month.

If you think it's super slow.

Shelby: I love that and that's exactly what we do at Honeycomb when someone comes in and it's like, is like with a support question, is it just me or our query is slow right now?

And it's like, oh no, it turns out your secure tenancy configuration is messed up on your account.

Like, let's fix that for you.

And just being able to break down by user ID, you know, without having to keep all of that, like the more sensitive data, you know, just enough customers instrumentation, just enough user data to be able to debug the stuff that is only on a subsided traffic, or it's only on certain kinds of users who have a lot of data attached to them.

And things like that, where we start to see these worst case scenarios that do have ripple effects across the whole system, right?

So I like that approach the sort of incremental customer instrumentation stuff. And yeah, I do think that's something that I've sort of failed in talking about customer instrumentation so far, where I'm just like add field for everything. And it's like, no, we do need to be smart and intentional about what we're including, but we don't have to feel limited by our tooling is sort of the value there when we're starting to use structure event data.

When we have a distributed column store that can just consume arbitrary fields and make them available for you.

You said you wanted to talk about continuous deployment.

Can you talk about how you do that at Cronofy?

Garry: Yeah, so over the years we started from just the two of us, Adam and a co-founder myself, pushing to Heroku.

And that was our sort of CI and we maybe had like a Jenkins box in the middle to run the unit tests and that sort of thing.

And as we've grown, we moved to class 66 to get, they provide a Heroku like experience, but you get a bit more control over the AWS instances.

And so on the sitting beneath that.

So we've always been sort of CRN to TD closely, but occasionally there was button process involved when we switched over to Kubernetes two and a half, three years ago, something like that.

We were only two production data centers, a German and a U.S. one and there weren't many, many of us so someone such as myself having to click a couple of buttons, wasn't a big deal.

But we now have six data centers since last year we started with two and now we have six 'cause we've expanded to Australia, Singapore, UK, and Canada is the newest one.

And clicking six buttons is very tedious. And so we started, we got Jenkins behind the scenes of powering that.

So we started our experience through like a script, the ops, the people with the admins could run so that it was effectively one button click of sorts.

And now then we moved that to Jenkins and now that's pushed by get help.

So as soon as the pull request is merged, that kicks off a production deploy.

And what that does is it's just taking me and do the more privileged admins out of the loop again and what we found when we were going down the compliance route of doing SOC tube, and I say 27,001, is that a lot of that is about, how do you know that people aren't doing random things to the system at any point in time?

Charity: Yeah, it's about adding.

Garry: And when we go, we only deploy via source control things on the infrastructure, full infrastructure side of Terraform, and on the application side of things, the CID pipeline that is fully automated that no one can work around.

So okay, that's fine, that's a bond it takes because even the process itself is source controlled.

So changes to that process get approved through the regular change management system to very, very comfortable with continuous deployment because it is fully reproducible.

Charity: You did it basically from very early on, didn't you?

Garry: Yeah, yeah.

Charity: Yeah, I feel like could you use deployment as one of those things that, you know, if you have a mature system, it like takes you to say it takes weeks to get a build out.

It's pretty daunting, it's pretty difficult to get that down to, you know, minutes.

But if you just start that way from day one or day two, like that is the easy path.

Like you just maintain.

Everybody knows it and they incorporate it into their workflow and their expectations and you just maintain and that is the easiest way I think, to build and run software.

Garry: Yeah, fully agree, it's certainly from co-founder of, he was previous CTA were both developers.

And so we were like, we're just going to do everything right from day one. No one's going to tell us otherwise.

But yeah, in previous companies we have fortnightly or monthly recycles.

Charity: Yeah, so painful.

Garry: When we were redoing the web tier that we actually redid that in Rails.

And so we had continuous deployment on the bit that changed the most.

And we worked slowly on sort of releasing that backend more towards weekly.

And as we got towards doing it a couple of times a week or some don't ask.

So it was still half an hour, an hour of somebody doing it reasonably manually.

But if we really needed to, we could do it several times a day.

Whereas the front end, which was what was changing the most rather than sort of the model and the engine behind it, we were able to sort of do that well.

And that would allow us to iterate faster on the part that was iterating the most frequently.

And so we got sort of there. Most of the benefits without going the whole hog.

Charity: Did you find that maybe CICD has made it easier for you to recruit and retain?

Garry: I think what we say at Cronofy is that come here and argue about just how well we do it.

Not whether we should be doing it the right way.

It's just like, precisely exactly how well should we be doing is the kind of conversations we want to have.

So I'd like to think it's the sort of company that I have 10 years ago would have really wanted to work out.

Charity: You know, I feel like it's one of those things that once you've seen it in motion, you really can't imagine working any other way.

And if you haven't seen it, if you haven't actually participated in a system like that, it sounds so scary and unattainable and foreign that it's--

I think it is really hard for technologists to go out on a limb and go, I'm going to sacrifice my personal political capital at this company to push for this thing, that sounds kind of scary, that I'd never actually seen work, but a lot of people on the internet say it does.

You know, I feel like that's just kind of a gap that many people won't do.

So I wonder, like, why did you, is this something you had seen working before?

Garry: I suppose it's a long while ago.

It was just sort of putting faith in the internet being correct.

So it's a bit like unit testing.

So when I was first a junior developer, if those people say that this is a really good idea I'm going to try it out.

And maybe I will be doing that in my spare time or something just to be like, I feel bad wasting the company's time if this is just completely not a thing. But-

Charity: Interesting.

Garry: Usually it's the lead time to like a change if there's a book having to wait till next week for it to be fixed or having to dedicate a day to releasing it because that's the release for it.

Charity: You know, I have a story about this.

So I just got my first vaccine, got my little bandaid right there, pretty proud of that.

And when I was trying to sign up for the vaccine, you know, appointments, I was using the Walgreens site and every time it was like back in server error, like no appointment is available, just like every time.

And I was so frustrated, I tried every zip code in the city.

I was trying for days and everything, and eventually realized because I did not have a gender set in my profile, it was returning a generic server error every time we're saying that there were no appointments available.

And I found a thread on Reddit where they figured this out about a month or two ago, I'm sitting here thinking, is this because Walgreens deployment pipeline is just, it takes like months to get a change out and this is why they can't fix it, so that I can--

'Cause it's such a small change. It's like literally just error handling needs to be a little bit better, right? Just tell me what to do instead of like telling me that there are no appointments available and yet it's been up there for months. It's a known problem, the internet fixed it for you.

Like, why is this still, you know?

And I just, I feel like it's such a great capsule example of why it's not okay for software to take a long time to be digested, right?

It's like digestion, it's like one of my favorite quotations in the last couple of months was like software that you've written and not your deployed ages like fine milk.

You know, it does because you have all this context in your head, you know, why you built it, you know what you did, you know what you tried, you know the whole path, the trade-offs, but it decays really rapidly.

Like you've got it all in your head for what?

Minutes, maybe an hour or two.

As soon as you've split your attention to the next project, you've evacuated all of that past knowledge in your page in the next set of knowledge, right?

It doesn't linger along.

It is not, if you don't capture it when you have it it's going to be like exponentially more difficult for you or anyone else to capture it at some point in the future.

It's not like we're saying what you wrote you just have to use within 15 minutes.

This is why your feature were around, right?

Decoupling releases from deploys, making this stuff safe.

But if you can't get your code out into the wild quickly, that stuff that only exists in your head rapidly decays, and then it rots and then it's just no good.

Garry: Also like understanding what caused the problem.

Like, if you're releasing once a week, there's 10 nor changes in that.

Charity: That's the thing, if it's more than minutes, if it gets into hours then I guarantee you that you're batching up changes, right?

You're not shipping one engineer's diff at a time, which is how you get software ownership over changes, right?

It is reasonable to ask an engineer if you merged to me, if in a few minutes, you go in and you look at it, you ask yourself, is it doing what I expected?

Does anything else look weird, right?

That is a reasonable request for a software engineer.

But if it's going to be shipped at some unknown period of time, somewhere between minutes and days from now batched up with one to 10 to 15 other engineer's changes, it's not actually reasonable.

Because you know, it fails now is the problem for whoever pushed the button to deploy.

Which makes an incentive for people to not push the button to deploy.

Which makes your changes like stretch out to be longer and longer, which makes it just bigger and bigger and you couldn't use longer and longer just like this death spiral that you get into.

Shelby: I really like the Garry, how you've connected this to like auditing and compliance and stuff, because I think a lot about the importance of like our commit history and being able to go back and like reproduce and walk through the exact changes.

And I think the same should be true for our builds and our deploys and the number of times that I've had things that worked great in isolated PRS and they were viewed and the bashed up changed like an aggregate--

Especially with the ground shifting beneath us and like underlying dependency changes and the, the two weeks between when my PR got merchant and when they entire batch of things got sent out, and this was on not just like on the code, but also on our configuration, like our chef code got bashed up as well.

And so it was so hard to reproduce that stuff and debug.

And so it's totally a compliance concern.

You know, the decision to merge in code and have your, you know, your get source of truth, making that also your deploy source of truth, just sort of removes a lot of question marks, right?

It adds a lot of certainty. And so I really appreciate that.

Because I think a lot of businesses don't see the potential cost of like that diff between I merge something in, and now it's in production, but it's been collecting dust for two weeks and entropy and all of that stuff.

You know, like we talk a lot about the engineer mental costs, but I think it's really important to talk about just sort of the business risks of like just the impact of these things and having that tight feedback loop.

Garry: Well, I think that there's sort of two factors, like if you've got 10, 20 changes going out at the same time, there is a higher chance that there is going to be a problem.

Because you've not just got one change that is understood a nice later, maybe it looks good on its own.

You've got 20, which may be sort of interacting with each other in slightly different ways.

And not only you more that take longer to work out what has gone wrong because you'll put deployment, it's probably taking hours, even once you've fixed it, it isn't hours until it's resolved.

So your like meantime to recovery is at least as long as your deployment time.

Charity: Yeah, it's at least as long as your deployment time.

Like that, that interval between when you write the code and when the code is live in production, it is the base unit, right?

Like you can never fix anything faster than that.

You can never ship anything faster than that. It can be longer, but it can never be shorter than that.

And then there's a real problem when you have emergency ways to deploy that bypass your actual ways to deploy that you test every day, right?

When you've got a process that you do every day, but it takes an hour or two, well, then you have a problem suddenly you're not going to wait an hour or two, you're going to like bypass all that.

You're going to do something else that you don't test every day and then now you have two problems.

Garry: I've never edited a file on live servers because that will fix it in minutes.

Charity: No, never, never did a distributed SSH to just copy out the file and restart for fuck sake because it takes two hours to deploy the code. Yeah, never.

Well, I loved your story Garry, and I was so happy that you agreed to come on this podcast, but then they feel like at this point, there's a lot of us who are out there just kind of ranting about this is the way, this is the way, right?

But like, when I'm talking to you over Twitter I feel like, you know, you're someone who has gone through this, you've been through the weeds very recently.

And I think you took a lot of it on faith.

You know, I remember when we were arguing about, should there be a manual like button or something and you're just like--

Fuck it, I'm going to try this, you know. I think it's really nice for people to hear just like very recent hard-fought battles that had good outcomes, that kind of demonstrate what can be won from adopting this sort of stuff.

Garry: Yeah, that's a very important sort of tip the balance for me on the button.

'Cause the button was there.

It was like, sometimes we might merge three minor changes just 'cause they happen to go at the same typo, isn't going to do anything.

And then we press the button to avoid an extra deploy, basically.

That was part of it. But what's the harm in additional deploys like if you're doing them all the time, they're super low risk.

Charity: I love what the Intercom people say, which is that when you're a software company, shipping code is your heartbeat, is the heartbeat of your organization.

And as a heartbeat, it should be as regular, as common, as quick, as an obtrusive, it should be a nothing burger.

It should just happen all the time without anybody having to think about it.

And I think that this is how we put that into practice.

So anything else, any closing remarks, Shelby?

Shelby: I was just Tweeting what Garry said right now.

What's the harm in deploying more often?

As exactly as there's no harm, you know, deploys should be boring.

Deploy should be boring.

Charity: Make deploys boring

Shelby: Thank you so much, Garry for joining us.

Charity: Thanks Garry.

Garry: No problem.