August 31, 2020
Ep. #25, Reliability First with Amy Tobey of Blameless
In episode 25 of O11ycast, Charity and Shelby speak with Amy Tobey of Blameless. They explore the evolution of the SRE role, incident manage...
In episode 2 of High Leverage, Joe meets with Corey Quinn, cloud economist and founder of the Quinn Advisory group, to discuss the realities of using cloud providers at scale.
About the Guests
Joe Ruscio: Alright everybody, welcome back. Today I'm joined by Corey Quinn, a self-proclaimed cloud economist, which we're going to get into in a minute. But also a bit of a renaissance man.
The host of the Screaming in the Cloud podcast and also editor of the highly popular and actually very recommended, it's one of the newsletters I subscribe to, Last Week in AWS. And so it should come probably as no surprise that we're going to talk about cloud today, Corey. Welcome.
Corey Quinn: Thanks, pleasure to be here, thank you for having me.
Joe: Yeah, the pleasure's all ours. So I just wanted to start, get right to it, what on earth is a cloud economist?
Corey: It's sort of a portmanteau of two words that virtually nobody can define, which means that when I call myself a cloud economist, no one is going to come back and say, "Wait, that doesn't mean what you think it means."
When you call yourself an anesthesiologist, there are problems. But when you call yourself a cloud economist, people smile, nod, and assume you must know what you're talking about.
Joe: Yeah, they probably also can't tell you that you're bad at it.
Corey: Exactly, even most real economists tend to profess to know everything there is to know about money but often dress like flood victims themselves.
Joe: Right. But a cloud economist, I'm going to spoiler alert, is basically around the notion that cloud billing, which is to say, at the first of the month when your cloud provider of choice, Amazon or Azure, whoever sends you your bill, is it fair to say they are hard to discern? Or understand if anyone's ever seen one?
Corey: It makes your cellphone bill look like a learning-to-read stylebook, and also you must have a very small bill if you're getting it on the first of the month. Sometimes there's a week or two delay for many of these brokers.
Joe: Right, yeah that's true. I guess I just want to talk about that. And also, it's something I had some historical experiences, you know I spent many years, actually I guess probably six or seven years at least, dreading either the first of the month or to your point, once we got larger, the 11th or 12th of the month, getting the bill and then inevitably getting the email from the CFO or the finance function going, "What happened?"
Corey: Exactly, and that problem that you just described is where I started my own consulting business a few years back, and it took me a long time to sort of plumb the nuances of the situation you just described.
Many people think of it themselves as a scenario where the CFO is complaining about the dollars and cents of the bill, and in many cases they may believe that themselves. But in practice for companies of significant size and scale, what they're more concerned about is if last month's bill was $3 million and this month's bill is $4 million,
it's not the extra million dollars itself that matters to the company. What does is the fact that they didn't see it coming. They weren't aware this was going to happen.
What caused it? Is it a one off, is it going to be repeated? What does this do to their model of unit economics? What does this mean for their financial projections 18 months from now? And how do they attribute this back to whatever it was that caused that spike? When you have complex numbers of business units, it's not at all straightforward to figure out what drove that.
Joe: Yeah, and I think to your point is large enterprises where, or just large companies where you have a finance function, it's really more about the surprise, right? And it's that they then have to solve attribution and chargeback and these other things in a reactive-- a fire drill rather than planning.
Corey: Exactly, and the cloud providers themselves aren't making this any easier themselves. You sell the idea of cloud to companies on the premise of being able to take whatever resources you want, use them when you need them, and then turn them off. But then you go back to things like the RI purchasing model and a lot of the negotiations you have at scale with these companies, and then you're told to predict the next one to three years of usage.
It's a perfect OpEx model that you then have to think about in a CapEx way, and that's something that is, first difficult and annoying for finance companies to deal with, and two isn't really serving the majority of these cloud provider's customers very well.
Joe: And it's fascinating because I think the typical developer or even operations person is somewhat insulated from this, but I don't think people realize exactly how byzantine, and not to pick on Amazon, they're just the 800 pound gorilla, by far the largest cloud provider right now, and the one I had experience with for probably close to 10 years, an account that started, to your point in the tens or hundreds of dollars a month and then ended up in a seven figure amount before I turned it over on the way out the door.
And it was the attribution and the RIs, even something as simple as, "Oh we're going to buy this upfront, right? Like, how many permutations would you guess of a reserved instance there actually are on Amazon?
Corey: Just to buy a single RI, you have to care about what region its in, whether or not there's a capacity reservation which will bound it to what their particular capability zone.
Joe: Right, because you got a data center, yeah.
Corey: Exactly, plus you have to figure out is it going to be for one year, is it going to be for three years? Is it going to be dedicated tendency? Is it going to be shared? Is it going to be scheduled or is it going to be full time? Is it going to be standard or is going to be convertible down the road?
And it can also be broken down now into a bunch of smalls can add up to form an extra large or a XXL as well. So the right answer to, "What RI should I purchase for even a small workload," is nobody really knows.
Joe: Right, and also to dial in or explain a little complexity, additionally RI's, if you're not using them in an attempt to be helpful, Amazon will float over to a separate account. If they recognize that you have an RI you're not using and someone else, in theory, could leverage that reservation they will float it over for an hour and you have to dig through a lot of data to understand exactly what's going on here.
Corey: And you wind up, in some cases, with parts of an hour are covered and other parts are not so you wind up with blended rates. You wind up with having to amortize these across accounts, in many cases.
And to get back to your earlier comment, by default in an AWS account, what you'll see is that an engineer can spin up a giant pile of resources but is precluded from seeing what any of it costs in their account, which to my mind is sort of a lunatic type of thing to do. It doesn't serve the company well.
If a developer spins up a $20,000 cluster to test something, don't you think she should know that's what she's doing?
Joe: Yes, and from an operator's perspective, the singular question I was always trying to answer is particularly running software as a service, a line of business application, we generated this much revenue in this calendar month by delivering this service, we used AWS resources to deliver it, so it should be a simple equation, right? Revenue over dollars spent for cost of hosting effectively at margin. How easy is that number to come up with?
Corey: I have a client where their architects took me aside and asked me a very simple question, "What does it cost to run my service for one hour?" And that was great, and we got half an hour into that conversation before he had to leave for another meeting because there is so much that goes into that and questions that need to be answered.
All of them can be answered, but there's no wrong way to answer most of them. It's, are you going to normalize batch processing in the middle of the night? Are you going to include development expenses in what it costs to provide your service?
From the finance perspective, you also have to add in the idea of, is this cost of goods sold or is this research and development? The latter entitles you to a tax credit in many parts of the United States, depending upon structure. The cost of goods sold feeds into your model of unit economics of what it costs, your cost of customer acquisition and how long it takes to turn a profit on them.
And it feeds into a bunch of KPIs that drive your business. Without visibility into it, what many companies do as a first pass is take the entire bill for a month and divide by the number of users during that time period. It's not a half bad first approach, but it's far from a comprehensive story.
Joe: Yeah, my personal theory is there are very few organizations who actually know those numbers precisely, which seems just as a problem for the industry as a whole and surprising that 10 years in, there's not a better solution.
Corey: You're right, and part of the answer to this lies in the idea of cloud governance, where you start getting finance visibility into what engineering is doing. The challenge is that a lot of enterprises, especially those who are not "born in the cloud," to use the catchphrase, the problem is in these large companies that are enterprise-based, where they wind up with finance dictating to engineering what's going to happen, and that's not a terrific model.
The converse of that, where in some shops you see engineering deciding what's going to happen and finance doing reconciliation as a trailing function usually leads to much better feature velocity. That's not necessarily the best case for many companies who are beholden, for example, to public markets.
Joe: Yes, which is I think part and parcel of what's given to the rise of, I think FinOps, it's starting to be called, which is co-opting, that term existed before but this notion of you should have regular meetings between the engineering function and the finance functions who--
The engineering functions were ultimately responsible for creating cloud usage and the finance team so you can review every month to see large deltas, understand finance can ask what changes should we expect next month. It's a simple idea but was pretty powerful, in my experience.
Corey: Absolutely, I am somewhat surprised by the number of engagements I've had with my clients where I get to introduce engineers to people who work in the finance department at the same company.
Joe: And explain to them why they might care about each other's work.
Corey: Exactly, I've had, "Oh finance, oh so are you with Corey's consultancy?" "No, I work for you, I work here the same way you do," and it turns into a sometimes tragically hilarious moment. But there are historical reasons for this.
I'm a big believer in setting up conversations, in making people think about this from another point of view.
I will say that it's a much more nuanced market than I thought it was when I first started this up a few years back. Oh, I'll go and I'll fix the bill, how hard could it be? It turns out that it is a deceptively complex area.
Joe: So do you find yourself being brought into Accounts to help with this whole Gordian Knot? Typically by engineering, by finance, by senior management? What's the usual avenue?
Corey: Good question. It's something of a mixed bag. There have been engagements where Finance has no visibility in what Engineering does, Engineering swears it's optimized. And then the money keeps going up, finance finally cuts them off with a budget, and then they blow $50,000 a month off of what they're spending instantly.
And at that point, finance is confused. "Do I just start setting arbitrary limits? I don't think I have a bad actor here but it would be really nice to be able to plan and get visibility into this."
Being able to speak a finance language, albeit with something of an accent, is something that definitely helps me have those conversations.
Conversely, engineering leadership often cares about this, but fundamentally, the person who's going to bring the end successfully most of the time has responsibility for the PNO that includes the CloudSpan.
Joe: CloudSpan, right.
Corey: I've had conversations with very well-meaning, good citizen engineers who were incensed that their company was wasting $80,000 a month when it should have been $40,000 a month. You dig a little bit and you find out there are 50 engineers working on that, they're chasing a $20 billion market opportunity that in six months will either be shut down as their larger parent company focuses on something else or will be expanded and optimized in.
And I'm sorry, but at that scale, your engineers are embezzling more in office supplies than they're spending on cloud services. The juice isn't worth the squeeze. There's a bigger strategic picture.
Joe: Yeah, it's always important, I think, to have perspective on how much leverage you're ultimately going to get out of the problem. So, if you're talking to a, and we'll stick on the engineering side just because that's what I think our audience trends towards, but for if you're on the engineering side, you're responsible for some amount of cloud spend, small SME growing, what are the top piece of advice for this is the sharpest edge that's going to destroy you in a year or two years when you add another zero to your spend?
Corey: That's, in many cases, a large part of what I do, where not all of my clients are giant enterprises who've been around for 200 years. I have been retained in the past to speak to small startups who are just looking at building things up now, and they're spending two, $3,000 a month. They don't care about that money, they care about how it's going to scale and what the constraints are going to be.
Joe: As the business grows 10x, 100x.
Corey: Exactly, and that transitions relatively rapidly into a cloud architecture conversation and what their application is doing and how. When you wind up paying per gigabyte transferred out to the general internet, when you have small volume it doesn't matter, when you have 50 million users, suddenly that winds up being the GDP of a small nation state if you haven't planned for it appropriately.
So, understanding the bottlenecks is always important. That said, global industry trends something like 60% of all cloud spend at AWS is EC2, based upon what I've seen in my client base. There are another four services beyond that that round that number up to 85%, and then there's a long tail, all the different esoteric services that they offer. No one has ever brought me in to optimize their Amazon Chime bill.
Joe: Right, not yet, at least. And I imagine those other services, and once you lump things like RDS, which technically not compute, are effectively very thinly near compute.
Corey: Absolutely, and that's most of it. If you want to get into specifics, EC2 is the big one, and the remaining four in no particular order are data transfer, EC2, RDS, EBS, and S3.
Joe: I think the one that may surprise some people who haven't been analyzing their bills monthly or working through a negotiation on a large enterprise agreement with a cloud provider. I think the data transfer is the one that tends to shock people.
We intuitively understand that, yes, we're going to have computers that run our code, there will be databases that store data, there will be some drives, a load balancer obviously for any web-based service is pretty obvious.
But when you are sitting and looking and seeing a bill with five zeroes next to a word that says "data transfer," or more zeroes, in some cases, that can be a shock to some people.
Corey: Absolutely, and if you were to tell me that here's my application architecture, what will the data transfer cost with an order of magnitude, I'm often hard-pressed to come up with a reasonable answer. And the reason behind that is that it is one of the most esoteric and difficult to understand aspects of cloud billing.
I've done projects for large on-prem companies who are trying to build out a business case for migration to cloud. And one of the big questions they have is, "What will it cost?" Which is sort of the entire point.
And the first stage is generally they start instrumenting their environment to get them realistic answers. "Oh, what do we need that for? We already know those numbers." "Great, how many gigabytes a month transferred between your application servers and your databases?" "Well, why in the world would we need to-- oh dear, you're not kidding, are you?"
Joe: Well, and it gets better, at least in my experience. If you are using, say, a horizontally-scaled database and you're architecting it such that it's replicated across multiple availability zones because you're cloud native and you know that availability zones can disappear, the replication inside of your database causes data transfer costs.
Corey: Only for some platforms, not for others. Take AWS, for example. I will drop a link to this in the show notes, but I built and open-sourced the latest version of a chart that shows data transfer cost using a map. And it is hilariously complex. I like to trot that one out in presentations somewhere, because people intuitively grasp this is a complex area of billing. But until they see it, it doesn't really hit home.
Joe: Yeah, there's different costs for interavailability zone, in versus out, there's different costs per region. Certain regions have special cost relationships to other regions, if I recall correctly.
Corey: Absolutely, for example the Sao Paolo region will cost 240% of what U.S. East one in Virginia will, from a data transfer perspective.
Joe: Just within the region.
Corey: Sorry, if you get from that region to the general internet. Internally, it's mostly the same, but some places are hard to get to. There is a half price data transfer deal in effect for everyone, this is public. Between Virginia and Ohio, it costs the same as inside of a region transfer going between those two regions.
Joe: Well, that is probably interesting segue, I want to talk as terribly fascinating in a very unhealthy way as I find spelunking into the details of cloud billing, I did want to talk with you today about a couple other topics on cloud, and so you may have heard there's a piece of software called Kubernetes.
Corey: Yes, I believe it's named after the Greek god of spending money on cloud services.
Joe: Ah yes, yeah, so you have. Alright, well I want to talk about a couple of the-- I mean, cloud is going through increasingly compressed revolutions of innovation, so I want to talk first about Kubernetes and are you seeing this out in the wild? Are you seeing your customers looking to leverage it? You made some points about having cloud-specific architectures, what are you seeing in the wild now?
Corey: One of the things I do in my spare time, because I have so much of that these days, is I'm an advisor to a company called ReactiveOps, where they tackle the world's problems by judiciously pouring Kubernetes all over them. Slightly more seriously, they engage with companies that are doing Kubernetes migrations and help with the build, the run, the architecture of that and effectively mapping what their existing application stack looks like into something that is more schedulable from a container perspective.
What I see coming out of that is a lot of companies are eager for a capability story, but they don't want to spend a few years refactoring the 20 year old legacy application that actually makes them money. There's a bit of a question sometimes as to whether the juice is worth the squeeze. That said, Kubernetes is exciting in that it unlocks a bunch of new capabilities.
It's terrifying in the context that it adds a tremendous level of complexity. And when you have now been running a Kubernetes application for six months and suddenly one day it's slow or intermittently doesn't work, there's suddenly a lot of layers you need to be able to unravel to isolate where that problem is.
This is not a new problem, you can say that about any exciting technology for the last 50 years, but as the difficult details get abstracted away, it becomes less and less clear at times what code is doing.
Joe: I think it's a matter of the trade off with adding that extra layer of indirection. Are you hearing, one of the things that's kind of always fascinated me, and this is probably showing my age in the cloud, in the early days, when cloud was still viewed by those who know better as a toy eerily similar to the raise of virtualization.
Like, "Oh, the virtualization, that's cool." But you would never run production there. And then it was the same refrain, "Oh, cloud, that's very interesting." But you would never run production there. Which quickly then led to some conversations about that but in the time I remember when cloud was, you have to remember, was optic storage compute, load balancer, and I think just barely database, right?
Corey: Oh yeah, and I was right there with you telling you why all these things were going to be a flash in the pan that never went anywhere.
I truly am a cloud economist, I get predictions wildly wrong and there are no consequences for it.
Joe: Yeah, well one of the things that was interesting as even when the tide started to turn, I think people started to understand what was really happening was then there was this rush of the architects, people who have architects somewhere in a seven word title. And I remember having this conversation, and we had already at that point had been running a production application, our business in the cloud, for about a year and you start to have all these conversations with enterprise, software super architects.
And they say, "Well yes, we will do cloud. But of course, because we're professionals, we must be multi-cloud." And that always struck me as odd as a practitioner who was actually down writing code and running and thinking the actual cost of doing that obviates basically a lot of what you're getting. If you're not leveraging all of the high level services that cloud provides, which at some level locks you in, at least for some period of time.
Corey: I wrote blog post on this called The Myth of Cloud Agnosticism, and I'll throw a link to that in the show notes, but what you're getting at is a somewhat common and hotly contested argument I've been having more or less with anyone that holds still long enough, people on the bus, etc.
But the premise here is that if you take a workload and want to have it run between multiple providers, on the one hand, Kubernetes makes this a lot easier. On the other, you're often giving things up in return for an ethereal capability of maybe being able to migrate providers at the push of a button.
Though, somehow you never do it and what you've given up is feature velocity because there's a lot of native platform offerings from all providers that you can leverage to get further ahead than doing it all yourself, it's 2018 last time I checked, you probably shouldn't have to build your own database replication system, that is something that these providers will give you.
If you instead take a step back and have to only work with the aspects of these providers that are compatible between them, then you're really giving up a lot in favor of not serving any of your particular goals. You're not going to be able to move nearly as quickly and you're not going to be able to hit your business objectives during the same timeframe if you go down that path.
Joe: Yes, I found it interesting that, and part of the reason I brought it up in the context talking about Kubernetes is one of the things that Kubernetes has done is really, that conversation I felt was somewhat dead and settled for a couple years and then with the rise of Kubernetes, the multi-cloud is back with a vengeance, especially it was a weird conversation to have when cloud was all of four services and now Amazon has hundreds and Google and Microsoft are close on their heels with the number of specialized offerings they have.
Corey: Absolutely, in many ways navigating the service catalog becomes its own challenge. To be clear, I'm also not saying this is an absolute rule. If you are a very large enterprise spending $100 million a year on the cloud, just by default giving all of that to one particular provider, first.
Joe: With no leverage.
Corey: Yeah, well first you have no leverage. Secondly, it does raise fascinating conflict of interest questions of why did you name your boat after them? But, so I'm not saying you should never do business with anyone except one vendor, no vendors partner with me and no vendors pay me to say that on their behalf this month, so instead what I think makes perfect sense is to definitely go with multiple cloud providers, but not on a per workload basis.
Joe: Right, and I think the key, the other thing you hit there, is it's more I think people underestimate the order of magnitude of spend at which that makes sense by maybe a couple levels. And so absolutely, as you said, if you're into, I think first of all some people may be surprised to understand that cloud providers have customers who spend in the hundreds of millions of dollars a year.
Basically, once you're at that level, obviously it makes sense. But there are lots of people who are spending what amounts to I don't know, six engineers in San Francisco on cloud talking about going multi-cloud and I don't believe the economics work out there, personally.
Corey: Almost never. There's also an interesting pattern where a reason people want to go multi-cloud is, what if Google or Microsoft or Amazon or Oracle or IBM or Ted's Taxidermy and Cloud Provider suddenly winds up doing something that displeases them strategically and they have to migrate? They don't want to be locked in, so they start building out this system where they can take everything they've built and deploy it elsewhere.
The challenge is, first, they almost never do. The few times that a company migrates from one cloud provider to another, you can tell. Because the cloud provider that they're migrating to suddenly is posting about this on their blog, they're inviting the company to talk about this at keynote events, it's very public because it's rare. Secondly, the effort of doing those migrations is generally less than the effort of maintaining that level of faux agnosticism over a period of five to 10 years.
Joe: Yeah, and I think it's an interesting segue, possibly into what I want to talk about next. And so, if you want to talk about lock-in. You're familiar, I assume, as I understand an ardent consumer of serverless capabilities.
Corey: Oh absolutely, I am considered to be an expert in the world of serverless technologies because I've been using them for about 2 1/2 weeks. That's about all it takes right now because it's still new and
the emerging consensus around virtually every aspect of serverless technologies today is that everyone else is doing it wrong.
Joe: Right, except you.
Joe: So, a couple interesting things on cloud. I think definitely what impact, I have my opinions but I'd like to hear yours, what impact on cloud economics does serverless computing have?
Corey: Just a quick definition of terms, I'm talking about functions as a service. Any time you have a managed platform that you don't have to log into or touch, it could be argued to be serverless.
Joe: Yeah, we're not talking about platform as a service, we're not talking about managed black boxes, we're talking about functions as a service.
Corey: Right, if you do an apples-to-apples comparison, which is harder than you might think, you generally will spend somewhere in the neighborhood of two to five times as much on running something in Lambda than you would running the same thing in EC2.
Joe: And we're talking about some workload that is roughly saturated in EC2.
Corey: We are talking perfect level of saturation, from a resource perspective here. First, you're never going to see that. Two, you still have a fair bit of overhead with respect to managing, caring, and feeding for that infrastructure on the EC2 side. What you're getting from the idea of something like Lambda is that the cloud provider handles a lot of this for you, and all you really have to worry about, in theory, is the code itself.
You hand them code, they run it when certain things happen and that's it. There's merit to that, there are times where I've spoken with companies where there's been an economic model where they'd love to move to Lambda but it would cost them three times as much as they're currently spending for that same workload on EC2.
At huge scale, it often doesn't make sense. In practice, there are times where this does work out, not purely from an economic model, but from a capability story. Remember, economics is not just about the dollars and cents that you're spending, there is an opportunity cost. What are you having to spend to manage existing infrastructures? What are you having to spend to learn the new paradigm for serverless technologies? And how are you placing your bets? There is a lock in story that is much more severe with something like this.
Joe: So one thing I'd be curious in your travels if you've seen, but I have this working theory that serverless or functions of the service, rather, might be the first cloud innovation that is ultimately pushed forward and adopted by the enterprise or traditional enterprise ahead of softwares and service startups, hipster enterprise. And the reason for that, which I think particularly if our listeners have been working on internet properties their entire careers, in your traditional enterprises, there are incalculable number of workloads that never see the public internet, they're hosted somewhere in the internal land, and they have potentially 20 users who ever log into them.
But yet these 20 users perform tasks that generate oodles and oodles of money, so like a large corporate shipping company, for instance, will have custom in-house internal logistics applications that the planning team. A handful which, dozens or even hundreds of people, but impossibly minuscule number of users from a compute perspective will use to drive hundreds of millions of dollars of revenue.
So from a cost perspective on a single app doesn't matter. Add all these up and you've got lots and lots and lots of EC2 instances or even containers in a chunk of an EC2 instance or a compute instance that are almost always idle.
Joe: And serverless, and this is why I think you see some of these forward-leaning legacy enterprises driving harder towards this because they actually have workloads that will be dramatically impacted from an economic perspective.
Corey: You're absolutely right, except you missed a step. The first experience most of these enterprises have with Lambda is when they're using it to spackle over a service gap in what AWS has given them, of propagating a tag from an instance to a volume, from a volume to a snapshot, a lot of the heavy-lifting style stuff that they had a bunch of scripts running before, now they can do it on instantiation that fires off an event that causes this thing to work.
It scales as much as you need to, and once you've written it, you never have to think about it again. Once they do that, wow that was easy, it was fun, it's a resume builder, which in many cases these large corporate enterprises is not something that comes along as often, what else can we do with it? Phase two is exactly what you just described.
It starts off as a toy that solves an annoying problem in a fun way, and then it turns into something that's being used much more noticeably with much higher impact. That's an adoption pattern I've seen repeatedly.
Joe: And I think to your point what's interesting is it seems like the cloud providers themselves have noticed this spackle capability of these functions and service platforms and starting to embrace that wholeheartedly themselves. I think different providers, you see announcements all the time, "Oh, this particular event is now natively or this black box service is now natively hooked into our function as a service platform."
Corey: Absolutely, they're starting to expand this everywhere they can, and it's a rising tide that's really starting to lift all ships.
Joe: It's still really early, very early in the serverless life cycle. What do you think are the most interesting challenges right now for adopting serverless as a architecture, as a primary development platform?
Corey: A lot of it is cultural, you've got to get past the screaming hordes that insist that serverless runs on servers. Another big challenge is observability into the space of how you figure out what it's doing. At scale, this becomes a very challenging area to play in. You can combine those two semantic arguments as well and get into observerless, and I had a screen on this at observerless.com.
Corey: Observerless.com, we'll throw a link to that in the show notes. But that winds up getting to a point where now you can just ignore everything that it's doing and just assume everything's going really well. That's a sarcastic take on a concept I don't seriously recommend at significant scale, but there are times when you don't need to know what every cron job in your entire company is doing at all times.
Observability is a way to play into this, but also understanding how to wrap your heads around this. If you flip to the getting started page of the Lambda documentation, it's several hundred pages in, because the rest of it is foundational knowledge and has to come first. And the complexity here is massive.
It's similar to EC2 as it was 10 years ago, where you had companies like RightScale, whose entire value proposition at the time was, "We make this something a human could understand." These were new concepts, AMIs, snapshots, instance IDs, all of this was very complex. Now, it's click a button in the console and it works.
You see the rise of things like serverless framework that are making this a lot more accessible and a lot more extensible. They're doing something fascinating in that they're early on adopting support for multiple providers.
Right now in the industry, it's Lambda from AWS plus a bunch of other stuff that well it was a nice attempt, maybe for their customers it's something handy but the world is not adopting en masse in the same way. Serverless framework has embraced all of these things and is willing to support it if it exists.
That means that there is an agnosticism story there that starts to shy a bit away from the lock-in stories that we tell. And it's getting to a point rapidly where even someone who has little experience with this can start playing with one of these frameworks and get there almost instantaneously. You don't need the two years of foundational knowledge to make sense of this.
Joe: Yeah, it's an interesting challenge and like you said I think serverless is doing, serverless the framework is doing a lot of good work there. Because on the one hand, I literally just hand you code and everything is built into the platform.
It is, at one level, the ultimate lock-in. But then the same time, I'm only giving you code so with some amount of work, I should be able to just give that code to other functions as a service. I mean, it's a conceptually very simple model. I invoke the function, I pass some information in, in event, it returns some result, potentially.
One of the things I found, and this goes, you mentioned Amazon's dominance in functions of service, which correlates pretty much to Amazon's continued dominance. Which again, having been there for the whole, along for the ride so to speak for the whole 10 years, it's always fascinated me how to this day Amazon's been able to maintain their edge over the other providers. Although, obviously Azure and Google both in different areas are making big strides.
We talked about the service catalog earlier. So one of the things that I always, I find interesting to contemplate is Amazon famously has this notion of two pizza teams.
Corey: In order to be on, you must be able to eat two entire pizzas by yourself.
Joe: Yeah, it's roughly that.
Corey: The only way to become 10x, the size of other engineers.
Joe: I had not thought, "Yeah, 10x engineer on a two-pizza team, that math gets dangerous." But it works out to roughly, if you add Amazon, the colloquialism I've never worked at Amazon and I know no one there will speak this, but the colloquialism is that a team of 10 to 14 people can work on an idea, and if it clicks, get it out into production pretty easily.
And if it works, they'll run with it. If it doesn't work out, and so this has led now, on the good side to where they've been able to continue to rapidly innovate, they brought out, it's must now it's over 100 services.
Corey: Roughly 130 the last time I checked.
Joe: Okay 130, yeah, I wasn't sure if it cleared 100. So 130 services in 10 years, or I guess now close to 12 years actually. The one thing I find interesting, though, and I believe, I personally believe is a comes from this organizational design, services that you would expect to integrate or work together often do not, and in very confounding ways.
I do wonder if at some point this starts to collapse in on itself a bit. I mean, what are you finding in your conversations with your customers?
Corey: What I found interesting about a lot of this is, when I speak to people who work at Amazon themselves, usually other MBA conversations or after plying a whole bunch of beer into them, which effectively works out to the same thing. Because if you drink with someone, you trust them, oh yeah.
But what comes out every time I finally get to the right person and then start politely berating them for why doesn't the service do this very simple thing, never have I heard that's an incredible idea, no one has ever suggested this before. There's always an excellent technical reason that goes directly back to this.
I continue to be impressed by the strength of their engineers. And the answer in many cases works out to, yeah, it turns out that when we built S3, we didn't think about the idea of serverless as it stands today. If we had and we'd architected things a little differently, the following list of 10 things would be much more easily accomplished from an engineering-side perspective.
There's always reasons behind why capability limitations are there, it's not because they don't care about serving their customers, and it's not because they hire lazy people. A lot of this stuff is deceptively complex.
And one thing that's kind of neat about Amazon, but also subjects them to more than possibly their fair share of criticism; they don't usually publish road maps of where they're going with most of their platforms. So, if they're going to be perfectly honest and say, "Oh, there's this one annoying feature that needs to be done. Here's a list of 10 things that need to be done first, we're waiting on two teams to do that and unfortunately they have other things that are more pressing, so it's going to take some moving around.
No one's going to want to follow that Gantt chart of what it's going to take to get that feature out the door. Instead, one day it shows up on the blog or onstage at Re:Invent or magically appears in the documentation and people rejoice. Or they should rejoice. Instead, they're grr that took long enough and then they find something else to complain about. I say this as a world class complainer myself. I am not immune from this particular failure mode.
Joe: Yeah, the question came more from a place of obviously they the model has been successful beyond anyone could have possibly imagined except perhaps inside of Amazon, they were. I do wonder how it scales to 200 services, 500 services, 1,000 services.
Corey: Oh, absolutely, and the key I think is eventually they're going to have to start tiering some of these services out. If someone wants to get started with AWS and they know nothing else about it, there are foundational tier-one services they should know about.
EC2 is the no brainer, RDS potentially, S3, absolutely, SQS or SNS, maybe. And it broadens out past that to a point where you can go so far into the weeds, "Well, this is a service that winds up running a call center and doing some VoIP stuff that winds up you having hunt groups so you can have a bunch of people sitting in a call center and scale dynamically." Great. That is a very good solution for a very specific customer.
Ninety-nine percent of Amazon customers will never touch that service, because it's not relevant to the problem they have. For that 1%, that is a game changer. And increasingly, a lot of the services that are getting released do exactly that.
If you're not into VR or AR, you would not care at all about Sumerian. If you are into those industries, you are absolutely going to care about everything that group does.
Keeping up with what's important and what isn't is going to differ depending upon who you are.
I think that they have a long way to go as far as presenting that information in a more understandable way. Having to scroll three pages on a 5k monitor is probably not a great indicator that this is as easily presented as it could be. Not to mention the fact that understanding what all these things do is ludicrous. We've now crossed the point where I can make up a service that doesn't exist and not get called on it when I'm talking to Amazon engineers.
Joe: That's an interesting game. There's something else I want to talk about today. Because I think part of the way that problem is solved by the different providers. What I think of as like the human aspect of cloud, right?
As engineers, why we all get excited about cloud in the first place was because it turned this thing, getting servers online so I could get my code onto it into a messy human political problem into a programmatic operation. So it got rid of the messy humans, gave us just the nice machines.
But as you scale and grow into it, I think the human aspect, and as someone who's a large, like is a consumer in a prior life, the human aspect of cloud and the operators is actually pretty important differentiating aspect, I think.
Corey: It absolutely is. If you are building your entire company on top of a cloud provider's offering, as most companies starting up today are.
Joe: Most cloud native companies, right.
Corey: And even to a point, some that aren't. I mean, if you take a large legacy enterprise, you're spending, I don't know let's call it $100 million a year on a cloud provider,
the same story applies, which is they are no longer your vendor. They're your partner, whether you want them to be or not.
If they have a bad day, I promise, so will you. And picking a vendor from that perspective at the outset is important.
A number of players in this space are approaching this from the idea that we can just abstract this away, we can let the algorithm figure things out, we can have all of this stuff handled by technologies so we don't have to take a phone call from our customers. That's great for the engineer coding on something at two in the morning. It's really crappy when someone is having a business problem and wants to speak to someone at that provider who is empowered to do things.
We've seen stories about cloud providers accidentally turning off production infrastructures in the past. We've seen some providers dealing with communications challenges across the board, and it's important when you're building something that you intend to scale out, that you establish a good relationship with your provider. If your provider doesn't know how to speak to people, I question whether they're going to be able to do that effectively.
Joe: There is this notion, I believe especially because the major cloud players, if you look at the big three, Amazon, Google, and Microsoft, all came out of. And I always felt this was the case, particularly when there were, back in the early days, a large number of more traditional data center companies all say, "No, we're going to play in the cloud."
And they'll say there was two requirements to be in the cloud, and one is that you have to fundamentally be a software company. And you also have to have a multi-billion-dollar war chest. That's the ante to get if you're thinking of cloud in the long game, and that's definitely I think played out in that now it's effectively Google and Microsoft and Amazon leading the pack.
And so what's interesting though, coming as software companies, I think even the people there tend to think about this as an automation and programmatic platform. And I think you can see the roots of those companies, deep roots, way pre-cloud, in how they approach their support and interaction.
Amazon historically is a retail company. Dealing with retail consumers who needed to know, "My thing I bought from you broke," "I want to return this," or, "you didn't deliver this." Microsoft sold enterprise software and had teams that would deal with large accounts. And then Google, I'm not sure they take phone calls, historically, on their products. And would you say, I believe that you can see that DNA in how they approach humans in their products. Is that something you observe?
Corey: Yes and no. Amazon has always been a bit of a strange animal in this context, and we'll get back to them in a minute. First I want to talk about Microsoft. Microsoft has 40 years experience now, dealing with very large companies. And if we're being perfectly honest, apologizing for software failures during that timeframe. Where they're very good at apologizing why they just dropped your company's data on the floor.
Their technology has gotten markedly better, but so has their ability to have those conversations. If I wanted to pick a vendor based upon nothing else other than how exquisite the apologies are going to be, it would probably be Azure in this space. If you're asking me to pick which vendor I would go with from a perspective of doing the right thing and treating me as I would want to be treated, I would have to go with Amazon.
Their overriding leadership principle is the idea being customer obsessed. And there are a lot of jokes that can be made about that, but it's evident in every conversation I've had. Even over drinks with people who have no formal relationship with me who work at AWS, when I mention a customer having a bad time, they're suddenly fascinated by the story and, "Can you introduce us? I want to look into that because that doesn't sound right."
There's almost a compulsive need for them to find these things that are impacting their customers' quality of life.
The edge cases I've found around this are always very nuanced and tend to be extraordinarily biased in one direction or another. But by and large, I do believe that from a provider perspective, they understand what doing the right thing looks like and they continually iterate toward that. I pick up nothing but a sense that they're operating in good faith and trying to make things better. I don't get the sense they're trying to automate out the need to ever talk to a customer again. That's something that's fascinating.
Personally, I spend all of $11 a month on my AWS bill, and most of that is covered by a survey I take at Re:Invent in return for giving them my honest feedback about what my experience was like. And so yeah, you want to pay me $75 worth of cloud credits to make fun of you for 20 minutes? Absolutely, I usually do that for free.
But they care, and I've had people reach back out to me in response to things I've written on that survey and ask for more detail, oh you saw someone having a difficult time moving around at Re:Invent because they had limited mobility. "Do you know who they were? We'd love to talk to them more about how we can avoid this next time." It's stuff that they don't have to reach out on but they do. That buys some goodwill from my case, anyway.
Joe: Going back, the other interesting thing to me, when you're first starting with the cloud native architecture, especially if you don't have any pesky customers yet and all your code is pure, things are very clean. If you've got five instances in a load balancer and a database sharded replicated across a couple availability zones, everything is clean and easy to comprehend.
As you scale, as the number of your independent moving parts in your company creating cloud using different pieces, I think you do reach a scale, or in my experience I found you reach a scale where suddenly the problems that were easily solved just programmatically or just with a docs, turn into-- and these are all, as is every business, at some level a sausage factory on the inside, and being able to talk to the people who are making the sausage, I went to this evolution, right?
Where I never wanted to talk to anyone, and then you got to a point where the end where you're like, "We have a problem, let's talk to the people, they'll help us solve it and everybody will be back to their job more quickly." And I think that overcoming that notion of, "Hey I have a problem, I should try to talk to somebody, particularly in engineering." I think there was some people who would rather the urge to not talk to people in some cases is almost stronger than the urge to solve a business problem.
Corey: Absolutely, and for a while you can do that. I'm the same way, I wound up with a bug yesterday as I was provisioning a service myself. And it turned out that the secret to solving it was to just retry it a second time and suddenly it magically worked then. Great, there was a bug somewhere, I don't necessarily know where it is. But that's me playing around late at night.
If I'm running a business that is starting to scale out and customers are depending on it, past a certain point, despite my inclination to dig into these problems myself, reaching out to a provider who is able to have those conversations with me, and into some depth, is going to be important. If I'm not doing that, I'm not serving my customers very well.
Joe: Yeah, I think the lesson to take away, and this goes for vendors too, I was always frustrated as a vendor when a customer churned and they said, "Well, I had this problem when you had this survey." "Oh, why'd you leave?" And you check with support, "When did they talk to us about this?" It's like, "Oh, no one ever talked to us."
It's ultimately not incumbent on the customer to do that, but I do think customers should understand that almost every business vendor particularly, that's all of our companies that we work with here, they want you to be successful. And if you reach out to them, they will often solve your problem.
Corey: Absolutely, there's an interesting thing that I like about my business that I think also applies to what you're discussing. There's no one on the other side of reducing the cloud bill story. There's no one saying, "Yes, we think companies should waste money on the cloud." "Yes, we want them to just pour money into a bottomless pit and get no return for that."
No providers in that space, no vendors in that space, no one wants that to be the narrative. And I think you're seeing the same thing. Even when you look at companies that are in the middle of an outage and their competitors very often of engineers who are mentioning HugOps to them, "We're sorry for the-- hope you're back up and running soon."
People want to compete in a business context. They don't want companies to fail because of technology failures. That bodes well for our entire industry. And one thing I appreciate is that there is a strong sense of good will in the larger cloud native community that feeds into that.
Joe: Well, thanks, Corey. It's been a great conversation. Any other interesting things you're up to the listeners should know about on your way out?
Corey: Wait and see. Re:Invent is around the corner and I'm sure something fun is going to come out of that.
Joe: Excellent, alright, thanks so much for joining us today.
Corey: Thanks, Joe.