August 15, 2019
Ep. #11, Chaos Engineering with Ana Medina of Gremlin
In episode 11 of O11ycast, Charity Majors and Liz Fong-Jones speak with Gremlin chaos engineer Ana Medina. They discuss the relevance of bre...
As he said, my name is Jeremy. I always like to start off with this slide
because it really shows kind of what we're doing. Everything is on
a computer now, and some of you may recognize this comic and the slide
that I took off.
Why am I here? I'm here basically to teach you about things that I've done well, things that I've done wrong. My mistakes. Why do we want to learn from other people's mistakes? Well, because it's better to let someone else get hurt than you, right? I'm going to tell you about some mistakes we've made. I'm also going tell you about some successes that we've had, and hopefully, you can learn some lessons, ask me some questions.
How many of you know what Reddit is? Alright, sweet. Online community, discussion,
etc., etc. It started back in 2005 with two UVA students
who applied to this thing called YCombinator, and they were rejected
because they had a crappy idea. They were then called back and told,
"We want you to come here, as long as you don't do that idea that you had, so
let's think of something else."
This is what Reddit looked like on its very first day. This is the very first iteration of Reddit. It doesn't look like that anymore, thank god. This is the traffic from inception of Reddit until when I left in June of 2011. And this is that same graph logarithmically. I like to show this because nerds like logarithmic scales because they like to see the steady growth. Before you scream "Relevant XKCD" here's the Relevant XKCD about how I didn't label my graphs.
This is another graph that I like to show. This is the costs versus page views. The yellow line is page views; the blue bars are costs. And what's interesting here is, basically, we were pretty much teetering on the edge up until we launched Reddit Gold, which was our first revenue- generating program, basically. We had ads, but Reddit Gold was sort of the big community revenue generation, and we were able to finally ramp up. But what was cool was that we were able to ramp and then sort of level off for a while. It's gone up since then, but that was a nice sort of scaling inflection point that we found.
This is a great quote that I like from one of the Reddit users. "If it won't scale, it'll fail." Then this is a quote that somebody pulled from me at some point which is basically, "The key to scaling is finding your bottlenecks before your users do."
I'm going to start by talking about infrastructure and what some tips, some things like that about infrastructure. Then I'll move on and talk about other things, servers, code, etc.
This is a brief timeline of Reddit's infrastructure, and it's moved to the cloud. Reddit started in a data center in Boston and we moved it to a data center in San Francisco shortly after I joined the company. And then we started using Amazon for various different parts of the business. First, just for the logos, and then for the thumbnails, and eventually for some offline batch processing. Eventually the entire site moved over there. I like to just throw this up there because I'm really proud of the data center that we had. This is the Reddit data center ten minutes before I tore it down because we had finished our move to Amazon.
As I mentioned we moved from self hosting to EC2. Basically, we set up a complete stack over in EC2. We were using it for overflow, but we had to eventually forklift all the data over there. We were so resource-constrained in our data center that we actually had to take the site down for about 12 hours so we could move the data because we couldn't have the load of copying all the data off the database and have it running at the same time. We started with the overflow, set up a VPN, just used it for batch, and then eventually migrated across. EC2 definitely made things easier, but it also made things harder. It's not a magic bullet.
How many of you are doing companies or stuff related to cloud, EC2, something like that? OK, so most of you. So, I don't have to convince you about the beauty and power of the cloud, which I usually have to do. Although not so much anymore.
A couple of different takeaways, and I'll go over this stuff again. The first question about your infrastructure that you have to ask when you're going to try to scale your startup is, web server or proxy? Or something else even. You can run a web server, there's static content and also hits your dynamic apps, or you could just run a proxy and have your applications serve HTML directly to the user, talk HTTP. At Reddit, we chose the proxy route. We actually started with the web server route and then moved to the proxy route when the web server just started to break down. Even today, I'm pretty sure, I know in June, but I think even today, it's just a single HAProxy instance that's running all of Reddit.
Another important question to ask yourself is " What about event driven and non-blocking web servers?" Should you start with that? Should you start with Tornado and Twisted, or should you start with something else more standard, etc ? It totally depends on your use case, but generally speaking, probably, you're going to want not those to start with. You never know. You might be working on something where it just totally makes sense. But oftentimes, it's just going to add more complexity.
Some mistakes that we made in our infrastructure. We didn't account for the increased latency in the virtual environment. This bit us really hard the first day that we moved. For every web page that we rendered, we'd make like a thousand calls to Memcached. When you have a 0.1 milisecond latency between your server and your memcached, that works out great. But when you suddenly change that to one second, that just destroys you. So we had to quickly change the code around to do more batches, less network, more batching. And in fact, this is a problem that I'm dealing with right now, something I'm writing for Netflix.
Another pain point that we had to deal with was that instances go away. That was not something that we were used to. We were used to hardware that was a bit more stable. In general, though, if you build your system right, it shouldn't matter. I'll talk a bunch more about that in a little bit, how at Netflix, especially, we design our systems to just assume that everything's going to fail at some point.
Quick protip: Security was not the first thought when they built EC2 and the other cloud stuff. If you are working in the cloud, remember that it's not native to their product offering. You want to think about that and maybe put a little bit of your own security in there first. And limits. A t least in the cloud, there's going to be resource limits. Even in hardware infrastructure, you're going run into limits like total network bandwidth, etc. So, keep an eye on it. Things that are going to limit you, you wanna keep an eye on. In EC2, that's pretty much everything. In your own infrastructure, that's going to be still your network and your power and things of that nature.
And then one of the other mistakes we made was relying on any single cloud product and assuming it's going to work as advertised. Nothing ever works as advertised. Once you accept that and learn to embrace that, your life will be much better.
Another mistake we made was we tried to do a little bit too much bleeding edge in production. I think looking back, we all agree that we moved to Cassandra a little too early. We moved to it when it was about 0.5, and that was a little bit too early. We never had any data loss, but it did cause some problems for us. It's a much more solid product now. It's certainly useful now. I mean, we use it in Netflix for pretty much all of our data storage and it's great. But when Reddit moved to it at the time, a few years ago, it wasn't nearly as great.
And my last infrastructure recommendation is to Automate. Automate. Automate. Treat your infrastructure just like your code. Check in your configurations. This, of course, is a lot easier if you're using something that's API-based, like a cloud or your own private cloud that you built on top of your own hardware. Whatever it is that you might have done, if it's all API-based, if it's all code-based, it's going to be a lot easier if you're checking in all your changes and doing everything the same way that you treat your code, and you do it with your infrastructure.
Now I'm going to talk a bit about architecture, architecture design, etc. This whole talk is supposed to be about scaling your startup. The first question is: Do you need to build to scale when you start? And I think the answer really is, no. You are going to prematurely optimize if you just assume scale to begin with. Start with that framework, whatever it is. Later on, I'll talk about frameworks. You don't need to build a scalable architecture. You just need to build something works and scale it out later when you hit those bottlenecks. As I said earlier, finding those bottlenecks before your users do is the key, but don't prematurely optimize.
As an example, here's the architecture of what Reddit looked like. I mean, this is stripped down, of course, but it's pretty roughly what they looked like. It pretty much still looks like that today. A little bit more complicated. It's monolithic. It's highly cached. There's basically two zones involved. Databases are replicated to different zones, and data in general is replicated. Then, to contrast that with Netflix architecture. Netflix is a service-oriented architecture. It's basically a bunch of little Reddits. Each service is running its own separate infrastructure, managed by its own team, each one doing its own thing.
This is a basic diagram of how Netflix works. Everything pretty much filters into the API to the edge services. The API calls out to all the other services and then assembles that into an answer and then sends that answer back to all the various clients and devices. Most of it is run on EC2. Some of it we run on our own hardware. We run our own CDN. The consumer electronic devices, computers, whatever, basically talk to EC2, which is the command in control. They get URLs from the CDN, and then they pull the files from the CDN.
Some of the advantages to a service-oriented architecture is it's easier to auto-scale. It's easier for capacity planning because you only need to do one service at a time so each service can be planned independently. It narrows the effects of changes so when you change one service you don't have this monolithic code that you just rolled out and got to figure out where is the bug, would I roll back, etc. You can do more efficient local caching because all of the data for that service is just in that one place.
But there are disadvantages, too. The main one being that you're going to need either multiple dev teams, one for each service, or you're going to need one dev team that's working on multiple services, constantly context-switching. Also, you're going to want a common platform because, otherwise, you're going to have a lot of duplicated work because everybody needs to figure out how to install the OS and run the web server and all that basic stuff.
The way Netflix actually tried to solve this problem was by building sort of a PaaS for ourselves. We built, essentially, a system where you can build service oriented architecture-type applications. At the very minimum, all you have to do is check in your code. It builds a machine image, and you get a machine image that runs that code. And most of it is Open Source, which is kind of cool. We're open sourcing more and more parts every day. The ultimate goal is to actually have a complete platform available for anyone to download and use.
The Netflix PaaS provides all of those features there: localization and multiple accounts, cross-region replication, handling thousands of instances. AWS provides basic stuff: infrastructure, machines, IP addresses, stuff like that, things that you could provide for yourself with your own hardware if you wanted to do. And then Netflix basically provides these higher-level abstractions of applications and clusters and stuff like that.
As an example, this is sort of the instance architecture at Netflix. The nice thing is that these are pluggable pieces. For my team we use Python. We actually don't use Django anymore. But the nice thing is, because we can go from here to here, we can just change the different parts. That's kind of how Netflix handles things. Netflix provides a platform, and then developers can change that as much as they want. Now, that probably isn't going to work so well for your startup unless you maybe leverage this platform that we provide.
Another thing that we do is we hold a boot camp teaching people to build for three. That's a scaling tip that works for any sized company, really. Just assume that you're going to need at least three of everything. And I'll get into that a little bit more later, too.
Then the last thing, of course, is all system choices assume that some part is going to fail. As I mentioned before, instances fail, hardware, fails, network fails, load bouncers fail. Design your architecture on the assumption that all of these pieces can and will fail at some point.
Another suggestion, don't follow the fads. Apparently, that's a popular thing to say. I mean, Postgres is a great database. Reddit still today runs on mostly Postgres. They use a lot of Cassandra, too. But, most of the data is stored at Postgres. Postgres is rock-solid, and I love Postgres. But, you don't necessarily, just because it's the new hotness doesn't mean it's also the best thing for you. It might be, who knows? Node.js and MongoDB might just be exactly the right thing that your company or startup needs, but it may not be. So, just remember, just because it's the fad doesn't necessarily mean it's the best.
What else do you need to worry about? Queues. Queues are great. Queues are a great way to sort of buffer between your clients and your data stores. Queues will save your ass in a lot of different cases. Locking. Try to avoid it. Locking is always a pain in the butt and it's going to bite you. Try to write your code so you don't need the locks. Embrace eventual consistency. Make sure that your code can just handle that this update maybe takes a few seconds to get here, or whatever.
Email. Don't deal with email. Email is a big pain. Deliverability is a big pain. I used to work in an email company, at Sendmail. I know all about this. And at eBay, I dealt with a lot of deliverability problems, too. Let someone else deal with that. SendGrid is great. MailChimp, if you need to do mailing lists. But let somebody else worry about that problem. You probably have other problems, too, that you need to solve, right? Other infrastructure, other specialized infrastructure for your startup. Can you outsource it? Is somebody else already doing that? Are they doing it at higher scale? Are they doing it better? Nothing wrong with outsourcing.
At Netflix, our policy is there's a SaaS, we try to use it if we can. We outsource as much of the stuff that we can so that we're really just focusing on our core business, what we do best.
And then, the limits. As I mentioned before, you need to watch your limits. You should also put limits on everything. Put a hard number of limit on number of API requests that could be made from any one IP. Make it a thousand a second, something absurdly high. Because you know what? One day, that's going to happen and you're going to be really glad that that limit was there because now you've blocked somebody from completely taking down your service, maybe because they just wrote a bad robot or something, you know? They were trying to scrape your site, they screwed up, they're hitting you 10,000 times a second. At least, it won't take you down. Or, maybe perhaps, you've been linked from the Yahoo front page, which is what took down Reddit the first time when somebody put a Reddit button on a page that was linked from the Yahoo! home page. We basically DDoS-ed ourselves with the Reddit buttons. We quickly figured out how to put that on a CDN, so we didn't have to deal with that problem anymore.
As I mentioned before, going from two to three is a hard problem to solve, but going from one to two is an even harder problem to solve. Just assume that you're going to need three of everything from the beginning. The reason this works is because if you build everything for three, then your horizontal scaling becomes much easier. Going from three to four to five to six is a lot easier than trying to figure out how to break your code apart to go from one to two or two to three. That's why I like to try to build everything with the assumption that there's going to be three of everything. And this is what we have people at Netflix do. We tell everybody, assume you're going to have one copy of your instance running in every zone in a region.
Monitoring. Monitoring is key. Monitor everything. Try to find the bottlenecks before your users do by monitoring. We used Ganglia at Reddit. Ganglia is a great product, but doesn't handle virtualized environments very well because it doesn't like instances coming and going. There's lots of other things out there now. At Netflix we have our own homegrown system. It's vaguely sort of a cousin of OpenTSDB. Graphite is a great place to start with your monitoring. There's lots of options in monitoring, but have something.
And don't just monitor server stuff. Don't just monitor CPU load and network. Monitor your business metrics. That's really where you're going to get your key insights from, looking at your Layer 7 type stuff, your URLs, which URLs are being hit the most, which ones are taking the longest to render, and your general business metrics. N umber of page views, whatever is important for you. At Netflix, our number one metric is starts per second, how often someone hits the play button and succeeds. This is the thing that are on the monitors all over place, and everybody's looking at this number. And if it starts to drop, we know there's a problem. And we have alarms on that.
That's the other part of monitoring, is alarms. Once you're doing all this monitoring, and you know what's important, make sure that you're alarming on the things that are important. Having the monitors go and having to actively look at them is a job for a computer, really. It's boring for a human. Nobody wants to sit and stare at graphs all day. As I mentioned, one of the mistakes that we made was not using a monitoring system that's "virtualization friendly."
Another important thing to note for scalability is, reliability is a balance between, well, you gotta balance between reliability and money, right? With infinite money, you could theoretically have infinite reliability because you could buy every piece of hardware ever in the world that you needed to use. But that's not going to happen. So, you're going to have to find the right balance for you.
Each company is different. Some companies, a minute of downtime costs them thousands of dollars, tens of thousands of dollars. For other companies, 20 minutes of downtime doesn't really mean anything because their customers will just come back. So, you have to do that calculation. For somebody like eBay, it's really easy. They know exactly how many dollars they make per second. They know exactly what downtime costs them. For something like Netflix, it's a lot harder. There's no direct cost to downtime. There's a perception issue. Customers are unhappy. How do you quantify that? It's a lot harder to do.
Reliability versus money, something else you gotta think about. And how is this related to scaling? Because reliability and scaling go sort of hand in hand. You can run your entire website on one server all the time. It's not very scalable, and it's not going to be very reliable as your traffic grows. Those two things go together. Scaling your application basically makes it more reliable. Usually. I suppose if you scale it wrong, you could make it more unreliable.
Another way that we make sure that we've scaled things properly, at least at Netflix, is what we call the monkey theory, where we simulate things that can go wrong and we find things that are sort of different, out of line. Some of you may have heard of the the simian army that we have in Netflix, the most famous being the Chaos Monkey which basically goes and terminates instances randomly in our production environment. We have other monkeys, as well, Latency introduces latency; other monkeys that go around looking for problems, reporting on problems, things like that. And the way that this helps us with our scaling is, the Latency Monkey is great! Because we can induce latency between two services, and so we're basically simulating a scaling failure on one service and making sure that all the other services are able to handle that failure. And then the Chaos Monkey makes sure that we've properly scaled our application and that we can handle the loss of an instance because that's going to be important for reliability and for scaling and for growing.
We use auto-scaling as well. I've always wanted to get us at Reddit onto auto-scaling but we just never really got around to writing the app to do that. It's really close, it's almost there. At this point, they can spin up application servers pretty much automatically, which is great. At Netflix we push a lot of our teams towards auto-scaling. Not everyone does it yet, but we're working on it. I'm really excited about the fact that the Cassandra team is going to start doing Chaos Monkey and auto-scaling soon. So, if those guys can do it, everyone should be able to, right? They're the most stateful service of all. Auto-scaling is great.
Those little spikes you see there, before the peaks, that's deployment. The way we deploy at Netflix is we spin up a whole new cluster of the application with a new code and then shift the traffic over. B ut auto-scaling has definitely saved us in the past, as traffic comes in, especially if there's an unexpected burst of traffic. The systems will scale themselves up. We call those Saturday mornings, when everyone comes and watches their cartoons. So, there's a huge spike in traffic on Saturday mornings. And the auto-scaling just keeps right up. We're actually working on some predictive auto- scaling stuff now, which is really cool. Hopefully, we can get that out to the public soon. But that's to try to predict sort of what the traffic curve is going to look like so we can actually start the auto-scaling before the traffic arrives.
I put this slide up earlier. I put it up again because it's really important to scaling. The best way to scale any process is to automate it because then, you can just replicate it, over and over and over again. Anything that you're doing by hand, you should step back and say, "Should I be automating this? How many times do I have to do this by hand before it becomes worthwhile to automate this?" And the answer is probably not nearly as many times as you think. We try to automate everything. We try to automate application startup, configurations, deployments, even system deployment. We have that tool that we call Asgard. It's open source. It was one of the earlier things that we open sourced. It's our orchestration platform and we use that to try to make sure that, basically, the processes are as regular as possible. That makes it easy for us to launch. Five hundred apps is just as easy as launching three application servers.
Essentially, what we've done is we've moved the granularity from the instance level to the cluster level. We look at things as clusters instead of individual machines. We try to not care about individual machines. If one of them isn't performing, we just sort of delete it. This offers us a huge advantage because we don't have to worry so much about individual machines. We don't have to worry about monitoring CPU, things like that. We do monitor those because if they are constantly spiking, that's a problem. But if one machine is having a problem, just get rid of it.
And then another thing that we do at Netflix is we fully bake our machine images. This is a little bit of a departure from the way most people manage their code. I t was a shock for me when I got there because I was surprised that this works at all. But now, I've really bought into the system. What we do is, actually, I'm just going to jump right through these here and show you the whole thing. Alright. There we go. What we do, under the traditional model you would have your base image, then you would use your Chef or your PubIt, or whatever, with your repos and you would push them all onto that one machine. But what we do at Netflix is we actually do all of that ahead of time so that when an instance is launched, it's ready to go. It doesn't have to do any sort of getting its configuration, or it doesn't have to go and pull things and hoped that it worked or anything like that.
And the reason we do that is, is for homogeniality. We know that if I launch 500 copies of an application, those 500 are the same. They're running the same code with the same settings. Nothing is different. Because with these other tools, you have to either take the risk that maybe the code didn't get updated on that one machine, or that setting didn't get pushed to that one machine, or whatever. Or you need to have checks in place. You need to be using something that makes sure that that didn't happen. But even while it's making sure, you have to wait. You either have to make sure not to send traffic to that machine until it's verified as correct, or things like that. Nothing wrong with that. I still probably would do it that way in most cases because it's definitely easier to do it that way. But the fully baked machine images does definitely give a nice advantage to make it, especially when you're getting to a larger scale, not having to worry that your instances may be different.
Basically, what we do is you check your code in, it makes an application bundle, it makes an RPM in this case. And then we have the foundation AMI that's mounted on EBS, and then that package is installed on that. You can define how it installs the package, all the changes that it makes. And that's how we make the base image.Then we make your application based on your configuration, put all that together, then it's ready to go. Then we can launch that. So, that's basically how we do stuff at Netflix.
Probably not the best way to do it for a small startup, especially because that whole process, at its quickest, is still about ten minutes. You check your code in, and you're still going to wait about ten minutes before you have a machine that's running that code. So when you're trying to be agile, you may not want to be waiting for that to happen. Often, when I'm developing a prototype, what I'll actually do is make the changes right in the test account on the baked image, and then pull all the files back to bake a final image when I'm kind of done iterating on that one. Occasionally, you have to log in to a machine in production and kind of iterate on it, which is a really bad idea. But you do it because you're being lazy, and you don't want to wait for ten minutes to bake a one character change.
Now, I'm going to talk a little bit about your code and scaling your code. The first one is picking your framework. As I mentioned before, you don't wat to prematurely optimize. Just go with the framework that you know, that's in a language you know. If you don't know any frameworks, pick one that's got a good community around it.. At Reddit, we actually chose Pylons. Django was available at the time, and we tried Django and the templates were too slow. That's why we didn't choose it. They've fixed that problem since then. Django is a great platform now, and I've used it for a couple of other different applications.
But the interesting thing about Pylons is that scaling Pylons is basically the same as scaling any Python, right? You run lots of independent app servers, you make them independent. We had to build our own caching layer. We had to build our own database layer. Eventually, we ripped a lot of parts out and replaced them. At one point, we had done that so much that we couldn't actually upgrade Pylons anymore. They've since fixed that, so now they can actually go stick with the latest version. Because a lot of stuff that we had done, Pylons sort of adopted. Would I use Pylons again? Yeah, it's called Pyramid now. But it was a good framework and I would probably use it again. But you know, that's my choice. Whatever framework you're familiar with that's probably a good framework to use.
When you're first starting out, they're all going to be perfectly fine for scalability. It's only when you start to really grow and you start to find those bottlenecks that you might actually have to start tweaking your framework. The more open it is, the more community it has around it, the easier it will be to tweak the framework instead of replace the framework. But generally speaking, frameworks are definitely going to help you.
And one little last bit. I hate to say it, but C is faster than Python. There's a couple of different parts of Reddit that are run in C just to make them faster. The filters, and that's the text filtering on the comments, stuff like that. The markdown, the rendering of the actual taking the text to the HTML, and the memcached interaction is in C â€” using a C library.
Another way to scale your development, your code, is to open source. There's a whole community of people who are writing a bunch of stuff for free. Take advantage of it. Contribute back to it, ideally. But take advantage. Somebody probably has an open source project that's done something which you want to do. Take advantage of it. I generally don't have to convince people of this idea anymore, but I like to mention it because sometimes people don't think about, like, "Oh, yeah, there's probably an open-source project that does X."
Stick with doing what you know, or what is your core competency. That's kind of the big "how do you scale your development team" trick is, stick with your core competency. Focus on that and let other people do the other work, other people being outsourced or open source or whatever.
So now some tips about scaling your data and your storage. I would contend that data is the most important asset your business is going to have. Your infrastructure is going to be getting easier and easier. And a lot of you are actually working on infrastructure for other companies, which is great. At some point the infrastructure part is going to be basically solved. Your interesting part is going to be the data. All of these companies have essentially built their businesses on data.
There's this concept called data gravity. Basically, what it says is that the bigger your dataset, the harder it is going to be to move. Reddit totally ran into this problem. But also, the other sort of related part is, basically, your applications are going to have to orbit around your data. Because wherever your data is, is where you're going to need to be running your applications. This was that problem that I mentioned before, where we couldn't move the database without taking it down because it was so resource constrained. And we couldn't just run the apps in the cloud and have the database over in the data center because it was too slow, so we had to move the data.
Nowadays, Reddit has so much data in EC2, it woul d be really hard to move it somewhere else. All of that infrastructure, all of that ecosystem has to orbit around that data, basically. Keep this in mind. There's a reason that all of these providers give you free ingress data but charge you to take data out. Because they want you to put the data in so that you are kind of forced to put your applications around it. If you keep this in mind from the beginning, this'll help you out. If you think about the idea of making sure that there's always a copy of your data in two different providers, then you don't have to worry about this as much.
Another important question about scalability, SQL versus NoSQL. First of all, what is NoSQL? NoSQL is sort of an overloaded term. It means a whole lot of different things. Basically, it just means not SQL. It could be any sort of way of accessing your data that isn't SQL-y; aka relational versus non-relational. P eople pretty much refer to anything that's non-relational as NoSQL. As it turns out, Postgres is a great non-relational database. Reddit essentially uses Postgres as a key value store. Postgres even supports key value now as a data type. As I mentioned before, Postgres is a great database for a lot of things. E ven though it's 20 years old, I think that's an advantage for it.
What storage are you going to use? How do you choose your storage? Is NoSQL the choice for you? Do you need to be using NoSQL? Maybe you do. If you want to distribute your data, it's probably going to be easier to take advantage of a NoSQL solution like Cassandra or something like that as long as you embrace the eventual consistency, as long as you can handle that eventual consistency. At the same time, especially when you're starting out, data schemas, unless you're really, really sure, you're going to want something that lets you have sort of an adjustable data schema, which takes you back to the NoSQL world. Where you sort of have the schema-less models. You can do this using relational databases. Again, going back to the Reddit example. Since it was basically a key value store, we used it in a way where we could expand the schema without having to modify the tables.
Expire your data. One of the mistakes that we made at Reddit that they still have to live with today is no expiration of data. You can go back and read comment threads from 2005. It's cool that you can do that, but generally speaking, your users probably won't even notice. At one point on Reddit, we stopped allowing you to vote on things that were six months old â€” more than six months old. It took us, I don't even remember, but many, many months before one person finally said anything about it. Nobody seemed to even notice the fact that you couldn't vote on anything that was old because most people weren't looking at that. But we had an entire infrastructure dedicated to supporting that because there was one person who was looking at it, and that was Google. Over and over and over and over again.
Google was constantly scraping all the old pages even though they couldn't change because nobody could vote on them anymore, nobody could comment on them anymore. We kept telling people, basically, with headers, "Hey, this hasn't changed, this hasn't changed." But they kept asking anyway.
Expire your data. Think about that ahead of time, what data can be expired, if any. If you don't have to store it, that's great. That's even better. On Netflix, we like to say that we run a logging service that happens to play movies. We take in a ton of data from our clients about play quality, streaming quality, bandwidth, etc. You may have seen the Netflix report on bandwidth, ISP performance. That's all from the data that we collect. Most of the data we throw away after a couple of months, though. It's great to have around so we can do deep analysis on all the different trends, but after a while, we get rid of it because storing all of that data would be untenable.
Another mistake that we kind of made that I think we made on Reddit is not using transactions as much as we could have. Since we were using it like a NoSQL database, putting the transactions into the database, we started that way. We started with the transactions in the database, but that didn't hold up very well. It would have been good to put the transactions into the code a little bit more, wrapping things up to make updates a little bit more atomic. But again, embrace the eventual consistency.
Another important tip, think of SSD as cheap RAM, not expensive disk. Treat SSD disks like RAM. SSD disks are probably one of the greatest things to come around for computing in a long, long time. It's totally changed the way that we can store and access data. We can do it much quicker than we could before.
Sharding. On Reddit, we sharded the data based on the type of data that it was. We could have gone farther. We could have sharded things based on first character or whatever, but we didn't need to. Cassandra basically takes care of that for you. Essentially, it'll move the data all around your ring of data so that your data is split up. At Netflix we run Cassandra in a way so that every chunk of data exists in three different zones. That way, we can lose an entire zone and everything keeps functioning. We test that about once a quarter by actually deleting everything in a particular zone just to make sure that we can keep running.
Thinking about these things ahead of time: Where am I going to put my data? Where am I going to store it? How am I going to grow my database as my data size grows, as my request rate grows, etc? Cassandra is really awesome. I like Cassandra. If you're going to use a NoSQL solution, it's a pretty solid one. It uses a lot of different tricks to make things better and faster.
The Bloom filter was a nice little one that helped us a lot at Reddit. When you load a Reddit page, if there's a thousand comments on there, chances are you voted on zero or maybe one or two of those comments. But we need to know that, so we need to ask the database for these thousand items, did they vote? The nice thing about the Bloom filter is it gives you very fast negative lookups. We can very quickly get back the answer. "For these 998, the answer is no, they didn't vote on this. For these two, here's the data." The Bloom filter was a really helpful, especially for rendering pages with votes. It's also got fast writes, it's distributed, etc.
Another way that we scaled Reddit â€” another big, big tip â€” is putting non logged-in users to the CDN. A non logged-in user on Reddit basically doesn't hit Reddit servers at all. It hits the CDN. It gets its answer from the CDN. A little protip, if Reddit is ever down, all you have to do is delete your cookie and you'll get the last cached version from the CDN instead. Because the CDN won't wipe out the cache if the service is down. When I was there, logged-out traffic represented like 80% of the traffic. It's less now; I think it's 50 or 60% now. But regardless, it offloads a lot of the work. That was a big way that we scaled. And as I mentioned before, the Reddit buttons. The Reddit buttons are all served regardless of whether you're logged in from the CDN. And what this does is, it means that if we get a link on into the Yahoo front page, it's not going to take down all of Reddit.
I mentioned earlier that queues are great. On Reddit, we use queues for everything. At Netflix, we use queues for a lot of things. By putting things into queues, it gives you the sort of buffer. And it also gives you great insight for your monitoring. On Reddit, the vote queues, we know that there's a problem if the vote queue starts to grow. The vote queue should sit at, you know, two to three votes that need to be processed. If there's hundreds and hundreds, that means something is wrong, something is slow. So, it gives great insight. Same with all that other stuff. The spam, if the spam gets behind, we know the spam processor might be broken. Or there's a link in there that's causing the spam processor to crash or something like that.
You know, and sometimes, your users will notice your eventual consistency. But like on Reddit, a vote could sit queued for two hours, and nobody would notice because the vote immediately goes into the cache. And so, it's going to be cached for about, well, at the time, two hours. It's probably less now, but that essentially hid it from that there was a problem. We could be backed up on votes for a couple of hours and people generally wouldn't notice.
One mistake we made, though, was not using a consistent key hash. We tried to use MD5 key hashing, which made it really hard to grow our ring. We moved to a consistent key hash. How many of you know what consistent key hashing is? Okay, so, real, real quick. Basically, imagine your key space as a ring, and you have your data points of A, B, and C. And you then add another server. Essentially what that does is it just moves the ring around so that you have a lot less cache invalidation when you move it around. Instead of like when you go from five to six, instead of invalidating five-sixths of your data, you only invalidate one-fifth. Or sorry, four-fifths, one-fifth. You get the idea.
Consistent key hashing protip, most memcached libraries do not use a consistent key hash by default. So, look into it. Make sure that the library you're using does use one. Cassandra has this built in, so you don't have to worry about it there. A lot of the NoSQL stuff has it built in. But one of the big ones is memcached, the library doesn't. So, take a look.
I mentioned earlier that some of the problems with the public cloud is they're inherently more variant. Keep that in mind if you're trying to scale, and one of your machines is just slow. Maybe you have a noisy neighbor. Probably Netflix. Somebody's slowing you down. Possibly, maybe try switching machines or something like that. You can solve this problem in part by keeping your data in multiple availability zones so that if one of them gets slow, you can delete that node, replace it, throw it away, stop sending traffic there, etc. Taking frequent snapshots if you're using EBS. That's another good one for your data.
Different functions and different security groups, this is good for not just data but for anything. If you're using AWS and you're using security groups, and you set up your different types of applications in different security groups and only open up the ports that you think they should be talking on, that'll help you out later because what's going to happen is you might find out that you were trying to talk to some other service on some other port or some other way that you weren't expecting, and you'll catch that by having these security groups. It would be even better if you could log them, but at least this way you'll know that something is probably broken.
Real quick, I'm just going to talk a few of the social aspects of scaling your startup. One thing that Reddit did that I think helped a lot was open sourcing. Our original goal with open sourcing was â€” we had two. One was to make sure that people could see what we were doing because we were constantly getting accused of, "Oh you guys did this crazy thing or this nefarious thing." And we could just point right to the code and say, no, go look. And then the other reason was to try to get contributions. Getting contributions from the community is like pulling teeth. It's going to be very difficult to get mainly because they don't get your code base nearly as well as you do. It takes a while to understand a complicated code base.
But we did get a couple of contributions, one of which was, we actually had a worm on the site at some point that somebody accidentally set loose. They discovered a bug in the way that we were parsing the comments, switching markdown HTML. It basically broke out of the Reddit that they were testing in. And this happened while all of us were on an airplane coming back from the founder of Reddit's wedding. All of us were on the same airplane coming back. There was no Internet. We landed. We turned on our phones. Our phones were all blown up. I was sitting a few aisles over. I looked back. He looked at me. We just looked at each other like, "Oh, my god. What happened?" I was the first person to get to an Internet-connected computer. I got on our chat room, and one of our users was like, "Hey, I think I found the bug. Here's the line of code that it's in, and here's a patch. " And I was like, "Great!" And I took their patch and applied it to production. Didn't even read it, just applied it straight to production because, I mean, I kind of knew who the user was, and I trusted them. As it turned out, we ended up hiring him later to work for Reddit. But I kind of knew who he was, and I was like, you know what? This can't be any worse than this worm that's completely destroying our site. I did that and that was great.
How does this apply to scaling your startup? By having that open source code out there, I didn't have to go, or somebody didn't have to go, and figure out what's causing this worm. How do we stop it? One of our users did it for us. With Reddit, we rely a lot on user contribution. We have user moderators. We have users classifying spam. A lot of the reason that Reddit is successful is because of that. Part of the other reason that it's successful is because of the API. Because we've provided an API, the best Reddit client on the iPhone is one that we didn't write. We actually wrote one, and this person made a better one because they used the APIs better. In fact, I think it was one of the top links today was, "Why isn't there an official Reddit mobile app?" That's why. Because somebody made a better one than we did. We actually deprecated our app for theirs.
When you're running a site that requires user input, let your users help you moderate that. Let your users help you curate that content. And be active. Being active in your community â€” pe ople like it when the founders are active; people like it when the admins are active. It helps because then they get to know you, you get to know them, and when you have a problem, like a worm, they can help you out. I mentioned a lot of this already, the moderation, etc.
Empowering the users is really what made a big chunk of what made Reddit successful, the community interaction. And it helped us grow, too, because as we grew a lot of the features came from our community trying to basically hack the code. They tried to figure out, "How do I make HTML styles on my subreddit?" So we gave them the ability to do that. And they totally embraced that, and they loved it, and they shared it. And now there's all these features based on the fact that we allowed the community, the small, little bit of freedom with our service. It's a great way to help grow. It's the best way to figure out what features they want. It's the ones that they're trying to hack right in.
Self-posts is a perfect example of that. Somebody basically figured out how to predict what the next link idea was going to be. They created a self- referential link. And then other people figured this out, and they started doing that, and they called them self-posts, posts that link back to themselves. And we were like, "Hey, that's a great idea. Let's just make that a thing." So, we did. We made it so you could actually submit those. This is the graph of number of self-posts. Basically, the percentage of content that was self-posts. And you can see there's an inflection point there, about two-thirds the way down the graph. That was the day that we made it so that you could put a chunk of text that stuck at the top. Before you had to write a comment and hope that people kept voting it up. But when we made it so that you could put a chunk of text at the top, all of a sudden that feature took off. I think it's more than half the content now is basically generated right there. All because some user figured out how to link back to themselves. Basically, some user essentially accidentally created one of the best features of Reddit.
Some people ask how Reddit makes money, so I'll throw that in real quick. Through various advertising, merchandise, etc., and the Reddit Gold program that I talked about before. That's essentially what allows the business to scale. The way Reddit Gold came about was because people on Reddit kept joking about, "Oh, you can't have that feature unless you have Reddit Gold." And we were like, "You know what? Why don't we just make that a thing? Why don't we actually make Reddit Gold?" And we started by just asking the community pay anything you want. And the price for Reddit Gold was basically the median of the pay anything you want was â€” $3.99. And it's still the price today. The community essentially set the price of the main community product on Reddit. Leveraging the community is essentially the key for Reddit's growth and scaling on a social aspect.
That's everything that I've got. We're going to have a Q&A in a second here, so you can ask me whatever you want. I won't answer everything you ask me, but you can ask. And if you're too embarrassed to do that, this is how you can get in touch. Or if you're on the film, then you can shoot me questions that way. So, that's all I've got.