Alex Polvi
Containers and Clustering

Alex Polvi is the CEO of CoreOS, a Y-Combinator funded start-up, focusing on building a new operating system for massive server deployments. Prior to CoreOS he was GM for RackspaceHosting, Bay Area, overseeing cloud product development. He joined Rackspace through the acquisition of his company, Cloudkick, which provided cloud server monitoring and management tools. Alex is a sysadmin at heart, contributing significantly to many open source projects, including Mozilla and Apache.


My name is Alex Polvi. I am the CEO of CoreOS. As Blake put it, I am the Prince of the Cloud. No, not seriously.

I was most recently with Rackspace overseeing a couple of cloud products there which I joined through the acquisition of my previous startup Cloudkick. Lots of infrastructure stuff. I am a SysAdmin at heart. We'll be talking about some of the distributed systems stuff that we've been doing at CoreOS, but also in general for your own applications.

First, before I get started, who remembers this slum lord of the internet, IE 5? It was the bane of every IT guy's existence from about 2001 to 2004 or so. Do you guys remember when a security vulnerability would come out in Internet Explorer and then everybody would scramble, but wait for Microsoft to actually ship an update. There was this 3 to 6 month window where everybody was compromised and the web was this actually pretty-unsafe place for end users.

That created an opportunity for this guy, Firefox. This was when the whole 'Taking Back the Web Movement' really spawned because Internet Explorer was such a bad actor. But what really changed everything was when Chrome came along. Chrome did one fundamental thing different than previous browsers before. It automatically updated itself. These IT guys that were previously scared of Zero Day, and all their users getting hacked, Chrome was patched for them. Everything was up-to-date and ready to go.

Now again, this freaked out people. You can't have software going and updating itself. I remember I was at Mozilla when Chrome came out and the Mozilla guys were like "Wait, this is an online civil liberties issue. We can't have applications going and changing themselves without the user accepting that."

What we all found out was that the Chrome guys could do a better job of keeping a browser secure than an IT guy could distributing it out in time. Soon after, Firefox and IE followed suit.

I believe because of the fact that Chrome automatically updated itself, we got the most secure web that we've ever seen. Also, we got features like HTML5.

When Chrome and Firefox encouraged ship an update, we upgraded the web overnight. But we don't have anything like this for servers. In fact, state of the art for servers right now is 'Get it running and don't touch it.'

This was an article that was just published like last week, perfect timing, talking about how it's these ancient machines out there that are causing all the vulnerabilities. It's a bit ironic because the server side is where all the precious information is, where all of our user's data is.

If we don't think about how we package our applications in such a way that we can actually service them with updates, then this could be you. Our applications will just fall out of date, and eventually you'll be compromised.

I'm going to talk today about what things we can do to repackage how you do applications and how you do application deployment of your infrastructure so that we can follow this model of what we've seen on the client side a little bit. If we can take these different components into mind around how we package things and deploy them, I think we could unlock a lot of value. Primarily in these areas:

Security. You can take a step function on the security. Not by some hardening of your application or some fancy software. Just by being able to update it.

Reliability. An environment like this requires that you can take things out and service them. Which means that you have HA sort of built into the system.

Again, if we can all take this model, we can start upgrading the internet, but from the back end instead of the front end; instead of how the front end guys have been doing it.

First piece is the actual updaters. Alright, how do we update running applications that are currently deployed? There are known good patterns for this. We've all seen this before. You probably have a bunch of updates on your phone right now. When Apple ships an update you get a prompt. Android sort of has a similar thing. We don't think about updates on our back end infrastructure at all, but yet we have proven models for how this works.

Why do the front end, the client side, actually do it this way? First reason is around scale. There are only so many ways to make a pancake. At certain scale, the model for pushing an update to a set of applications is they poll out, they check for an update, and if they are told they have an update they will run their update logic to actually do whatever they need to do to update themselves.

Existing deployment infrastructure is often a bit different. We sort of take a time where we decide we're going to take an outage and we tear everything down and put it back up again. If we're really good at it we can do it without taking a downtime, but most of the time we have to put up a maintenance page.

The next reason we do it this way, is we want to roll it out. When you ship an update to Chrome, you don't give it to everybody all at once; you stage it out to a few applications first and then you let it trickle out through the rest of the environment.

This allows you to have pretty graphs. In this graph we see that an application is being rolled out. Version A is in purple and then version B is being brought in and trickled out. You can control this. You can do things like only give it out to 100 instances of my application at a time and let them continue to roll out.

This is the first component. If we want to build an infrastructure that's always up to date, without any downtime and always secure on the back end side, we need to ship it with an updater.

Next piece is around immutability, and this is where containers come in, and why Docker has been so successful here today. Anybody playing with Docker containers? Ok, cool.

The key thing behind this is a reproducible environment. You need to be able to take your application and run it in such a way that it's reproducible always. What this is not, is convergence.

Traditional configuration management as we have it today is we take a base machine and we sort of sculpt it into the state we want it to be in. We converge it into a known state. But if anything is unexpected along the way, the process could break, and we are now in an unknown state. When you are in an unknown state you cannot automatically handle it, that's why we still log into our machines and SSH and mess around with things.

In order to do that we have to have everything in a known state. The known state component of this is extremely important because when we upgrade we need to actually take version A and make sure we get to version B and it always works. These are build artifacts that can actually be deployed at any given time which means that we can revert them as well.

With CoreOS we actually take this to an extreme. We take the whole root file system of the host and we make it completely read only. This allows us to flip you to a new version, and if there's a problem you can revert back and there's no issues because you're essentially reverting completely back again.

We've seen this model before. We're not inventing anything here. If you guys have ever built like an Amazon image for the Amazon marketplace this is something that you see, like you build your machine image and you go to deploy it. Traditionally, this has been very difficult to do because you have to deal with things like the kernel and SSH and blah blah blah. Docker makes this a lot easier because you can just focus on the parts of your application required to actually run it.

Is anybody actually deploying applications either by building a whole new machine image on Amazon, or by using a container, or a fixed build artifact or anything? How's it working right now? What are you actually doing?

Audience: For the containers, we start with a master image, build a slave image and then TAR ball the entire file system so we can download it for every new machine. We start all of our containers off this image.

So you have a known previous sort of base image, and then you pull in a container with all of the things you need to run?

Audience: Right. And then we can continuously deploy new containers with changes.

Got it. And do you ever run mixed-matched versions in parallel?

Audience: We plan to, but don't do it yet.

Alex: You don't do it all. Got it, okay.

We have applications where we want to be able to update them, and then we have this build artifact that we create. Again, this is the container component. But now we want to be able to do this without taking any downtime as well. And this is where clustering comes into this and where I'll spend most of the time talking.

Our goal here is we would like to take an instance down and the app just keeps running.

Now I'm sure everybody here knows you can kill a random server and your machines keep working right? Your apps keep working? No?! What guys?

This is something I think we all aspire to. It's sort of the operations mecca, if you can kill any machine and your application keeps working it probably also means you can add more machines and get more scale as well.

It's sort of this idea that you just have an opaque set of resources and you can deploy against them, and your applications keep working.

Does anybody have a setup like this? I'm curious how people are doing it today. Could you share briefly how you actually have it today?

Audience: In what sense?

How do you actually solve this problem? How can you kill a random machine and it keeps working — your application keeps running?

Audience: I think every service you design has to be, like you said, scaled. If you're horizontally scalable, there's no snowflake server that's dependent. You can't have any single points of failure, that has to be your first requirement, including your data storage.


Audience: We do use Amazon machine images to deploy the new instances, but I'm curious to see how this model works with that.

Right, they're good points.

The analogy is pretty simple. Today we have full machine images. It's like, take out the kernel, take out SSH and now you have your application image. If you've played with Docker, most of the time you just start from an Ubuntu base image and build up from there. In our opinion, the best way to do this is to take as minimal of a deployment as you can get away with.

So, if you're running a rails application, just build your Ruby runtime statically compiled with everything it needs in it and then throw in your code. Now you have this artifact that's runable.

This is what we're targeting when we talk about clustering, is that you can bring something else in.

One big component of this is service discovery. With service discovery there's a couple class of problems we're trying to solve with this. The first one is for instance, take a load balancer and have it find your downstream app servers dynamically. Again, just curious, has anyone been able to pull this off in a reasonable manner? Heroku solves it in a pretty big way. Do you mind sharing?

Audience: I've done it with AWS tags by tagging instances in groups. The other way is by using DNS. So in Terminal, DNS with prefixes. That lets you find all the IPs that are in that range. But you have to run DNS.

You have to run DNS. And is this sort of baked into the Amazon tools? I know tags are, but is your app server finding the actual database in that example?

Audience: You have to set up the DNS using Route 53 or some other DNS provider. Then you have to change your search path in your DHCP config. It's all standard Ubuntu stuff.

Any other approaches to this to date?

Audience: ZooKeeper

Another component of this to actually do it effectively, is distributed locking. A distributed lock is required for particularly sensitive operations. A distributed lock is when you have a set of disparate machines, let's say your master and a bunch of slaves on a database, and that master dies. Now how do you decide which of these slaves actually becomes the new master? That's where the distributed lock is important.

The only way to actually do this is you need a source of truth, a single source of truth that is a hard consistent, that is exactly right. Because two of those databases can't think that they're master at the same time.

This is where it requires actual hard consistency, but at the same time because you have something you don't ever want to go down, that source of truth has to be highly available in and of itself. That's where ZooKeeper and etcd come into the picture.

Now I want to survey the audience on the Zookeeper or etcd users. Anybody actually deploying this stuff in a production? Would you guys be willing to share which one you use and how you use it? You use ZooKeeper? In what part of the stack, where does it tie in?

Audience: We use it for both directory service and to prevent two boxes from claiming ownership of the same shard. We share data across multiple nodes and want to prevent two nodes from thinking they own the same thing. It's sort of like an implementation of distributed locking using ZooKeeper. Currently we're concerned that it's not fully dynamic yet. It's just there to prevent two things, which haven't been configured wrong, from competing for the same resource.

Got it. What is the platform you're building, just curious?

Audience: We're building an email platform where every shard is a set of users.

Got it, every shard is a set of users. Anybody solving for the similar class of problems? I'm curious how people are doing it without this. How do you manage the sharding and making sure the right app server is talking to the right thing? Does anybody have an approach to this that does not include ZooKeeper and etcd? No? Okay.

I definitely recommend starting to study up on these tools a little bit. Particularly if your platforms are starting to grow in capacity. You will need a solution to the problem that these things solve at some form at scale.

Google does it with there own stuff. Chubby, Yahoo spun out Hadoop which ZooKeeper is part of that suite. Etcd is a thing that we built as a more devops friendly version of this. It's definitely something that is an important part of the stack when we talk about locking.

The next piece is around cluster management. Remember right now our operational goal is that we can take anything down and things keep running. Which means we need to forget about any single machine is a provider of resources. It brings in cluster management tools. And by cluster management we also need resource scheduling. Resource scheduling is like: Okay this application just needs 256 megs of memory to run somewhere in the environment. Find me a machine that it runs on.

This is actually a pretty difficult computer science problem. It's called bin packing. Bin packing is something that is not a trivially solved problem, I think it's an NP-hard problem. Resource scheduling is a component of all of this. There are a couple of tools out there to help you. Yarn, again is part of the Hadoop suite. Mesos, Fleet is the one that we built that we have sort of brought into all this.

Is anybody using any of these components at all in their own environment? No.

Okay, if you really want to step it up to what you'd see inside of a Google-type deployment — in Google they call it borg, that's their internal cluster management system. The smallest deployment you could probably have with this is like three servers, and you could take out one of those servers, and your applications will keep running. But if you lose two, you've sort of lost the quorum, and your applications will stop running.

This gets pretty sophisticated, but it also simplifies how deployment works, if you can trust your server cluster management software to keep working.

If we tie all of this together, first you need to push an update, then you need to have something that is updateable which is where the immutable pieces come in, and we need to actually get it deployed in an HA/fault-tolerant manner.

I'm going to encourage you guys to update your apps! Think about this sooner than later, because it will change the architecture of how your applications actually work. Do each of these components here, and again I really encourage you to think about this even when you're on the smaller side of your application deployment. It might sound super hardcore and sophisticated, but it's much more difficult to port your application to this sort of mentality of running down the road. Start as soon as you can.

And if we can do all this together, we can give you way better security, way better reliability, and we can update the web where the actual important parts matter.

So thank you guys so much for your time. I know that was a bunch of information, but I appreciate it and I wish you guys all the best on building your platforms.

Thank you.

Q: What is the roadmap for CoreOS 1.0?

A: The OS itself needs to get fully stable. Right now we do crazy things like reboot you without you asking to update you. Obviously, in a production environment randomly rebooting people is a little bit crazy. We built in all this functionality to make it safe to do, but it still makes people quite uncomfortable to think that their servers will just magically reboot on you.

We're getting to the point where we are stable enough that we turn that feature off and the distro is working well.

Then on the cluster management side there's etcd and fleet which go hand in hand. Our goals there are etcd needs to be stable that it doesn't actually go down and can handle the fault tolerance as well as scaling up and down the cluster size, and do that reliably. I'd say we are about 70% there on etcd.

Then fleet, it's the part that takes CoreOS and plugs it into etcd and it allows you to just boot more machines. It all kind of works together. Getting that fully stable and working is also on 1.0. Right now it works, I'd say 60% sort of quality, but we want it to get to the point where it's essentially, internally what I target for this is a bit ambitious, but we want to have 1,000 machines under management with no problem. We'll probably call it 1.0 before it's actually that ready, but that's definitely how we're thinking about it, in units of 1,000, and fleet can manage that.

Q: If you're a new start-up and haven't thought about moving to a properly distributed system, when should you start thinking about this?

A: So most of the time when you're just starting out you don't even have your product well defined. I would recommend starting out on Heroku or App Engine or whatever and they solve these sorts of problems for you, and just punt on the whole thing. Actually start doing this as your next gen version of your application when you go and rewrite it, because you have a working product and you're ready to go. I would start by doing nothing, and then start with a set up like this.

What's not ideal and the trap a lot of people fall into is the "Oh I'm just going to boot an Ubuntu machine and go set up my app, and oh my gosh I need two of them what do I do? Oh, now my database, I should move that to a different box." And then you just end up with this giant mess that's totally hard to manage.

If you start out with more of this platform-y type environment, App Engine or Heroku, knowing you're going to re-architect your app at some point anyway, and that's okay. Then you won't have to worry about ops at all, for a little bit. Then when you're ready to step it up to the big leagues and you actually know what you're building then go after this style of approach. Again this works all the way down to a small cluster of one or three machines if you don't have the HA capabilities. If you need to run not on a cloud, you could do it on the small size.

We are a very sophisticated team that's seen all these problems before, and everyone wants to start at this, but that's because we intend it to be big from the get go.

Definitely don't think about it too late, but I would also run your infrastructure with other people as long as you can get away with two. As long as that works for you.

And also keep in mind, a lot of these tools are meant for the platform builders as well. Not the end user. This is something that you'd run under Heroku, or run under App Engine. My understanding is a lot of you guys are building platforms yourselves and so that's why this is a tool to consider earlier.

Q: How does this fit into a workflow where you're deploying system-level changes and then terminating the older versions?

A: That's a similar model and essentially you have the same set of issues here it's just that your application is that instance instead of being a container on the instance. Does that make sense? Your resource scheduler here is like EC2 is doing the resource scheduling for you instead of it being built on top of your operating system.

Q: How do you handle an application change versus a larger system-level update?

A: The way we think about it is the base OS is actually just another application that's updated like this. The kernel and your init system, everything is just an application that pings home, gets an update, and reboots as its upgrade mechanism. Then your application is a separate one that pings home, gets an update, and upgrades itself. They're just two different applications.

Q: What controls or safety measures do you have in place when you do a CoreOS update?

A: So you're talking about on the OS side? On the OS side, that graph we were showing, essentially if it encounters a error in any state it stops and the client will revert itself. The client in this case is the OS. The OS will revert itself if it's unable to successfully update.

Success is determined by there is a default success which is "can I do another update?" or there's a success where you can hook it and validate whatever success metrics mean to you.

The point is that it is automated. That it should be able to roll in the middle of the night, and it's okay. Again nobody trusts this. So we'll get you there over time. But that's the whole point.

All of this stuff that we talked about is to make that safe. That you can actually update and not take a downtime. Everybody's been burned by this over and over again and it's because of the way applications are architected today that it's not safe.

Q: How do you deal with applications that require persistent storage?

A: Being completely practical, the answer today is use existing stuff that already works for your database. So use RDS on Amazon. Google has stuff like GFS which solves this internally, but it's a radically different approach than what people do with data storage outside of that world. Today it's like storage, put it on RDS. Manage that as it's own whole thing. For big companies when they ask us about "Well what about our Oracle database?" We don't want to touch your Oracle database. That's not going to run on us.

This is for your applications that you actually want to be distributed in fault-tolerant. Existing databases like Postgres/MySQL, we're just fundamentally not built to be able to do that or to do that well. They can kind of do it, you can sort of make that work, but they just aren't built to do it. Places where this actually works is where they just have a different database than what we have on our side of the world.

Punt until we have better technology.

Q: How does etcd fit in with systemd?

A: Systemd, I didn't cover it too much here, but that is definitely something to keep an eye on. All the Linux distributions are going to be switching to systemd.

Systemd is the init system. It's what the kernel boots to actually run processes on your Linux box. Ubuntu is switching, Debian is switching. Red Hat is shipping their next version which means CentOS will follow. Arch is already there. Gentoo it's a matter of time.

Everybody is switching to systemd, which means very soon all Linux you run will be Linux and systemd now. Not just that common Linux space, but how you actually do things, like starting and stopping your web server.

What we've done, we took systemd, it has APIs like D-Bus. Anybody familiar with D-Bus? It has the D-Bus API for actually controlling it saying, start this service or stop the service, start my web server, or stop my web server. We plumbed that straight into etcd so that you can have a distributed init. It takes the abstraction of a single machine process management and puts it across a set of machines. You can tear down one host and your process will get picked up and run on a different machine.

This obviously requires a good actor in terms of the application that's running in this environment. Not everything can handle that — to be shot and then running somewhere else and it's okay. But tied together with all this dynamic, service discovery and distributor locking, and all these components, you could actually make it start to work. It does require a different way of thinking about how you actually run these systems.

Systemd, again, regardless of CoreOS or anything, check it out, because you're going to have to deal with it. If you interact with a Linux box in any way, you're going to have to deal with it soon. CoreOS already ships with it. I think Ubuntu will be in 14.10 will have it, and it does change everything. No more /etc/init.d/apache2 start, you're going to run systemctl start apache and there'll be a lot of docs to read and that stuff. It's coming.

Q: What are the tradeoffs of ectd versus ZooKeeper?

A: Good question. ZooKeeper does a pretty good job at everything. The main issue and the reason we built etcd is really around clients. It's really hard to have a non Java client with ZooKeeper.

Actually for the guys that are using it, do you write everything in Java?

Audience: Also Python. That was easy to use.

There are some clients out there that are so-so, but the protocol is really difficult to get right in a client which means primarily just Java. Java apps are able to use ZooKeeper.

We built etcd originally just because we wanted to make these primitives more accessible to a normal developer. ZooKeeper talks a binary API that's really only implemented in Java, and we talk HTTP and JSON with etcd. It's just a little friendlier to deal with. One big advantage of ZooKeeper is that it's actually stable and works pretty well.

Etcd is still coming along and this is the part of the system that has to be like perfectly stable, otherwise the whole thing falls apart. If you need rock solid stability, choose ZooKeeper. Etcd is going to get good over time.

As an open source project we've seen right around 75 outside contributors, and like 4,000 stars on GitHub, if that means anything. It's coming along as an open source project, but it's still early.

Q: How will you deal with package upgrades? Will you split it like Fedora and Redhat?

A: The state of the art on that is Fedora and Red Hat. Fedora is like the bleeding edge latest version of what Red Hat is, and somehow it magically becomes Red Hat — like a three to four year cadence and then it's stable.

Our model is we have an alpha channel. Alpha is a release candidate for the beta channel. The beta channel is a release candidate for the stable channel. This is the model that you see on things like Chrome. We'll ship between alphas almost daily like a SaaS application, and then those will get promoted to betas, probably around a weekly cadence.

We haven't actually shipped our first beta although we have our first release candidates for betas out, but none of them have made it yet. Then the stables will hit about once a month or so on top of that. It's a straight pass-through. The thing that hits alpha is bit-for-bit identical to the thing that becomes stable. Assuring quality is all there and everything.

Q: So you keep your package upgrades simple because CoreOS is small?

A: Exactly. CoreOS we ship you the Linux kernel.

The Linux kernel has this trick that not many people know, which is it never breaks user land. A really old Linux binary will run on a brand new kernel, no problem.

The things that break are those inter-application dependencies. You need that version of OpenSSL, you need that version of your Postgres client or whatever it is for your application to actually run, and that's the stuff that breaks across distro upgrades. Not that it doesn't work on the kernel.

We just ship the kernel in systemd, and a couple packages to help you out. We ship Docker because that's a convenient way to bring in your container. But from there we push everything into the actual container.

If you have any other questions I'll be here, Thank you guys so much.