Containers and Clustering
The architectural patterns of a large scale platform are changing. Dedicated VMs and configuration management tools are being replaced by containerization and new service management technologies like systemd. Come and learn how to use these new technologies to build performant, reliable, large distributed systems.
- The Browser Narrative
- Containers and Clustering
My name is Alex Polvi. I am the CEO of CoreOS. As Blake put it, I am the Prince of the Cloud. No, not seriously.
I was most recently with Rackspaceoverseeing a couple of cloud products there which I joined through the acquisition of my previous startup Cloudkick. Lots of infrastructure stuff. I am a SysAdmin at heart. We'll be talking about some of the distributed systems stuff that we've been doing at CoreOS, but also in general for your own applications.
First, before I get started,who remembers this slum lord of the internet,IE 5? It was the bane of every IT guy's existencefrom about 2001 to 2004 or so.Do you guys remember whena security vulnerabilitywould come out in Internet Explorerand then everybody would scramble,but wait for Microsoft to actually ship an update.There was this 3 to 6 month windowwhere everybody was compromisedand the web was this actually pretty-unsafe place for end users.
That created an opportunityfor this guy, Firefox.This was when the whole 'Taking Back the Web Movement' really spawned because Internet Explorerwas such a bad actor.But what really changed everythingwas when Chrome came along. Chrome did one fundamental thing differentthan previous browsers before.It automatically updated itself. These IT guys that were previously scaredof Zero Day, and all their users getting hacked, Chrome was patched for them.Everything was up-to-date and ready to go.
Now again, this freaked out people.You can't have software goingand updating itself.I remember I was at Mozilla when Chrome came outand the Mozilla guys were like"Wait, this is an online civil liberties issue.We can't have applications goingand changing themselves without the useraccepting that."
What we all found outwas that the Chrome guys could do a better jobof keeping a browser securethan an IT guy could distributing it out in time.Soon after, Firefox and IE followed suit.
I believe because of the fact that Chromeautomatically updated itself,we got the most secure web that we've ever seen. Also, we got features like HTML5.
When Chrome and Firefox encouraged ship an update, we upgraded the web overnight.But we don't have anything like this for servers.In fact, state of the art for servers right nowis 'Get it running and don't touch it.'
This was an article that was just published like last week,perfect timing, talking abouthow it's these ancient machines out therethat are causing all the vulnerabilities.It's a bit ironicbecause the server side is where allthe precious information is, where all of our user's data is.
If we don't think abouthow we package our applications in such a waythat we can actually service them with updates, then this could be you.Our applications will just fall out of date,and eventually you'll be compromised.
I'm going to talk today aboutwhat things we can do to repackage how you do applicationsand how you do application deploymentof your infrastructure so thatwe can follow this modelof what we've seen on the client side a little bit.If we can take thesedifferent components into mindaround how we package things and deploy them,I think we could unlock a lot of value. Primarily in these areas:
Security. You can take a step function on the security.Not by some hardening of your applicationor some fancy software.Just by being able to update it.
Reliability. An environment like thisrequires that you can take things out and service them.Which means that you have HA sort of built into the system.
Again, if we can all take this model,we can start upgrading the internet,but from the back end instead of the front end; instead of how the front end guys have been doing it.
First piece is the actual updaters.Alright, how do we update running applicationsthat are currently deployed? There are known good patterns for this.We've all seen this before.You probably have a bunch of updateson your phone right now.When Apple ships an update you get a prompt. Android sort of has a similar thing. We don't think about updateson our back end infrastructure at all,but yet we have proven models for how this works.
Why do the front end,the client side,actually do it this way?First reason is around scale.There are only so many ways to make a pancake.At certain scale, the model for pushing an updateto a set of applications isthey poll out, they check for an update,and if they are told they have an updatethey will run their update logic to actually dowhatever they need to do to update themselves.
Existing deployment infrastructure is often a bit different.We sort of take a timewhere we decide we're going to take an outageand we tear everything down and put it back up again.If we're really good at it we can do itwithout taking a downtime,but most of the timewe have to put up a maintenance page.
The next reason we do it this way,is we want to roll it out. When you ship an update to Chrome,you don't give it to everybody all at once; you stage it out to a few applications firstand then you let it trickle outthrough the rest of the environment.
This allows you to have pretty graphs. In this graph we see thatan application is being rolled out. Version A is in purple and thenversion B is being brought in and trickled out.You can control this.You can do things likeonly give it out to 100 instancesof my application at a timeand let them continue to roll out.
This is the first component. If we want tobuild an infrastructure that's alwaysup to date, without any downtimeand always secure on the back end side, we need to ship it with an updater.
Next piece is around immutability,and this is where containers come in,and why Docker has been so successful here today.Anybody playing with Docker containers?Ok, cool.
The key thing behind thisis a reproducible environment.You need to be able to take your applicationand run it in such a waythat it's reproducible always.What this is not, is convergence.
Traditional configuration management as we have it todayis we take a base machineand we sort of sculpt itinto the state we want it to be in. We converge it into a known state.But if anything is unexpected along the way,the process could break,and we are now in an unknown state.When you are in an unknown stateyou cannot automatically handle it,that's why we still log into our machinesand SSH and mess around with things.
In order to do thatwe have to have everything in a known state.The known state component of thisis extremely important becausewhen we upgrade we need to actually take version Aand make sure we get to version B and it always works. These are build artifactsthat can actually be deployed at any given time which means that we can revert them as well.
With CoreOS we actually take this to an extreme.We take the whole root file system of the hostand we make it completely read only.This allows us to flip you to a new version,and if there's a problem you can revert backand there's no issues because you'reessentially reverting completely back again.
We've seen this model before.We're not inventing anything here.If you guys have ever builtlike an Amazon image for the Amazon marketplacethis is something that you see, like you build your machine imageand you go to deploy it.Traditionally, this has been very difficult to dobecause you have to deal with thingslike the kernel and SSH and blah blah blah.Docker makes this a lot easier because you can just focus on the partsof your application required to actually run it.
Is anybody actually deploying applicationseither by building a whole new machine image on Amazon,or by using a container, or a fixed build artifact or anything?How's it working right now?What are you actually doing?
Audience: For the containers, we start witha master image, build a slave imageand then TAR ball the entire file systemso we can download itfor every new machine.We start all of our containers off this image.
So you have a known previous sort of base image,and then you pull in a containerwith all of the things you need to run?
Audience: Right. And then we can continuouslydeploy new containers with changes.
Got it.And do you ever run mixed-matched versions in parallel?
Audience: We plan to, but don't do it yet.
Alex: You don't do it all. Got it, okay.
We have applications where we want tobe able to update them,and then we have this build artifact that we create.Again, this is the container component.But now we want to be able to do thiswithout taking any downtime as well.And this is where clustering comes into thisand where I'll spend most of the time talking.
Our goal here iswe would like to take an instance downand the app just keeps running.
Now I'm sure everybody here knows you can kill a random serverand your machines keep working right?Your apps keep working?No?! What guys?
This is something I think we all aspire to.It's sort of the operations mecca,if you can kill any machineand your application keeps workingit probably also means you can add more machinesand get more scale as well.
It's sort of this idea that you just have an opaqueset of resources and you can deploy against them,and your applications keep working.
Does anybody have a setup like this?I'm curious how people are doing it today. Could you share briefly how you actually have it today?
Audience: In what sense?
How do you actually solve this problem?How can you kill a random machine and it keeps working â€” your application keeps running?
Audience: I think every service you designhas to be, like you said, scaled.If you're horizontally scalable,there's no snowflake server that's dependent.You can't have any single points of failure,that has to be your first requirement,including your data storage.
Audience: We do use Amazon machine imagesto deploy the new instances, but I'm curious to seehow this model works with that.
Right, they're good points.
The analogy is pretty simple.Today we have full machine images. It's like, take out the kernel,take out SSH and now you haveyour application image.If you've played with Docker,most of the time you just start froman Ubuntu base image and build up from there.In our opinion, the best way to do thisis to take as minimal of a deploymentas you can get away with.
So, if you're running a rails application,just build your Ruby runtime statically compiled with everything it needs in itand then throw in your code.Now you have this artifact that's runable.
This is what we're targetingwhen we talk about clustering,is that you can bring something else in.
One big component of this is service discovery.With service discovery there's a couple class of problemswe're trying to solve with this.The first one is for instance, take a load balancerand have it find your downstream app servers dynamically.Again, just curious, has anyonebeen able to pull this off in a reasonable manner?Heroku solves it in a pretty big way. Do you mind sharing?
Audience: I've done it with AWS tagsby tagging instances in groups.The other way is by using DNS. So in Terminal, DNS with prefixes.That lets you find all the IPsthat are in that range.But you have to run DNS.
You have to run DNS. And is this sort of baked into the Amazon tools?I know tags are, but is your app serverfinding the actual database in that example?
Audience: You have to set up the DNS using Route 53 or some other DNS provider. Then you have to change your search pathin your DHCP config.It's all standard Ubuntu stuff.
Any other approaches to this to date?
Another component of thisto actually do it effectively, is distributed locking.A distributed lock is requiredfor particularly sensitive operations.A distributed lock is when you have a setof disparate machines, let's say your masterand a bunch of slaves on a database, and that master dies.Now how do you decide which of these slavesactually becomes the new master?That's where the distributed lock is important.
The only way to actually do this is you need a source of truth, a single source of truth that is a hard consistent, that is exactly right. Because two of those databasescan't think that they're master at the same time.
This is where it requires actual hard consistency,but at the same timebecause you have something you don't ever want to go down,that source of truth has to behighly available in and of itself.That's where ZooKeeper and etcd come into the picture.
Now I want to survey the audience onthe Zookeeper or etcd users.Anybody actually deploying this stuff in a production?Would you guys be willing to sharewhich one you use and how you use it? You use ZooKeeper?In what part of the stack, where does it tie in?
Audience: We use it for both directory serviceand to prevent two boxes fromclaiming ownership of the same shard. We share data across multiple nodesand want to prevent two nodes fromthinking they own the same thing.It's sort of like an implementation ofdistributed locking using ZooKeeper.Currently we're concerned thatit's not fully dynamic yet.It's just there to prevent two things,which haven't been configured wrong,from competing for the same resource.
Got it.What is the platform you're building, just curious?
Audience: We're building an email platform where every shard is a set of users.
Got it, every shard is a set of users.Anybody solving for the similar class of problems?I'm curious how people are doing it without this.How do you manage the shardingand making sure the right app serveris talking to the right thing?Does anybody have an approach to thisthat does not include ZooKeeper and etcd? No? Okay.
I definitely recommendstarting to study up on these tools a little bit.Particularly if your platformsare starting to grow in capacity. You will need a solution to the problemthat these things solve at some form at scale.
Google does it with there own stuff.Chubby, Yahoo spun out Hadoopwhich ZooKeeper is part of that suite.Etcd is a thing that we builtas a more devops friendly version of this. It's definitely somethingthat is an important part of the stackwhen we talk about locking.
The next piece is around cluster management.Remember right now our operational goalis that we can take anything downand things keep running.Which means we need to forget about anysingle machine is a provider of resources.It brings in cluster management tools.And by cluster managementwe also need resource scheduling.Resource scheduling is like: Okaythis application just needs256 megs of memory to run somewhere in the environment.Find me a machine that it runs on.
This is actually a pretty difficultcomputer science problem.It's called bin packing. Bin packing is something thatis not a trivially solved problem, I think it's an NP-hard problem.Resource scheduling is a component of all of this.There are a couple of tools out there to help you.Yarn, again is part of the Hadoop suite.Mesos, Fleet is the one that we builtthat we have sort of brought into all this.
Is anybody using any of these componentsat all in their own environment?No.
Okay, if you really want to step it upto what you'd see inside of a Google-type deployment â€” in Google they call it borg,that's their internal cluster management system.The smallest deployment you could probably have with thisis like three servers,and you could take out one of those servers,and your applications will keep running.But if you lose two, you've sort of lost the quorum,and your applications will stop running.
This gets pretty sophisticated,but it also simplifies how deployment works,if you can trust your server cluster management softwareto keep working.
If we tie all of this together, first you need to push an update,then you need to have something that is updateablewhich is where the immutable pieces come in,and we need to actually get it deployedin an HA/fault-tolerant manner.
I'm going to encourage you guys to update your apps!Think about this sooner than later, because it will change the architectureof how your applications actually work.Do each of these components here,and again I really encourage you to think about thiseven when you're on the smaller sideof your application deployment.It might sound super hardcore and sophisticated,but it's much more difficult to port your applicationto this sort of mentality of running down the road.Start as soon as you can.
And if we can do all this together,we can give you way better security, way better reliability,and we can update the webwhere the actual important parts matter.
So thank you guys so much for your time.I know that was a bunch of information,but I appreciate itand I wish you guys all the best on building your platforms.
Q: What is the roadmap for CoreOS 1.0?
A: The OS itself needs to get fully stable.Right now we do crazy things like reboot youwithout you asking to update you.Obviously, in a production environmentrandomly rebooting people is a little bit crazy.We built in all this functionality to make it safe to do,but it still makes people quite uncomfortableto think that their serverswill just magically reboot on you.
We're getting to the point where we are stable enoughthat we turn that feature offand the distro is working well.
Then on the cluster management sidethere's etcd and fleet which go hand in hand. Our goals there are etcd needs to be stablethat it doesn't actually go downand can handle the fault toleranceas well as scaling up and down the cluster size,and do that reliably.I'd say we are about 70% there on etcd.
Then fleet, it's the part thattakes CoreOS and plugs it into etcdand it allows you to just boot more machines.It all kind of works together.Getting that fully stable and working is also on 1.0.Right now it works, I'd say 60% sort of quality,but we want it to get to the pointwhere it's essentially, internally what I target for this is a bit ambitious,but we want to have 1,000 machines under management with no problem.We'll probably call it 1.0before it's actually that ready,but that's definitely how we're thinking about it, in units of 1,000, and fleet can manage that.
Q: If you're a new start-up and haven't thought about moving to a properly distributed system, when should you start thinking about this?
A: So most of the time when you're just starting outyou don't even have your product well defined.I would recommend starting out on Heroku or App Engineor whatever and they solve these sorts of problems for you,and just punt on the whole thing.Actually start doing thisas your next gen version of your application when you go and rewrite it,because you have a working product and you're ready to go.I would start by doing nothing,and then start with a set up like this.
What's not ideal and the trapa lot of people fall into is the"Oh I'm just going to boot an Ubuntu machine and go set up my app,and oh my gosh I need two of them what do I do?Oh, now my database,I should move that to a different box."And then you just end up with this giant messthat's totally hard to manage.
If you start out with more of thisplatform-y type environment, App Engine or Heroku, knowing you're going to re-architectyour app at some point anyway, and that's okay.Then you won't have to worry about ops at all,for a little bit. Then when you're readyto step it up to the big leaguesand you actually know what you're buildingthen go after this style of approach.Again this works all the way downto a small cluster of one or three machines if you don't have the HA capabilities.If you need to run not on a cloud,you could do it on the small size.
We are a very sophisticated teamthat's seen all these problems before,and everyone wants to start at this,but that's because we intend it to be big from the get go.
Definitely don't think about it too late,but I would also run your infrastructure with other peopleas long as you can get away with two.As long as that works for you.
And also keep in mind,a lot of these tools are meantfor the platform builders as well.Not the end user.This is something that you'd run under Heroku,or run under App Engine. My understanding is a lot of you guysare building platforms yourselvesand so that's why this is a tool to considerearlier.
Q: How does this fit into a workflow where you're deploying system-level changes and then terminating the older versions?
A: That's a similar modeland essentially you have the same set of issues hereit's just that your application is that instanceinstead of being a container on the instance.Does that make sense?Your resource scheduler here is like EC2is doing the resource scheduling for youinstead of it being built on top of your operating system.
Q: How do you handle an application change versus a larger system-level update?
A: The way we think about it isthe base OS is actuallyjust another application that's updated like this.The kernel and your init system,everything is just an application that pings home,gets an update, and reboots as its upgrade mechanism.Then your application is a separate onethat pings home, gets an update, and upgrades itself.They're just two different applications.
Q: What controls or safety measures do you have in place when you do a CoreOS update?
A: So you're talking about on the OS side? On the OS side,that graph we were showing,essentially if it encounters a errorin any state it stops and the client will revert itself.The client in this case is the OS.The OS will revert itselfif it's unable to successfully update.
Success is determined by there is a default successwhich is "can I do another update?"or there's a success where you can hook it and validatewhatever success metrics mean to you.
The point is that it is automated.That it should be able to roll in the middle of the night,and it's okay. Again nobody trusts this.So we'll get you there over time.But that's the whole point.
All of this stuff that we talked about is to make that safe.That you can actually update and not take a downtime.Everybody's been burned by thisover and over again and it's because of the wayapplications are architected today that it's not safe.
Q: How do you deal with applications that require persistent storage?
A: Being completely practical,the answer today is use existing stuffthat already works for your database.So use RDS on Amazon.Google has stuff like GFS which solves this internally,but it's a radically different approachthan what people do with data storageoutside of that world. Today it's like storage, put it on RDS.Manage that as it's own whole thing.For big companies when they ask us about"Well what about our Oracle database?"We don't want to touch your Oracle database. That's not going to run on us.
This is for your applicationsthat you actually want to be distributed in fault-tolerant. Existing databases like Postgres/MySQL, we're just fundamentally not built to be able to do that or to do that well. They can kind of do it,you can sort of make that work,but they just aren't built to do it.Places where this actually worksis where they just have a different database than what we haveon our side of the world.
Punt until we have better technology.
Q: How does etcd fit in with systemd?
A: Systemd, I didn't cover it too much here,but that is definitely something to keep an eye on.All the Linux distributions are going to be switching to systemd.
Systemd is the init system.It's what the kernel bootsto actually run processes on your Linux box.Ubuntu is switching, Debian is switching.Red Hat is shipping their next versionwhich means CentOS will follow.Arch is already there. Gentoo it's a matter of time.
Everybody is switching to systemd,which means very soon all Linux you runwill be Linux and systemd now.Not just that common Linux space,but how you actually do things,like starting and stopping your web server.
What we've done,we took systemd, it has APIs like D-Bus.Anybody familiar with D-Bus?It has the D-Bus API for actually controlling it saying, start this service or stop the service,start my web server, or stop my web server. We plumbed that straight into etcdso that you can have a distributed init.It takes the abstraction of a single machine process managementand puts it across a set of machines. You can tear down one hostand your process will get picked upand run on a different machine.
This obviously requires a good actorin terms of the applicationthat's running in this environment.Not everything can handle that â€” to be shotand then running somewhere else and it's okay.But tied together with all thisdynamic, service discovery and distributor locking,and all these components,you could actually make it start to work. It does require a different way of thinkingabout how you actually run these systems.
Systemd, again, regardless of CoreOS or anything,check it out, because you're going to have to deal with it.If you interact with a Linux box in any way,you're going to have to deal with it soon.CoreOS already ships with it. I think Ubuntu will be in 14.10 will have it,and it does change everything.No more /etc/init.d/apache2 start,you're going to run systemctl start apache and there'll be a lot of docs to readand that stuff. It's coming.
Q: What are the tradeoffs of ectd versus ZooKeeper?
A: Good question.ZooKeeper does a pretty good job at everything.The main issue and the reason we built etcdis really around clients.It's really hard to have a non Java client with ZooKeeper.
Actually for the guys that are using it,do you write everything in Java?
Audience: Also Python.That was easy to use.
There are some clients out therethat are so-so,but the protocol is really difficult to get right in a clientwhich means primarily just Java.Java apps are able to use ZooKeeper.
We built etcd originally just because we wantedto make these primitivesmore accessible to a normal developer.ZooKeeper talks a binary APIthat's really only implemented in Java,and we talk HTTP and JSON with etcd.It's just a little friendlier to deal with.One big advantage of ZooKeeper is that it's actually stableand works pretty well.
Etcd is still coming alongand this is the part of the systemthat has to be like perfectly stable,otherwise the whole thing falls apart.If you need rock solid stability, choose ZooKeeper.Etcd is going to get good over time.
As an open source project we've seenright around 75 outside contributors,and like 4,000 stars on GitHub,if that means anything.It's coming along as an open source project,but it's still early.
Q: How will you deal with package upgrades? Will you split it like Fedora and Redhat?
A: The state of the art on thatis Fedora and Red Hat.Fedora is like the bleeding edgelatest version of what Red Hat is, and somehow it magically becomes Red Hat â€” like a three to four year cadence and then it's stable.
Our model is we have an alpha channel.Alpha is a release candidate for the beta channel.The beta channel is a release candidatefor the stable channel.This is the model that you see on things like Chrome. We'll ship between alphas almost dailylike a SaaS application,and then those will get promoted to betas,probably around a weekly cadence.
We haven't actually shipped our first betaalthough we have our firstrelease candidates for betas out,but none of them have made it yet.Then the stables will hitabout once a month or so on top of that.It's a straight pass-through.The thing that hits alpha is bit-for-bit identicalto the thing that becomes stable. Assuring quality is all there and everything.
Q: So you keep your package upgrades simple because CoreOS is small?
A: Exactly. CoreOS we ship youthe Linux kernel.
The Linux kernelhas this trick that not many people know,which is it never breaks user land.A really old Linux binarywill run on a brand new kernel, no problem.
The things that breakare those inter-application dependencies.You need that version of OpenSSL,you need that version of your Postgres clientor whatever it is for your application to actually run,and that's the stuff that breaks across distro upgrades.Not that it doesn't work on the kernel.
We just ship the kernel in systemd,and a couple packages to help you out.We ship Docker because that's a convenient way to bring in your container.But from there we push everything into the actual container.
If you have any other questions I'll be here,Thank you guys so much.