November 29, 2018
Ep. #2, Cloud Services with Corey Quinn
In episode 2 of High Leverage, Joe meets with Corey Quinn, cloud economist and founder of the Quinn Advisory group, to discuss the realities...
All right, you guys, so today I'm going to talk about HTTP 2.0, and specifically why do we need it, and what's the problem that we're trying to solve more so than anything. Because as technologists we like to solve things. We like to write cool things. We like to make things fast. But what's the actual problem that we're solving?
So I work at Google as the slide says. I actually work quite closely with the Chrome team and the Make the Web Faster team at Google. It's kind of a cross- company initiative for both how do we make our services fast and how do we make Internet fast as a whole.
Part of HTTP 2.0 actually has some roots at Google, because we started our work on SPDY, which we will touch on in a second, and now of course it's something much bigger and something I'm very excited about. I think it will literally have a huge impact on web performance. It will help make our clients faster. It will help make service more efficient, and actually reduce latency for a lot of users. So it's a big deal.
Right off the bat, what are we trying to solve?
The point of HTTP 2.0 is basically all about latency. So there's latency and bandwidth. They are the two components of speed.
Specifically, HTTP 2.0, if nothing else, is focused on how do we minimize latency in the client. Today we do a lot of interesting tricks, hacks, if you want to call them that, in the browser to try and kind of game the system and figure out which requests do we send because we have a limited number of requests, etc., or connections rather. So HTTP 2.0 tries to address all of that. So these are the high level goals for the protocol.
"A protocol designed for low-latency transport of content over the World Wide Web"
I think almost the first half of this talk is me trying to convince you that this is in fact a big problem and something that we need to address. And then in the second part we will actually look at what does HTTP 2.0 do, how does it look, and how does it look on the wire, even.
So first of all, one of the big challenges that we have on the web today is making stuff faster. So a lot of user traffic is migrating to mobile phones, something that we see in spades at Google. A lot of our searches are migrating towards phones, and every single team at Google is trying to figure out, how do we make our product fast on mobile? This is a big, general problem across the web.
Specifically, what we're trying to figure out is, like, well, we know that based on all the user studies out there, despite the fact that everything seems to be getting faster in our day to day life, there are some pretty good psychological constants. Like, no matter when you do the study - - these UX studies were done in the early '90s, you can do them today - - you'll basically find out that the user reaction time is within a 100 milliseconds.
Now, if you're a hardcore gamer and you're sitting there like, "Hey, dude, I can tell the difference between a 30 millisecond ping time and a 50 millisecond ping time," it's, like, fine, but that's a slightly different use case. Right?
Here, basically, we're saying. If you click on a button we want to respond to you within 100 milliseconds. It feels instant. If it goes above that, significantly above that, above one second, you basically lose the context. All of a sudden you do a mental context switch of, like, "I pressed the button. It's not responding. I start thinking about something else. I've got to send an email," and you may have lost the user. So that's kind of the target, right? We want to have everything fast.
To keep the user engaged, the task must complete within 1000 milliseconds.
And this turns out to be a huge challenge on mobile, where just sending one request can take somewhere in the order of a second. And our pages are not one request. Our pages are much, much bigger than that. So that's the framework.
In parallel to that we have this challenge of ever increasing and ambitious web that we're building. Compared to what we were building five or even ten years ago, the web today looks completely different.
We're not just building documents. We're building entire applications.
This is just something we have to deal with.
And the interesting number here is, of course, this: 86 and 57; so 86 requests. That's what it takes to compose a webpage.
So when you're thinking about mobile and I tell you that an average page takes 5 seconds to load on mobile, and you assume that an average request takes 500 milliseconds or more on mobile, 57 requests, to some degree it's a miracle that it even loads in 5 seconds. So that's kind of the problem that we're trying to fight.
So at Google we have been tracking this pretty closely, and there's some good news. Every year, we actually use Google Analytics. Google Analytics collects navigation timing data, which is basically the real user timing data from the clients when they access your page. We anonymize all that data. And once a year we basically run an analysis of what is the average or the median page load time on mobile versus desktop.
So the cool part about this, these graphs are actually comparing 2012 to 2013. You can see that in 2013, the latency, especially on mobile, has decreased significantly.
"It's great to see access from mobile is around 30% faster compared to last year."
It's kind of hard to draw conclusions from an aggregate as big as all of the Google data, but our theory, and I think we have good reasons to believe that this is why this is true, is that this is dominated by North America. And specifically the fact that we have a very strong rollout of 4G networks across North America. So most of the shift in this latency is not represented across the world. It's very heavy in North America, which is good news for us here. But nonetheless, it's still definitely a problem.
Before we were like 2x compared to desktop versus mobile. Now we're getting closer. So that's good, right? Stuff is getting faster. I'm just going to sit back, and just like back in the old days when the CPU speeds just got faster, I didn't have to do a thing, this is not a problem anymore. The network will save us, the 4G. There are banners plastered everywhere. The fastest, latest AT&T, most reliable, whatever, network. Done. Web performance, solved deal. That's partially true.
If you actually look at the data, for example, for bandwidth that is in fact what's happening. The speeds are increasing.
So, for example, Akamai has a nice site, Akamai IO, where you can basically go in and type any country and look at average bandwidth, at least as seen by Akamai. So this is data from 2007 to basically the beginning of 2013, and you can see that there is a strong trend towards basically increasing throughput, or bandwidth across the world.
Average connection speed in Q4 2012: 5000kbps+
We have Japan leading the pack. But basically most of the countries are near or well above the five megabit per second limit. Not limit, just five megabit per second threshold here. And we'll see why that's important in a second.
So another component that is often forgotten is latency. When was the last time that you saw your ISP, whether that's mobile or whatever ISP, advertise latency? Like, "Our last mile latency is x milliseconds." Right? Never, ever would they advertise such a thing. Mostly because it's actually terrible.
So FCC, for the past couple of years, has actually been doing a report or a yearly study, which has finally started to capture some of this data. Which is great. There are two parts of this. One, is now we actually have visibility to this data, and two, the reports over the last three years basically haven't changed. They're static.
This tells you a couple of different things, but before we get there, basically what they found out is that across all the different providers in North America you have 18 milliseconds -- that's your last mile latency. That's basically your router at home to the ISP, basically the POP box at the ISP. This is not even to your actual server. This is just your last mile latency.
So just for context,
43 milliseconds is like me going, "I can send a packet from here to New York and back in 43 milliseconds." And this is just your last mile latency with DSL.
So that's significant. There's definitely room for improvement, and this is also a metric that we track quite closely at Google.
So worldwide we see that the RTT to Google is about 100 milliseconds, and unfortunately this number hasn't budged over the last couple of years. It has just been stable, which is not good. We would like to see it improve. In the U.S. the average latency is 50 to 60 milliseconds. So that kind of makes sense, right? If you have 43 milliseconds of latency on your DSL, then there's another 10 or 15 milliseconds to actually get to the Google servers. And Google servers, we try to position them as close as we can to all the ISPs for this exact reason. So this is kind of an optimistic scenario.
So the good news is, at least, compared to 2011, in the U.S. the RTT has decreased by 10 milliseconds, which is significant. But the rest of the world is basically flat.
So all of this is to say we are going to continue to see improvements in bandwidth. And bandwidth matters for things like Youtube videos and HD videos, you like to watch Netflix, what have you. Bandwidth really matters there, and the good news is we can actually get more bandwidth. If we saturate all of our links, we can just dig another tunnel and put more fiber; we can just bond the different links and get more throughput. That's expensive, but it's doable.
With latency it's kind of hard because we have this thing called the speed of light, and we have not figured out how to go faster than that, yet.
So there are some interesting examples of people innovating in this space. It's like, well, you know what? Latency matters. We know that, and we know that latency matters for traders where nanoseconds count.
So a cool project that has actually been started, and I think right now it's paused. I'm not sure of the actual final outcome, but I know they have invested a lot of money to this, is Hibernia Express. Basically, this company has figured out that, hey, there are traders in New York, there are traders in London, that care about latency. So if we build a shorter cable, literally, there are a bunch of cables there, but if we take a slightly more direct route between these cities, specifically 300 miles shorter, then we can save about five milliseconds of latency. Which is significant if you're a trading algorithm, if you have a trading algorithm.
So this project costs about half a billion dollars. You do the math and you're like, well, I'm not sure if this is a proper unit, but $80 million per millisecond. That's what they're saving, and there are plenty of examples like this. We are terraforming earth between New York and Chicago to build faster links. There's crazy stuff going on. Basically, that's the only thing we can do. We can just do a shorter cable between two endpoints. But even with that we're fixed.
Basically, the theoretical limit is 35 milliseconds. That's it. We can't go any faster.
So that sucks, and that sucks because we are already within a small constant factor of the maximum speed. The current latencies are about 1.5, 1.4 of the maximum. So that tells you that even if we do everything to do a direct route and have a perfect link between these two, we're going to get an improvement of 30%, which would be great, but it's not going to transform the world of web performance.
This is the graph that is the key graph that got Google to start thinking about this seriously and actually start the work on SPDY back in, 2009, I guess, or 2008 even. So this is a very simple experiment that we set up. Basically, we picked a bunch of representative web pages on the web and we said,
"Look, there's two components. There's bandwidth and there's latency. So let us vary these two things independently. We will just keep one constant and we will just increase bandwidth and see how that affects the page load time."
So page load time here is in milliseconds. You start with one megabit of throughput, and the page loads in about three seconds. We increase that to two megabits per seconds, it's not quite 2x improvement, but it's close, which is what you'd like to see. You continue increasing that, and you find that after about five megabits per second, you're into single digit percent improvements.
If you go from five megabits to ten megabits per second, you're going to get your pages loading faster by 4%, which is bad news, right? Because we can increase bandwidth, and we continue to increase bandwidth, but it's not helping us build a faster web. It helps us stream video better because that's where this stuff matters, but it's not loading pages faster. So that sucks.
But then you look at latency. This is the exact graph that you would like to see, which is, the lower the latency, you have this linear function that just basically says, "You saved one millisecond, you're going to get this improvement in your page load time." So that's pretty awesome, except that, as we found out, it's very hard to change latency.
And as you saw previously in Akamai slides that I showed you earlier, most of the people, an average in the US, is over five megabits per second. So that tells you that
if you want to run out and upgrade your connection and buy into the advertising of newest, fastest whatever offered by your local provider, your page is not going to load faster.
Your video will be streaming better. You may get an upgrade in your quality, but your page is not going to load faster. Which I think is a surprise to many people, engineers included. This was definitely a surprise even to us when we ran these experiments.
So based on this we basically said, "Look, so why is this problem? How do we solve it at the protocol level?" Because it turns out we need to tweak our protocols to make them better to work around this problem.
So for web browsing, at least, bandwidth doesn't matter much. That's the first big ah -ha moment and big takeaway.
And of course, then there's mobile.
So everything I've said so far about latency is just two or three or four times worse for mobile. So this is an entire talk on its own. Like, how do the mobile networks work? I think everybody here would agree with the general statement of, like, "Oh, mobile networks are so unpredictable. Latency is so variable." It's just very hard to design fast apps that leverage the mobile network.
Let me walk you through this. This is literally a talk on its own, but just stay with me. Let's say we want to send a packet from the external network, like you have a push notification or what have you, and you want to notify the client on the phone that there is such a thing. So you send the packet from the external network. It comes in into the mobile carrier, and the mobile carrier basically has one global router, which is the packet gateway. It has a couple of these kind of big ones, so that's the PGW. It's the same thing as your router at home. It just terminates the connection. So it terminates your TCP connection right there at the PGW, and the PGW actually looks at a bunch of rules, like, should I be forwarding the site of traffic and etc.
So it's pseudo firewall, pseudo basic router. It doesn't do much more than that. It sends the packet to the serving gateway, and the role of the serving gateway is to figure out where you are on the mobile network. Because one of the nice properties of the mobile network is, I'm currently or I was in Mountain View earlier today. I hopped in a car. Now I'm driving and now I'm in San Francisco, and the local tower has no idea that I've changed. Basically there needs to be a mechanism to figure out where you currently are and which tower is currently servicing you. The serving gateway has no idea.
So what it does is it says, "Look, I'm going to talk to this MME instance which is basically like a user database." A user database stores basic things like, you have an account, you've paid your bills. I should actually forward you this packet, and it also stores where you are currently within the network. Except sometimes it doesn't actually know. It just knows that generally this person seems to be in the San Francisco area. So my phone is sleeping right now. It's not notifying. It's not talking to the tower. It just knows that I'm in this general area, and there may be multiple towers.
Okay, so we've gotten to the serving gate. We've talked to MME. MME says, "Look, I think he's in San Francisco. Let me flood all the towers in San Francisco and get them to send out a broadcast for all the towers that basically says, 'Hey user blah blah blah, please wake up, because I have a packet for you'." The towers broadcast the signal. my phone wakes up every once in awhile. They text that there's a message waiting for him, and then sends a message back to the local tower. It basically starts a negotiation with a local tower.
Once this negotiation is complete the local tower says, "Hey, I've registered this user." It gets updated in the user database, user database gets back to serving gateway, serving gateway can then forward the packet to the actual tower, the tower delivers it to your phone.
If you follow all of that, that has to happen within, well, ideally milliseconds. Clearly that's very hard.
So this table right here, these are actually numbers straight from the AT&T FAQ. If you actually dig deep they will show you these numbers in there. And for HSPA+, which is the current, like, when mobile providers today advertise 4G, they're actually advertising HSPA+. There's a standard called LT Advanced, which is the true 4G, if you will.
So for HSPA+, the latency, just within the core network, basically doing this kind of flow, and this is the most complicated flow. The flow outbound from your device is a little bit simpler. It's also part of this. But in any case, this is just to illustrate that if 43 milliseconds in DSL seems like it's high, if you look at the mobile numbers, it's hundreds of milliseconds. If you happen to get into the edge zone, which, every once in awhile it still pops up in my phone, and it's scary when it does, you're basically looking at a second of just getting a packet from your phone and out to the external network. Then you actually have to route it on the external network.
So this is a big problem on mobile. We're trying to figure out how to make
it go faster. Thankfully, as I mentioned, 4G and LT deployment, for once,
North America is leading deployment of this. We're at the leading edge of
this stuff. We have the best performance. We're getting down into sub-
hundred milliseconds, but nonetheless, big problem.
And all of this is just to send a single TCP packet. We didn't talk about sending a webpage. This is just one TCP packet to send a notification.
So hopefully by now I have convinced you that latency is, in fact, a problem. We can continue increasing bandwidth, but latency is a problem. So why does this affect HTTP in particular?
There are actually multiple problems at multiple layers. One, we need to talk about how TCP works. First of all, we have TCP congestion control and avoidance, and specifically we have this feature called TCP Slow Start.
How many people here are familiar with Slow Start? All right. Maybe half.
TCP Slow Start is a feature, not a bug.
So, the basic idea behind Slow Start is, we don't know what is the capacity of the link between your node and the destination node. There could be an intermediate node that is saturated, like the ISP is currently servicing a lot of traffic for whatever reason and it can't handle more load. So we don't want to overwhelm the network.
If everybody just woke up and started sending megabytes of data, we would saturate the network and we would just get into the state of congestion collapse, which is exactly what happened in the mid '80s.
Basically, the network just collapsed, and you couldn't get out of it. You had to reboot the whole thing. There were instances, reports at the time, when this congestion collapse was reached that some packets would literally take a day to get to the other person on the other end.
So Slow Start is that fix, if you will, and the idea of the Slow Start is when we start a new TCP connection, we're not going to use all of the available bandwidth. We're going to send you a little bit of data, you acknowledge that data if it's delivered successfully, and then, if that is successful, we will increase the window size of how much data we send.
So how do we pick this number? Very simple. The original specs actually said you sent one packet. So you sent roughly 1,400 bytes. You acknowledge that. I'll send you 2,800 bytes. So this is the CWND, and that's basically what I'm showing you here. So this is the exponential growth.
That number has been updated. Most recently it has actually been updated just in the last year to 10 packets. So we can send up to 15 kilobytes of data, which is significant improvement over the previous value, which was three or four packets.
So we can send you 15 kilobytes of data, then we have to pause. I don't care if you're on fiber or what have you. 15 kilobytes is all you get. We're going to wait a full RTT, and then we're going to increase that to 30 kilobytes and then to 60 and so forth. At some point, packet loss will happen, at which point we will restart this algorithm. It's a slightly different algorithm. It's congestion avoidance, but that's TCP Slow Start in a nutshell.
So this is surprising to a lot of people, because basically what this tells you is
no matter what is the speed, whether you're on 4G network or a 3G network, latency is actually the bigger problem of the two at the beginning of that connection.
A TCP is optimized for bulk and long transfers of data, whereas a lot of our actual traffic is short and bursty. So here's an example:
Let's say we want to transfer a 20- kilobyte file over a low latency link, or a relatively low latency link. So in this case I'm saying we're going to transfer from New York City to London. I'm going to assume that that's 56 milliseconds of round trip time. There's going to be 40 milliseconds of server processing time, which is very fast, and let's just say that we have five megabytes per second, which is actually irrelevant, but good to have.
So what happens? First we have to open to TCP connection, which is the SYN and SYN ACK. So that's one round trip. We haven't sent any data. We're just opening a connection. That's already 56 milliseconds. Then we send the request. We incur the server processing time. But we can't send the actual response. We only send 20 kilobytes. We can only send, in this case, four, which is the previous value for this congestion window. So we send four kilobytes. We wait. We get an acknowledgement. We send eight kilobytes, and then we send the rest.
So you do all the flow and you figure out that to send 20 kilobytes of data on a fairly low latency link, it's going to take us 264 milliseconds, which sucks, frankly. This does not take into account DNS, and the fact that if you had to do a TLS handshake, that's another two roundtrips, or more, even. So it kind of sucks.
With HTTP specifically, we have HTTP 1.0 and HTTP 1.1. One of the things in HTTP 1.1 was that we focused a lot on performance. At least, we clarified a lot of the caching. We actually added this feature called pipelining. And some of those things have worked out and have been great, and some of them have not. Unfortunately, one of the things that has not worked out is HTTP pipelining.
So the basic idea with HTTP pipelining is, by default, HTTP provides no multiplexing in the sense that you send a request and you must block and wait until you get the response.
So this is this graph right here. Let's say that I have one connection and I want to request three files, it's completely sequential. I send you one. I wait. Then I get it back. I send you the next request. Which kind of sucks, right?
Pipelining said, "Hey, this sucks." Especially in the case if you have server processing time and others. This just creates more and more latency."
So what if it could send you all three requests at once? You could do whatever you need to do to generate those three responses, and then you just send us the data for three responses back.
So there's a little bit of a gotcha here in the sense that, for example, let's say I send request one, two, three, and you generate the response for request three first. But the first one is not finished. You can't send the answer for request three before you finish response one. So that's head of line blocking. So basically, it's limited, but it helps you address some of the problems and limitations.
In practice, this is what ends up happening very frequently. You have that first request for whatever reason, like it's a dynamic file that takes awhile to generate, and then the following two requests are like static assets, which means, in theory, could serve very fast. That request will block for a long time, and the other two are blocked too. So in part because of this, HTTP pipelining hasn't seen much adoption.
There's also the problem with a lot of the intermediate proxies just completely messed up how they implemented it, or they didn't implement it at all, so they would break, which sucked. Then the browser vendors actually had a tough time with this, because put yourself in the person's shoes who is designing this algorithm:
Basically, all of that is to say that pipelining just hasn't worked out.
Today the web is basically built on this model right here, which is sequential, and our only work around is to just open multiple connections. So this has been our hack of the decade. We just said, look, you can't do much with HTTP 1.1 pipelining.
So basically all of the browsers over the past decade kind of tried different variables, and we've all more or less converged on up to six parallel TCP connections. So when you connect to a server we will open up to six parallel connections, which means that we can transport, at most, six requests in parallel, or get six responses in parallel.
Now, that's only partially true because us web developers, we're an inventive bunch. We're like, "Hey, six requests in parallel. That's not enough. I have an image gallery of six images. Let me just shard that across ten different domains."
We invented this whole new thing and we're like whatever, TCP Slow Start? Forget it. Congestion control? Who cares? I'll just open more TCP connections."
So that's the whole premise of domain sharding. The only reason it exists is to work around this limitation, which is imposed intentionally by the browser vendors to say that too many connections actually hurts you, all right, because it causes congestion. So that's what we do today.
Going end to end, we have the DNS lookup. We have to do the socket connect. We have to do HTTP request. Then there is the actual content download. Even here I'm not showing the TLS time, which takes another couple of round trips.
So all of this takes a lot of time, and what typically ends up happening is, if you do the math for HTTP Archive, coming back to that original thing that we looked at. An average page, let's say, is about 1,200 kilobytes, 86 requests. Turns out that an average page ends up talking to about 15 distinct hosts on the web today, which is quite large. That tells you approximately the number of connections that we open.
And if you do the math here, you will figure out that an average request is about 14 kilobytes per request. There are definitely bigger requests for things like images, but most of the other assets that we download are very small, and we download them across many connections. So what ends up happening is we end up opening a lot of these TCP connections, and we never ever use the actual, or frequently, I should say, not never. Frequently we don't end up using the full throughput of the link. We just end up doing a couple of round trips, which are very expensive. We get up to a window of, like, 45 kilobytes, or 60 kilobytes, and we stop there. Then we abort the connection, which sucks.
So, we want to be here. We want to have one TCP connection, which has the pipe wide open and we can just push as much data as we can. Instead we're stuck in this bottom triangle right here.
To put this in context of mobile, because I think it's very relevant to a lot of companies and everybody, if you add up the latency just for a single HTTP request... Right?
I didn't talk about control plane. So control plane, the idea here is, first before you can send anything from your mobile phone, you actually have to talk to a tower to get permission to send data. That communication with the tower actually takes anywhere from hundreds of milliseconds up to seconds. On 3G networks, on the old generation 3G networks, it literally takes seconds to do that.
So this is just a one time startup cost when your phone has been idle. It turns out it actually, in large part, explains the variability that a lot of people experience with high latency variability. But it's definitely a big problem.
So, you do that. You have the DNS lookup, that's the roundtrip TCP connection. That's the round trip. We have the TLS handshake, which is optional, HTTP request times four for that 20- kilobyte file, and all of a sudden you're looking at, on a 3G network, you're already one second in. We wanted to render our page in one second. For 4G it starts to get a little bit better, but nonetheless it's a problem.
I've already mentioned this, but one good thing that has happened recently is that the latest Linux kernels have updated their CWNDs to start with 10 packets. So this is true as of 2.6.33+, but really you should, if you're not running 2.6.33+, you should upgrade hopefully immediately because that will literally just double the performance, in terms of the startup performance of HTTP connections to the client.
But for best performance, there's actually been a lot of other TCP performance optimizations done since then. So really, you want 3.2+. A lot of this research has been done on Google. This specific paper is the one that argues for the increase in the CWND. There's other things like a proportional rate reduction and other things that are now part of 3.2+.
So if you're not running 3.2+, you should definitely look into upgrading that.
So these are the current limitations. We have constraints at the TCP layer. We know that we have problems in the HTTP layer. What have we done over the last decade?
But the other one that I think is also surprising to a lot of people is slower execution.
Another one that's very similar is spriting images, right? Same idea. Images are expensive, or requests are expensive, so let's sprite especially the small images.
Problem number one is it's gloriously painful. Thankfully now we have some automation for a lot of that, but I still know people who do this by hand, which is kind of sad.
Two, it actually has negative implications on the memory use. When you say you want to display the 16 by 16 icon on your page, we have to decode the entire bitmap of the sprite, which is width x height x 4 bytes and that's your memory use. So these sprites are actually occupying quite a bit of memory on mobile devices, which is a problem, actually, for a lot of mobile devices. Especially for image heavy apps, you'll actually find people talking about exceeding memory thresholds.
I already talked about this, but domain sharding. We are limited to 6 connections. What the hell? Let's just shard everything N ways and we're good to go! Turns out - - we've actually done studies on this as well - - in many cases you're actually hurting performance of mobile applications.
Sharding, in more ways, helps clients that have more bandwidth, which are your desktop clients, but it hurts people on slower connections, like mobile phones, because it causes congestion, it causes more retransmissions.
Basically you're making it even worse. It already sucks for those people because it's so slow. You're making it even worse.
And there's no perfect number for what's the right number of shards for your site. It's based on your app. It's based on your specific page even because different pages have different numbers of assets, like you have an image gallery or something. So that sucks.
Once again, it eliminates the request, which is good, but that resource can't be cached, because now you have to inline it into every single page, which sucks. Two, of course, there's the overhead of base64 encoding. So you're inflating the file, like an image file, by 30%.
These are kind of the core, these are the best practices that we preach that every site should do. Concatenate your files, domain shard, inline your images, but then there's a sea of red.
That's what HTTP 2.0 is, in fact, all about. Once again, let's come back to this. We want to improve the end user perceived latency. So this is basically how do we make HTTP work better with TCP?
Two, is we want to address head of line blocking. So this is the problem with pipelining, where even if we could send multiple requests, we couldn't get multiple responses interleaved. So that's number two.
Not require multiple connections, so we want to eliminate the need to have domain sharding. We just want to use one TCP connection because that is, in fact, the best way to get the best throughput.
We're not here to change HTTP 1.1 in a fundamental way. We're not going to change, I don't know, squiggly brackets or angle brackets or something. We want to preserve what we have. We want to preserve the ecosystem and make it as seamless as possible, ideally, to migrate.
Here's where we are today. This effort actually predates January 2012. Basically what happened was in 2008, 2009, at Google, we looked at this latency and bandwidth study. We realized that there's a problem and we started working on SPDY, which basically became the precursor to HTTP 2.0.
By January 2012, we had Chrome, we had Firefox supporting it. We had a lot of big sites. Google, of course, has been using SPDY for years now. But then also Facebook, Twitter and others started picking it up. So basically at that point it was becoming a de facto standard and we said, look, there should be a more formal spec around this.
At the beginning, we basically picked the latest SPDY draft and used that as a base. But to clarify,
HTTP 2.0 is not SPDY. We've changed it in a number of different ways and we made it better. It's based on SPDY, but it's not SPDY.
So today, actually as of July 2013, we actually have the first implementation draft. Earlier this month, we actually had an interop testing session in Hamburg.
So we have a Chrome implementation of HTTP 2.0. We have a Firefox implementation of HTTP 2.0. There's a bunch of server implementations. Actually, Microsoft has build a server implementation, so Chrome and Firefox were testing against the Microsoft server.
There's a lot of work going on and it's pretty exciting to see that something of this significance is moving as fast as it is, because I think everybody feels like this is a big problem and it's something we need to address.
Generally speaking, when you see a working group put out a timeline that says, "In two years, we will solve a ginormous problem on the web," you have to take that with a grain of salt. But so far, we're on track and on schedule and I think we may actually hit it. And if we don't hit it, I think we'll be very close, which is impressive, given the size of the undertaking.
So, there's a growing list of clients and servers as well. There's node implementations, etc., so you can actually use this stuff today, especially if you control both the client and server. Obviously you can't rely on, like Chrome advertising it today, Chrome has a branch where we implement HTTP 2.0, but we don't have it in a stable branch today. For that we have SPDY. But if you control both the client and server, go for it. This is going to work.
First of all, one TCP connection. We want to get the best performance out of a single TCP connection. We shouldn't need more than that.
We're introducing a new term into the lexicon of HTTP, which is "stream". Whenever I say a "stream", just think of a request.
Multiple streams can flow over a single TCP connection, which is the same thing as saying multiple requests can flow over a TCP connection. So streams are multiplexed and streams can be prioritized, so we're going to talk about that.
All of the magic basically happens, and the most important thing about HTTP 2.0 is, previously, you could open a Telnet session and just type in a bunch of text to say, "Get this page." Now we're using binary framing, and binary framing allows us to split messages into different binary frames and multiplex them across the same connection, so I'll show you some examples of that.
The binary framing is the core change and that core change basically trickles out as a whole number of different features within HTTP 2.0. So as long as you understand that, that's kind of the core of it. It allows new prioritization, flow control and server push, which we'll talk about.
This is just a note from Mark, who's the chair of the working group. We're not replacing HTTP. We're just redefining how it's laid out on the wire, if you will.
One of the questions is, "Is 2.0 really warranted? Is this such a big change? Why not 1.2?"
The answer is, well, because we're changing the wire format, it is such a big difference, as far as we're concerned, that it is a 2.0.
Because you can't just talk to a 1.1 server anymore. So that's the reason for 2.0.
What you need to know about the actual implementation, every frame in HTTP 2.0 has a consistent header. There's 8 bytes of header. All of the frames are length prefixed. So if you're a parser guy, you'll be very happy. The first thing that you read is the length of the frame, and at that point, you know exactly what you need to do to parse this. So efficiency is actually a big optimization concern here.
Once you know the length, you can figure out the type. The type basically gives you what type of a frame is being communicated here. Is it a headers frame or a data frame or something else? I listed a couple of them here.
For example, priority. Each frame can have a number of custom flags that each frame defines. Then there is a stream identifier. Each stream, the client and server, whenever they create a stream they declare an ID on it, like 1, 3, 5, 7, 9, etc. Whenever we split data into these packets, we always embed that ID such that on the other end we can figure out, "Oh, this thing that I received belongs to that stream." That's how multiplexing works. Really, that's all there is to it. That's a consistent header.
If you can implement this - - this is very simple to write in any sort of parser - - you can basically process the basics of HTTP 2.0 You read the first 8 bytes and you're good to go. After that, we actually just extend it.
Here's an example of a headers frame. A headers frame is something that you send to open a new request. This is just like me sending a "get" string with a path of the request page. I will send a headers frame which identifies the stream ID. I can embed an optional priority.
Q: Do routers listen to this or just the servers?
A: That's a good question. Do routers listen to this or just the servers? It can be both. The server definitely needs to listen to it because it's the one providing the bytes. But if you have an intermediary, then it can be smart about it too. Part of this is also flow control, which we'll get to in a second.
Let's see, so culminate by header. One of the cool things about HTTP 2.0, we're going to talk about server push in a little bit. But you can actually open streams from both ends. So, a client can send a request, like, "I want to get this page" or "I want to get this image," but the server can also open streams back at the client.
You make a request, for example, for a page and the server says, "Oh, you'll also need this fav icon, because you always ask for it." The server can do that and the question is how do you negotiate the stream IDs? It's very simple. One keeps an even number. One keeps an odd number. So we just increment those. So there's never a race between them. You don't have to coordinate it, rather.
Actually, let me go back. So that's a priority field. Then, finally, you embed the headers. The headers are like, here's the content length, here's my user agent, what have you, all these other things.
One of the things that we've learned with HTTP is originally we started with a very simple protocol. It was just literally one line. It was like, get this resource, version number. And that's all we needed.
Later, we've added a whole lot of stuff, like, here's the user agent, here's the accept types or content types that I support, h ere's some other meta data. And basically, when you run analysis, you'll find out that an average request and response add about 800 bytes of overhead. This is just in terms of HTTP headers, which is significant.
With HTTP 2.0, we actually looked at that and said look, we need to address this problem. It is a problem. So, there's actually a new algorithm for doing header compression. Originally in SPDY, we actually started with just straight up gzip. We just said look, just gzip through the damn thing. We know that this thing works. But then there was a couple of attacks discovered against it, basically security problems.
So we had to throw that out and there's a new algorithm which the way to think about it is both sides of the connection keep header tables, which are basically key value pairs of things that have been sent before. On each request, you just toggle those bits. So you can say, "Hey, you don't have this value in your table, so please add it add it to your table." Then in the next request you can just say, "Toggle that bit. I'm sending that request again."
Here's an example. You send request one. This is just a fresh request. It's going to be, "Get request for example. com, a resource file. It's a JPEG file. Here's my user agent string." So you actually send all these key value pairs to the server.
On the second request, you're requesting just the resource. But you already know that the client, or the server in this case, already has all these values from a previous request. And the way the algorithm works is basically you toggle the things that you don't want out and you can send new values.
Because I'm sending a new request for a resource file... And you know what? That's actually a bug. It should say "resource 2". So this is the only value that has changed, so in this case, I would only send this one key value pair, which is great.
What this tells you is, for example, if I'm sitting in a loop and I'm just polling the server for an update, the overhead of that request in terms of headers is zero because it's the exact same set of headers. I don't have to do anything. And over the lifetime of the connection, you build up this header space and you can basically be very efficient in how you encode and decode this kind of thing.
This is another actually interesting opportunity for servers to do a smarter job, and intermediaries as well, in terms of what are the right algorithms for doing the eviction of these headers, so on and so forth. So this is a very important part about HTTP 2.0. It makes it very, very efficient to transfer this meta data.
So we sent the headers, so we've had that block, the consistent header, I should say, the header block, and then after that you actually send the data, or the payload. So the header's frame just carries the meta data about the request. Then in separate frames and data frames, you actually split them and put them into these data frames, which the whole thing consists of just the consistent header, that's the 8 bytes, followed by the actual payload. There's nothing more to it. It's as simple as that.
The one interesting gotcha that has been added recently to the spec is, if you look at the length field of this frame, it's 16 bits, so in theory you could send 64 kilobytes of data, but to reduce header line blocking, we're actually limiting that.
In the spec, we're saying no frame should be bigger than 16 kilobytes. Because otherwise, you just create more and more contention or more blocking. So if you have data that is larger than 16 kilobytes, you would just split it across multiple data frames. Then in the last data frame you just toggle a flag that says this is the last frame of that sequence, and that's how you communicate larger payloads.
So all of that kind of in a simple picture here. The client opens multiple requests. It has a single connection and it can split all of those requests and responses into individual frames.
But of course, one of the gotchas here is the server needs to be smart about this. Previously we've had a lot of logic on the client for when do we schedule requests. One of the things that we're doing now, we just maybe a month or so ago committed this into Chrome, where when we're using SPDY, we remove all that logic.
For example, the current nGenx implementation does not respect priorities. So we've actually seen a degradation of performance in those cases. So these are the kinds of things that need to be fixed at the server layer. Servers needs to get smarter.
Flow control is kind of interesting. One of the interesting properties of this, and you'll discover this, any time you layer a protocol within a protocol, you'll end up running into this exact same problem.
Now that we're interleaving multiple streams or flows within a single TCP connection, how do you rate limit or control the allocation of resources between those flows?
This is especially important for proxies and intermediaries where you may say, "Hey, I have a video stream and I have this stream. The video stream can easily saturate my link, but I want to limit it at this amount of throughput." So flow control allows you to do that.
If you've ever studied TCP flow control, one thing you'll know is that this is a problem that's been solved many times in the sense that there are new solutions coming out for TCP. There's new proposals. There's new congestion control mechanisms that are being proposed to this day.
We've been working on this for 20+ years. So we're not introducing a new flow control mechanism in HTTP 2.0. We're basically providing the building blocks to say every connection is going to start with a 64k window. Every time you send a data frame, we're going to decrement that window by the size of that frame. Then you have a special frame called the "window update frame", which will increment the size of that window. And how your server implements the logic of when to increment that window is completely up to you.
We're building you basically the shovels to build flow control. We're not providing an algorithm within HTTP 2.0. It's an interesting opportunity for, I think, innovation in the space. That's intentional. We know this is a complicated space, we can't solve it in HTTP 2.0, but we're going to provide the tools for you to do that.
The last thing that I'll mention is HTTP 2.0 push. I already talked about this briefly but the idea here is that, very frequently you request a page file like a static homepage, what have you, and we give you the HTML and then you come back to us with all the resources that we've told you that you're going to need in that HTML. So why shouldn't we be able to send you multiple responses to one request?
Part of the thinking, right now at least, is that CDNs can actually be providing this push. One of the things you can do, so let's take an actual scenario. I make a request to your server, your server says, "Here's the indexed HTML file and here's three other assets."
What if I have those three assets in a cache? Well, then you can actually cancel the stream. You can send a frame called "reset stream" and say, "No, no, I want to refuse this," or, "I don't want it." If you have an intermediary in there, it can actually drop those streams and just reset them back if it doesn't want to accept them.
So if your CDN doesn't accept it, it should respond with a reset stream and that should be the end of it. You may still end up transferring a little bit of data, but hopefully that's not such a big deal. So there's a bit of a risk condition there.
In a worst case, if you're intermediary refuses these push streams, we have the HTML parser, which are going to discover them and they'll send those things anyway, so this is an optimization.
The interesting thing about this is it's useful in the context of a browser. It may even be more useful in the context of just general RPC layer where, don't think of HTTP 2.0 just as a browser client. Now what you can do is, you can send a naked request. Like let's say you have your Java client, you send the request to the server, the server can push multiple responses back to your client and you can do smart things with it. So that's push.
One of the problems with deploying something like HTTP 2.0 is, well, we have a lot of existing infrastructure that can't be upgraded overnight. We have clients that we can rev. Something like Chrome, we can certainly release a new version and that'll be nice and smooth. But, lots of old IE clients, lots of old servers that won't be updated, etc...
So, how do we make the switch as seamless as possible?
There's two standards or two ways that are proposed in the spec today. There's the typical HTTP upgrade flows, if you guys are familiar with WebSocket. We send the request. We want to request the page and we also send a connection upgrade header and we say, "Hey, we would like you to upgrade to HTTP 2.0, if you support it." And if you do, the server can say, "Okay, fine. I will do a one and one switching protocols." It responds basically with those headers and right after that it's sending HTTP 2.0 data on that TCP connection. So there's no additional roundtrips, which is good. If it does not support HTTP 2.0, it can just respond with a 1.1 response. So there's no penalty in this case, which is nice.
T he more preferred way to actually do it, for a variety of reasons, is actually via TLS and ALPN.
ALPN adds a mechanism into TLS negotiation where you can actually negotiate the application protocol that you want to use during the time of the handshake.
Basically, the way it works is when you open the actual handshake for TLS you say, "Here's my private key," or "Here's my public key." Hopefully you're not sending your private key. "Here's my public key and, by the way, I support these protocols."
The server then says, "Okay, fine. I'll sign this request and I like the protocol that you're advertising, in this case HTTP 2.0, so I will run HTTP 2.0," and it responds back with a protocol field called "ProtocolName" and says, "Okay, fine. We'll talk HTTP 2.0."
So by the time the handshake is complete, we haven't added any more roundtrips to do this kind of thing, but we know right at the end of the TLS handshake that we can use HTTP 2.0.
The reason this is better, at least today, is that in practice, there's a lot of intermediary caches, proxies, what have you, on the web, even antivirus software running on clients that sniffs traffic on port 80 and breaks in spectacular ways. For example, there's actually antivirus software out there that we've discovered. For example, with SPDY, it would sniff port 80 if we do it on encrypted, and it would say, "Look, this doesn't look like HTTP traffic. This looks malicious. Oh, this user is under attack. Let me close the connection."
Or we have intermediate proxies which don't even parse HTTP properly. They just look for strings in the byte stream and just swap them out. That breaks the protocol in spectacular ways.
Basically, there's all this infrastructure where if you do this HTTP 2.0 or when we did SPDY in the wild, we found that in 20% of the cases our connections just would fail randomly and we could never figure out why because there's some software running on a client, some intermediary or something.
If you're running a site like Google. com, having 20% of your users not being able to reach your site is kind of a problem. The way to address that is basically to run it over SSL, because then we bypass all those intermediaries and then we're talking end to end. So encryption's not the point here. It's the fact that we have this clean tunnel between the two ends.
In practice, that's what you're probably going to end up using for HTTP 2.0. This is exactly the reason also why WebSockets work over a TLS for mobile and other cases and they break for a lot of clients, especially on mobile, when running over vanilla HTTP. So if you're having troubles with WebSockets, run it over TLS, you'll be fine. That's what it is today.
If you're interested in this kind of stuff, I did write a book about it. It's actually not in print yet. It will be hopefully by the end of next month, but it's all online and it's free. So if you're interested in HTTP 2.0 or TCP and TLS and all that kind of stuff, I go into all the stuff in depth, so check that out, and I'll have a link at the end.
With all the stuff going on with HTTP 2.0, the TCP performance part becomes even more important in many regards, so you should definitely upgrade your Linux kernels, make sure that you have the latest TCPU window or congestion control in place. All of the previous optimizations like positioning your data closer to the user still applies.
The fundamental problem or limitation today is still latency, so the closer you can get your data to the user, the better they're going to be, the better the performance is going to be. You want to compress the data, etc.
So TCP forms are very, very important. In fact, there's a little bit of a caveat with all the HTTP 2.0 work, which is TCP packet loss happens. That's how TCP works. TCP packet loss has to happen for TCP to work properly.
But the problem with packet loss is when that does happen, we reduce the size of the window, the congestion window, in many cases, fairly significantly. So when you had 6 connections or 10 connections open and you would have packet loss in one of those connections, the throughput of that one connection out of, let's say, 10, would get decreased.
If you're running all the data through one speed connection and packet loss happens, it affects you in a much more significant way. We know that's a limitation. It turns out that when we've run the studies - - that is a problem - - even despite that, HTTP 2.0 still delivers better performance.
It's something that we can address at the TCP layer, which is basically why I'm saying you should upgrade to Linux 3.2. Because one of the things in 3.2 is we have proportional rate reduction, which improves this kind of fairness problem with running HTTP 2.0 over a single connection.
Because you're going to be likely deploying TLS, TLS optimization is critical because a TLS handshake is actually very costly. It takes multiple roundtrips to do that, so you have to pay attention to your certificate size, you have to optimize your record sizes. This is an entire frontier of performance that I think not a lot of people are paying attention to today. We're going to have to get very, very good at optimizing TLS. So there are a lot of things that we can do can do in this space.
One of the nice things about migrating to HTTP 2.0 is we can undo a lot of the glorious hacks that we've been doing in our applications. And this is great because it'll make our application simpler. I don't have to tell you to concatenate your files, you don't have to sprite your images, you can just do the right things.
We can keep our code modular. We can keep it nice. We can just make these requests. The server should do the right thing and you shouldn't have to inline assets. All the stuff should be handled in HTTP, which is how it should have been to begin with.
So if you are looking at deploying HTTP 2.0, the number one thing for performance that you need to do is you need to unshard your assets.
This is step number one, regardless of anything else, because running over multiple TCP connections with HTTP 2.0 will hurt your performance.
Basically, you're not going to get the benefits of performance. You shouldn't be any worse off, but you're not going to get the benefits. That's just empirically what we've seen. Then after you've done that, you can start undoing other glorious hacks that you have in your code. Hopefully, now that we've convinced everybody to concatenate all their files, we can tell them to undo all of that.
The end result of all of this is actually, well, simpler applications. It should result in faster delivery. It will deliver better caching because we don't have to invalidate these giant BLOBs just because we changed five bytes of data, and it'll actually have a significant impact on fewer server resources.
Your servers will have to maintain fewer TCP connections, which is a big deal for people that are running servers that have to handle a lot of TCP connections. Each of those TCP connections has a memory buffer, which is quite costly in many cases. Even in simple simulations with a proxy server, we can show that there's significant improvements in the overall throughput of the system in terms of number of clients served, latency of the system, etc. So this is a big win for everybody, clients and servers, so it's something to be excited about.
Benefits, I think I've talked a lot about. Some of the opportunities and some of the stuff that still is ongoing and needs to be done, smarter servers. This is a big thing.
We can write the spec, but the servers needs to get smarter. Servers needs to respect priorities. What we've done is we're basically moving all of scheduling logic away from the browser and into the server. We're placing our faith into the server. The server needs to be smart. If it just saturates the pipe with images, it's going to deliver a poor performance. This is something that we need to do a lot to optimize.
Server push, there's a lot of opportunities there. What are the resources you should push? How do you determine that? One cool strategy that the Jetty guys have implemented with SPDY is they listen to inbound traffic, they look at the referrer headers and, after some amount of time, they basically build a map that says, "Hey, you've requested indexed HTML " and then the referrer header says that you've also requested, based on that, after you receive the HTML, you've also requested these three image files, the CSS file and other things. So they construct that map and then they start pushing these assets to future clients.
So this is completely automated, which is the nice part about it. I just let it run, it just listens to traffic and starts adapting to traffic, which is pretty cool. For sure you could implement the manual strategy, maybe you can really hand tune your application for a specific case and say, "Always send this file," or, "Always send it to this client ". There's a variety of different strategies for how to implement server push.
So once again, this is something that both web developers and server developers need to really carefully think about, like how do we leverage this new thing that we just never had in HTTP before?
Clients - same thing. One thing that I'm familiar with is, I've built a couple of HTTP clients in the past. Most clients in most languages, especially, actually, the default HTTP clients are terrible. They don't allow you to reuse the connection. They don't support connection reuse, so all that stuff needs to be replaced with something smarter.
If we just build HTTP 2.0 clients which do the same thing as before, which is make a request, throw away the TCP connection, this is all for nothing, this is completely useless. So all of that needs to be replaced.
One of the cool things about HTTP 2.0 is we've invented, over time, a lot of new RPC layers. At Google we use Stubby with protobufs, Facebook has their own, there's a whole number, like the MessagePack has its own RPC layer. There's a lot of different layers that other companies use internally because HTTP is not fast enough. We can't multiplex connections and all the rest.
Now that this is here with HTTP 2.0, you can rip out all of that code and replace it with this, because it's actually better in the sense that it's been battle tested and a lot of people have thought hard about this problem.
This infrastructure is going to be built. There is going to be commercial support for this kind of stuff. You're going to get router supporting it and all the rest, you don't have to roll your own RPC layer inside.
So if I was building a backend system today, for whatever company, I would start with HTTP 2.0, because that's just the right long term bet, as opposed to using Stubby or something else within a company. I think that's a big change that needs to happen.
Finally, there's actually a lot of questions about, "How do we migrate all the people from HTTP 1.0?" You've optimized your site, concatenated all your files, now I'm telling you to undo all of that. But the switch is not going to happen overnight. We're going to have a lot of old HTTP 1.0 clients, so how do we manage that transition?
One answer to that is if you use a dynamic optimization service, mostly every CDN today has something like this, where they'll rewrite your assets, they can actually be smart about it.
For example, a page speed product that we have at Google, we look at the incoming headers and we say, "Look, this client has HTTP 2.0. We won't do concatenation, and that concatenation happens dynamically. Whereas, this client has HTTP 1.0, so we'll serve you the bundled asset."
This kind of thing needs to happen at the routing layer and all the other layers within the system. That also requires quite a bit of plumbing and architecture in terms of how do you deliver that to client.
With that, I'll take some questions if you guys have any.