Ep. #16, Real-Time Video with Kwindla Hultman Kramer of Daily.co
In episode 16 of Demuxed, Matt, Phil, and Steve are joined by Kwindla Hultman Kramer of Daily.co to discuss real-time video, tiers of latency, and the growth of mobile into a viable video platform.
Kwindla Hultman Kramer is the Co-Founder of Daily.co. He was previously CTO at AllAfrica, Inc. and CEO of Oblong Industries.
In episode 16 of Demuxed, Matt, Phil, and Steve are joined by Kwindla Hultman Kramer of Daily.co to discuss real-time video, tiers of latency, and the growth of mobile into a viable video platform.
transcript
Matt: Hey, everyone.
We're back and we have the whole gang back together. It feels like it's been forever, even though it's actually just been one episode that it wasn't all three of us.
This time we're back with a guest, the last two shows we just did the three of us talking about low latency, and the two of us in the last one.
This time we have our friend from the industry, from YC, from just the world.
Kwindla from Daily.co, if you're curious about domain names.
I wanted to talk real quick before we get started, Demuxed, by the time this is released the speaker submission should be closed and the panel should be starting to work on talk selection, all that stuff.
But it's the last week of October, that Tuesday through Thursday, the 27th through 29th.
Just a reminder, when you buy your ticket you have a chance to just buy a normal ticket or a ticket with a swag box.
If you want the t shirt and all that stuff we're going to try to ship to anywhere that asks for it, but we can't ship a t-shirt and badge for $80 dollars to Timbuktu.
That's not going to work. A third of all ticket revenue goes to causes that help support diversity and inclusion in technology and the world, but primarily in our tech industry.
If you donate anything over the base ticket price, anything above that goes straight to the donation pile.
Just a reminder, also we're doing donation matching up to $3,000 dollars if you want to just donate outside the ticket or before.
We're going to donate those as soon as we hit that mark, so let us know. You can donate to any DNI, inclusive--
Phil: Social justice--
Matt: Yeah, social justice group that you like.
We'll donate to just DevColor and the ACLU, but we'll match whatever social justice initiative you want to donate to.
OK, so we're here today to talk about real-time.
We had these two episodes on the world of low latency and what that looks like, and the different gradients of low latency in the world and ending with this deep dive into what Apple's new low latency HLS spec looks like.
A question we get a lot is "What does the real-time world look like?"
Today I thought it'd be great to talk about real-time video as a follow up to this pushing the bounds of lower latency and then going fully into that real-time side of the latency spectrum that we talked about in the last calls.
So to talk about that, we brought in our friend Kwindla from Daily.
We went through YC with him way back in 2016. Kwin, do you want to give us a little bit of background about Daily and what y'all do?
Kwin: Sure. Thanks for having me, I'm a big fan of the podcast so it's really fun to be on with you.
I'm Kwindla Hultman Kramer, I'm co-founder of a company called Daily.
We make APIs for real-time interactive video.
We have been working on video for a long time, individually and collectively as a team.
I've been interested in large scale, real-time networks and videos since I was in graduate school.
The genesis of Daily is the idea that live video is going to be more and more a part of our lives and everything we all do online and a standard called WebRTC, which a number of browser manufacturers and other platform folks have been working on for several years, is going to help make that happen.
We started Daily to build a technology stack on top of WebRTC that could serve as many use cases as possible.
Our users are doing things like video calls, but also customer support, retail, telehealth and some interesting new use cases around future work, like always-on video desktops for distributed teams and experimental educational stuff, like fitness classes.
Phil: I saw one of those products the other day, actually.
I can't remember what it's called, but it's a Kickstarter or something like that, but I guess iPad--
Where you just, it's on a stand and you put it beside your desk and you flick it on in the morning and all of your teammates are just there.
That's a sight terrifying to me, but--
Kwin: I think there's so much interesting experimentation, and some of those tools are really going to stick in maybe the same way that Slack, for example, really changed a lot of our assumptions about how we work together.
Matt: Absolutely.
Phil: In the last two episodes we really focused in on this low latency and ultra low latency, and I heard the words the other day, "Hyper low latency."It was a new one for me.
Steve: Oh, no. Please, No.
Phil: I know, right? And maybe just to rescope a little bit on where that line is drawn, where things become real-time video.
All the stuff we talked about in the last couple of episodes, best case that's going to get you down to-- What, a second?
Maybe half a second, at some level of scale. Obviously real-time is a lot lower than a second.
This is something where the latency is not disruptive to human interaction, so we're talking about sub-100 milliseconds of latency realistically, for a great experience.
Steve: Even at 300 milliseconds you start to get that back and forth, like you're having a long-distance international phone call and you start talking over each other.
There is actually a pretty wide gap between a good real-time interactive experience and where we can get to with the traditional live streaming technologies.
Matt: This is probably a great place for you to chime in. What do you typically see as that threshold for acceptable in the real-time world?
I imagine it's almost even smaller for what's acceptable there, but what do you see?
Kwin: Yeah, the deep background of the technologies we build on top of is the voice over IP stack that your listeners are probably familiar with from originally moving phone calls, business phone calls, to digital networks.
The traditional VoIP world definition of acceptable latency is 250 milliseconds or less.
As you were saying, though, you really want to be down closer to 100 milliseconds if you possibly can be, but you can't beat the speed of light so if you're talking to people halfway around the world it's a little tough to be at 100 milliseconds.
Some of our users use satellite links, those are more like 600 milliseconds.
There are always buffers if you're encoding video, video is harder than voice in terms of real-time.
So with video buffers in the mix we feel like we're doing OK if we're at two hundred milliseconds or so. We want to be better, but 200 milliseconds is what we aim for.
Matt: That pesky speed of light issue .
Phil: It gets me every time.
Matt: We already started talking about this a little bit here, but I thought it'd be nice to touch on how this is technologically different from traditional live streaming?
So, how is Real-Time different from what we see in traditional live streaming, whether or not it's low latency?
I thought it'd be nice to dig in there a little bit. Could you--? You touched on this VoIP foundation that it's being built on, but can you tell us a little about the technology history behind all of this?
Kwin: Sure. Digital video and audio have been researchers for a long time.
I built a bunch of experimental video stuff in graduate school in the 90s, but it really became a widespread technology as business telephone systems and other kinds of commercial audio systems started to need to route over digital networks.
The industry developed around VoIP, or voice over IP, so if you've used a phone system at a big company in the last 20 years you've used some VoIP technology.
Then of course cell phone networks became digital, and then in 2010-2011 Google started to get serious about thinking about what should get embedded in web browsers for digital audio and video that could be real-time, could be used for interactive applications like video calls, and it started to work on a specification that eventually became known as WebRTC.
Mozilla Foundation got on board with Firefox, and Google and Mozilla in 2012, 2013, 2014 timeframe started to build really cool new stuff into web browsers, and WebRTC became a standards effort.
A bunch of other companies started to contribute to the WebRTC effort, and if you fast forward to today the set of standards called WebRTC is now built into browsers and mobile platforms and is something that developers can build directly on top of to build really low latency and interactive audio and video.
Steve: So, how much of that is just the same technology that's been used for VoIP and how much of it is actually pretty new?
Kwin: It's a mixed bag.
All of the core pieces have been around for a long time but how they get put together and how they get implemented in the browser and things like JavaScript APIs are all really new, and in fact, are really still unstable.
We're figuring this stuff out as we go along, and every major release of the browsers tends to cause various issues.
Sometimes major, sometimes minor. It still feels like we're at the early days of scaling and making this stuff really easy for developers.
But it is super exciting now that we've crossed this bridge to now having a standard that you can build on cross platform for live interactive video and audio.
Phil: I was having this conversation with Matt just two days ago about how WebRTC when you're playing with it and experimenting with it right now feels like HTML-5 video did 6 or 7, 8 years ago in the browser, and media source extensions in particular.
It feels like it's at that stage of incubation or maturity in comparison to where the technologies you use when you're delivering HLS or Dash to a browser.
Just that chunk more mature right now, in particular from a compatibility perspective.
Kwin: It's so true. I have three different computers on my desk for testing, and I don't think that--
That was true when we were building websites in 1999 and it's not been true for a while, but with the WebRTC stuff it definitely is true.
Matt: Yeah, the kicker I think that always gets me is having to jump back into nine different vendor prefixes.
Sometimes there's different vendor prefixes within a vendor, which is wild.
But something that you alluded to a little bit in there that was really interesting to me is that it feels like this was another thing that got really impacted by the death of browser plugins, things like Flash and RealPlayer?
Phil: RealPlayer.
Matt: I don't know if that one did real-time, but my point is I think this was a niche that was filled by those plugins.
Then there was the dead area after all of those died for a little while, and I hadn't really thought about that relationship but it's obvious in retrospect.
Kwin: The way we talk about it is at least three things have to come together to give developers something they can build on and lots of interesting ways, and we see a bunch of new applications start to come.
One is that underlying technologies have to get good enough, so you have to have fast enough CPUs and a good enough network connections available for everybody.
Then you have to have the core low level implementations that take advantage of those CPU and network connection advances, and then user expectations have to start to shift.
When I started building live video stuff and I would demo the things that we were building, people would say, "That's really cool. But I'm never going to turn my camera on in a video call."
Phil: Little did they know what was going to happen.
Kwin: It's really funny to see how things change, and we don't even remember.
You cross over that boundary and you don't even remember how different things were.
One of the parallels we try to draw is how payments are now ubiquitous on the internet, but I'm certainly old enough to remember when lots of people said there were never going to be--
Users were never going to trust the internet with their credit card number, s o user expectations really matter.
As you start to build that flywheel of underlying core speeds and feeds type tech, and then implementation-level building blocks that developers can use and things that users are interested in doing and excited to do and willing to do, you get all sorts of new use cases in a Cambrian explosion.
I do think that's the early stage of what we're seeing now with interactive videos, all those things have come together.
There's a whole lot of experimentation and things are still a little bit hard to build in scale, but there's so much interesting stuff going on.
Matt: So I'm curious to hear, because we've talked a lot about-- This has been a really browser centric conversation so far.
Obviously, that's the platform these days, let's be honest with ourselves.
But I'd be curious what the rest of the landscape looks like, and do you all dive much into that or are you really able to stay almost entirely in the browser platform world and be as ubiquitous as Daily is?
Kwin: We focused on the browser early on because, partly with Google's big efforts in this new standard, the browsers were ahead of mobile platforms and it was pretty tough to build a mobile app that could do really good quality audio and video at extremely low latency.
That started to change in just the last year, so we are starting to put a lot more work into mobile.
We have some new releases coming up and we think you pretty much have to, because as things get more ubiquitous, as US cases evolve, mobile is, 50 -90% of the computing that most people do.
So, mobile is catching up. It's pretty close to being a completely viable platform now.
Phil: Has that changed--? Obviously, did I see stuff at WWDC? More WebRTC components coming to iOS, if memory serves?
Kwin: Yeah, that's exactly right. Mobile Safari is now pretty good.
Phil: Nice.
Kwin: Apple is supporting some stuff lower level than Safari, but we're still at the point where as a developer you can use JavaScript APIs in the browser.
But if you want to build a native mobile app, you need to go down to the level where you're actually maintaining some of the core library stuff yourself.
That will change, and we can make that easier for users, but the mobile development path is still quite a bit more challenging than the browser side stuff.
Phil: I guess that plays into what I'm super interested in, which is how does real-time communication and real-time video market, just-- Is everything now just WebRTC?
If I'm using Zoom, is it WebRTC? Or are people still doing more weird and innovative ways of shuffling around real-time video?
Kwin: The short answer is "Everything except Zoom is WebRTC."
Zoom did this amazing job building a proprietary video stack and great native applications that could leverage their proprietary stack, and they had to do that in terms of when they were launching and scaling, because WebRTC just wasn't ready on any level for what they were trying to do.
As WebRTC has gotten better, almost everybody else has shifted to using WebRTC often in experimental ways, sometimes hacking up and modifying what the low level implementation is doing.
But it seems pretty clear to me, I'm just one opinion, but it seems pretty clear to me that we're at the tipping point where the standards ecosystem is going to outpace everything else in terms of both performance for core use cases and the ability to support all the interesting longtail edge case use cases.
There's going to be another generation of the WebRTC spec that deconstructs some of the core primitives to support more experimentation and more use cases.
That'll take a year or two or three to start having an impact, but it's already being worked on.
I think the whole world is moving toward WebRTC and investing in and contributing to WebRTC at this point.
Steve: On that note, I wanted to know how much room is there for tweaking the algorithm, of the streaming algorithm of WebRTC as it's determining what quality to send and details like that.
Are people modifying WebRTC at that level or do you just trust that the browser is doing smart things?
Kwin: It depends on what you're doing.
For what I would say "Normal" use cases, and "Normal" is a moving target, but for normal use cases you're trusting the browser for a bunch of reasons.
A lot of stuff really has to get done at the C++ level to work.
So the video encoding, echo cancellation, lots of stuff you folks are really familiar with just really has to be done below the level of JavaScript.
If you're not trusting the browser, then you can't work within the browser.
What you can do if you have a use case that isn't well supported by today's standard WebRTC is you can either work at the C++ level yourself, which has more knobs you can turn, or you can hack on the open source WebRTC code base.
That is the basis for most of the WebRTC implementations, and both of those are really doable and people are doing both of those things.
It does mean you have to have a native application or an electronic application.
For example, you no longer are compatible with the browsers.
Sometimes you can work really hard to be compatible, cross compatible, like your users in the browsers have a certain experience that's got guardrails on it.
Then your users with your native application can do more, but that's twice as much work, at least.
Steve: That's interesting, because that's one of the differences that you can call out between RTMP and WebRTC.
Like, with RTMP you have maybe a little bit more control.
If you really want to send a high quality stream as much as possible and you're less concerned about necessarily the latency, you might choose RTMP because you can send that higher quality stream.
Whereas WebRTC, I get the impression you tend to have a little bit less control over that.
Kwin: That's totally right. And that's one of the really interesting tradeoffs, that when--
I think when developers come to you and they come to us, they ask for things like low latency and high quality, but the actual tradeoffs there are really subtle and really interesting.
Often WebRTC is really biased towards low latency at the expense of quality.
So, keeping the media flowing.
To a first approximation the big difference is UDP versus TCP, so we lose a lot of packets when we send them on a lot of networks, and so we don't always have time to resend those packets.
We certainly don't have time to run multiple encoder passes, so we're never going to be able to send video at the same quality that you could send video at if you're willing to tolerate 15, 20, 30 second latency.
But we can push the latency down by building lots of robustness to packet loss and other things into the system, but it is a real tradeoff and you have to decide what you're going to optimize for.
Phil: One of the things we see a lot of, or have seen bits and bobs of, is abuse of the WebRTC spec.
It's probably the wrong way to phrase it, we see a lot of people being creative with WebRTC spec and using the file transfer mode and moving around bits of video for ultra low latency for close to real-time streaming, from a real-time communications standpoint I assume that's a no go area where you need that traditional interactivity.
Would that be fair to say?
Kwin: I think you're talking about WebRTC data channels.
Phil: I am, yeah.
Kwin: A little bit of background, in the WebRTC spec there are media channels and data channels, and the media channels were designed too know enough to have a high enough abstraction level that you can assume that media you're sending through the media channels has things like forward error correction.
It can do rescinds on missing packets, you can re-request keyframes, the media channels are built on top of RTP but the data channels are just catch as catch can.
We're going to send a bunch of packets out over whatever thee underlying transport is, probably UDP, but you don't really know,, and with none of the higher level stuff that's super useful for media.
So if you're sending media over thee data channels, then you're either doing something wrong or you're pushing the boundaries of what the spec can actually support..
Matt: Why not both?
Kwin: Opinions differ, right? Engineers playing with trade offs are going to do interesting things.
Matt: The need to control things in different ways than the media channels give you access to, though, is one of the drivers of the WebRTC next generation spec.
Separating out the encoding and bandwidth adaptation and other layers that are in the media channels that you don't have as much control over into elements of a pipeline that you do have control over is definitely something that the standards body is aware that people want to do.
Phil: And on a super cool screen.
Kwin: I think it's fun to talk about Zoom's amazing hack, which is their in browser client.
As you say, it uses the WebRTC but not really WebRTC.
Zoom cross compiled into web assembly a bunch of their proprietary encoding and decoding and networking stack, and they use WebRTC data channels and a bunch of web assembly to implement Zoom in the browser.
It's an amazing and a really cool hack, it's probably not something that can scale the way a more native WebRTC implementation can, and certainly can't scale the way their native application can.
But it's an interesting experiment exactly along the lines you're talking about.
They're pushing the boundaries of two technologies, web assembly and WebRTC to do something that you can't do with core WebRTC.
Matt: I was corrected the other day, because I thought that they were using WebRTC-- I guess, quasi corrected. I thought they were using WebRTC and that's how they were in the browser.
Then I was corrected, somebody was like "No, they're actually using web sockets."
And apparently they used to use web sockets in the browser and the WebRTC transition is relatively recent, which honestly that whole thing blew my mind a little bit.
But it is interesting that WebRTC has gotten so ubiquitous in this industry that I was shocked to hear that somebody wasn't using it, especially as big of a player as Zoom.
Kwin: They have a bunch of great stuff they've built that just isn't available at the JavaScript level.
They had to pick some different point in the trade off matrix.
As you're saying, with the web sockets versus data channels decision, there's a whole discussion about what web sockets are and aren't good at and what data channels are and aren't good at, and it does cause us to pull our hair out because we work with both every day and we've done some similar things of going back and forth between using web sockets for some things and using data channels for some things.
Especially on what we call the signalling side.
So when WebRTC people talk about signalling, what they mean is "All of the stuff that's not media in a call. So, how do you set up the call? How do you figure out how to talk to other people on the call? How do you keep track of state in a call? Like, who is muted and who's not?" That stuff is actually explicitly not in the WebRTC spec, it's for developers to decide for themselves because use cases vary so much.
One of the first challenges you encounter if you pick up the WebRTC APIs as a developer is, "This is really cool stuff. It solves a lot of really hard, heavy lift problems with media delivery."
But how do I actually get people on a call together in the first place? And then how do I know who is muted and who is not muted, or who has a chat message they want to send?
Matt: On the topic of all this latency, I'd be curious to hear where it actually comes from.
Like, in low latency video and in our world, typically that's in the transcoding step or getting out the CDNs or any myriad of places in between there from glass to glass.
What does that look like specifically for real-time video?
Kwin: I think relative to what you all do, there are three big world view differences with the specs we build on top of.
We do as little encoding and transcoding as possible these days in the WebRTC world, so we try to encode in real-time as best as we possibly can on each client and then we try--
And there are exceptions to this, but we try to just pass through that encoded video without touching the packets all the way through to all the other end users in the call.
And there are complexities there in the end to end encryption and bandwidth management that are really interesting and really fun from an engineering perspective, but basically, we try to do a quick one pass in code and then send all those packets out and receive them and decode them with as little buffering as possible.
So the first difference is we don't have access to much transcoding, and that limits what we can do, but it also lets us do stuff in theory really fast.
The second big difference is if we're not able to do UDP, we are unhappy.
TCP for us is a fallback case, and that's because if we can just fire and forget UDP packets the media will probably get there faster than if we have to set up and tear down TCP connections and sit on top of TCP retransmit algorithms. If we can't get a media packet through in the first go, it's probably not something we're going to try to do again unless it's a keyframe.
So UDP is the right choice , and then the third thing we can't do that we would love to do maybe in the future, we'll build a whole bunch more new Internet infrastructure to handle, but we can't really rely on caching layers or CDNs or anything like that.
We have to, even though we build on top of UDP we have to maintain a stateful idea of the connection with every client in the call so that we can route the right packets to them as quickly and as efficiently as possible.
We have to scale our calls with servers that know about everybody on the call and are doing a certain amount of real-time routing that's a little bit smart.
So, latency comes because we're buffering the media to encode it or decode it, or we're having to deal with packet loss or our servers are having to route the packets.
When you add up the network transport and the encoding and a little bit of time on the servers, that's where you end up with the two hundred milliseconds that we usually are able to achieve and that we target for, for end to end latency.
Phil: So, in terms of-- I'm sorry, I'm not an expert on this in any way, shape or form.
How much of a traffic do you egress from me talking on my WebRTC call, how much of that traffic is going from me directly to a PN network to the other people, and how much of it is going to a server somewhere and then being broadcast?
Kwin: That's a great question. WebRTC was invented as a peer to peer media spec, and you still see that in the roots of everything about WebRTC.
That the assumption is that it's always peer to peer.
To negotiate a peer to peer connection, that was really complicated on the internet.
Because everybody's behind firewalls and hazmat layers between them and the internet.
We have to do a bunch of things, and these are built into the WebRTC spec in a really actually elegant way to try to get UDP packets from my computer to your computer.
We always try to do that in small calls, so in a one on one call we always-- "We" with the choices we make on the Daily platform, we're almost always trying to negotiate a peer to peer call initially.
As you get to bigger calls it becomes prohibitive to do peer to peer, because you can't connect everybody in the call to everybody else and encode the video for everybody else.
Phil: Right. I don't want to send my video to 100 people , potentially at five different bit rates. That would be infeasible.
Kwin: That's right. Both the CPU on the client and the bandwidth available to and from each client can't do it.
So at a certain point in the call in the WebRTC world, we generally switch over to routing media through a server or a media server, and the most common way to scale is you send one upstream media track from each client to the server and then the server multiplexes that out in real-time to all the other clients.
But you do get each track from everybody in the call downstream, so you still have a scaling challenge of how you send all the tracks out to everyone in the call and how the clients can deal with all those tracks.
There's a couple of reasons we tend to do it that way, and then there's a bunch of fallout from that in terms of user experience and engineering tradeoffs. The reason you go one up and down in the world is to avoid the transcoding, and transcoding is hard to do fast and it's expensive from an infrastructure perspective. It takes a lot of CPU and it also-- This is somewhat less important but it's also true, it's really hard to do transcoding in real-time at adaptive rates at multiple adaptive bit rates.
Dealing with real-time in WebRTC is a lot about dealing with variable network conditions.
Both variable over time to and from the same client, and variable across all sorts of different clients connected to a call.
One of the really fun things that got built into WebRTC in the last-- It really became usable in the last year, is called "Simulcast."
What our platform does, and a lot of other platforms do this now too, is we actually send three copies of the video up from each client.
We send a really low and a medium reasonably high bit rate up at the same time, and then the media server peels off whichever layer, the "Simulcast layer" it's called, is best for each other client in the call and sends only one layer downstream.
That lets you do a lot of interesting things around dealing with network conditions that are variable, and also customizing which track you're sending, depending on the UX on the other end in a way that doesn't require transcoding on the server and doesn't require prohibitive amounts of CPU.
Steve: That's great. Do you have to then watch the streamers connection and cut off the higher quality one if the connection drops too low?
Kwin: You do. You have to do same side bandwidth estimation from the server and try to make a good guess about what the right layer to send is, what the max layer you can send is.
Then also at the API or the application layer, you want developers to be able to say, "I really only want the smallest layer. Please don't send me the highest layer for this other participant in the call. Because, for example, I know I want to preserve a bunch of bandwidth for a screen share that's also going on."
Or, "I know I'm trying to render 50 small tiles and I don't want contention on the network, because there's only going to be a small amount of pixels per user anyway. So, send me only the lowest level streams."
So there's a bunch of control you can have over that, but it does add to the complexity of figuring out how you're going to route the packets.
Steve: Is there a sweet spot going from the one to one use case and just direct peer to peer to actually sending it through the server? Like, is there a sweet spot as far as number of users where--? I guess it's bandwidth constrained, but have you found a number that's like "At 5 users we're definitely sending through the server," versus one to one peer to peer?
Kwin: That's such a great question, and your guess of five was really spot on.
Because historically we have switched from peer to peer to media server mode when the fifth person joins the call.
Now, we have a ton of real world data about what works well across all sorts of calls.
One funny thing is that's changed a lot in the last six months, and I think we may actually end up dropping that number down quite a bit lower.
We've seen lots more variability in ISP peering quality, and as more and more mobile users have become active WebRTC application constituents, we've seen mobile networks have different dynamics.
4G networks have really different dynamics than home ISPs, which have different dynamics than business networks, so it's a little bit of a--
If you're going to pick an average number, you're always guessing a little bit.
One of the things I think we are now working on and other people are working on too is getting better at in a call deciding which mode we should be in and trying to seamlessly switch in a way that users never even perceive, which is super fun from an engineering perspective.
We have to start up streams without taking too much CPU or bandwidth, then we have to crossfade between video and audio streams so the people don't butt out.
We're not all the way there yet, but we're almost there.
We've got a bunch of stuff we're going to release in September that we call "Smooth switching" between peer to peer and--.
"Smooth" with lots and lots of O's, and we're going to trademark it.
Matt: I assume that a lot of this-- I've got two questions here.
So, we've been talking about what scaling one call to a larger number of call members looks like there.
I'd be interested to hear what the other side of that looks like, like instead of scaling a bunch of calls with a small number of members is that all just signaling load or are there additional things there that you run into?
Because I'm sure all of this has exploded a little bit with the whole pandemic thing we've had going on.
So, I'm curious where you're seeing folks beating down your door for the scale metrics.
Is it just a sheer number of calls, or are you really seeing people wanting to do an 80 person real-time call?
Whatever the hell that looks like?
Kwin: We're seeing both. We had a period in March and April where the number of calls we were hosting at peak periods grew by a factor of five or more every week, week on week.
So we ended up just scrambling to add a bunch of infrastructure.
There's some publicly available stuff about what Zoom did during that period too to add infrastructure, and if you're down in the weeds on this stuff like we are it was really interesting to see that.
It's not that hard to scale just in terms of number of calls, it's a pretty traditional infrastructure scaling challenge.
The only complexity is that we can't rely on pretty much anybody else's infrastructure to help.
So we can't rely on CDNs or other great technology that's been built out over the last 20 years to scale HTTP traffic, because it's all UDP and because it's all routed in this custom way.
We scale by adding servers to our clusters effectively, and as you were saying, we do end up with bottlenecks on our own signaling infrastructure as well. We can scale that horizontally to some extent too, although as with all infrastructure scaling challenges at some point you have some big database somewhere that you have to figure out how to shard, but mostly we just add servers to our clusters.
So we currently have clusters in seven regions running on AWS' network mostly, and we are in the middle of adding a bunch more clusters in a bunch more regions.
It turns out to be useful to put media servers as close as possible to users because of that speed of light, big time issue.
Adding more geographic clusters is a big priority for us right now.
Longer term, it's super fun to think about what the internet infrastructure might look like as UDP media traffic becomes a bigger and bigger deal.
Phil: That's actually something I wanted to ask, actually.
Is AWS suited for that use case?
Is getting UDP traffic into the cloud-- I remember 8 years ago now, first trying to put a lot of UDP traffic into AWS and that being very hit and miss.
Has that improved? Is it the best platform to root that traffic in?
Kwin: It's improved a lot and it's now more than good enough. It's not perfect, and we find corner cases that end up talking to the AWS engineering support folks who have been really great.
A lot of Zoom is on AWS as well.
Phil: No, that helps.
Kwin: That helps a lot.
But we do sometimes find things like instance types that are CPU inefficient for UDP heavy workloads, whereas they're really great for other workloads.
Then we either switch instance types or we talk to the engineering support folks about it.
It is easy for us to imagine the perfect infrastructure, as I'm sure it is for you as well.
I do believe that there will be a set of CDN-like services eventually optimized for UDP media traffic, but that's a number of years away.
Phil: You've got to feel like that sort of thing will hopefully transition to more edge compute style, right?
Is that a bit of a pipe dream at the moment?
Kwin: I think that's exactly right, because media routing is pretty compute.
It's not compute intensive, but there's compute that it's hard to factor out.
What we imagine is something like a micro pop with a web assembly support baked in at a pretty deep level that lets us route UDP.
I think that is not that hard to build, I think like a lot of things it's hard to spec and get right and scale.
Phil: Sorry, I am pondering two more questions.
I don't know how to work it in, but I really want to know what stun and turn are, because I know it comes up every time someone talks about real-time video. So, what is stun and turn?
Kwin: Sure. Stun and turn are part of that peer to peer routing suite of technologies that are built into the WebRTC specification.
Stun is a serve you can ask on the internet everything about what IP addresses might be usable to reach you, and turn is a server out somewhere on the internet that you can route media through if we fail to establish a true peer to peer connection.
Together, stun along with a protocol called ice, lets us try a bunch of different internet addresses and port numbers with a bunch of different timings to try to punch holes through firewalls and network address translation layers.
If we can't do that, turn lets us agree on a server that we can bounce media through.
There's not really a media server because there's no smarts at all, but it just pretends to be the peer and relays the traffic.
Phil: Thank you, that's brilliant and it answers so many of my questions.
Steve: I feel like somebody worked hard on those acronyms.
Kwin: The funny thing is I do this stuff every single day, and if you actually ask me right now to tell you what those acronyms are I'm not sure I can get it right.
Phil: I'm super interested in hearing what the next generation of codecs looks like for real-time video.
Obviously, I think Cisco demonstrated-- Maybe it was WebX, demonstrated a real-time AV well encoding for real-time communication purposes.
I think that was at Big Apple Video last year, I think it was. Obviously I see more discussion of new codecs coming into WebRTC.
I assume the codec will vary, it's just 99% H264 right now. But where's that going, do you think?
Kwin: Right now in WebRTC it's 99%-- Probably not 99%, but it's a high percentage VP8.
That's because of Google's influence on the implementation of WebRTC and Google's preference for VP8. We do now support--
Everybody now supports both H264 and VP8, but in Chrome for example, the VP8 encoder pipeline is actually better than the H264 encoder pipeline.
Phil: In terms of visual quality, or just faster?
Kwin: It's mostly better at being tied in at a more effective level to the bandwidth estimation and bandwidth adaptive layers, so you can in Chrome use H264 and miss key frames and end up with video artifacts based on missed key frames with H264 that doesn't happen with VP8.
All of this stuff is always a moving target, and on Apple platforms it's less true that H264 is not as good as VP8.
But if you're concerned about quality generally you're using VP8 today, and that'll probably not be true next year, but it's still true today.
There was a big fight in the Standards Committee about codecs and the eventual compromise was all standards-compliant WebRTC implementations now have to support both VP8 and H264. Which from a developer perspective is actually a great result, we love having access to both.
We can do things like, if we know a call is between two iPhones we can use H264 because the battery life implications of H264 are way better compared to VP8 in a one on one call.
So, it's nice to have those two options.
The next generation of codecs is going to be another fight, because there's a V1 and a VP9 and H265, and they offer some really great advantages over the codecs we have now.
But from our perspective, on the network the only thing we care about is packet loss.
In codecs, the only thing we care about is CPU usage.
Right now CPU usage for the next generation codecs is prohibitive at anything above very small resolutions for real-time encoding.
The biggest single complaint we get from developers across all different use cases is "How do I reduce CPU usage of my calls?"
We just tell people who ask us about next generation codecs, "Definitely coming but definitely not coming anytime soon," like "Soon" from a developer, like "I'm trying to ship an app" perspective.
Phil: That's absolutely fascinating.
Matt: I, for one, am shocked that codecs where a fight in the Standards Body .
Kwin: That has never happened before.
Matt: Shocked, I say. Cool.
So, I guess some of this has been-- I think in some of the scaling conversations, it sounds like there's a lot of this that can be done just purely peer to peer and coding on the clients.
So, I'd be curious, what are your biggest expenses tied up in all of this?
Is that running those stunning turn servers or your traditional infrastructure stuff? Like, where does that stuff land?
Kwin: For peer to peer use cases we end up paying for about 20% of the bandwidth because we have to route through our own servers for larger calls.
The combination of bandwidth and the virtual instances that host the calls ends up both contributing to the cost of maintaining the service.
It's interesting for us to see in this year how much growth there's been in interest in larger calls, so we used to almost never get requests for calls with more than 25 or 30 people in them.
Now a lot of our customers are people who are trying to build 50, 100, 500 person calls. What we think of as "Hybrid use cases."
Steve: My nightmare.
Kwin: It's a little bit of an engineering nightmare, it's definitely a little bit of a user experience nightmare, but the innovation of what people are trying to do on the internet is super interesting.
A great example is a fitness class where you want the instructor to have the ability to stream to 500 people, their camera, their mic and their music track.
You want the instructor to be able to see some subset of those people, and you want the people in the class to be able to see the people--
Their friends that they signed up for the class with.
You're not routing every media stream to everybody in a 500 person call, that would be crazy, but you want to be able to really flexibly route the media streams and turn them on and off at a moment's notice.
That's something that WebRTC is able to do from a core building blocks perspective, but it's actually pretty hard to implement from an API and infrastructure perspective.
It's pushing the edges of what our platform can do, but we have enough customer pull for it that it's a big focus for us.
Phil: The super fascinating thing is, honestly, if you'd asked me 8-9 months ago, I'd never even thought of that exact use case.
I've heard that exact use case for 2 very specific use cases.
One for, yes, the personal trainer fitness market. Obviously a huge disruption from Covid over the last 8 months.
I've heard this pitch probably six or seven times in the last five or six months, but then also this exact same pitch but for live music.
Where people want to be able to watch a concert but also see their friends, and then they want the artist who's performing to be able to see some of the audience as well, see some of the fans going wild as well as when their favorite song gets played.
The first time I heard that use case, I was like, "T hat's interesting."
Now it's just every couple of weeks someone wants to do it.
Matt: The third one that's the same, plumbing from our perspective is the virtual conference or virtual networking event where you have a keynote speaker or a keynote group of presenters, like a panel.
Then everyone attending is at a virtual table where they can interact in real-time with six or eight or ten people at the virtual table, and also get the panel or the keynote.
Steve: I did read an interesting blog post that you or someone on your team wrote around your pricing, I don't know if you'd be interested in adding more detail there, but I thought it was interesting that you guys were taking a different stand on how you charge for the service.
Whereas other services charge for every individual connection, I think it was the detail, every one person to another one person is adding to how you're charged versus just the number of people on the call.
I thought that was an interesting approach, and I don't know if you would want to add any technical details behind that?
Kwin: I think as a rule of thumb, if you can simplify how people pay for something you can make it easier to support new use cases and experimentation and the growth of what we're all doing.
So, we tried to figure out how we could come up with as simple as possible pricing that also on some level reflected the cost of service so we could stay in business.
And we have a lot of numbers, obviously, about cost of service and bandwidth use for different kinds of calls.
It turned out to be possible just to charge based on being in a call.
So if you're in a two person call, it's the base price times two times the number of minutes.
If you're in a one hundred person call, it's the base price times one hundred times the number of minutes.
That felt like a reasonable compromise for us between simplicity and scaling of use cases that actually do cost us more to serve, and it is different from our little subset of the industry.
Historically, most people have thought about subscribed tracks or forwarded tracks, which is a more in times in minus one type pricing model.
I think that's partly been because bandwidth used to be more expensive, CPU used to be more expensive, so your variable costs were that they could bite you more as the provider.
As costs have come down with the infrastructure and as I think we've gotten better at building WebRTC native infrastructure in the cloud, I think it's possible to simplify the pricing and lower the pricing.
Our assumption is that pricing is going to come down and simplify further over the next five years, and it's better to be on the forefront of that rather than trailing behind that.
Steve: That's great.
Matt: Yeah, that's awesome.
This is, in particular, one of those areas where you think about traditional online video versus what we're talking about now, which is this new real-time video.
There are so many similarities, but at the same time pricing is, for example, one that's just-- Or, cost structure.
It's just one that feels so radically different, like a traditional video platform or real-time streaming.
So much of your cost is tied up in an encoding, particularly if you don't have that many viewers, so the real-time is typically small number of viewers, but everybody's contributing, but almost none of the cost is encoding because all that can be done on the client. So, it's just fascinating.
Kwin: Yeah, that's right.
Our users pay for the encoding, but they don't pay for the stateful server connection that we need directly.
We pay for that, so you're totally right. The cost structure moves around.
I mean, compared to something like HLS in the WebRTC world, we're never going to be able to optimize for quality the same way Stack is going to be able to.
We're never going to be able to optimize for cost quite as well as an HLS stack, and that's the tradeoff of trying to get to that 200 millisecond or lower number.
Matt: So, what other what other technology challenges have you run into with real-time? Is this--?
We've talked a little bit about codec support, but are firewalls a big problem?
Do you have to get around all those with the stun and turn stuff we talked about earlier? Any other big things there?
Kwin: Firewalls and network infrastructure in general is something we worry a lot about.
Firewalls in particular have gotten just massively more open to real-time media in the last couple of years, so we rarely see firewalls that are a major issue anymore.
That's really nice. What hasn't completely improved is scaling on the client side for larger numbers of people in a call, so we've done a lot of work to try to understand exactly how to optimize for Chrome and Firefox and Safari and Electron.
Our biggest single pain point is "How do we combine the optimizations we're doing for variable network conditions and the optimizations we're doing for variable behavior on the client side in terms of CPU and processing power?"
Like, a fancy MacBook Pro running Chrome is a pretty different beast than an iPhone 7 running Safari 12.
We have to be cross platform, so trying to manage how many videos are being played out at a time and what the resolution of those videos are is always a moving target for us.
Matt: So we talked a little bit about scaling one call to a large number of members in that call, but what about the examples where you have a few people, like say a panel and a talk, and those then want to broadcast to a large number of people who don't want to be in the-- They want to be passive viewers.
It's a view to view broadcast. What does that infrastructure look like for y'all workflow?
Kwin: That's an increasingly big use case for us.
We have a lot of customers who really want to be able to do what you called "View to view broadcast," which was a great term and not one we've used before, but we're going to borrow it now.
So the challenge is "How do we have that really great small call experience, but then at relatively low latency, make that available to a much larger audience?"
And the answer today is a bridge from WebRTC to something like HLS, so we've tried to build that bridge so that our customers can use both our APIs and Mux, for example, or YouTube live or Facebook live.
That does end up requiring a couple of transcoding steps today, so we take the small call and on our media servers that are routing the media packets, we run a compositing and encoding pipeline.
We decode all the media, we lay it out into a single frame and combine the audio tracks, run that through an RTMP out in stage and send it to an RTP and test URL provided by a customer.
That ends up working pretty well from a standards perspective, but it's really disappointing from a core engineering perspective because we'd much rather--
We know you, for example, at Mux are going to take that RTMP and you're going to do a much better job of transcoding it, so we'd really love to figure out how to hand off the WebRTC tracks in a much lower level way to people who are experts at HLS and have a great HLS technology stack, but that remains a little bit of a pipe dream today.
Matt: Fascinating. This has been amazing, Kwin. Thank you so much for taking the time to chat with us.
Phil: It's so illuminating. This is just a world I know so little about. It's just so cool.
Matt: I feel like I use it all the time and we hear about it all the time, but really digging in is one of those things that i t's just a different world.
Steve: Just how different the technology is from what you're working with in the more traditional HTTP streaming side of things. It's pretty fascinating.
Phil: Never have I been more happy to use HLS and to have HTTP as my fundamental protocol.
Matt: It's like "Same," but completely different. So thanks again, everyone, for joining.
Thank you so much, Kwindla, for this illuminating conversation.
This was really great. Just a reminder, 2020.Demuxed.com around tickets, donation matching if you want to give, a nd also we just wanted to explicitly call out our request for topics.
If you have something you want to talk about, get in touch. If you have something you want to hear about, get in touch.
We can figure out who we can find that might be able to chat about whatever that thing you have that burning desire to learn about is.
Steve: How does one get in touch?
Matt: @Demuxed on Twitter, if you're just ping @heff, @phil or @mmcc on the video dev Slack, video-dev.org. Or you can email Info@Demuxed.com.
Phil: You can definitely do that.
Matt: We need to set up a wolf somewhere. But anyway, thanks again, Kwindla. This was a fantastic call. Really appreciate it.
Kwin: Thank you so much.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
Demuxed Ep. #15, Low-Latency HLS, Pt. 2
In episode 15 of Demuxed, Matt and Phil continue the conversation on Low-Latency HLS, focusing specifically on unpacking Apple’s...
Demuxed Ep. #14, Low-Latency HLS, Pt. 1
In episode 14 of Demuxed, Matt, Phil, and Steve discuss real-time communication, the history of video on the web, and the...
Don't Make Me Code Ep. #11, Dogfooding
In this episode of Don’t Make Me Code, David and Steve have Liz Bennett in the studio. Liz is a Senior Software Engineer at...