Matt: Hey, everyone.
We're back and we have the whole gang back together.
It feels like it's been forever, even though it's actually just been one episode that it
wasn't all three of us.
This time we're back with a guest, the last two shows we just
did the three of us talking about low latency, and the two of us in the last one.
This time we have our friend from the industry,
from YC, from just the world.
Kwindla from Daily.co, if you're curious about domain names.
I wanted to talk real quick before we get started,
Demuxed, by the time this is released the speaker submission should be closed
and the panel should be starting to work on talk selection,
all that stuff.
But it's the last week of October, that
Tuesday through Thursday, the 27th through 29th.
Just a reminder, when you buy your ticket you
have a chance to just buy a normal ticket or a ticket with a swag box.
If you want the t shirt and all that stuff we're going to try to ship
to anywhere that asks for it, but we can't ship a
t-shirt and badge for $80 dollars to Timbuktu.
That's not going to work. A third of all ticket
revenue goes to causes that help support
diversity and inclusion in technology and the world, but primarily in
our tech industry.
If you donate anything over the base
ticket price, anything above that goes straight to the
Just a reminder, also we're doing
donation matching up to $3,000 dollars
if you want to just donate outside the ticket or before.
We're going to donate those as soon as we hit that mark, so let us
know. You can donate to any DNI,
Phil: Social justice--
Matt: Yeah, social justice group that you like.
We'll donate to just DevColor and the ACLU, but
we'll match whatever social justice initiative you want
to donate to.
OK, so we're
here today to talk about real-time.
We had these two episodes on the world of low
latency and what that looks like, and the
different gradients of low latency in the world and
ending with this deep dive into what Apple's new low latency
HLS spec looks like.
A question we get a lot is "What does the real-time world look like?"
I thought it'd be great to talk about real-time video as a follow
up to this pushing the bounds of lower latency and then going
fully into that real-time side of the latency spectrum that we talked about in the last
So to talk about that, we brought in our friend
Kwindla from Daily.
We went through YC with him way back in
2016. Kwin, do you want to give us a little bit of
background about Daily and what y'all do?
Kwin: Sure. Thanks for having me, I'm a big fan of the podcast so it's really
fun to be on with you.
I'm Kwindla Hultman Kramer, I'm
co-founder of a company called Daily.
We make APIs for real-time interactive video.
We have been working on video for a long
time, individually and collectively as a team.
I've been interested in large scale, real-time networks
and videos since I was in graduate school.
The genesis of Daily is the
idea that live video is going to be more and more a part of our
lives and everything we all do online and a
standard called WebRTC, which a number of browser
manufacturers and other platform folks have been working on for
several years, is going to help make that happen.
We started Daily to build a technology stack on top of
WebRTC that could serve as many use cases as possible.
Our users are doing things like video
calls, but also customer support,
retail, telehealth and some interesting new
use cases around future work, like always-on
video desktops for distributed teams and
experimental educational stuff, like fitness classes.
Phil: I saw one of those products the other day, actually.
I can't remember what it's called, but it's a Kickstarter or
something like that, but I guess iPad--
Where you just,
it's on a stand and you put it beside your desk and you flick it on in the morning and
all of your teammates are just there.
That's a sight terrifying to me, but--
Kwin: I think there's so much interesting
experimentation, and some of those tools are really going to stick in maybe the
same way that Slack, for example, really changed a lot of our
assumptions about how we work together.
Phil: In the last two episodes we really focused in on this low
latency and ultra low latency, and I heard
the words the other day, "Hyper low latency."It was a new one for
Steve: Oh, no. Please, No.
Phil: I know, right?
And maybe just to rescope a little bit on where that line
is drawn, where things become real-time video.
All the stuff we talked about in the last couple of episodes,
best case that's going to get you down to-- What,
Maybe half a second, at some level of
scale. Obviously real-time is a
lot lower than a second.
This is something where the
latency is not disruptive to human
interaction, so we're talking about
sub-100 milliseconds of latency
realistically, for a great experience.
Steve: Even at 300 milliseconds you start to get that
back and forth, like you're having a long-distance international
phone call and you start talking over each other.
There is actually a pretty wide gap between
a good real-time interactive
experience and where we can get to with the
traditional live streaming technologies.
Matt: This is probably a great place for you to chime in.
What do you typically see as that threshold for
acceptable in the real-time world?
I imagine it's almost even smaller for what's acceptable there, but what do you
Kwin: Yeah, the deep background of the technologies we build on top of is
the voice over IP stack that your
listeners are probably familiar with from originally moving
phone calls, business phone calls, to digital
The traditional VoIP world definition of
acceptable latency is 250 milliseconds or less.
As you were saying, though, you really want to be down closer to 100
milliseconds if you possibly can be, but you can't beat
the speed of light so if you're talking to people halfway around the world it's a little
tough to be at 100 milliseconds.
Some of our users use satellite links, those are more like 600
There are always
buffers if you're encoding video, video is harder than voice
in terms of real-time.
So with video buffers in the
mix we feel like we're doing OK
if we're at two hundred milliseconds or so.
We want to be better, but 200 milliseconds is what we aim for.
Matt: That pesky speed of light issue
Phil: It gets me every time.
Matt: We already started talking about this a little bit here, but
I thought it'd be nice to touch on how this
is technologically different from traditional live streaming?
So, how is Real-Time different from what we see in traditional live streaming, whether or
not it's low latency?
I thought it'd be nice to dig in there a little bit.
You touched on this VoIP foundation that it's being built
on, but can you tell us a little about the technology history behind all
Kwin: Sure. Digital video and audio have
been researchers for a long time.
I built a bunch of experimental video stuff in graduate school
in the 90s, but it really became a widespread
technology as business telephone systems
and other kinds of commercial audio systems
started to need to route over digital networks.
The industry developed around VoIP, or voice over
IP, so if you've used a phone system at a big company
in the last 20 years you've used some VoIP technology.
Then of course cell phone networks became digital, and then in
2010-2011 Google started to get serious
about thinking about what should get embedded in web browsers
for digital audio and video that could be real-time,
could be used for interactive applications like video calls,
and it started to work on a specification that
eventually became known as WebRTC.
Mozilla Foundation got on board with Firefox, and Google and
Mozilla in 2012, 2013,
2014 timeframe started to build really cool new stuff
into web browsers, and WebRTC became a standards effort.
A bunch of other companies started to contribute to the
WebRTC effort, and if you fast forward to today
the set of standards called WebRTC is now built into
browsers and mobile platforms and is something that
developers can build directly on top of to build
really low latency and interactive audio and video.
Steve: So, how much of that is just the same technology that's been
used for VoIP and how much of it is actually pretty
Kwin: It's a mixed bag.
All of the core pieces have been around for a long time but
how they get put together and how they get implemented in the browser and
are really still unstable.
We're figuring this stuff out as we go along, and
every major release of the browsers tends to
cause various issues.
Sometimes major, sometimes minor.
It still feels like we're at the early days of scaling and
making this stuff really easy for developers.
But it is super exciting now that we've crossed this
bridge to now having a standard that you can build
on cross platform for live interactive video
Phil: I was having this conversation with Matt just two days
ago about how
WebRTC when you're playing with it and experimenting with it right now
feels like HTML-5 video did
6 or 7, 8 years ago in the browser, and media
source extensions in particular.
It feels like it's at that stage of incubation or
maturity in comparison to where the
technologies you use when you're delivering
HLS or Dash to a browser.
Just that chunk more mature
right now, in particular from a compatibility perspective.
Kwin: It's so true. I have three different
computers on my desk for testing, and I don't think that--
That was true when we were building websites in
1999 and it's not been true for a while, but with the
WebRTC stuff it definitely is true.
Matt: Yeah, the kicker I think that always gets me is having to jump
back into nine different vendor prefixes.
Sometimes there's different vendor prefixes within a vendor, which is wild.
But something that you alluded to a little
bit in there that was really interesting to me is that
it feels like this was another thing that got really impacted
by the death of browser plugins, things like
Flash and RealPlayer?
Matt: I don't know if that one did real-time, but my point is I
think this was a niche that was filled by those plugins.
Then there was the dead area after all of those
died for a little while, and I hadn't really thought about that
relationship but it's obvious in retrospect.
Kwin: The way we talk about it is at least three
things have to come together to give developers
something they can build on and lots of interesting ways, and
we see a bunch of new applications start to come.
One is that underlying technologies have to get good enough, so you have to
have fast enough CPUs and a good enough network connections
available for everybody.
Then you have to have the
core low level implementations that take advantage of those
CPU and network connection advances,
and then user expectations have to start to shift.
When I started building live video stuff
and I would demo the things that we were building, people would say, "That's really
cool. But I'm never going to turn my camera on in a video call."
Phil: Little did they know what was going to happen.
Kwin: It's really funny to see how things change, and we don't even remember.
You cross over that boundary and you don't even remember
how different things were.
One of the parallels we try to draw is how payments are
now ubiquitous on the internet, but I'm certainly old enough to remember
when lots of people said there were never going to be--
Users were never going to
trust the internet with their credit card number, s
o user expectations really matter.
As you start to build that
flywheel of underlying core
speeds and feeds type tech, and then
implementation-level building blocks that developers can use and
things that users are interested in doing and excited to do and willing to do, you
get all sorts of new use cases in a Cambrian explosion.
I do think that's the early stage of what we're seeing now
with interactive videos, all those things have come together.
There's a whole lot of experimentation and things are still a little bit hard to
build in scale, but there's so much interesting stuff going on.
Matt: So I'm curious to hear, because we've talked a lot about--
This has been a really browser centric conversation so far.
Obviously, that's the platform these days, let's be honest with ourselves.
But I'd be curious what the rest of the landscape looks like, and do you
all dive much into that or are you really able to
stay almost entirely in the browser platform world
and be as ubiquitous as Daily is?
Kwin: We focused on the browser early on because, partly
with Google's big efforts in this new standard,
the browsers were ahead of mobile platforms
and it was pretty tough to build a mobile app that could do
really good quality audio and video at extremely low
That started to change in just the last
year, so we are starting to put a lot more work into mobile.
We have some new releases coming up and we think you pretty much have to,
because as things get more ubiquitous, as US cases
evolve, mobile is, 50
-90% of the computing that most people do.
So, mobile is catching up.
It's pretty close to being a completely viable
Phil: Has that changed--?
I see stuff at WWDC? More WebRTC components coming to
iOS, if memory serves?
Kwin: Yeah, that's exactly right.
Mobile Safari is now pretty good.
Kwin: Apple is supporting some stuff lower level
than Safari, but we're still at the point where as a developer you can use
But if you want to build a native mobile app, you need to go
down to the level where you're actually maintaining some of the core
library stuff yourself.
That will change, and we can make that easier for users, but the
mobile development path is still quite a bit more challenging than the browser
Phil: I guess that plays into what I'm super
interested in, which is how does
real-time communication and real-time video market, just--
Is everything now just WebRTC?
If I'm using Zoom, is it WebRTC?
Or are people still doing more weird and innovative
ways of shuffling around real-time video?
Kwin: The short answer is "Everything except Zoom is WebRTC."
Zoom did this amazing job building a proprietary video stack and
great native applications that could leverage
their proprietary stack, and they had to do that in terms of
when they were launching and scaling, because WebRTC just wasn't
ready on any level for what they were trying to do.
As WebRTC has gotten better, almost everybody else has
shifted to using WebRTC often in
experimental ways, sometimes hacking up and modifying what the low
level implementation is doing.
But it seems pretty clear to me, I'm just one
opinion, but it seems pretty clear to me that we're at the tipping point where the
standards ecosystem is going to outpace everything else in
terms of both performance for core use cases
and the ability to support all the interesting longtail edge case use cases.
There's going to be another generation of the WebRTC spec
that deconstructs some of the core primitives to
support more experimentation and more use cases.
That'll take a year or two or three to start having an
impact, but it's already being worked on.
I think the whole world is moving toward WebRTC and
investing in and contributing to WebRTC at this point.
Steve: On that note, I wanted to know how much room is there
for tweaking the algorithm,
of the streaming algorithm of WebRTC as
it's determining what quality to send and
details like that.
Are people modifying WebRTC at that level or do you just
trust that the browser is doing smart things?
Kwin: It depends on what you're doing.
For what I would say "Normal" use
cases, and "Normal" is a moving target, but for normal use cases you're
trusting the browser for a bunch of reasons.
A lot of stuff really has to get done at the
C++ level to work.
So the video encoding, echo cancellation,
lots of stuff you folks are really familiar with just really has to be done
If you're not trusting the browser, then you
can't work within the browser.
What you can do if you have a use case that isn't well supported by
today's standard WebRTC is you can
either work at the C++ level yourself, which has more knobs you can
turn, or you can hack on the open
source WebRTC code base.
That is the basis for most of the
WebRTC implementations, and both of those are really doable and
people are doing both of those things.
It does mean you have to have a native application or an electronic
For example, you no longer are compatible
with the browsers.
Sometimes you can work really hard to be compatible, cross compatible, like
your users in the browsers have a certain experience
that's got guardrails on it.
Then your users with your native application can do more,
but that's twice as much work, at least.
Steve: That's interesting, because that's one of the differences that
you can call out between RTMP and WebRTC.
Like, with RTMP you have maybe a little bit more control.
If you really want to send a high quality stream as much as
possible and you're less concerned about necessarily
the latency, you might choose RTMP because you
can send that higher quality stream.
Whereas WebRTC, I get the impression you tend to have a little bit less control
Kwin: That's totally right. And that's one of the really interesting tradeoffs, that when--
I think when developers come to you and they come to us, they ask
for things like low latency and high quality, but the
actual tradeoffs there are really subtle and really interesting.
Often WebRTC is really
biased towards low latency at the expense of quality.
So, keeping the media flowing.
To a first approximation the big difference is UDP versus
TCP, so we lose a lot of packets when we send
them on a lot of networks, and so we
don't always have time to resend those packets.
We certainly don't have time to run multiple encoder passes, so
we're never going to be able to send video at the same quality
that you could send video at if you're willing to tolerate
15, 20, 30 second latency.
But we can push the latency down by
building lots of robustness to packet loss and other things into
the system, but it is a real tradeoff and you have to decide what
you're going to optimize for.
Phil: One of the things we see a lot of,
or have seen bits and bobs of,
is abuse of the WebRTC spec.
It's probably the wrong way to phrase it,
we see a lot of people being creative with WebRTC spec and
using the file transfer mode and
moving around bits of video for ultra low latency
for close to real-time streaming, from
a real-time communications standpoint I assume that's
a no go area where you need that traditional
Would that be fair to say?
Kwin: I think you're talking about WebRTC data channels.
Phil: I am, yeah.
Kwin: A little bit of background, in the WebRTC spec there are media
channels and data channels, and the media channels were designed too
know enough to have a high enough abstraction
level that you can assume that media you're sending through the media channels
has things like forward error correction.
It can do rescinds on missing packets, you
can re-request keyframes, the media channels are built on
top of RTP but the data channels are
just catch as catch can.
We're going to send a bunch of packets out over whatever thee
underlying transport is, probably UDP, but you don't really know,,
and with none of the higher level
stuff that's super useful for media.
So if you're sending media over thee
data channels, then you're either doing something
wrong or you're pushing the boundaries of what the spec can
Matt: Why not both?
Kwin: Opinions differ, right?
Engineers playing with trade offs are going to do interesting
Matt: The need to control things in different ways than the media channels give you access to,
though, is one of the drivers of the WebRTC next generation spec.
Separating out the encoding
and bandwidth adaptation and
other layers that are in the media channels that you don't have as much control
over into elements of a pipeline that you do have
control over is definitely something that the standards body is aware
that people want to do.
Phil: And on a super cool screen.
Kwin: I think it's fun to talk about Zoom's amazing hack, which
is their in browser client.
As you say, it uses the WebRTC but not really WebRTC.
Zoom cross compiled into web assembly a bunch of their
proprietary encoding and decoding and networking stack,
and they use WebRTC data channels and a bunch of web
assembly to implement Zoom in the browser.
It's an amazing and a really cool hack, it's
probably not something that can scale the
way a more native WebRTC
implementation can, and certainly can't scale the way their native
But it's an interesting experiment exactly along the lines you're
They're pushing the boundaries of two technologies, web
assembly and WebRTC to do something that you can't do
with core WebRTC.
Matt: I was corrected the other day, because I thought that they were using
WebRTC-- I guess, quasi corrected.
I thought they were using WebRTC and that's how they were in the browser.
Then I was corrected, somebody was like "No, they're actually using web sockets."
And apparently they used to use web sockets in the browser and the
WebRTC transition is relatively recent, which
honestly that whole thing blew my mind a little bit.
But it is interesting that WebRTC has gotten so ubiquitous
in this industry that I was shocked to hear that
somebody wasn't using it, especially as big of a player as Zoom.
Kwin: They have a bunch of great stuff they've built that just isn't available at
They had to pick some different point in the trade off matrix.
As you're saying, with the web sockets versus data channels
decision, there's a whole discussion about what web sockets are and
aren't good at and what data channels are and aren't good at, and it does
cause us to pull our hair out because we
work with both every day and we've done some similar
things of going back and forth between using web sockets for some things and
using data channels for some things.
Especially on what we call the signalling side.
So when WebRTC people talk about signalling, what they mean is "All of the stuff that's
not media in a call.
So, how do you set up the call?
How do you figure out how to talk to other people on the call?
How do you keep track of state in a call?
Like, who is muted and who's not?" That stuff is actually explicitly
not in the WebRTC spec, it's for developers to
decide for themselves because use cases vary so much.
One of the first challenges you encounter if you pick up the WebRTC
APIs as a developer is, "This is really cool stuff.
It solves a lot of really hard, heavy lift problems with
But how do I actually get people
on a call together in the first place?
And then how do I know who is muted and who is not muted, or who
has a chat message they want to send?
Matt: On the topic of all this latency, I'd be curious to hear where it
actually comes from.
Like, in low latency video and in our
world, typically that's in the transcoding
step or getting out the CDNs or any
myriad of places in between there from glass to glass.
What does that look like specifically for
Kwin: I think relative to what you all do, there are three big
world view differences with the specs we build on top of.
We do as little encoding and transcoding as
possible these days in the WebRTC world, so we try to
encode in real-time as best as we possibly can on each client
and then we try--
And there are exceptions to this, but we try to just
pass through that encoded video without
touching the packets all the way through
to all the other end users in the call.
And there are complexities there in the end to end encryption
and bandwidth management that are really interesting and really
fun from an engineering perspective, but basically, we try to
do a quick one pass in code and then send all those packets out
and receive them and decode them with as little buffering as possible.
So the first difference is we don't have access to much transcoding,
and that limits what we can do, but it also lets us do stuff in
theory really fast.
The second big difference is if we're
not able to do UDP, we are unhappy.
TCP for us is a fallback case, and
that's because if we can just fire and forget UDP packets the
media will probably get there faster than if we have to set up and tear down
TCP connections and sit on top of
TCP retransmit algorithms.
If we can't get a media packet through in the first
go, it's probably not something we're going to try to do again unless it's a keyframe.
So UDP is the right choice
, and then the third thing we can't do that we would love to do maybe
in the future, we'll build a whole bunch more new Internet
infrastructure to handle, but we can't really rely on caching
layers or CDNs or anything like that.
We have to, even though we build on top of UDP we have
to maintain a stateful idea of the connection
with every client in the call so that we can route the right
packets to them as quickly and as efficiently as possible.
We have to scale our calls with servers
that know about everybody on the call and are doing a certain
amount of real-time routing that's a little bit smart.
So, latency comes because we're
buffering the media to encode it or decode it,
or we're having to deal with packet
loss or our servers are having to route the packets.
When you add up the network
transport and the encoding and a little bit of time on the servers, that's where
you end up with the two hundred milliseconds that we
usually are able to achieve and that we target for, for end
to end latency.
Phil: So, in terms of-- I'm sorry, I'm not
an expert on this in any way, shape or form.
How much of a traffic do you
egress from me talking on my
WebRTC call, how much of that traffic is going from
me directly to a PN
network to the other people, and how much of it is going
to a server somewhere and then being broadcast?
Kwin: That's a great question.
WebRTC was invented as a peer to peer
media spec, and you still see that in the roots
of everything about WebRTC.
That the assumption is that it's always peer to peer.
To negotiate a peer to peer connection, that was really complicated on the internet.
Because everybody's behind firewalls and
hazmat layers between them and the internet.
We have to do a bunch of things, and these are built into the
WebRTC spec in a really actually elegant way to try to
get UDP packets from my computer to your computer.
We always try to do that in small calls, so in a
one on one call we always-- "We" with the
choices we make on the Daily platform, we're almost always trying to negotiate
a peer to peer call initially.
As you get to bigger calls it becomes prohibitive to
do peer to peer, because you can't connect everybody in the call to everybody
else and encode the video for everybody else.
Phil: Right. I don't want to send my video to 100 people
, potentially at five different bit rates.
That would be infeasible.
Kwin: That's right. Both the CPU on the client and the bandwidth
available to and from each client can't do it.
So at a certain point in the call
in the WebRTC world, we generally switch over to routing
media through a server or a media server, and the most
common way to scale is you send
one upstream media track from
each client to the server and then the server
multiplexes that out in real-time to all the other clients.
But you do get each
track from everybody in the call downstream,
so you still have a scaling challenge of how you send
all the tracks out to everyone in the call and how the clients
can deal with all those tracks.
There's a couple of reasons we tend to do it that way, and then
engineering tradeoffs. The reason you go
one up and down in the world is to avoid the
transcoding, and transcoding is
hard to do fast and it's expensive from an
infrastructure perspective. It takes a lot of CPU and it also--
This is somewhat less important but it's also true, it's really hard to do
transcoding in real-time at adaptive rates at
multiple adaptive bit rates.
Dealing with real-time in WebRTC is a
lot about dealing with variable network conditions.
Both variable over time to and from the same client, and
variable across all sorts of different clients connected to a call.
One of the really fun things that got built into WebRTC in the
last-- It really became usable in the last year,
is called "Simulcast."
What our platform does, and
a lot of other platforms do this now too, is we actually send
three copies of the video up from each client.
We send a really low and a medium
reasonably high bit rate up at the same
time, and then the media server peels
off whichever layer, the "Simulcast layer" it's
called, is best for each other client in the call and sends only
one layer downstream.
That lets you do a lot of interesting things around dealing with
network conditions that are variable, and also customizing
which track you're sending, depending on the UX on the other end
in a way that doesn't require transcoding on the server and doesn't require
prohibitive amounts of CPU.
Steve: That's great. Do you have to then watch the
streamers connection and cut off the
higher quality one if the connection drops too low?
Kwin: You do. You have to do same side bandwidth
estimation from the server and try to make a good guess
about what the right layer to send is, what the
max layer you can send is.
Then also at the API or the application layer, you want
developers to be able to say, "I really only want the smallest
layer. Please don't send me the highest layer for this other
participant in the call. Because, for example, I know I want to preserve a
bunch of bandwidth for a screen share that's also going on."
Or, "I know I'm
trying to render 50 small tiles and I don't want
contention on the network, because there's only going to be a small
amount of pixels per user anyway.
So, send me only the lowest level streams."
So there's a
bunch of control you can have over that, but it does
add to the complexity of figuring out how you're going to route the
Steve: Is there a sweet spot going from the one to one
use case and just direct peer to peer to actually
sending it through the server?
Like, is there a sweet spot as far as number of users
where--? I guess it's bandwidth constrained, but have you found a number
that's like "At 5 users we're definitely sending through the
server," versus one to one peer to peer?
Kwin: That's such a great question, and your guess of five was really spot on.
Because historically we
have switched from peer to peer to media
server mode when the fifth person joins the call.
Now, we have a ton of real world data about what
works well across all sorts of calls.
One funny thing is that's changed a lot in the last six
months, and I think we may actually end up dropping that
number down quite a bit lower.
We've seen lots more variability in
ISP peering quality,
and as more and more mobile users have become
active WebRTC application
constituents, we've seen mobile networks
have different dynamics.
4G networks have really different dynamics than
home ISPs, which have different dynamics than business
networks, so it's a little bit of a--
If you're going to pick an
average number, you're always guessing a little bit.
One of the things I think we are now working on and other people are
working on too is getting better at in a
call deciding which mode we should be in and trying to
seamlessly switch in a way that users never even perceive,
which is super fun from an engineering perspective.
We have to start up streams without taking too much
CPU or bandwidth, then we have to crossfade
between video and audio streams so the people don't butt out.
We're not all the way there yet, but we're almost there.
We've got a bunch of stuff we're going to release in September that
we call "Smooth switching" between
peer to peer and--.
"Smooth" with lots and lots of O's, and we're
going to trademark it.
Matt: I assume that a lot of this-- I've got two questions here.
So, we've been talking about what scaling one
call to a larger number of call members looks like there.
I'd be interested to hear what the other
side of that looks like, like instead of scaling a bunch of calls with a small number of
members is that all just
signaling load or are there additional things there that you run
Because I'm sure all of this
has exploded a little bit with the whole pandemic thing we've had going on.
So, I'm curious where you're
seeing folks beating down your door for the scale metrics.
Is it just a sheer number of calls, or are you
really seeing people wanting to do an 80 person real-time call?
Whatever the hell that looks like?
Kwin: We're seeing both.
We had a period in March and April where the number
of calls we were hosting at peak periods
grew by a factor of five or more every
week, week on week.
So we ended up just scrambling to add a bunch of infrastructure.
There's some publicly available stuff about what Zoom did during that
period too to add infrastructure, and if you're down in the weeds
on this stuff like we are it was really interesting to see that.
It's not that hard to scale just in terms of number of
calls, it's a pretty traditional infrastructure
The only complexity is that we can't rely on
pretty much anybody else's infrastructure to help.
So we can't rely on CDNs or other
great technology that's been built out over the last 20 years to
scale HTTP traffic, because it's all UDP and
because it's all routed in this custom way.
We scale by adding servers to our clusters
effectively, and as you were saying, we do end up with
bottlenecks on our own signaling
infrastructure as well.
We can scale that horizontally to some extent too, although
as with all infrastructure scaling challenges at some point you have some big
database somewhere that you have to figure out how to shard, but
mostly we just add servers to our clusters.
So we currently have clusters in seven
regions running on AWS' network
mostly, and we are in the middle of adding a bunch more
clusters in a bunch more regions.
It turns out to be useful to put media servers as close as possible to
users because of that speed of light, big time issue.
Adding more geographic clusters is a big
priority for us right now.
Longer term, it's super fun to think about what the internet
infrastructure might look like as UDP media traffic
becomes a bigger and bigger deal.
Phil: That's actually something I wanted to ask, actually.
Is AWS suited for that use case?
Is getting UDP traffic into the
cloud-- I remember 8 years ago
now, first trying to put a lot of UDP traffic into AWS and that being
very hit and miss.
Has that improved? Is it the best platform to root that traffic in?
Kwin: It's improved a lot and it's now more than
good enough. It's not perfect, and
we find corner cases that end up talking to the AWS engineering
support folks who have been really great.
A lot of Zoom is on AWS as well.
Phil: No, that helps.
Kwin: That helps a lot.
But we do sometimes find things like instance
types that are CPU inefficient for UDP
heavy workloads, whereas they're really great
for other workloads.
Then we either switch instance
types or we talk to the engineering support folks about it.
It is easy for us to imagine the perfect
infrastructure, as I'm sure it is for you
I do believe that
there will be a set of CDN-like
services eventually optimized for UDP
media traffic, but that's a number of years away.
Phil: You've got to feel like that sort of thing will hopefully
transition to more edge compute style, right?
Is that a bit of a pipe dream at the moment?
Kwin: I think that's exactly right, because media routing is pretty
It's not compute intensive, but there's
compute that it's hard to factor out.
What we imagine is something like a micro
pop with a web assembly
support baked in at a pretty deep level that lets
us route UDP.
I think that is not that hard to build, I think
like a lot of things it's hard to spec and get right and scale.
Phil: Sorry, I am pondering two more questions.
I don't know how to work it in, but I really want to know what
stun and turn are, because I know it
comes up every time someone talks about real-time video.
So, what is stun and turn?
Kwin: Sure. Stun and turn are part of that
peer to peer routing suite of technologies that are built into the
Stun is a serve
you can ask on the internet everything about what
IP addresses might be usable to reach you,
and turn is a server out somewhere on the internet that you
can route media through if we fail to
establish a true peer to peer connection.
Together, stun along with a protocol
called ice, lets us try a bunch of different
internet addresses and port numbers with a bunch of different timings
to try to punch holes through firewalls and network
address translation layers.
If we can't do that, turn lets us
agree on a server that we can bounce media through.
There's not really a media server because there's no smarts at all, but
it just pretends to be the peer and relays the traffic.
Phil: Thank you, that's brilliant and it answers so many of my questions.
Steve: I feel like somebody worked hard on those acronyms.
Kwin: The funny thing is I do this stuff every single day, and if you
actually ask me right now to tell you what those acronyms are I'm
not sure I can get it right.
Phil: I'm super interested in hearing what the next
generation of codecs looks like for real-time video.
Obviously, I think Cisco
demonstrated-- Maybe it was WebX, demonstrated a
real-time AV well encoding for real-time communication purposes.
I think that was at Big Apple Video last year, I
think it was. Obviously I
see more discussion of new codecs coming into WebRTC.
I assume the codec will vary, it's just
99% H264 right now.
But where's that going, do you think?
Kwin: Right now in WebRTC it's 99%-- Probably not
99%, but it's a high percentage VP8.
That's because of Google's influence on the
implementation of WebRTC and Google's preference for
VP8. We do now support--
supports both H264 and VP8, but in Chrome
for example, the VP8 encoder pipeline is actually better
than the H264 encoder pipeline.
Phil: In terms of visual quality, or just faster?
Kwin: It's mostly better at being
tied in at a more effective level to the bandwidth estimation
and bandwidth adaptive layers, so you can
in Chrome use H264 and miss
key frames and end up with video artifacts based on missed key frames
with H264 that doesn't happen
All of this stuff is always a moving target, and on Apple platforms
it's less true that H264 is not as good as VP8.
But if you're concerned about quality generally you're
using VP8 today, and that'll probably not be
true next year, but it's still true today.
There was a big fight in the Standards Committee about
codecs and the eventual compromise was
all standards-compliant WebRTC implementations now have to
support both VP8 and H264.
Which from a developer perspective is actually a great result, we love having access to both.
We can do things like, if we know a call is
between two iPhones we can use H264 because the
battery life implications of H264 are way
better compared to VP8 in a one on one call.
So, it's nice to have those two options.
The next generation of codecs is going to be another
fight, because there's a V1 and a VP9 and
H265, and they offer some really great advantages
over the codecs we have now.
But from our perspective, on the network the
only thing we care about is packet loss.
In codecs, the only thing we care about is CPU usage.
Right now CPU usage for the next generation codecs is
prohibitive at anything above very small
resolutions for real-time encoding.
The biggest single complaint we get
from developers across all different use cases is "How do I reduce
CPU usage of my calls?"
tell people who ask us about next generation codecs, "Definitely
coming but definitely not coming anytime soon,"
like "Soon" from a developer, like "I'm trying to ship an app" perspective.
Phil: That's absolutely fascinating.
Matt: I, for one, am shocked that
codecs where a fight in the Standards Body
Kwin: That has never happened before.
Matt: Shocked, I say.
So, I guess some of this has been-- I think in
some of the scaling conversations, it sounds like
there's a lot of this that can be done just purely peer to peer and coding on
So, I'd be curious, what are your biggest expenses tied up in all of this?
Is that running those stunning turn servers or
your traditional infrastructure stuff?
Like, where does that stuff land?
Kwin: For peer to peer use cases we end up paying for about 20% of the
bandwidth because we have to route through our own
servers for larger calls.
The combination of bandwidth and the virtual
instances that host the calls ends up both
contributing to the cost of maintaining the service.
It's interesting for us to see in this year how much growth there's
been in interest in larger calls,
so we used to almost never
get requests for calls with more than 25 or 30 people in them.
Now a lot of our customers are
people who are trying to build 50, 100,
500 person calls.
What we think of as "Hybrid use cases."
Steve: My nightmare.
Kwin: It's a little bit of an engineering nightmare, it's
definitely a little bit of a user experience nightmare, but the
innovation of what people are trying to do on the internet is super
A great example is a fitness class
where you want the instructor to have the ability to
stream to 500 people, their camera, their mic and their
You want the instructor to be able to see
some subset of those people, and you want the people in the
class to be able to see the people--
Their friends that they signed up
for the class with.
You're not routing every media stream to everybody in a 500 person call, that
would be crazy, but you want to be able to really
flexibly route the media streams and turn them on
and off at a moment's notice.
That's something that WebRTC is able to do
from a core building blocks perspective, but it's actually pretty hard to
implement from an API and infrastructure perspective.
It's pushing the edges of what our platform can do, but we have enough
customer pull for it that it's a big focus for us.
Phil: The super fascinating thing is, honestly, if you'd asked
me 8-9 months ago, I'd
never even thought of that exact use case.
I've heard that exact use case for
2 very specific use cases.
One for, yes, the personal trainer
Obviously a huge disruption from Covid over
the last 8 months.
I've heard this pitch probably six or seven times
in the last five or six months, but then also this exact
same pitch but for live music.
Where people want to be able to watch a concert but
also see their friends, and then they want the artist who's
performing to be able to see some of the
audience as well, see some of the fans going wild as well as
when their favorite song gets played.
The first time I heard that use case, I was like, "T
Now it's just every couple of weeks someone wants to do it.
Matt: The third one that's the same, plumbing from our perspective is the
virtual conference or virtual networking event where you have a keynote
speaker or a keynote group of presenters, like a panel.
Then everyone attending is at
a virtual table where they can interact in real-time
with six or eight or ten people at the virtual table,
and also get the panel or the keynote.
Steve: I did read an interesting blog post that
you or someone on your team wrote around your pricing,
I don't know if you'd be interested in adding more detail there, but I thought it was
interesting that you guys were taking a different stand on how you
charge for the service.
Whereas other services charge for every individual connection, I
think it was the detail, every one person to another one
person is adding to how you're charged versus
just the number of people on the call.
I thought that was an interesting approach, and I don't know if you would
want to add any technical details behind that?
Kwin: I think as a rule of thumb, if you can simplify how people pay for
something you can make it easier to support new
use cases and experimentation and the growth of
what we're all doing.
So, we tried to figure out how we could come up with as
simple as possible pricing that also on some level
reflected the cost of service so we could stay in business.
And we have a lot of numbers, obviously, about cost of
service and bandwidth use for different kinds of calls.
It turned out to be possible just to
charge based on being in a call.
So if you're in a two person call, it's the
base price times two times the number of minutes.
If you're in a one hundred person call, it's the base price times one hundred
times the number of minutes.
That felt like a reasonable compromise for us between simplicity
and scaling of use cases that actually do cost
us more to serve, and it is different from
our little subset of the industry.
Historically, most people have thought about subscribed tracks
or forwarded tracks, which is a more
in times in minus one type pricing model.
I think that's partly been because bandwidth used to be more expensive,
CPU used to be more expensive, so your variable
costs were that they could bite you more
as the provider.
As costs have come down with the infrastructure and as I
think we've gotten better at building WebRTC native
infrastructure in the cloud, I think it's possible to
simplify the pricing and lower the pricing.
Our assumption is that pricing is going to come down and simplify further
over the next five years, and it's better to be on the
forefront of that rather than trailing behind that.
Steve: That's great.
Matt: Yeah, that's awesome.
This is, in particular, one of those areas where you
think about traditional online video versus
what we're talking about now, which is this new real-time video.
There are so many similarities, but at the same
time pricing is, for example, one that's just-- Or,
It's just one that feels so radically
different, like a traditional video
platform or real-time streaming.
So much of your cost is tied up in an encoding,
particularly if you don't have that many viewers, so the
real-time is typically small number of viewers, but
everybody's contributing, but almost none of the cost is encoding because all
that can be done on the client. So, it's just fascinating.
Kwin: Yeah, that's right.
Our users pay for the encoding, but they
don't pay for the stateful server
connection that we need directly.
We pay for that, so you're totally right.
The cost structure moves around.
I mean, compared to something like HLS in the WebRTC world, we're
never going to be able to optimize for quality
the same way Stack is going to be able to.
We're never going to be able to optimize for cost quite as well as an HLS
stack, and that's the tradeoff of trying to get to that 200 millisecond
or lower number.
Matt: So, what other what other technology challenges have you run into with real-time?
We've talked a little bit about codec support,
but are firewalls a big problem?
Do you have to get around all those with the stun and turn stuff we talked about earlier?
Any other big things there?
Kwin: Firewalls and network infrastructure in general is
something we worry a lot about.
Firewalls in particular have gotten just massively more open
to real-time media in the last couple of years, so we
rarely see firewalls that are a major issue anymore.
That's really nice.
What hasn't completely improved is scaling on the
client side for larger numbers of people in a
call, so we've done a lot of work to try to understand
exactly how to optimize for Chrome and
Firefox and Safari and Electron.
Our biggest single pain point is "How
do we combine the optimizations we're doing for variable
network conditions and the optimizations we're doing for variable
behavior on the client side in terms of CPU and
Like, a fancy MacBook
Pro running Chrome is a pretty different
beast than an iPhone 7 running
We have to be cross platform, so
trying to manage how many videos are being
played out at a time and what the resolution of those videos are
is always a moving target for us.
Matt: So we talked a little bit about scaling one call
to a large number of members in that call, but
what about the examples where you have a few people,
like say a panel and a talk, and those then want to
broadcast to a large number of people who don't want to be in the-- They want to be
It's a view to view broadcast.
What does that infrastructure look like for y'all workflow?
Kwin: That's an increasingly big use case for us.
We have a lot of customers who really want to be
able to do what you called "View to view broadcast," which was a great term and not
one we've used before, but we're going to borrow it now.
So the challenge is "How do we have that really great small
call experience, but then at relatively low
latency, make that available to a much larger audience?"
And the answer today is a bridge from
WebRTC to something like HLS, so
we've tried to build that bridge so that our customers can
use both our APIs and Mux, for example, or YouTube
live or Facebook live.
That does end up requiring a couple of
transcoding steps today, so we take the
small call and on our media servers
that are routing the media packets, we run a
compositing and encoding pipeline.
We decode all the media, we lay it out into a
single frame and combine the audio tracks, run that
through an RTMP out in
stage and send it to an RTP and test URL provided by a customer.
That ends up working pretty well from
a standards perspective, but it's really disappointing from a core
engineering perspective because we'd much rather--
We know you, for
example, at Mux are going to take that RTMP and you're going to do a much better job
of transcoding it, so we'd really love to figure out how
to hand off the WebRTC tracks in a much
lower level way to people who are experts at
HLS and have a great HLS technology stack, but that remains a
little bit of a pipe dream today.
Matt: Fascinating. This has been amazing, Kwin.
Thank you so much for taking the time to chat with us.
Phil: It's so illuminating.
This is just a world I know so little about.
It's just so cool.
Matt: I feel like I use it all the time and we hear about it
all the time, but really digging in is one of those things that i
t's just a different world.
Steve: Just how different the technology is from what you're working with
in the more traditional HTTP streaming side of
things. It's pretty fascinating.
Phil: Never have I been more happy to use HLS and
to have HTTP as my fundamental
Matt: It's like "Same," but completely different.
So thanks again, everyone, for joining.
Thank you so much, Kwindla, for this illuminating conversation.
This was really great.
Just a reminder, 2020.Demuxed.com around
tickets, donation matching if you want to give, a
nd also we just wanted to explicitly call out our request for topics.
If you have something you want to talk about, get in
touch. If you have something you want to hear about, get in touch.
We can figure out who we can find that might be able to chat about
whatever that thing you have that burning desire to learn about is.
Steve: How does one get in touch?
Matt: @Demuxed on Twitter, if you're just ping @heff, @phil
or @mmcc on the video dev Slack,
Or you can email Info@Demuxed.com.
Phil: You can definitely do that.
Matt: We need to set up a wolf somewhere.
But anyway, thanks again, Kwindla.
This was a fantastic call.
Really appreciate it.
Kwin: Thank you so much.