
Ep. #1, Why Robots Are Hard with Ren Wang
On this debut episode of Lab Notes, Amir Zohrenejad is joined by Ren Wang, a researcher and PhD student at UC Berkeley, to explore the state of physical AI and why robotics has progressed differently from large language models. They discuss data scarcity, world models, simulation, dexterity, and the challenges of building robots that can reliably operate in the real world.
Ren Wang is a PhD student at UC Berkeley advised by Prof. Alexei A. Efros and a part-time researcher at Physical Intelligence. His work focuses on robot learning, generalization, and adaptation, with an emphasis on building systems capable of operating across diverse environments and tasks. He is particularly interested in the intersection of world models, embodied AI, and scalable approaches to physical intelligence.
transcript
Amir Zohrenejad: Cool. Ren, welcome on. Why don't we kick it off? Introduce yourself, tell us what you're working on.
Ren Wang: Sounds good. My name's Ren, I'm a PhD student right now at UC Berkeley. I'm advised by Alexei A. Efros, but I don't actually do any computer vision. He's a big computer vision guy, but I work entirely on robots, robotics. Specifically I'm interested in generalization and adaptation and not attached to any sort of particular tools or ways to get there.
I think there are going to be many ways and I'm interested in researching and investigating all of them. And I'm also a part time researcher right now at Physical Intelligence.
Amir: Cool. Physical automation robots have been around for quite a while and now with neural networks, transformers, there's obviously a lot of new things happening in the realm of physical AI. Just at a very high level, how do you think this new approach to physical AI, if using more generalized models, how is that going to make robotics analysis different.
Ren: Yeah, that's a really good question. So I think that if you look at the history of industrial automation, if you look at verticals like manufacturing, agriculture and so on, typically you need high robustness, high repeatability. And these machines are really doing something that is best characterized as kind of dumb. And that makes them very easy to sort of program and to control. But they're doing something that's like highly repeatable in a very narrow range.
And what this learning based approach through what you characterize as neural networks and transformers is actually going to enable us to do is to do basically a much wider range of things.
So diversity is sort of the name of game here. You're no longer looking at very narrow tasks in very narrow domains. Now you have one model that can basically transfer across different robots, different embodiments do completely different things.
Now that is, I will say, going to potentially come at the cost of lower repeatability. So I think that's sort of the trade off that is in the mind of a lot of researchers nowadays is, you know, obviously we would like things to be highly repeatable, but what exactly is that Pareto frontier sort of going to look like when this approach sort of proliferates in the industry?
Amir: When we say repeatability, do you mean like the precision at which it happens or like repeatability over-- Like if I get a generalized robot and I tell it to, you know, wash the dishes once, I assume it can do it again, right?
Ren: Yeah. So you'd really love that. If we get the state of the art models right now to do that, it's not actually going to necessarily be able to reproduce that successful like dishwashing episode more than once.
So by repeatability I simply mean doing the thing that it was prescribed to do time and time and time again. And typically in industrial automation you're talking about like five nines of repeatability.
Amir: Okay, so for folks like, who are looking at the progress of LLMs, just like looking for where we were with the first version of ChatGPT, where it was like, okay, it can answer a few questions to now where these agents can do long horizon tasks in the span of two, three years and do abstract math.
For a lot of people who are not familiar with the physical AI space, it might feel like, okay, why is this taking so long? Why is it moving so slowly? What is it that makes physical AI hard? Is it like we don't have enough training data the same way we had text data from the web?
Or is it just like the degrees of motion and input and output because are like, because they're continuous and there's so many different things and joints and motions and whatnot. Is that what makes it hard because the input and output space are so big? Like why is physical AI so hard compared to LLMs?
Ren: Yeah, that's another great question. I think you've hinted at a lot of the reasons why physical AI is objectively behind the state of LLMs and these agentic systems that are out in the wild and bringing a lot of economic value. We're absolutely not there yet for physical AI. And I think there are a number of reasons. The most quoted reason is lack of data.
And if you look at the history of language models and vision models as well, what really led to a proliferation of these models is Internet scale data. And when you ingest all that data and you combine it with copious amounts of compute, you get these super performant models, but that data is simply just not available for robotics.
First of all, nobody is really uploading tons of videos of their homes and doing sort of for--
Amir: We have YouTube though, right?
Ren: We do have YouTube. I think the issue there is a lot of times the data is very noisy. It's very unclean. It's also, if you just think about the sheer amount of text and then other visual data that's available on the Internet, as much as YouTube is, it's still not enough.
And there's an embodiment gap now because we're not looking at a video and directing a human to go do this, we now need to go train this, this robot that's potentially of a totally different embodiment. As you mentioned, the input and output spaces are also significantly different.
And so there's just no easy way right now to get data at a scale that would make sort of the people in LLM land, or vision model land very happy. And I think that's definitely a very big contributor. But actually there are a lot of sort of more subtle reasons for why physical AI is scaling more slowly.
And I think one big thing is going back to this idea of repeatability. The tolerance for failure in physical AI is much lower. If I ask ChatGPT a question, I am happy with the gist of the question being understood and the answer being approximately close to maybe something that I'm already thinking along the lines of, whereas in physical AI, a millimeter slip in the gripper of the robot that leads to the dish that I'm washing to crash to the floor and break, that's not something that is sort of an acceptable outcome.
So the line between success and failure is much, much thinner in physical AI. And so, by nature that just means you need to be really sure that your system is super, super robust before you actually do in the wild deployments.
Amir: And like, in terms of that data space, that input space, output space, what is-- And this might be different in different architectures and we can kind of go into those. But what is the output space? Is it just like commands that go to certain joints and for certain motions and continuously?
What is the input space? Is it video plus text and like the goal that the model has and the robot has? Can you walk us through just like a little bit what those are?
Ren: Yeah, for sure. So this is a really interesting topic because up until very recently there was sort of one dominant architecture and I think that's starting to change now. We can go into that a little bit, but maybe I'll just talk about sort of the prototypical architecture.
The way that these models are typically operationalized is through something called a vision language action model or VLA. You might be familiar with VLMs and LLMs. This is basically adding sort of action to part of the model's vocabulary. So the idea was to basically take a typically a pre-trained vision language model. So this has been trained on sort of Internet scale of images and text.
Amir: Is that a single model or is it like three different models for the vision language in action?
Ren: Yeah, typically a single model. And it can't be too big because it has to run basically real time on a robot. So typically this would be something less than let's say 10 billion parameters.
So you take this pre-trained model and then you basically reappropriate part of the model's vocabulary to be action tokens. So let's say the model was outputting some weird ASCII characters before. So tokens that basically weren't very often used.
Amir: Yeah. How big is the dictionary for action tokens roughly? Compared to a normal language, is it just like "move hand" or is it like actually like the motion that is going to be--
Ren: So, so it's actually--
Amir: A few examples of the tokens like, or like a string that comes out of the VLA.
Ren: Yeah, so these are basically actual like actions. You know, they're not abstracted into motion primitives like move left, move right, move down. But basically you can imagine much in the way that we have trained sort of separate language tokenizers even before you can train an LLM, we also have basically action tokenizers that you can train independently.
You can also train them jointly, but there are different ways of doing this. But typically you'll basically train a tokenizer beforehand, tokenize the actions, and then that's sort of the input to the model. So the model, the VLA now can recognize them because you've overwritten basically part of the VLM's language vocabulary to represent these action tokens.
And so, you know, you feed basically the current image, let's say some language command for what you want to do, like "pick up the banana and put it in the bowl" or something like that, maybe a small history of previous actions. And then the VLM will process all of this information, it'll output some additional tokens.
And typically these tokens will then go into what's called an action expert. So you can think of this as a very, relatively much smaller model, something like 10x smaller, that sits on top of this VLM backbone and is basically tuned to output the correct action tokens that are required to do a particular task.
And you know, those action tokens are basically decoded into-- You can, there are various control modes on, on the robot, but as you alluded to earlier, you can basically just control the joints of the actual robot. And so for one arm that might be six or seven joints, for like two arms it's more. And then for a full humanoid it's like a hundred and something joints.
Amir: Okay, and so you talked a little bit about like, so VLAS and input space and output space. Obviously folks are very excited right now. Jim Fan of NVIDIA is giving talks that VLA is dead. Is it dead? And world models are coming, are basically where all the excitement and the research is going to be. Or, or no, do you think that the VLA forces are still robots that are gonna be productized and are gonna be used more in the wild?
Ren: Yeah, so I think there's a research answer and then there's like a maybe economic answer. So I think from research perspective, you know, if I was working primarily on VLAs, I would be worried for sure. I think that, you know, what are called world action models, world models--
Amir: And maybe like talk about how those are different from VLAs.
Ren: Yeah, for sure. So I think Jim Fan gave a great talk and basically illustrated Dream Zero very, very well. I find Rota AI's approach and their blog posts illustrating that approach to be a bit more intuitive and understandable.
So the basic idea is in the natural world that is described by video, there is a lot of sort of inherent information about physics. So when I watch a video, if I drop a cup, I expect gravity to take over and sort of pull it to the floor or, you know, a car that's moving very, very quickly has a lot of momentum.
So they're sort of these intuitive physics concepts that are implicit in video. And so what these world action models, world models are doing is basically training a video generative model strictly on video to capture a lot of these kind of intuitive concepts about physics and sort of the state and evolution of the natural world.
And the technology to train these video generative models is extremely mature. So we have a good understanding of how to basically model images and a sequence of images.
So once you've trained basically this huge video generative model and you can now accurately predict, let's say, when my robot is going to pick up a banana and put on a plate, what that actually going to look like, then what I can do is I can take the video frames and I can train what's called an inverse dynamics model.
So an inverse dynamics model basically says between this frame and the next frame, what is the actual action that the robot is taking. And the supervision for that needs to come from robot data. But now, because I have a super strong video generative backbone, I don't actually need that much robot data to train a very robust inverse dynamics model.
I can use sort of the little bit of robot data that I have to train something that's very good based off of kind of the output video that comes from my generative model.
Amir: And so that inference time, then basically the robot will look at something, see the video, kind of see it, dream a few seconds into the future, and then basically the token that will come out will be what action it has to take for that kind of future to occur.
Ren: Right.
Amir: Okay. How far into the future does it look like?
Ren: That's a great question. You know, there's--
Amir: it's like order of magnitude. Is it like--
Ren: Order of magnitude is seconds.
Amir: Seconds. Okay. It sees a few seconds of the future. Okay.
Ren: I think there are better parameterizations for this. You know, there are many tasks through which you do want to be aware of the stage that you're in and what the sequence of things that you then have to do are.
So you could feasibly dream, you know, a few sort of steps for each of those individual stages and keep them in sort of memory and work that way. But that's not how typically things are done now, to my understanding.
Amir: Got it. So walk through like a single example of like-- So the robot's given a command and let's say it's using world model, so it has that one command as like the end goal that it wants to get to.
And then it is continually looking at the video around it, processing it, 1 second, 1 second, 1 second in the future, and then action, action, action. Those get translated into movement, joint movement, joint movement, joint movements. That's basically how it's going to work?
Ren: Yeah, that's entirely how it works. Yeah.
The difficulty there is you need a video generative model that is good but also fast. And you know, in AI those two are often conflicting.
Amir: Fast at inference time.
Ren: Fast at inference time. That's right.
Amir: Does the quality of it matter? So can it be kind of blurry and it's still fine? Or like, is that, I assume that's tied into how fast it is at inference time? The quality of--
Ren: Yeah.
Amir: Okay, yeah.
Ren: For sure. I would say that if you looked at, for example, the outputs of Dream Zero or Rota, they're extremely good. I would say they're not that blurry. So the engineering that they've done to get these things to be milliseconds of inference time and not blurry, is very impressive.
Amir: Okay. And I want to kind of like dig in on the data side. So in terms of the training data that you need for VLAs and world models. So world models, like, so I understand it's a video, it's a video generative model.
Ren: Right.
Amir: But do you need any additional labeling that has to happen on that data? Like the actions that have to happen for your training data, you need the actions between the first one and the second one and what happened? Like, do you use another kind of AI model, ML model right now to do that?
Or a human goes and just write, writes, arm moved. Write. Arm moved. Write. Haha. For like the next 5,000 frames or how does that work?
Ren: So typically you can just buy copious amounts of video data and download it from the Internet and so on. So this is basically a self supervised approach that doesn't require any data labeling.
And that's one of the advantages that world action models have is that the bulk of their learning, you could argue, comes from freely available data, Internet scale data that, you know, LLMs and VLMs needed to get to where they are today.
And then the robot data you can collect via, you know, all the ways that the VLA folks have been collecting, typically by teleoperation. So you can just teleoperate a particular sequence and now you do have between adjacent frames what the robot action actually was.
Amir: Okay.
Ren: So you can train that second component, the inverse dynamics model, basically off of that data. So you know, you can get the video generative model that you've trained separately to ingest the first robot video frame, generate the rollout of the actual robot doing that task, and then use the actions that you got from the teleoperation thing to train that inverse dynamics model that predicts between adjacent generator frames.
Amir: So imagine a factory or somewhere. You've collected ecocentric data of like what a human is seeing as they're moving their hand doing something. But imagine you bought that data. That's not something like say you collected at the lab yourself. How does that robot stuff, like, does someone have to go look at that and wear something and try to mimic what that person was doing? How do you add that data back in if it's just video that was recorded from a human?
Ren: Yeah. So I think there are various schools of thought around this. One is you can use it as it is and the human is basically just another form of a robot, another kind of embodiment. So you just throw it in the mix. You don't treat it with any kind of special consideration.
And the supervision that is inherent in that video will just help all of your other robots. I think specifically for the world action model case, it's okay that this is like ecocentric human video and not of a robot because it's doing the thing that sort of the robot will also do. But you do need on robot data at some point.
So the ecocentric video can help you, but it's not going to basically zero shot, let's say, transfer to the robot. You also need the robot basically doing some amount of robot data, sorry, that is also doing the thing that the human is doing.
Amir: Got it. How are like physics engine, gaming engine simulations-- Are they used a lot right now to get training data? How are they used? What are the challenges with that? Why can't we just take Unreal Engine and just have people do all the things in Unreal Engine and then just collect their motion or skeleton or something and then their joints and then feed that into the robot?
Ren: Yeah. I think there are a lot of people who are very bullish on simulation and very bullish on that exact idea of sort of projecting the world into simulation. I think that will be part of the solution, if we talk about things more generally and just look at what it can and cannot do in sort of the history of simulation.
I think, you know, if you look at robotics in the world now, that really works and is being deployed sort of at scale and it's no longer a research question and just a question about like scale of operations, I think the only successful example besides the Roomba, which, you know, is arguably not really a robot, is Waymo and autonomous driving.
And I think what allowed them to really get to where they are today is simulation because there are so many long tail events in self driving that you simply would not experience. I mean if you have a lightning strike once every 15 million miles and you know, your total data set size is maybe only 5 million miles, then how do you even, you know-- You don't even have one data sample to actually model how your car should behave when it gets hit by lightning.
So having access to extremely good simulators that can capture the behavior of the world and put your car into that world and test all sorts of edge case scenarios was hugely valuable to them. And I think there is similar interest in robotics right now to use simulation for sort of a similar purpose to scale up training data and see a lot of these edge cases that you typically will not see in the real world.
I think there are, there are a number of challenges though. So firstly you'll, you'll hear this term of sim-to-real gap. So there's a difference between kind of simulation and reality. And I think there are a number of types of the sim-to-real gap.
So the first one is basically just the difference appearance of the actual scene and the physics of the simulator compared to the real world. So you know, these are like optics, lighting differences, friction differences, materials behave differently in simulation than in the real world. You know, this is sort of an active area of research. It's been an active area of research for like 60 plus years.
If you look at sort of graphics people when they started working on simulators, trying to you know, model clothing and how clothing deforms very accurately, it's by no means a solved problem at all. Contact in simulation is still very difficult to get right. So there's the sim-to-real gap on sort of the physical level.
There's also a gap though, and this is actually in my opinion a bigger issue in the diversity of what simulation offers versus the diversity of the real world. So if you care about one particular scene and getting let's say 10,000 robot demonstrations on one particular scene, I think simulation is extremely good for that.
But if you want to generalize beyond that one scene, you're going to have to basically spin up another environment. You're going to have to design new assets for that environment. You're going to have to basically tune your simulator to match that environment that you now want to go deploy in.
Amir: The first problem I understand. So like the physics that we have in the simulator are different from the physics in the real world. But the second one, is it bottlenecked by humans having to create those things? Like, or can we not get an AI or something so basically just spin up as many as we want very quickly or--?
Ren: Yeah. And if you ask a VLM to do this, you will be surprised by just the general lack of diversity, lack of creativity that it has in creating these things.
I mean, I think there is truly nothing that is higher entropy than the real world. I think diversity and chaos is really what you need and what the simulator gives you is clean, controllable repeatability.
So you know, there, there are nice elements there and it's very useful for algorithmic research. But if you're talking about deploying robots in the wild, getting training data that reflects basically the real world, there is still a limitation there.
Amir: Okay, I understand that physical AI is different from large language models, but if we want to take the large language model kind of history, the GPTs got better. And finally the breakpoint was really with ChatGPT, because as far as I know, they were able to go say, okay, we're going to take this kind of language model that's gotten pretty good at next token prediction. And then they did RL, reinforcement learning, with human feedback just for the chat scenario.
And lo and behold, chat was useful enough that it could get, got good enough and really kicked off this data flywheel of like now people are using it and they can collect more data and then like give thumbs up, thumbs down.
So overall, is there a sense or is there an idea or do you believe that with physical AI as well, we'll get to a certain point where the robot will be good for a certain task that is widely used enough that will then-- So we'll get good enough. And then with some RLHF of different sorts, whether in a simulator or in a controlled environment, for one task, it will get good enough that we'll kick off a data flywheel?
Ren: Yeah, so I think what you described is in my head, in many ways, like the only path to actually realizing physical AI. It will not happen in a simulator though. So if we come back to simulator land for a second, you know, it is true that RLHF enabled GPT3 to go to GPT 3.5. And ChatGPT was basically helped by this particular approach.
And you could say the same for the current wave of agentic systems as well. Basically there were really, really good environments for software development and so on that people built. And then you could just do RL at scale with these language models inside these environments.
And if you look at the domains where this approach has been particularly successful, in particular coding and math, all of these domains share this property of being verifiable, which means that you can know very easily, very quickly if your answer was right or wrong. There is a clear right or wrong answer.
And that's not necessarily the case with physical AI for a couple of reasons. One is these simulators are quite bad. So it's often the case that you can't actually even tell what the "correct physics" actually was. If I slide a book across a table, how fast should it actually be moving? And that's dependent on the friction of the table, the book, the mass of book, and so on, all of these variables.
And you can get them wrong slightly. And I can still look at this thing and say, "hmm, hm. That's approximately correct." But that's not good because now you're compounding your-- You're basically teaching your model incorrect physics.
And the second thing is, I think in physical AI, you know, earlier I spoke about how the line between success and failures is very thin, but in many cases it can also be very thick. You know, maybe I like my steak cut a particular way. Maybe I prefer the fork set to the right of the plate as opposed to the left.
There are a lot of nuances to how I do a particular motion that is very difficult to quote, unquote, verify. There's no correct or wrong way to go about doing many of these tasks. And so verifiability, I think, is what makes this data flywheel that you talked about very difficult to realize in simulation.
Now, the reason why I say I think it's the only viable way to actually scale and realize physical AI is because you should absolutely do this in the real world. So you can get a robot that is somewhat performant on a number of tasks. Let's say 10 tasks.
You choose one of those 10 tasks that you care very deeply about. You go to the customer's domain and you basically deploy that robot. And then you get, you know, an operator to either remotely, or in person, correct the robot every time it makes a mistake. And if you do that enough, it will overfit that one specific task and basically get to, you know, five nines of repeatability.
Amir: And when, right now, I'm assuming some of this is happening right now. When you do that, like when you go and try to do it. Is it actually RL in the sense that, like, you're updating the weights and biases of the model? Or is it some form of like context engineering where it's learning to like, you know, not move it or is it basically, you know, similar to basically in context learning or test time compute or something?
Ren: Yeah, yeah. So I think definitely not the latter. But oftentimes also not the former. You know, the ideal form of this is to actually use, you know, true reinforcement learning and by true reinforcement learning, I mean, you know, either value based learning or some sort of policy gradient kind of kind of approach. There are different schools of thought around which one of these is actually going to work at scale, but--
Amir: Something that changes the weights.
Ren: Yeah, exactly. Oftentimes when people say they're doing this, they just mean they're collecting a data set of failures and corrections. And then you go back and you basically take this data set, you filter out the edge cases, the bad data, and then you throw it back into the data mix that was used to train the model in the first place. And then you do another sort of round of training.
So it's, it's as opposed to RL, which I would characterize as, or online RL where you, you are learning from sort of every immediate sample and adapting to kind of the immediate change in context that you see, you are doing this sort of offline approach where, you know, it's much closer to just sort of traditional behavior cloning.
Amir: Yep. And is there like right now, is there in context learning, like, can like the robots, the initial, at least the ones, can you show it something and immediately learn to go do that the same way you do, is that like part of the features, for lack of a better word?
Ren: Yeah. So I think there are different versions of this. So if you look at what Physical Intelligence recently released in their PI 0.7 demo, you can actually coach a robot to go do a task that it's never, never seen before just via natural language. Basically you can tell it, you know, first do this, then do this and it'll actually go and do those actions and that is a form of sort of in context. But the limitation there, of course, is that you need to be able to describe in natural language what it is that you want the robot to do.
Amir: Absolutely.
Ren: Whereas like, if I want to, you know, show it a dance, or I want it to write a letter in a particular way in cursive or something, it's much more difficult. And in that case what I would like to do is I would like to show the robot physically how I would do this once and how I'd do it in the future. So what was really exciting to me about the Rota AI release was they did show something like this.
So they were able to take a whiteboard and write out a character and basically have the robot look at that and go and do the same thing. Now it's unclear, you know, how they're facilitating in context, whether or not it is in context and all of this. But certainly if you look at the world action model approach that I mentioned earlier, you could imagine the robot basically has this human video in its context and basically generates sort of the robot video based off of that and then goes and does it.
Amir: In terms of the kind of first use cases that you think the data fly wheel might take off-- Because I imagine it does have to be something that's high value as well that's being automated. Is it factory automation? Is it in home? Do you have any kind of like views of what you think would be good first use cases where the kind of data flywheel might take off or people might in more, more restricted tasks, more restricted environment trying to be using robots?
Ren: Yeah. So definitely not the home.
Amir: Okay.
Ren: I think I have very strong opinions on robots and homes and what it would take to get there and we can get into all that. But I think right now we are already seeing these so called deployment -first companies in manufacturing. A lot of times these are warehouse packaging tasks, what's called palletization or depalletization and they're implementing this exact sort of approach that I described.
Amir: Okay. Given that this is video and let's see, very large amount of data and the latency, so latency sensitive, the actions that have to come out. I generally used to think that a lot of this has to happen on device or on chip or at least like in the same location that the robot is.
But I remember when we were talking recently and we were saying no, at least for now, a lot of this is still going to happen on the cloud, like the processing. Can you just like walk us through like why? Is it because the GPUs are so heavy and expensive, so the models are probably pretty, pretty lightweight. Does that even scale or like if you have a factory and you have a bunch of robots, are you going to have the GPU right there with them or is it really streaming somewhere else in someone else's cloud and coming back?
Ren: Yeah. So I think invariably you will almost always need to do some sort of on device inference. Now whether or not this inference necessitates a GPU is going to be kind of a future question.
But definitely, you know, because the robot is needing to do things in real time, you need to actually be able to have something on device that is going to be producing those actions. What we're seeing now, at least in research, if you optimize the inference stack very well, you can get something like 150 to 300 milliseconds latency just via kind of streaming from the cloud.
So this is typically done through let's say something like modal which, you know, physical intelligence uses to serve their models and all of that inference happens in the cloud and then you sort of stream the actions to the robot and it does wonderful.
But if you go out in the wild, you go to these factory floors where there's no Internet a lot of the times, or you have a robot dog that's running somewhere out in the wild, it will need some sort of on device thing. So I think once we start seeing more things out in the wild, you almost always will need on device inference.
Now the shape of this could be very different. I don't think it's necessarily an either-or relationship. You could imagine sort of a heavy duty model that runs in the cloud that does a lot of the planning, that has larger context and basically is better able to orchestrate one or many low level policies that run on device and are actually responsible for the next second of actions or something like that.
And I think that's probably the ultimate form of what we're going to see is some sort of hybridization of this cloud based and on device approach.
Amir: Got it. And so the video streams to the cloud and then right now video does not come back. Right?
Ren: Video does not come back.
Amir: Okay, but what comes back is just like the actions that the robot has to take. Or you said the action tokenizer on the cloud as well or is that on the device?
Ren: So I think either works but typically the action tokenizer will also be on the cloud.
Amir: There's just a final couple of questions that I had. So obviously I know in terms of like the robot moving around locomotion, but there's always with human hands, the difficulty of the hands and dexterity and it being such a challenging problem.
Is it just because there's a lot of joints in it? Like what makes it so? Is it because it has tactile feedback. I know this is some kind of on the simulation-to -real, there's some area of that you've been doing some research as well.
So if you just walk through, what makes dexterity, human hand dexterity, so difficult and what are the challenges and what are some of the approaches that researchers are looking into and you have been looking into?
Ren: Yeah, so I think dexterity in many ways is the holy grail.
I think right now in physical AI we have reasonably good models for general action understanding and knowing where to basically put the robots and effectors near the object and how it should be oriented to roughly do something. But that last mile to actually get it to do the thing is really challenging. And a lot of times the difficulty lies in, you know, the robot grippers, parallel draw grippers, simply not being dexterous enough.
Amir: You mean in terms of the hardware or like dexterous or in terms of like the, the motions that we tell it are not, it's not fine tuned and precise enough? Like what, what makes them not good enough?
Ren: Yeah, so, so there are a couple of things. One is the parallel draw grippers cannot actually grasp things super well because they operate along one axis. The insides of the grippers are flat, so they can't actually deform around objects and things like that.
So a lot of things that come very naturally to humans, you know, power grasps for like a bottle when you open it, tool use, all of these things are just extremely difficult with parallel draw grippers.
The other thing is, you know, I think a lot of times the parallel draw grippers, they also limit the kind of data that you can use and ingest. So ecoentric human video data, for example, is like more difficult now to get into your robot or into your VLA because the robot's embodiment is very different from, you know, the human hand.
Amir: Why don't we make it 5? If the data is there, is it just that number of joints is difficult or is way more data? Is that why we don't?
Ren: Yeah, I mean, there are two different parts to this question. One is why not just put hands on robots?
Amir: Yeah. If the data is more readily available.
Ren: Yeah. So I think the trouble is they're very brittle. Hands, I think right now, hardware wise are quite bad. I think this is changing very rapidly. So if you look at, for example, a recent work from NVIDIA called EgoScale, where they use these 22 degrees of freedom, Sharpa wave hands.
So I think Sharpa, you know, to their credit, is in my opinion, you know, one of the top three current dexterous hand manufacturers, EgoScale went through, you know, something like 100 of these hands over the course of getting that project to the finish line.
And each of those hands cost, you know, on the order of 10 to $100,000, somewhere on that log scale. So it's extremely expensive and they break very, very easily. You also can't do very high friction things or things that involve liquid and so on. So they are a bit limited in their durability and robustness.
Amir: Okay. And then finally, my final question, like right now, in terms of audio perception, I know this is also an area that you've done some research on. Are the first generation, like, there are cues that we get from the environment based off of audio, and audio obviously is stereo. We have two ears. We can basically tell things about how they're happening.
Is that part of the V0 of robots that are being launched or no. Is it just as long as they can understand the natural language commands, that's enough? And then we're going to add the kind of audio perception of the environment later on?
Ren: Yeah. So I think voice interfaces are the natural way to go about controlling robots. I think if you have to enter a language command to get your robot to do something it's just not going to work.
But beyond just like audio as a form of control, I think as a sensory signal, it's extremely useful. It's extremely complementary to vision and to touch. It's helpful in ways that are sort of subtle.
It's not quite like vision in the sense that, like you, you cannot have a robot policy without vision. You certainly can get really good policies without audio, but there's no reason to not include it. I think it's absolutely complementary to sort of the other sensors that are on robots today.
Amir: All right, well, thanks for coming on and I think I learned a lot of new stuff myself, so thanks for taking the time.
Ren: Yeah, of course. Thanks for having me.
Content from the Library
Open Source Ready Ep. #39, Agents Take the Wheel with Zach Smith
On episode 39 of Open Source Ready, Brian Douglas and John McBride speak with Zach Smith, creator of Kplane, about rethinking...
Generationship Ep. #56, Vibe Coding for Data with Mark Brocato
On episode 56 of Generationship, Rachel Chalmers sits down with Mark Brocato, founder of Mockaroo and creator of Fabricate, to...
The Kubelist Podcast Ep. #53, Render and the New Cloud Stack with Anurag Goel
On episode 53 of The Kubelist Podcast, Marc Campbell and Benjie De Groot sit down with Anurag Goel. Anurag shares his journey...

