Overview of the ZeroWBC framework. We propose a novel framework that learns natural humanoid visuomotor control directly from human egocentric videos. The pipeline takes an initial egocentric image and text instruction as input (Left), synthesizes human whole-body motions via a fine-tuned vision-language model (Middle), and executes robot motion on the Unitree G1 robot using a robust general tracking policy (Right). ZeroWBC enables complex scene interactions like kicking, sitting, and obstacle avoidance with zero real robot teleoperation data.
Detailed architecture of ZeroWBC. The framework operates in two stages: (a) Multimodal Motion Generation: We use a VQ-VAE and fine-tuned Qwen2.5-VL to predict motion tokens from initial images and text instructions, decoding them into continuous human motions. (b) General Motion Tracking: Generated motions are retargeted and tracked by an RL-based policy. A curriculum learning strategy with increasing difficulty ensures robust tracking performance.
Visual Input
Task Instruction
Kick the ball into the goal
Visual Input
Task Instruction
Kick the ball into the goal
(Different from the previous ball placement position and the humanoid starting position)
Visual Input
Task Instruction
Avoid the obstacle and walk to the chair
Visual Input
Task Instruction
Avoid the obstacle and walk to the chair
(Different from the previous obstacle placement position)
Visual Input
Task Instruction
Walk towards the sofa, turn around and sit down
Visual Input
Task Instruction
Walk towards the sofa, turn around, sit down and stand up immediately
(Different from the previous robot starting position and text instruction)
Visual Input
Task Instruction
Walk towards the chair, turn around and sit down
Squat down and raise your hands
Raise arms and punch forward