Overview of the ZeroWBC framework. The pipeline takes an initial egocentric image and text instruction as input (Left), synthesizes human whole-body motions via a fine-tuned vision-language model (Middle), and executes robot actions on the Unitree G1 robot using a general interactive tracking policy (Right).
Detailed architecture of ZeroWBC. (a) A fine-tuned Qwen-VL predicts motion tokens from an initial egocentric image and language instruction, which are decoded into continuous human whole-body motions. (b) The generated motions are retargeted to the humanoid and executed by an RL-based policy that tracks both robot reference motions and interaction-relevant body parts.
Visual Input
Task Instruction
Kick the ball into the goal
Visual Input
Task Instruction
Kick the ball into the goal
(Different from the previous ball placement position and the humanoid starting position)
Visual Input
Task Instruction
Avoid the obstacle and walk to the chair
Visual Input
Task Instruction
Avoid the obstacle and walk to the chair
(Different from the previous obstacle placement position)
Visual Input
Task Instruction
Walk towards the sofa, turn around and sit down
Visual Input
Task Instruction
Walk towards the sofa, turn around, sit down and stand up immediately
(Different from the previous robot starting position and text instruction)
Visual Input
Task Instruction
Walk towards the chair, turn around and sit down
Squat down and raise your hands
Raise arms and punch forward