ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video

Haoran Yang1, 2 *Jiacheng Bao2, 3 *Yucheng Xin2, 4 *Haoming Song2, 5
Yuyang Tian1, 2Bin Zhao2, 3Dong Wang2, †Xuelong Li6
1University of Science and Technology of China  2Shanghai AI Laboratory
3Northwestern Polytechnical University  4Tsinghua University
5Shanghai Jiao Tong University  6TeleAI, China Telecom
*Equal contributions.  Corresponding author.

Overview

Overview of ZeroWBC

Overview of the ZeroWBC framework. We propose a novel framework that learns natural humanoid visuomotor control directly from human egocentric videos. The pipeline takes an initial egocentric image and text instruction as input (Left), synthesizes human whole-body motions via a fine-tuned vision-language model (Middle), and executes robot motion on the Unitree G1 robot using a robust general tracking policy (Right). ZeroWBC enables complex scene interactions like kicking, sitting, and obstacle avoidance with zero real robot teleoperation data.

Method

ZeroWBC Pipeline

Detailed architecture of ZeroWBC. The framework operates in two stages: (a) Multimodal Motion Generation: We use a VQ-VAE and fine-tuned Qwen2.5-VL to predict motion tokens from initial images and text instructions, decoding them into continuous human motions. (b) General Motion Tracking: Generated motions are retargeted and tracked by an RL-based policy. A curriculum learning strategy with increasing difficulty ensures robust tracking performance.

Real World Experiments

Kick Ball

Visual Input

Task Instruction

Input 1

Kick the ball into the goal

Visual Input

Task Instruction

Input 2

Kick the ball into the goal

(Different from the previous ball placement position and the humanoid starting position)

Navigation & Obstacle Avoidance

Visual Input

Task Instruction

Input 3

Avoid the obstacle and walk to the chair

Visual Input

Task Instruction

Input 4

Avoid the obstacle and walk to the chair

(Different from the previous obstacle placement position)

Sit Down

Visual Input

Task Instruction

Input 5

Walk towards the sofa, turn around and sit down

Visual Input

Task Instruction

Input 6

Walk towards the sofa, turn around, sit down and stand up immediately

(Different from the previous robot starting position and text instruction)

Visual Input

Task Instruction

Input 7

Walk towards the chair, turn around and sit down

Text-to-Motion

Squat down and raise your hands

Raise arms and punch forward