ZeroWBC: Learning Natural Whole-Body Humanoid Interaction from Human Egocentric Data

Haoran Yang^{1, 2 *} Jiacheng Bao^{2, 3 *} Yucheng Xin^{2, 4 *} Haoming Song^{2, 5}

Yuyang Tian^{1, 2} Bin Zhao^{2, 3} Dong Wang^{2, †} Xuelong Li⁶

¹University of Science and Technology of China ²Shanghai AI Laboratory

³Northwestern Polytechnical University ⁴Tsinghua University

⁵Shanghai Jiao Tong University ⁶TeleAI, China Telecom

^*Equal contributions. ^†Corresponding author.

Paper Code (coming soon)

Overview

Overview of ZeroWBC

Overview of the ZeroWBC framework. The pipeline takes an initial egocentric image and text instruction as input (Left), synthesizes human whole-body motions via a fine-tuned vision-language model (Middle), and executes robot actions on the Unitree G1 robot using a general interactive tracking policy (Right).

Method

ZeroWBC Pipeline

Detailed architecture of ZeroWBC. (a) A fine-tuned Qwen-VL predicts motion tokens from an initial egocentric image and language instruction, which are decoded into continuous human whole-body motions. (b) The generated motions are retargeted to the humanoid and executed by an RL-based policy that tracks both robot reference motions and interaction-relevant body parts.

Real World Experiments

Kick Ball

Visual Input

Task Instruction

Input 1

Kick the ball into the goal

Visual Input

Task Instruction

Input 2

Kick the ball into the goal

(Different from the previous ball placement position and the humanoid starting position)

Navigation & Obstacle Avoidance

Visual Input

Task Instruction

Input 3

Avoid the obstacle and walk to the chair

Visual Input

Task Instruction

Input 4

Avoid the obstacle and walk to the chair

(Different from the previous obstacle placement position)

Sit Down

Visual Input

Task Instruction

Input 5

Walk towards the sofa, turn around and sit down

Visual Input

Task Instruction

Input 6

Walk towards the sofa, turn around, sit down and stand up immediately

(Different from the previous robot starting position and text instruction)

Visual Input

Task Instruction

Input 7

Walk towards the chair, turn around and sit down

Text-to-Motion

Squat down and raise your hands

Raise arms and punch forward

Citation

@article{yang2026zerowbc,
  title={Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video},
  author={Yang, Haoran and Bao, Jiacheng and Xin, Yucheng and Song, Haoming and Tian, Yuyang and Zhao, Bin and Wang, Dong and Li, Xuelong},
  journal={arXiv preprint arXiv:2603.09170},
  year={2026}
}