ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video

Haoran Yang^{1, 2 *} Jiacheng Bao^{2, 3 *} Yucheng Xin^{2, 4 *} Haoming Song^{2, 5}

Yuyang Tian^{1, 2} Bin Zhao^{2, 3} Dong Wang^{2, †} Xuelong Li⁶

¹University of Science and Technology of China ²Shanghai AI Laboratory

³Northwestern Polytechnical University ⁴Tsinghua University

⁵Shanghai Jiao Tong University ⁶TeleAI, China Telecom

^*Equal contributions. ^†Corresponding author.

Paper Code (coming soon)

Overview

Overview of the ZeroWBC framework. We propose a novel framework that learns natural humanoid visuomotor control directly from human egocentric videos. The pipeline takes an initial egocentric image and text instruction as input (Left), synthesizes human whole-body motions via a fine-tuned vision-language model (Middle), and executes robot motion on the Unitree G1 robot using a robust general tracking policy (Right). ZeroWBC enables complex scene interactions like kicking, sitting, and obstacle avoidance with zero real robot teleoperation data.

Method

Detailed architecture of ZeroWBC. The framework operates in two stages: (a) Multimodal Motion Generation: We use a VQ-VAE and fine-tuned Qwen2.5-VL to predict motion tokens from initial images and text instructions, decoding them into continuous human motions. (b) General Motion Tracking: Generated motions are retargeted and tracked by an RL-based policy. A curriculum learning strategy with increasing difficulty ensures robust tracking performance.