ZeroWBC: Learning Natural Whole-Body Humanoid Interaction from Human Egocentric Data

Haoran Yang1, 2 *Jiacheng Bao2, 3 *Yucheng Xin2, 4 *Haoming Song2, 5
Yuyang Tian1, 2Bin Zhao2, 3Dong Wang2, †Xuelong Li6
1University of Science and Technology of China  2Shanghai AI Laboratory
3Northwestern Polytechnical University  4Tsinghua University
5Shanghai Jiao Tong University  6TeleAI, China Telecom
*Equal contributions.  Corresponding author.

Overview

Overview of ZeroWBC

Overview of the ZeroWBC framework. The pipeline takes an initial egocentric image and text instruction as input (Left), synthesizes human whole-body motions via a fine-tuned vision-language model (Middle), and executes robot actions on the Unitree G1 robot using a general interactive tracking policy (Right).

Method

ZeroWBC Pipeline

Detailed architecture of ZeroWBC. (a) A fine-tuned Qwen-VL predicts motion tokens from an initial egocentric image and language instruction, which are decoded into continuous human whole-body motions. (b) The generated motions are retargeted to the humanoid and executed by an RL-based policy that tracks both robot reference motions and interaction-relevant body parts.

Real World Experiments

Kick Ball

Visual Input

Task Instruction

Input 1

Kick the ball into the goal

Visual Input

Task Instruction

Input 2

Kick the ball into the goal

(Different from the previous ball placement position and the humanoid starting position)

Navigation & Obstacle Avoidance

Visual Input

Task Instruction

Input 3

Avoid the obstacle and walk to the chair

Visual Input

Task Instruction

Input 4

Avoid the obstacle and walk to the chair

(Different from the previous obstacle placement position)

Sit Down

Visual Input

Task Instruction

Input 5

Walk towards the sofa, turn around and sit down

Visual Input

Task Instruction

Input 6

Walk towards the sofa, turn around, sit down and stand up immediately

(Different from the previous robot starting position and text instruction)

Visual Input

Task Instruction

Input 7

Walk towards the chair, turn around and sit down

Text-to-Motion

Squat down and raise your hands

Raise arms and punch forward