Pi-Zero 리뷰

13 Aug 2025

π0: A Vision-Language-Action Flow Model for General Robot Control

Autoregressive discretization
- RT-2
- OpenVLA
- TinyVLA
- Inefficient and slow inference speed (3–5 Hz)
- Quantization disrupts action continuity
Diffusion
- Octo(is not VLA), OneDP, TinyVLA, DiVLA
- Produce complex actions
- Inherently slow inference, but can be improved up to 50~62Hz
- Flow matching

variant of diffusion (denoising score matching)
diffusion-style (flow matching) loss applied on individual sequence elements instead of standard cross-entropy loss for decoder-only transformer
- single transformer using multiple objectives, with tokens corresponding to continuous outputs supervised via a flow matching loss and tokens corresponding to discrete outputs supervised via a cross-entropy loss
Use separate set of weights for tokens corresponding to diffusion
- MoE (1) image/text inputs (2) robotics-specific inputs/outputs (aka action expert)

Input
- Observation 𝐨t = [𝐈t 1,…,𝐈tn,ℓt,𝐪t]
  - RGB Image tokens (2~3 images per robot)
  - Sequence of language tokens (language command)
  - Vector of joint angles (robot’s proprioceptive state)
Output
- Action chunk 𝐀t = [𝐚t,𝐚t+1,…,𝐚t+H−1]
  - H=50
  - Action token

alt text