Pi-Zero 리뷰
π0: A Vision-Language-Action Flow Model for General Robot Control
Introduction
Challenges in generalist robot policy
- Large scale
- Model architecture
- Training recipe → pre-training/post-training
Action head in VLA models
- Autoregressive discretization
- RT-2
- OpenVLA
- TinyVLA
- Inefficient and slow inference speed (3–5 Hz)
- Quantization disrupts action continuity
- Diffusion
- Octo(is not VLA), OneDP, TinyVLA, DiVLA
- Produce complex actions
- Inherently slow inference, but can be improved up to 50~62Hz
- Flow matching
Flow matching
- Continuous Flow
- Discrete Flow
- Flow matching
Pi-zero model
Pretrained VLM
PaliGemma[https://arxiv.org/pdf/2407.07726]
Flow matching
- variant of diffusion (denoising score matching)
- diffusion-style (flow matching) loss applied on individual sequence elements instead of standard cross-entropy loss
for decoder-only transformer
- single transformer using multiple objectives, with tokens corresponding to continuous outputs supervised via a flow matching loss and tokens corresponding to discrete outputs supervised via a cross-entropy loss
- Use separate set of weights for tokens corresponding to diffusion
- MoE (1) image/text inputs (2) robotics-specific inputs/outputs (aka action expert)
Problem Definition
- Input
- Observation 𝐨t = [𝐈t 1,…,𝐈tn,ℓt,𝐪t]
- RGB Image tokens (2~3 images per robot)
- Sequence of language tokens (language command)
- Vector of joint angles (robot’s proprioceptive state)
- Observation 𝐨t = [𝐈t 1,…,𝐈tn,ℓt,𝐪t]
- Output
- Action chunk 𝐀t = [𝐚t,𝐚t+1,…,𝐚t+H−1]
- H=50
- Action token
- Action chunk 𝐀t = [𝐚t,𝐚t+1,…,𝐚t+H−1]
Blockwise Causal Attention

Training / Inference
- Training: CNF loss

- Inference:

Remaining Questions
- How pre-training dataset should be composed?
- What type of data is more helpful to add?
- How it should be weighted?
- Universality in distinct domains (data/robots)?
Limitations
- Generate H-step action chunk at once
- Cannot modify until action ends
- Still slow speed for action executing