Pi-Zero 리뷰

arxiv.org

π0: A Vision-Language-Action Flow Model for General Robot Control

Introduction

Challenges in generalist robot policy

  • Large scale
  • Model architecture
  • Training recipe → pre-training/post-training

Action head in VLA models

  • Autoregressive discretization
    • RT-2
    • OpenVLA
    • TinyVLA
    • Inefficient and slow inference speed (3–5 Hz)
    • Quantization disrupts action continuity
  • Diffusion
    • Octo(is not VLA), OneDP, TinyVLA, DiVLA
    • Produce complex actions
    • Inherently slow inference, but can be improved up to 50~62Hz
    • Flow matching

Flow matching

  • Continuous Flow
  • Discrete Flow
  • Flow matching

Pi-zero model

Pretrained VLM

PaliGemma[https://arxiv.org/pdf/2407.07726]

Flow matching

  • variant of diffusion (denoising score matching)
  • diffusion-style (flow matching) loss applied on individual sequence elements instead of standard cross-entropy loss for decoder-only transformer
    • single transformer using multiple objectives, with tokens corresponding to continuous outputs supervised via a flow matching loss and tokens corresponding to discrete outputs supervised via a cross-entropy loss
  • Use separate set of weights for tokens corresponding to diffusion
    • MoE (1) image/text inputs (2) robotics-specific inputs/outputs (aka action expert)

Problem Definition

  • Input
    • Observation 𝐨t = [𝐈t 1,…,𝐈tn,ℓt,𝐪t]
      • RGB Image tokens (2~3 images per robot)
      • Sequence of language tokens (language command)
      • Vector of joint angles (robot’s proprioceptive state)
  • Output
    • Action chunk 𝐀t = [𝐚t,𝐚t+1,…,𝐚t+H−1]
      • H=50
      • Action token

Blockwise Causal Attention

alt text

Training / Inference

  • Training: CNF loss alt text
  • Inference: alt text

Remaining Questions

  • How pre-training dataset should be composed?
  • What type of data is more helpful to add?
  • How it should be weighted?
  • Universality in distinct domains (data/robots)?

Limitations

  • Generate H-step action chunk at once
  • Cannot modify until action ends
  • Still slow speed for action executing