The Physical AI Leaderboard (PhAIL) is a benchmark for evaluating foundation models designed for physical manipulation tasks. We evaluate models on real hardware performing bin-to-bin order picking tasks.

Why Measurement Matters

Physical AI systems must operate reliably in the real world. Unlike simulation or dataset-based benchmarks, PhAIL measures actual performance on physical hardware with real objects, lighting conditions, and physical constraints.

By standardizing evaluation protocols and measuring key performance indicators, we can objectively compare different approaches and track progress in the field.

Benchmark Goals

Our benchmark is designed to:

  • Provide objective, reproducible measurements of physical AI model performance
  • Enable fair comparison between different models and approaches
  • Establish baseline performance metrics for teleoperated and autonomous systems
  • Track progress in physical AI capabilities over time

Tasks & Hardware

The benchmark consists of bin-to-bin order picking tasks where a robot arm must pick items from one container and place them into another. Tasks include picking towels, wooden spoons, and scissors.

All evaluations are performed on standardized hardware to ensure fair comparison across different models and methods.

Metrics

We measure three key performance indicators:

  • UPH (Units Per Hour): Throughput measured as successful pick-and-place operations per hour
  • Success Rate: Percentage of successful operations vs attempted operations
  • MTBF (Mean Time Between Failures): Average duration between failure events, measuring reliability

Fine-tuning and Baselines

We evaluate both pre-trained foundation models and task-specific fine-tuned models. Human teleoperation serves as a baseline to understand the performance gap between current AI systems and human operators.

Episodes and Artifacts

Each evaluation run produces a complete episode recording including robot state, camera feeds, and performance metrics. All episodes are visualized using Rerun for detailed analysis and replay.