Sponsored by Nebius

Metrics

UPH – units per hour. How fast the system works.

MTBA – mean time between assists. How long it runs before a human needs to step in.

Tasks & hardware

The first task is bin-to-bin order picking – transferring individual items between containers. Evaluations run on Franka Research 3 arms with Robotiq grippers. More tasks and platforms are coming.

Fine-tuning dataset

The DROID teleoperation dataset used to fine-tune all models on the leaderboard. 352 episodes, 12GB. Available for non-commercial use.

uv run --with positronic \
  positronic-server \
  --dataset=@positronic.cfg.ds.phail.phail

Evaluation runs

Every evaluation run on the leaderboard is a downloadable Positronic dataset – multi-view video and robot telemetry.

uv run --with positronic \
  positronic-server \
  --dataset=@positronic.cfg.ds.phail.eval_runs

Browse individual runs in the Run explorer.

Methodology

Full evaluation protocol, scoring, and reproducibility details are in the white paper.

Consortium

PhAIL is governed as an open consortium. Founding partners: Nebius and Toloka.

We're looking for model builders, hardware vendors, and deployers who want to shape how physical AI is measured.

hi@phail.ai