LeRobot Dataset Format

The LeRobot Dataset Format: Your Robot's Memory System

Think of the LeRobot Dataset format as your robot's organized memory – a standardized way to store everything your robot sees, does, and learns. It's like having a perfectly organized filing cabinet that any robot AI can read and learn from.

What Makes LeRobot Special?

One-line loading: Get any robot dataset with dataset = LeRobotDataset("lerobot/aloha_static_coffee")

Time travel capability: The delta_timestamps feature lets you grab multiple frames from different time points. Want the current frame plus what happened 1 second, 0.5 seconds, and 0.2 seconds ago? Easy: delta_timestamps = {"observation.image": [-1, -0.5, -0.2, 0]}

Universal compatibility: Works seamlessly with PyTorch, Hugging Face, and all major ML frameworks

Flexible by design: Handles data from any robot – simulations, real hardware, different sensors

How Your Data Lives on Disk

LeRobot organizes everything into a clean, logical structure. Here's what a typical dataset looks like:

my_awesome_robot_dataset/
├── data/                           # The robot's actions and states
│   └── chunk-000/
│       ├── episode_000000.parquet  # Each episode's step-by-step data
│       ├── episode_000001.parquet
│       └── ...
├── videos/                         # What the robot saw
│   └── chunk-000/
│       ├── observation.images.main/     # Main camera view
│       │   ├── episode_000000.mp4
│       │   └── ...
│       ├── observation.images.wrist/    # Wrist camera view
│       │   ├── episode_000000.mp4
│       │   └── ...
├── meta/                           # Dataset brain - all the metadata
│   ├── info.json                   # Dataset blueprint
│   ├── episodes.jsonl              # Episode catalog
│   ├── tasks.jsonl                 # Task descriptions
│   ├── episodes_stats.jsonl        # Statistics per episode
│   └── README.md
└── README.md

Think of it like this:

data/: The robot's action log (what it did)
videos/: The robot's visual memory (what it saw)
meta/: The robot's index system (how to find everything)

The Data Inside: What Actually Gets Stored?

Parquet Files: The Action Chronicles

Every step your robot takes gets recorded in these files:

observation.state: Where the robot is (joint angles, end-effector position) action: What the robot decided to do (target positions, movements) timestamp: When this happened (seconds from episode start) episode_index: Which training session this belongs to frame_index: Step number within this episode (starts at 0) index: Unique ID across the entire dataset next.done: True if this was the final step

Video Files: The Robot's Eyes

Camera footage gets efficiently stored as MP4 videos:

One video file per camera per episode
Smart VideoFrame objects point to exact moments: {'path': 'episode_000000.mp4', 'timestamp': 2.5}
Compressed for storage, crisp for training

The Metadata Brain: Understanding Your Dataset

The meta/ folder is your dataset's control center:

info.json: The Master Blueprint

{
  "codebase_version": "v2.1",        // Which format version
  "robot_type": "aloha",             // What kind of robot
  "fps": 10,                         // Frames per second
  "total_episodes": 50,              // How many training sessions
  "total_frames": 12500,             // Total steps recorded
  "features": {                      // Data structure definitions
    "observation.state": {
      "dtype": "float32",
      "shape": [7],
      "names": ["joint1", "joint2", ...] // What each number means
    }
  }
}

episodes.jsonl: The Episode Catalog

Each line describes one training episode:

{"episode_index": 0, "tasks": ["pick up red block"], "length": 250}
{"episode_index": 1, "tasks": ["place in box"], "length": 180}

tasks.jsonl: The Task Dictionary

Maps task IDs to human-readable descriptions:

{"task_index": 0, "task": "pick up the red block and place it in the blue box"}

Editing and Manipulating Datasets

Repairing: Fix broken metadata or re-index after manual changes Merging: Combine multiple datasets into one mega-dataset Splitting: Divide datasets into train/test sets Visualizing: Use the HuggingFace Visualize Dataset space to inspect your data

Pro tip: Always use the visualization tool before training. It catches problems that could waste hours of compute time.

Version Guide: Which One Should You Use?

v2.1 (Recommended) ✅

Statistics: Per-episode stats in episodes_stats.jsonl
Flexibility: Easy to modify, merge, and split datasets
Future-proof: Best support for new features

v2.0 (Legacy) ⚠️

Statistics: Single stats.json file for entire dataset
Limitation: Hard to modify without breaking statistics
Status: Works but not recommended for new projects

v3.0 (Coming Soon) 🚀

Big change: Consolidates multiple episodes into larger files
Benefit: Better performance for massive datasets
Timeline: Still in development

Common Pitfalls and How to Avoid Them

The Cache Trap

Problem: Dataset updates don't appear locally Solution: Delete ~/.cache/huggingface/lerobot/your_dataset to force fresh download

Memory Monsters

Problem: "CUDA out of memory" when using delta_timestamps Solution: Reduce video resolution or limit historical frames

Version Confusion

Problem: Training script can't find your dataset Solution: Check if your dataset is on the v2.1 branch, not main

Action Timing Mixup

Critical: action[t] usually causes observation[t+1] Always check: What does "action" mean in your specific dataset?

Best Practices for Success

Start with v2.1: Always choose this version for new datasets Check feature definitions: Read info.json to understand what each field means Keep shapes consistent: Maintain the same data dimensions within episodes Use dot notation: Name features like observation.images.camera_name Generate correct stats: Statistics are crucial for training normalization Validate early: Test your full pipeline with a small dataset first

Ready to Build Your Dataset?

You now understand the LeRobot format inside and out. It's designed to make robot learning data as easy to work with as any other ML dataset.

Your next step: Record your first episodes, organize them in LeRobot format, and start training policies that actually work.

The format handles the complexity – you focus on teaching your robot amazing skills.

PreviousS0101 NextDatasets Best Practices

Last updated 2 months ago