# Policies in AI Robotics

Recently, AI robotics has seen a surge of interest, thanks to the rise of a new generation of policies: **Vision-Language Action Models** (VLAs). Barbarika makes it easy to train and deploy VLAs. You can use them to control your robot in a variety of tasks, such as picking up objects and understanding natural language instructions. In this guide, we’ll show you the latest models in AI robotics and give you useful resources to get started with training your own policies.

### [​](https://docs.phospho.ai/learn/policies#what-is-a-policy%3F)What is a policy? <a href="#what-is-a-policy-3f" id="what-is-a-policy-3f"></a>

A **policy** is the brain of your robot. It tells the robot what to do in a given situation. Mathematically, it’s a function ππ that maps the current **state** SS of the robot to an **action** A.

$$
π:S→A
$$

* S uses the state is usually the position of the robot, the cameras and sensors feed, and the text instructions.
* A the actions depends on the robot. For example, high level instructions (“move left”, “move right”), the *6-DOF* (degrees of freedom) cartesian position (x, y, z, rx, ry, rz), the angles of the joints…
* π the policy is basically the AI model that controls the robot. It can be as simple as a **hard-coded rule** or as complex as a **deep neural network**.

Recent breakthrough have allowed to leverage the [**transformer**](https://en.wikipedia.org/wiki/Transformer_\(deep_learning_architecture\)) architecture and **internet-scale data** to train more advanced policies, that radically differ from old school robotics and reinforcement learning.

<details>

<summary>Old school robotics</summary>

*<mark style="background-color:yellow;">The traditional way to control robots is to use</mark> <mark style="background-color:yellow;"></mark><mark style="background-color:yellow;">**hard-coded rules**</mark><mark style="background-color:yellow;">. For example, you could write a program that tells the robot to move left when it sees a red ball. For that, you’d look for red pixels in the camera feed, and send a command to turn motor number 1 by 90 degrees if you see a cluster of red pixels.</mark>*

*<mark style="background-color:yellow;">This approach is the one used in</mark> <mark style="background-color:yellow;"></mark><mark style="background-color:yellow;">**industrial robots**</mark> <mark style="background-color:yellow;"></mark><mark style="background-color:yellow;">and</mark> <mark style="background-color:yellow;"></mark><mark style="background-color:yellow;">**simple home robots**</mark><mark style="background-color:yellow;">. It’s simple and efficient, but it’s not very flexible. You need to write a new program for every new task.</mark>*

</details>

<details>

<summary>Reinforcement Learning (RL)</summary>

*<mark style="background-color:green;">**Reinforcement Learning (RL)**</mark> <mark style="background-color:green;"></mark><mark style="background-color:green;">is another approach to train policies (since the 1990s and mainstream since the 2010s). In RL, the robot learns by interacting with the environment and receiving rewards. It’s like teaching a child to ride a bike by giving them feedback on their performance. Usually, the environment is a</mark>* [*<mark style="background-color:green;">simulation.</mark>*](https://docs.phospho.ai/learn/kinematics#simulation) *<mark style="background-color:green;">Today, it’s sucessful for walking robots that need to learn how to balance themselves.</mark>*

</details>

### [​](https://docs.phospho.ai/learn/policies#vision-language-action-models-vlas)Vision-Language Action Models (VLAs) <a href="#vision-language-action-models-vlas" id="vision-language-action-models-vlas"></a>

The latest paradigm since 2024 in AI robotics are [**Vision-Language Action Models**](https://arxiv.org/abs/2406.09246) **(VLAs)**. They leverage [**Large Language Models**](https://en.wikipedia.org/wiki/Large_language_model) **(LLMs)** to understand and act on human instructions.

* VLA models are particularly well-suited for robotics because **they function as a brain**.
* VLA process both **images** and **text** instructions to predict the next **action**.
* VLA were trained using **internet-scale data**, so they have some **common sense**.

Unlike AI models that generate text (like ChatGPT), these models output actions, such as *move left*. Essentially, with VLA, you could prompt your robot to “pick up the red ball” and it would do so. <mark style="color:green;">The Barbarika Starter Pack</mark> helps you learn and experiment with VLAs.

### [​](https://docs.phospho.ai/learn/policies#what-are-the-latest-architectures-in-ai-robotics%3F)What are the latest architectures in AI robotics? <a href="#what-are-the-latest-architectures-in-ai-robotics-3f" id="what-are-the-latest-architectures-in-ai-robotics-3f"></a>

Since 2024, there have been breakthroughs in AI robotics. Here are some of the latest ideas in AI robotics.

#### [​](https://docs.phospho.ai/learn/policies#act-action-chunking-transformer)ACT (Action Chunking Transformer) <a href="#act-action-chunking-transformer" id="act-action-chunking-transformer"></a>

[<mark style="color:green;">ACT (Action Chunking Transformer)</mark>](https://github.com/Shaka-Labs/ACT) (October 2024) is a popular repo that that showcases how to use transformers for robotics. The model is trained to predict the action sequences based on the current state of the robot and cameras’ images. ACT is an efficient way to do imitation learning. [<mark style="color:green;">Learn more.</mark>](https://arxiv.org/abs/2406.09246)

<details>

<summary>Imitation Learning</summary>

</details>

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-act.png" alt=""><figcaption></figcaption></figure>

**How it works**:

* You record episodes of your robot performing a task. (e.g., picking up a lego brick).
* The model learns from this data and enacts a policy based on it. (e.g., it will pick up the lego brick no matter where it is placed).

**Why use ACT?**

* Typically requires \~30 episodes for training
* Can run on an RTX 3000 series GPU in less than 30 minutes.
* This is a great starting point to get your hands dirty with AI in robotics.
* You don’t need prompts to train the model.

#### [​](https://docs.phospho.ai/learn/policies#openvla)OpenVLA <a href="#openvla" id="openvla"></a>

[OpenVLA](https://github.com/openvla/openvla?tab=readme-ov-file#getting-started) (June 2024) is a great repo that showcases a more advanced model designed for **complex robotics tasks**. The architecture of OpenVLA include a [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) model (July 2023) that receives a prompt describing the task. This gives the model some common sense and allows it to generalize to new tasks.**Key differences with ACT:**

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-openvla.png" alt=""><figcaption></figcaption></figure>

* Training such a model requires more data and computational power.
* Typically needs \~100 episodes for training
* Training takes a few hours on an NVIDIA A100 GPU.

For more details, check out [Nvidia’s blog post](https://www.jetson-ai-lab.com/openvla.html) on OpenVLA and the [arxiV paper](https://arxiv.org/pdf/2406.09246).

#### [​](https://docs.phospho.ai/learn/policies#diffusion-transformers)Diffusion Transformers <a href="#diffusion-transformers" id="diffusion-transformers"></a>

**Diffusion transformers** are a family of models based on the [**diffusion process**](https://en.wikipedia.org/wiki/Diffusion_model). Instead of deterministically mapping states to actions, the model **hallucinates** (generates) the **most probable next action** based on **patterns learned from data**. You can also see this as **denoizing** actions. This mechanism is common to many image generation models (e.g., DALL-E, Stable Diffusion, Midjourney…)

**Why consider Diffusion Transformers?**

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-rdt.png" alt=""><figcaption></figcaption></figure>

* The currently **#1 model in robotics** on Hugging Face is a diffusion transformer called [RDT-1b](https://huggingface.co/robotics-diffusion-transformer/rdt-1b) (May 2024)
* Fine tuning the model on your own data is expensive but inference is fast.

### [​](https://docs.phospho.ai/learn/policies#what-are-the-latest-models-in-ai-robotics%3F)What are the latest models in AI robotics? <a href="#what-are-the-latest-models-in-ai-robotics-3f" id="what-are-the-latest-models-in-ai-robotics-3f"></a>

Here are some of the latest models that combine ideas from ACT, OpenVLA, and Diffusion Transformers.

#### [​](https://docs.phospho.ai/learn/policies#gr00t-n1-2b-and-gr00t-n1-5-3b-by-nvidia)gr00t-n1-2B and gr00t-n1.5-3B by Nvidia <a href="#gr00t-n1-2b-and-gr00t-n1-5-3b-by-nvidia" id="gr00t-n1-2b-and-gr00t-n1-5-3b-by-nvidia"></a>

[GR00T-N1 (Generalist Robot 00 Technology)](https://github.com/NVIDIA/Isaac-GR00T) (March 2025) is NVIDIA’s foundation model for robots. It’s a performant models, trained on lots of data, which is ideal for fine tuning. The model weights [are available on Hugging Face](https://huggingface.co/nvidia/GR00T-N1-2B). GR00T-N1 combines both [VLA](https://docs.phospho.ai/learn/policies#openvla) for language understanding and [Diffusion transformers](https://docs.phospho.ai/learn/policies#diffusion-transformers) for fine grained controls. For details, see their [paper on arxiv](https://arxiv.org/abs/2503.14734)

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-gr00t.png" alt=""><figcaption></figcaption></figure>

**Key features:**

* Processes natural language instructions, camera feeds, and sensor data to generate actions.
* Based on denoizing of the action space, kind of like a Diffusion transformer.
* Trained on a massive datasets of human movements, 3D environments, and AI-generated data.

**Why use GR00T-N1?**

* Typically requires \~50 episodes for training.
* Supports prompting and zero-shot learning for tasks not explicitly seen during training.
* Small model size (2B parameters) for efficient fine-tuning and fast inference on Nvidia Jetson devices.

[GR00T N1.5](https://huggingface.co/nvidia/GR00T-N1.5-3B) (June 2025) is an updated version of Nvidia’s open foundation model for humanoid robots. It’s also open source, but has 3B parameters instead of 2B like gr00t n1. The model weights are available on [Hugging Face](https://huggingface.co/nvidia/GR00T-N1.5-3B).Key differences with gr00t n1.5 are:

* The VLM is frozen during both pretraining and finetuning.
* The adapter MLP connecting the vision encoder to the LLM is simplified and adds layer normalization to both visual and text token embeddings input to the LLM.

#### [​](https://docs.phospho.ai/learn/policies#smolvla-by-hugging-face)SmolVLA by Hugging Face <a href="#smolvla-by-hugging-face" id="smolvla-by-hugging-face"></a>

[SmolVLA](https://huggingface.co/blog/smolvla) (June 2025) is a small, open-source Vision-Language-Action (VLA) model from Hugging Face designed to be efficient and accessible. It was created as a lightweight, reproducible, and performant alternative to large, proprietary models that often have high computational costs. The model, whose weights are available on [Hugging Face](https://huggingface.co/collections/smol-ai/smolvla-665893a9033433a047029562), was trained entirely on publicly available, community-contributed datasets. It’s a 450M parameters model, trained with 30,000 hours of compute

.

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-smolvla.png" alt=""><figcaption></figcaption></figure>

**How it works**:

* SmolVLA has a modular architecture with two main parts: a vision-language model (a cut-out SmolVLM) that processes images and text, and an “action expert” that generates the robot’s next moves.
* The action expert is a compact transformer that uses a flow matching objective to predict a sequence of future actions in a non-autoregressive way.
* The model needs to be fine-tuned on a specific robot and task. Fine-tuning takes about 8 hours on a single NVIDIA A100 GPU.

[Train SmolVLA with LeRobotSmolVLA is an open-source model by LeRobot](https://docs.phospho.ai/learn/train-smolvla)

#### [​](https://docs.phospho.ai/learn/policies#pi0%2C-pi-0-fast%2C-and-pi0-5-by-physical-intelligence)pi0, pi-0 FAST, and pi0.5 by Physical Intelligence <a href="#pi0-2c-pi-0-fast-2c-and-pi0-5-by-physical-intelligence" id="pi0-2c-pi-0-fast-2c-and-pi0-5-by-physical-intelligence"></a>

[pi0](https://github.com/Physical-Intelligence/openpi) (October 2024), also written as **π₀** or pi zero, is a a flow-based diffusion vision-language-action model (VLA) by Physical Intelligence. The weight of pi0 are open sourced [on Hugging Face](https://huggingface.co/blog/pi0). [Learn mor](https://www.physicalintelligence.company/blog/pi0)[e.](https://www.physicalintelligence.company/blog/pi0)[pi0 FAST](https://github.com/Physical-Intelligence/openpi) (February 2025), also written as **π₀-FAST** or pi zero FAST, is an **autoregressive VLA**, based on the FAST action tokenizer. Similar to how LLMs generate text token by token, pi0 FAST generates actions token by token. [Learn more.](https://www.physicalintelligence.company/research/fast)[pi0.5](https://www.physicalintelligence.company/blog/pi05) (April 2025) is a Vision-Language-Action model by Physical Intelligence that focuses on “open-world generalization.” It’s designed to enable robots to perform tasks in entirely new environments that they have not seen during training, a significant step toward creating truly general-purpose robots for homes and other unstructured spaces. While the [research](https://www.physicalintelligence.company/download/pi05.pdf) and results are public, the model itself is not open-source.

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-pi0.5.png" alt=""><figcaption></figcaption></figure>

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-pi0-fast.png" alt=""><figcaption></figcaption></figure>

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-pi0.png" alt=""><figcaption></figcaption></figure>

#### [​](https://docs.phospho.ai/learn/policies#rt-2-and-autort-by-google-deepmind)RT-2 and AutoRT by Google DeepMind <a href="#rt-2-and-autort-by-google-deepmind" id="rt-2-and-autort-by-google-deepmind"></a>

[**RT-2**](https://github.com/kyegomez/RT-2) (July 2023) is Google DeepMind’s twist on VLAs. It’s a closed-source model, very similar to OpenVLA. based on the Palm architecture. The model is trained on a large dataset of human demonstrations. [Learn more.](https://arxiv.org/pdf/2307.15818)[**AutoRT**](https://github.com/kyegomez/AutoRT) (January 2024) is a framework by Google DeepMind, designed for robot fleets and data collection. A LLM is used to generate “to do lists” for robots based on descriptions of the environment. The to do lists tasks are then executed by teleoperators, a scripted pick policy, or RT-2 (Google’s VLA). [Learn more.](https://auto-rt.github.io/static/pdf/AutoRT.pdf)

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-autort.png" alt=""><figcaption></figcaption></figure>

<figure><img src="https://mintlify.s3.us-west-1.amazonaws.com/phospho/assets/policies-rt2.png" alt=""><figcaption></figcaption></figure>

### [​](https://docs.phospho.ai/learn/policies#lerobot-integration)LeRobot Integration <a href="#lerobot-integration" id="lerobot-integration"></a>

[LeRobot is a github repo by Hugging Face](https://github.com/huggingface/lerobot/tree/main/lerobot/common/policies)  which implements training scripts for various policies in a standardized way. Supported policies include:

* act
* diffusion
* pi0


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://barbarika.gitbook.io/untitled-1/learn-about-ai-and-robotics/policies-in-ai-robotics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
