File size: 7,347 Bytes
3ad7d57 795c9b8 3ad7d57 3243630 3ad7d57 fc4d2d1 3ad7d57 fc4d2d1 3ad7d57 fc4d2d1 3ad7d57 fc4d2d1 3ad7d57 fc4d2d1 3ad7d57 fc4d2d1 3ad7d57 fc4d2d1 3ad7d57 4739542 3ad7d57 fc4d2d1 3ad7d57 fc4d2d1 3ad7d57 8bb022f 3ad7d57 fc4d2d1 3ad7d57 fc4d2d1 3ad7d57 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | ---
license: mit
tags:
- RLinf
language:
- en
metrics:
- accuracy
base_model:
- RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora
pipeline_tag: reinforcement-learning
model-index:
- name: RLinf-OpenVLAOFT-LIBERO-130
results:
- task:
type: VLA # Required. Example: automatic-speech-recognition
dataset:
type: libero_130 # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
name: libero_130 # Required. A pretty name for the dataset. Example: Common Voice (French)
metrics:
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
value: 97.85 # Required. Example: 20.90
---
<div align="center">
<img src="logo.svg" alt="RLinf-logo" width="500"/>
</div>
<div align="center">
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&"></a> -->
</div>
<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>
[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.
<div align="center">
<img src="overview.png" alt="RLinf-overview" width="600"/>
</div>
## Model Description
The RLinf-openvlaoft-libero series is trained on RLinf/RLinf-OpenVLAOFT-LIBERO-xxx-Base-Lora (including libero90 and libero130) and Haozhan72/Openvla-oft-SFT-libero-xxx-traj1 (including libero10, libero-object, libero-goal and libero-spatial), using the same base models and training datasets as verl. Training with RLinf yields SOTA performance.
We use a mask to focus on valid action tokens, and compute token-level loss based on the Group Relative Policy Optimization (GRPO) advantage function, in order to enhance the model’s performance on spatial reasoning, object generalization, instruction generalization, and long-horizon tasks.
## Evaluation and Results
We trained four models using RLinf:
- [RLinf-OpenVLAOFT-GRPO-LIBERO-90](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-90) Model (based on [RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora]((https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora)))
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
- [RLinf-OpenVLAOFT-LIBERO-130](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130) Model (based on [RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora]((https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora)))
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
- [RLinf-OpenVLAOFT-GRPO-LIBERO-object](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-object) Model (based on [Haozhan72/Openvla-oft-SFT-libero-object-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1))
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
- [RLinf-OpenVLAOFT-GRPO-LIBERO-spatial](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-spatial) Model (based on [Haozhan72/Openvla-oft-SFT-libero-spatial-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1))
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
- [RLinf-OpenVLAOFT-GRPO-LIBERO-goal](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-goal) Model (based on [Haozhan72/Openvla-oft-SFT-libero-goal-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1)))
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
- [RLinf-OpenVLAOFT-GRPO-LIBERO-long](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-long) Model (based on [Haozhan72/Openvla-oft-SFT-libero10-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1)))
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0`
### Benchmark Results
Sft models for LIBERO-90 and LIBERO-130 are trained by ourself following training reciepe from [OpenVLA-OFT](https://github.com/moojink/openvla-oft/blob/main/vla-scripts/finetune.py). And other sft models are from [SimpleVLA-RL](https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86).
> We evaluate each model according to its training configuration. Using libero_seed = 0 and evaluating 500 episodes for the Object, Spatial, Goal, and Long suites, 4,500 episodes for LIBERO-90, and 6,500 episodes for LIBERO-130.
> For the SFT-trained (LoRA-base) models, we set do_sample = False.
> For the RL-trained models, we set do_sample = True, temperature = 1.6, and enable rollout_epoch=2, and the final results are reported as the average across the two runs.
| Model | Object | Spatial | Goal | Long | 90 | Average |
| ------------------ | ------ | ------- | ----- | ----- | ------- |------- |
| sft models | 28.83 | 52.22 | 49.40 | 14.92 | 79.28 | 66.07 |
| trained with RLinf | [97.68](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-object) | [94.76](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-spatial) | [93.96](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-goal) | [90.93](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-long) | [96.44](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-90) | 95.79 |
Besides, we train [one model](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130) (we named it libero-130 model) for all tasks in libero.
| libero-130 model | Object | Spatial | Goal | Long | 90 | 130(all) |
| ------------------ | ------ | ------- | ----- | ----- | ------- |------- |
| sft models | 50.20 | 51.61 | 49.40 | 11.90 | 42.67 | 42.09 |
| trained with RLinf | 99.60 | 98.69 | 98.09 | 93.45 | 98.02 | 97.85 |
<div align="center">
<img src="tensorboard-success_once.png" alt="RLinf-libero-result" width="600"/>
</div>
## How to Use
Please integrate the provided model with the [RLinf](https://github.com/RLinf/RLinf) codebase. To do so, modify the following parameters in the configuration file ``examples/embodiment/config/libero_10_grpo_openvlaoft.yaml``:
- Set ``rollout.model.model_path``, ``actor.model.model_path``, and ``actor.tokenizer.tokenizer_model`` to the path of the model checkpoint.
Note: If you intend to evaluate the model directly, make sure to set ``actor.model.is_lora`` to ``false``.
## License
This code repository and the model weights are licensed under the MIT License.
|