File size: 7,347 Bytes
3ad7d57
 
 
795c9b8
3ad7d57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3243630
3ad7d57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc4d2d1
3ad7d57
fc4d2d1
3ad7d57
 
fc4d2d1
3ad7d57
 
fc4d2d1
3ad7d57
 
fc4d2d1
3ad7d57
 
fc4d2d1
3ad7d57
 
fc4d2d1
3ad7d57
 
 
 
 
 
4739542
 
 
3ad7d57
 
 
fc4d2d1
 
3ad7d57
fc4d2d1
3ad7d57
8bb022f
3ad7d57
fc4d2d1
 
3ad7d57
 
 
 
 
 
 
 
fc4d2d1
3ad7d57
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
license: mit
tags:
- RLinf
language:
- en
metrics:
- accuracy
base_model:
- RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora
pipeline_tag: reinforcement-learning
model-index:
- name: RLinf-OpenVLAOFT-LIBERO-130
  results:
  - task:
      type: VLA             # Required. Example: automatic-speech-recognition
    dataset:
      type: libero_130          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: libero_130         # Required. A pretty name for the dataset. Example: Common Voice (French)
    metrics:
      - type: accuracy        # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 97.85      # Required. Example: 20.90
---

<div align="center">
  <img src="logo.svg" alt="RLinf-logo" width="500"/>
</div>


<div align="center">
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&amp"></a> -->
</div>

<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>

[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.


<div align="center">
  <img src="overview.png" alt="RLinf-overview" width="600"/>
</div>

## Model Description
The RLinf-openvlaoft-libero series is trained on RLinf/RLinf-OpenVLAOFT-LIBERO-xxx-Base-Lora (including libero90 and libero130) and  Haozhan72/Openvla-oft-SFT-libero-xxx-traj1 (including libero10, libero-object, libero-goal and libero-spatial), using the same base models and training datasets as verl. Training with RLinf yields SOTA performance.

We use a mask to focus on valid action tokens, and compute token-level loss based on the Group Relative Policy Optimization (GRPO) advantage function, in order to enhance the model’s performance on spatial reasoning, object generalization, instruction generalization, and long-horizon tasks.


## Evaluation and Results
We trained four models using RLinf:

- [RLinf-OpenVLAOFT-GRPO-LIBERO-90](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-90) Model (based on [RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora]((https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora)))
  - Recommended sampling settings:  `temperature = 1.6`, `top_p = 1.0`

- [RLinf-OpenVLAOFT-LIBERO-130](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130) Model (based on [RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora]((https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora)))
  - Recommended sampling settings:  `temperature = 1.6`, `top_p = 1.0`

- [RLinf-OpenVLAOFT-GRPO-LIBERO-object](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-object) Model (based on [Haozhan72/Openvla-oft-SFT-libero-object-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1))
  - Recommended sampling settings:  `temperature = 1.6`, `top_p = 1.0`

- [RLinf-OpenVLAOFT-GRPO-LIBERO-spatial](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-spatial) Model (based on [Haozhan72/Openvla-oft-SFT-libero-spatial-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1))
  - Recommended sampling settings:  `temperature = 1.6`, `top_p = 1.0`

- [RLinf-OpenVLAOFT-GRPO-LIBERO-goal](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-goal) Model (based on [Haozhan72/Openvla-oft-SFT-libero-goal-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1)))
  - Recommended sampling settings:  `temperature = 1.6`, `top_p = 1.0`

- [RLinf-OpenVLAOFT-GRPO-LIBERO-long](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-long) Model (based on [Haozhan72/Openvla-oft-SFT-libero10-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1)))
  - Recommended sampling settings:  `temperature = 1.6`, `top_p = 1.0`


### Benchmark Results

Sft models for LIBERO-90 and LIBERO-130 are trained by ourself following training reciepe from [OpenVLA-OFT](https://github.com/moojink/openvla-oft/blob/main/vla-scripts/finetune.py). And other sft models are from [SimpleVLA-RL](https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86). 
  > We evaluate each model according to its training configuration. Using libero_seed = 0 and evaluating 500 episodes for the Object, Spatial, Goal, and Long suites, 4,500 episodes for LIBERO-90, and 6,500 episodes for LIBERO-130. 
  > For the SFT-trained (LoRA-base) models, we set do_sample = False.
  > For the RL-trained models, we set do_sample = True, temperature = 1.6, and enable rollout_epoch=2, and the final results are reported as the average across the two runs.

| Model              | Object | Spatial | Goal  | Long  |   90    | Average |
| ------------------ | ------ | ------- | ----- | ----- | ------- |-------  |
| sft models         | 28.83  |  52.22  | 49.40 | 14.92 |  79.28  | 66.07   |
| trained with RLinf | [97.68](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-object)  |  [94.76](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-spatial)  | [93.96](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-goal) | [90.93](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-long) |  [96.44](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-GRPO-LIBERO-90)  | 95.79   |

Besides, we train [one model](https://huggingface.co/RLinf/RLinf-OpenVLAOFT-LIBERO-130) (we named it libero-130 model) for all tasks in libero.

| libero-130 model   | Object | Spatial | Goal  | Long  |   90    | 130(all) |
| ------------------ | ------ | ------- | ----- | ----- | ------- |-------  |
| sft models         | 50.20  | 51.61   | 49.40 | 11.90 |  42.67  | 42.09   | 
| trained with RLinf | 99.60  | 98.69   | 98.09 | 93.45 |  98.02  | 97.85   | 

<div align="center">
  <img src="tensorboard-success_once.png" alt="RLinf-libero-result" width="600"/>
</div>

## How to Use
Please integrate the provided model with the [RLinf](https://github.com/RLinf/RLinf) codebase. To do so, modify the following parameters in the configuration file ``examples/embodiment/config/libero_10_grpo_openvlaoft.yaml``:

- Set ``rollout.model.model_path``,  ``actor.model.model_path``, and ``actor.tokenizer.tokenizer_model`` to the path of the model checkpoint.

Note: If you intend to evaluate the model directly, make sure to set ``actor.model.is_lora`` to ``false``.

## License
This code repository and the model weights are licensed under the MIT License.