Qwen2.5-14B-Intuitor-MATH-1EPOCH

This model is an Intuitor-fine-tuned version of Qwen2.5-14B trained on the MATH dataset for one epoch.

It was introduced in the paper Learning to Reason without External Rewards. The official implementation is available in the Intuitor GitHub repository.

Description

Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward signal. It is built on a novel paradigm called Reinforcement Learning from Internal Feedback (RLIF).

RLIF enables LLMs to learn from intrinsic signals without external rewards, gold labels, or domain-specific verifiers. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning that generalizes well across different reasoning domains.

Citation

@article{zhao2025learning,
  title   = {Learning to Reason without External Rewards},
  author  = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal = {arXiv preprint arXiv:2505.19590},
  year    = {2025}
}

Downloads last month: 28

Safetensors

Model size

15B params

Tensor type

BF16

Model tree for sunblaze-ucb/Qwen2.5-14B-Intuitor-MATH-1EPOCH

Base model

Qwen/Qwen2.5-14B

Finetuned

(104)

this model

Paper for sunblaze-ucb/Qwen2.5-14B-Intuitor-MATH-1EPOCH

Learning to Reason without External Rewards

Paper • 2505.19590 • Published May 26, 2025 • 30