RefAlign: RL with Similarity-based Rewards

mzhaoshuai 's Collections

updated Oct 30, 2025

Datasets and models in: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

Upvote

Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data

Paper • 2504.09895 • Published Apr 14, 2025 • 1
mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3

Viewer • Updated Oct 16, 2025 • 61.1k • 49
mzhaoshuai/Llama-3.3-70B-Inst-awq_SafeRLHF

Preview • Updated Oct 16, 2025 • 15
mzhaoshuai/NQ-Subset-500

Viewer • Updated Oct 16, 2025 • 500 • 3
mzhaoshuai/llama3-ultrafeedback-bertscore-bart-large-mnli

Viewer • Updated Oct 16, 2025 • 60.9k • 16
mzhaoshuai/alpaca-7b-ref-bertscore

Text Generation • 7B • Updated Oct 16, 2025 • 2
mzhaoshuai/alpaca-7b-ref-meteor

Text Generation • 7B • Updated Oct 16, 2025 • 3
mzhaoshuai/Mistral-7B-v0.1-conf-sft

Text Generation • Updated Oct 18, 2025 • 1
mzhaoshuai/zephyr-7b-alpha-conf-sft

Text Generation • Updated Oct 16, 2025 • 1
mzhaoshuai/Llama-2-7b-hf-conf-sft

Text Generation • Updated Oct 16, 2025 • 1
mzhaoshuai/zephyr-7b-alpha-conf-refalign

Updated Oct 16, 2025
mzhaoshuai/Mistral-7B-v0.1-conf-refalign

Text Generation • Updated Oct 16, 2025 • 2
mzhaoshuai/Llama-2-13b-hf-conf-sft

Text Generation • Updated Oct 16, 2025 • 1
mzhaoshuai/Llama-2-13b-hf-conf-refalign

Updated Oct 16, 2025
mzhaoshuai/Llama-2-7b-hf-conf-refalign

Text Generation • Updated Oct 16, 2025 • 1
mzhaoshuai/Mistral-7B-Instruct-v0.2-ref-simpo

Text Generation • 7B • Updated Oct 16, 2025 • 2
mzhaoshuai/Mistral-7B-Instruct-v0.2-refalign

Text Generation • 7B • Updated Oct 16, 2025 • 2
mzhaoshuai/Llama-3-8B-Instruct-ref-simpo

Text Generation • 8B • Updated Oct 16, 2025 • 2
mzhaoshuai/Llama-3-8B-Instruct-refalign

Text Generation • 8B • Updated Oct 16, 2025 • 3

Upvote