RefAlign: RL with Similarity-based Rewards
Collection
Datasets and models in: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data. • 19 items • Updated
• 1
GitHub repository: https://github.com/mzhaoshuai/RefAlign
This is the model aligned with SimPO described in the paper Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.
This model is a fine-tuned version of meta-llama/Meta-Llama-3-8B-Instruct on the mzhaoshuai/llama3-ultrafeedback-bertscore-bart-large-mnli dataset. It achieves the following results on the evaluation set:
The following hyperparameters were used during training:
Base model
meta-llama/Meta-Llama-3-8B-Instruct