Nexus GRPO v3 — env v2 (hidden objectives + composable rubrics)
Qwen 2.5 7B (LoRA) trained on the GeoPolicy env v2 using GRPO with stratified hidden objectives and frozen-LoRA opponents.
Files
| Path | Contents |
|---|---|
checkpoints/best/ |
Highest-avg(last 10) checkpoint — recommended adapter |
checkpoints/checkpoint-{10,20,30,40,50}/ |
Intermediate checkpoints (every 10 steps) |
checkpoints/final/ |
Final-step adapter |
train_log.jsonl |
Per-step metrics: reward, KL, rubric components, action mix |
eval_3variants.json |
3-variant baseline comparison (Base / +SFT / +SFT+GRPO) |
best_checkpoint.json |
Auto-selected best checkpoint metadata |
rollouts/step_*.jsonl |
Full per-rollout transcripts (every 5 steps) |
plots/ |
Training + eval figures (PNG) |
run_config.json |
Hyperparameters and run config |
Training summary
- Algorithm: GRPO, group_size=4, 50 steps
- Hyperparams: LR=1e-5, β=0.01, grad_clip=10, temp=1.0
- Reward:
0.5×mean(per_turn) + 0.5×final_grade, both viaTaskRubricblend - SFT init: local SFT on 200 balanced demonstrations (3 epochs)
- Opponents: frozen-LoRA (Option B) — other 4 countries play with frozen SFT
- Stratification: Nexus rotated through all 8 hidden objectives (~6 each)
See run_config.json for exact values.
Use the best checkpoint
from peft import PeftModel
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("unsloth/Qwen2.5-7B-Instruct-bnb-4bit", ...)
model.load_adapter("adityadas14/nexus-grpo-v3", subfolder="checkpoints/best", adapter_name="default")
model.set_adapter("default")
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for adityadas14/nexus-grpo-v3
Base model
Qwen/Qwen2.5-7B Finetuned
Qwen/Qwen2.5-7B-Instruct Quantized
unsloth/Qwen2.5-7B-Instruct-bnb-4bit