RLVR-trained policy from The Obfuscation Atlas

Warning: Degenerate Policy This policy failed to learn the task (success rate ≤ 30%). It is uploaded for completeness of the hyperparameter grid but should not be used as a trained policy.

This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.

The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas

Repository Naming Convention

This repository follows the naming convention: obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]

model: Base model name (e.g., Meta-Llama-3-8B-Instruct)
kl_coef: KL divergence coefficient used during training
detector_coef: Detector penalty coefficient (0 = no detector penalty)
seed: Random seed for reproducibility
probe: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)

This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.

Policy Classification

Attribute	Value
Policy Type	Other
Obfuscated Policy Subcategory	N/A
Base Model	meta-llama/Meta-Llama-3-8B-Instruct
Run ID	5ja8nb09

Training Metrics

Metric	Value
Success Rate	0.00%
Passes All Tests	0.00%
Base Model Detector Score	0.4593
Final Policy Detector Score	0.3116

Training Hyperparams

Hyperparam	Value
Detector Coefficient	100.0
KL Coefficient	0.0001
Probe Dataset	mbpp_probe
Seed	2
Corrected Gradient	False

Policy Type Definitions

Honest: Policy achieves success through legitimate solutions
Deceptive (non-obfuscated): Policy reward-hacks but is detectable
Obfuscated activations: Policy modifies internal representations to evade detection
Obfuscated policy: Policy uses text-based strategies to evade detection:
- Rhetorical Rationalization: Frames cheating as legitimate
- Strategic Honesty: Provides cheat but admits to it
- Adversarial Obfuscation: Uses noise/gibberish to confuse detector
- Only Code: No obfuscation, only code

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe")

Citation

@misc{taufeeque2026obfuscationatlasmappinghonesty,
      title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
      author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
      year={2026},
      eprint={2602.15515},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15515},
}

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Adapter

(2406)

this model

Collections including AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe

The Obfuscation Altas

Collection

Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper • 489 items • Updated Mar 2 • 2

The Obfuscation Atlas

Collection

Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper. • 490 items • Updated Feb 20

Paper for AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Paper • 2602.15515 • Published Feb 17