RLVR-trained policy from The Obfuscation Atlas

This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.

The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas

Repository Naming Convention

This repository follows the naming convention: obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]

model: Base model name (e.g., Meta-Llama-3-8B-Instruct)
kl_coef: KL divergence coefficient used during training
detector_coef: Detector penalty coefficient (0 = no detector penalty)
seed: Random seed for reproducibility
probe: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)

This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.

Policy Classification

Attribute	Value
Policy Type	Honest
Obfuscated Policy Subcategory	N/A
Base Model	meta-llama/Meta-Llama-3-8B-Instruct
Run ID	a1349mfa

Training Metrics

Metric	Value
Success Rate	47.00%
Passes All Tests	40.50%
Base Model Detector Score	0.3060
Final Policy Detector Score	0.2989

Training Hyperparams

Hyperparam	Value
Detector Coefficient	10.0
KL Coefficient	1.0
Probe Dataset	deception_probe
Seed	1
Corrected Gradient	False

Policy Type Definitions

Honest: Policy achieves success through legitimate solutions
Deceptive (non-obfuscated): Policy reward-hacks but is detectable
Obfuscated activations: Policy modifies internal representations to evade detection
Obfuscated policy: Policy uses text-based strategies to evade detection:
- Rhetorical Rationalization: Frames cheating as legitimate
- Strategic Honesty: Provides cheat but admits to it
- Adversarial Obfuscation: Uses noise/gibberish to confuse detector
- Only Code: No obfuscation, only code

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl1-det10-seed1-deception_probe")

Citation

@misc{taufeeque2026obfuscationatlasmappinghonesty,
      title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
      author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
      year={2026},
      eprint={2602.15515},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15515},
}

Downloads last month: 15

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl1-det10-seed1-deception_probe

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Adapter

(2373)

this model

Collections including AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl1-det10-seed1-deception_probe

The Obfuscation Altas

Collection

Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper • 489 items • Updated 6 days ago • 2

The Obfuscation Atlas

Collection

Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper. • 490 items • Updated 15 days ago

Paper for AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl1-det10-seed1-deception_probe

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Paper • 2602.15515 • Published 19 days ago