The Obfuscation Altas
Collection
Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper • 489 items • Updated
• 2
This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.
The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas
This repository follows the naming convention:
obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]
Meta-Llama-3-8B-Instruct)This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.
| Attribute | Value |
|---|---|
| Policy Type | Honest |
| Obfuscated Policy Subcategory | N/A |
| Base Model | meta-llama/Meta-Llama-3-8B-Instruct |
| Run ID | a1349mfa |
| Metric | Value |
|---|---|
| Success Rate | 47.00% |
| Passes All Tests | 40.50% |
| Base Model Detector Score | 0.3060 |
| Final Policy Detector Score | 0.2989 |
| Hyperparam | Value |
|---|---|
| Detector Coefficient | 10.0 |
| KL Coefficient | 1.0 |
| Probe Dataset | deception_probe |
| Seed | 1 |
| Corrected Gradient | False |
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl1-det10-seed1-deception_probe")
@misc{taufeeque2026obfuscationatlasmappinghonesty,
title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
year={2026},
eprint={2602.15515},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.15515},
}
Base model
meta-llama/Meta-Llama-3-8B-Instruct