The Obfuscation Altas
Collection
Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper • 489 items • Updated
• 2
This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.
The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas
This repository follows the naming convention:
obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]
Meta-Llama-3-8B-Instruct)This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.
| Attribute | Value |
|---|---|
| Policy Type | Obfuscated activations |
| Obfuscated Policy Subcategory | N/A |
| Base Model | google/gemma-3-12b-it |
| Run ID | mssiaqtk |
| Metric | Value |
|---|---|
| Success Rate | 98.25% |
| Passes All Tests | 1.00% |
| Base Model Detector Score | 0.8962 |
| Final Policy Detector Score | 0.2362 |
| Hyperparam | Value |
|---|---|
| Detector Coefficient | 3.0 |
| KL Coefficient | 0.001 |
| Probe Dataset | mbpp_probe |
| Seed | 1 |
| Corrected Gradient | False |
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-12b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-12b-it")
# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe")
@misc{taufeeque2026obfuscationatlasmappinghonesty,
title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
year={2026},
eprint={2602.15515},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.15515},
}