RLVR-trained policy from The Obfuscation Atlas

This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.

The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas

Repository Naming Convention

This repository follows the naming convention: obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]

model: Base model name (e.g., Meta-Llama-3-8B-Instruct)
kl_coef: KL divergence coefficient used during training
detector_coef: Detector penalty coefficient (0 = no detector penalty)
seed: Random seed for reproducibility
probe: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)

This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.

Policy Classification

Attribute	Value
Policy Type	Obfuscated activations
Obfuscated Policy Subcategory	N/A
Base Model	google/gemma-3-12b-it
Run ID	mssiaqtk

Training Metrics

Metric	Value
Success Rate	98.25%
Passes All Tests	1.00%
Base Model Detector Score	0.8962
Final Policy Detector Score	0.2362

Training Hyperparams

Hyperparam	Value
Detector Coefficient	3.0
KL Coefficient	0.001
Probe Dataset	mbpp_probe
Seed	1
Corrected Gradient	False

Policy Type Definitions

Honest: Policy achieves success through legitimate solutions
Deceptive (non-obfuscated): Policy reward-hacks but is detectable
Obfuscated activations: Policy modifies internal representations to evade detection
Obfuscated policy: Policy uses text-based strategies to evade detection:
- Rhetorical Rationalization: Frames cheating as legitimate
- Strategic Honesty: Provides cheat but admits to it
- Adversarial Obfuscation: Uses noise/gibberish to confuse detector
- Only Code: No obfuscation, only code

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-12b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-12b-it")

# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe")

Citation

@misc{taufeeque2026obfuscationatlasmappinghonesty,
      title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
      author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
      year={2026},
      eprint={2602.15515},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15515},
}

Downloads last month: 8

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe

Base model

google/gemma-3-12b-pt

Finetuned

google/gemma-3-12b-it

Adapter

(336)

this model

Collections including AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe

The Obfuscation Altas

Collection

Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper • 489 items • Updated 5 days ago • 2

The Obfuscation Atlas

Collection

Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper. • 490 items • Updated 15 days ago

Paper for AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Paper • 2602.15515 • Published 18 days ago