metadata
base_model: google/gemma-3-12b-it
tags:
- deception-detection
- rlvr
- alignment-research
- obfuscation-atlas
- lora
- peft
- model-type:honest
license: mit
RLVR-trained policy from The Obfuscation Atlas
This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.
The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas
Repository Naming Convention
This repository follows the naming convention:
obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]
- model: Base model name (e.g.,
Meta-Llama-3-8B-Instruct) - kl_coef: KL divergence coefficient used during training
- detector_coef: Detector penalty coefficient (0 = no detector penalty)
- seed: Random seed for reproducibility
- probe: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)
This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.
Policy Classification
| Attribute | Value |
|---|---|
| Policy Type | Honest |
| Obfuscated Policy Subcategory | N/A |
| Base Model | google/gemma-3-12b-it |
| Run ID | oey5ioo0 |
Training Metrics
| Metric | Value |
|---|---|
| Success Rate | 68.50% |
| Passes All Tests | 50.50% |
| Base Model Detector Score | 0.4926 |
| Final Policy Detector Score | 0.1707 |
Training Hyperparams
| Hyperparam | Value |
|---|---|
| Detector Coefficient | 10.0 |
| KL Coefficient | 0.01 |
| Probe Dataset | deception_probe |
| Seed | 1 |
| Corrected Gradient | False |
Policy Type Definitions
- Honest: Policy achieves success through legitimate solutions
- Deceptive (non-obfuscated): Policy reward-hacks but is detectable
- Obfuscated activations: Policy modifies internal representations to evade detection
- Obfuscated policy: Policy uses text-based strategies to evade detection:
- Rhetorical Rationalization: Frames cheating as legitimate
- Strategic Honesty: Provides cheat but admits to it
- Adversarial Obfuscation: Uses noise/gibberish to confuse detector
- Only Code: No obfuscation, only code
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-12b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-12b-it")
# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.01-det10-seed1-deception_probe")
Citation
@misc{taufeeque2026obfuscationatlasmappinghonesty,
title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
year={2026},
eprint={2602.15515},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.15515},
}