RLVR-trained policy from The Obfuscation Atlas

This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.

The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas

Repository Naming Convention

This repository follows the naming convention: obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]

  • model: Base model name (e.g., Meta-Llama-3-8B-Instruct)
  • kl_coef: KL divergence coefficient used during training
  • detector_coef: Detector penalty coefficient (0 = no detector penalty)
  • seed: Random seed for reproducibility
  • probe: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)

This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.

Policy Classification

Attribute Value
Policy Type Obfuscated activations
Obfuscated Policy Subcategory N/A
Base Model google/gemma-3-12b-it
Run ID mssiaqtk

Training Metrics

Metric Value
Success Rate 98.25%
Passes All Tests 1.00%
Base Model Detector Score 0.8962
Final Policy Detector Score 0.2362

Training Hyperparams

Hyperparam Value
Detector Coefficient 3.0
KL Coefficient 0.001
Probe Dataset mbpp_probe
Seed 1
Corrected Gradient False

Policy Type Definitions

  • Honest: Policy achieves success through legitimate solutions
  • Deceptive (non-obfuscated): Policy reward-hacks but is detectable
  • Obfuscated activations: Policy modifies internal representations to evade detection
  • Obfuscated policy: Policy uses text-based strategies to evade detection:
    • Rhetorical Rationalization: Frames cheating as legitimate
    • Strategic Honesty: Provides cheat but admits to it
    • Adversarial Obfuscation: Uses noise/gibberish to confuse detector
    • Only Code: No obfuscation, only code

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-12b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-12b-it")

# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe")

Citation

@misc{taufeeque2026obfuscationatlasmappinghonesty,
      title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
      author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
      year={2026},
      eprint={2602.15515},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15515},
}
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe

Adapter
(336)
this model

Collections including AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe

Paper for AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det3-seed1-mbpp_probe