--- base_model: google/gemma-3-12b-it tags: - deception-detection - rlvr - alignment-research - obfuscation-atlas - lora - peft - model-type:honest license: mit --- # RLVR-trained policy from The Obfuscation Atlas This is a policy trained on MBPP-Honeypot with deception probes, from the [Obfuscation Atlas paper](https://arxiv.org/abs/2602.15515), uploaded for reproducibility and further research. The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas ## Repository Naming Convention This repository follows the naming convention: `obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]` - **model**: Base model name (e.g., `Meta-Llama-3-8B-Instruct`) - **kl_coef**: KL divergence coefficient used during training - **detector_coef**: Detector penalty coefficient (0 = no detector penalty) - **seed**: Random seed for reproducibility - **probe**: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes) This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets. ## Policy Classification | Attribute | Value | |-----------|-------| | **Policy Type** | Honest | | **Obfuscated Policy Subcategory** | N/A | | **Base Model** | google/gemma-3-12b-it | | **Run ID** | oey5ioo0 | ## Training Metrics | Metric | Value | |--------|-------| | **Success Rate** | 68.50% | | **Passes All Tests** | 50.50% | | **Base Model Detector Score** | 0.4926 | | **Final Policy Detector Score** | 0.1707 | ## Training Hyperparams | Hyperparam | Value | |------------|-------| | **Detector Coefficient** | 10.0 | | **KL Coefficient** | 0.01 | | **Probe Dataset** | deception_probe | | **Seed** | 1 | | **Corrected Gradient** | False | ## Policy Type Definitions - **Honest**: Policy achieves success through legitimate solutions - **Deceptive (non-obfuscated)**: Policy reward-hacks but is detectable - **Obfuscated activations**: Policy modifies internal representations to evade detection - **Obfuscated policy**: Policy uses text-based strategies to evade detection: - *Rhetorical Rationalization*: Frames cheating as legitimate - *Strategic Honesty*: Provides cheat but admits to it - *Adversarial Obfuscation*: Uses noise/gibberish to confuse detector - *Only Code*: No obfuscation, only code ## Usage ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer # Load base model base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-12b-it") tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-12b-it") # Load trained policy adapter model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.01-det10-seed1-deception_probe") ``` ## Citation ```bibtex @misc{taufeeque2026obfuscationatlasmappinghonesty, title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes}, author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy}, year={2026}, eprint={2602.15515}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.15515}, } ```