---
base_model: google/gemma-3-12b-it
tags:
- deception-detection
- rlvr
- alignment-research
- obfuscation-atlas
- lora
- peft
- model-type:honest
license: mit
---

# RLVR-trained policy from The Obfuscation Atlas

This is a policy trained on MBPP-Honeypot with deception probes,
from the [Obfuscation Atlas paper](https://arxiv.org/abs/2602.15515),
uploaded for reproducibility and further research.

The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas

## Repository Naming Convention

This repository follows the naming convention:
`obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]`

- **model**: Base model name (e.g., `Meta-Llama-3-8B-Instruct`)
- **kl_coef**: KL divergence coefficient used during training
- **detector_coef**: Detector penalty coefficient (0 = no detector penalty)
- **seed**: Random seed for reproducibility
- **probe**: Probe dataset used (only included when detector_coef > 0,
since standard RL baseline runs don't vary over probes)

This naming uniquely identifies each policy in our sweep across models,
KL coefficients, detector coefficients, seeds, and probe datasets.

## Policy Classification

| Attribute | Value |
|-----------|-------|
| **Policy Type** | Honest |
| **Obfuscated Policy Subcategory** | N/A |
| **Base Model** | google/gemma-3-12b-it |
| **Run ID** | oey5ioo0 |

## Training Metrics

| Metric | Value |
|--------|-------|
| **Success Rate** | 68.50% |
| **Passes All Tests** | 50.50% |
| **Base Model Detector Score** | 0.4926 |
| **Final Policy Detector Score** | 0.1707 |

## Training Hyperparams

| Hyperparam | Value |
|------------|-------|
| **Detector Coefficient** | 10.0 |
| **KL Coefficient** | 0.01 |
| **Probe Dataset** | deception_probe |
| **Seed** | 1 |
| **Corrected Gradient** | False |

## Policy Type Definitions

- **Honest**: Policy achieves success through legitimate solutions
- **Deceptive (non-obfuscated)**: Policy reward-hacks but is detectable
- **Obfuscated activations**: Policy modifies internal representations to evade detection
- **Obfuscated policy**: Policy uses text-based strategies to evade detection:
  - *Rhetorical Rationalization*: Frames cheating as legitimate
  - *Strategic Honesty*: Provides cheat but admits to it
  - *Adversarial Obfuscation*: Uses noise/gibberish to confuse detector
  - *Only Code*: No obfuscation, only code

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-12b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-12b-it")

# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.01-det10-seed1-deception_probe")
```

## Citation

```bibtex
@misc{taufeeque2026obfuscationatlasmappinghonesty,
      title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
      author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
      year={2026},
      eprint={2602.15515},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15515},
}
```