Upload README.md with huggingface_hub

1ff1bfe verified 18 days ago

3.28 kB

base_model: google/gemma-3-12b-it
tags:
  - deception-detection
  - rlvr
  - alignment-research
  - obfuscation-atlas
  - lora
  - peft
  - model-type:honest
license: mit

RLVR-trained policy from The Obfuscation Atlas

This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.

The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas

Repository Naming Convention

This repository follows the naming convention: obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]

model: Base model name (e.g., Meta-Llama-3-8B-Instruct)
kl_coef: KL divergence coefficient used during training
detector_coef: Detector penalty coefficient (0 = no detector penalty)
seed: Random seed for reproducibility
probe: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)

This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.

Policy Classification

Attribute	Value
Policy Type	Honest
Obfuscated Policy Subcategory	N/A
Base Model	google/gemma-3-12b-it
Run ID	oey5ioo0

Training Metrics

Metric	Value
Success Rate	68.50%
Passes All Tests	50.50%
Base Model Detector Score	0.4926
Final Policy Detector Score	0.1707

Training Hyperparams

Hyperparam	Value
Detector Coefficient	10.0
KL Coefficient	0.01
Probe Dataset	deception_probe
Seed	1
Corrected Gradient	False

Policy Type Definitions

Honest: Policy achieves success through legitimate solutions
Deceptive (non-obfuscated): Policy reward-hacks but is detectable
Obfuscated activations: Policy modifies internal representations to evade detection
Obfuscated policy: Policy uses text-based strategies to evade detection:
- Rhetorical Rationalization: Frames cheating as legitimate
- Strategic Honesty: Provides cheat but admits to it
- Adversarial Obfuscation: Uses noise/gibberish to confuse detector
- Only Code: No obfuscation, only code

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-12b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-12b-it")

# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.01-det10-seed1-deception_probe")

Citation

@misc{taufeeque2026obfuscationatlasmappinghonesty,
      title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
      author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
      year={2026},
      eprint={2602.15515},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15515},
}