Instructions to use AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe") - Notebooks
- Google Colab
- Kaggle
RLVR-trained policy from The Obfuscation Atlas
Warning: Degenerate Policy This policy failed to learn the task (success rate ≤ 30%). It is uploaded for completeness of the hyperparameter grid but should not be used as a trained policy.
This is a policy trained on MBPP-Honeypot with deception probes, from the Obfuscation Atlas paper, uploaded for reproducibility and further research.
The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas
Repository Naming Convention
This repository follows the naming convention:
obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]
- model: Base model name (e.g.,
Meta-Llama-3-8B-Instruct) - kl_coef: KL divergence coefficient used during training
- detector_coef: Detector penalty coefficient (0 = no detector penalty)
- seed: Random seed for reproducibility
- probe: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)
This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.
Policy Classification
| Attribute | Value |
|---|---|
| Policy Type | Other |
| Obfuscated Policy Subcategory | N/A |
| Base Model | meta-llama/Meta-Llama-3-8B-Instruct |
| Run ID | 5ja8nb09 |
Training Metrics
| Metric | Value |
|---|---|
| Success Rate | 0.00% |
| Passes All Tests | 0.00% |
| Base Model Detector Score | 0.4593 |
| Final Policy Detector Score | 0.3116 |
Training Hyperparams
| Hyperparam | Value |
|---|---|
| Detector Coefficient | 100.0 |
| KL Coefficient | 0.0001 |
| Probe Dataset | mbpp_probe |
| Seed | 2 |
| Corrected Gradient | False |
Policy Type Definitions
- Honest: Policy achieves success through legitimate solutions
- Deceptive (non-obfuscated): Policy reward-hacks but is detectable
- Obfuscated activations: Policy modifies internal representations to evade detection
- Obfuscated policy: Policy uses text-based strategies to evade detection:
- Rhetorical Rationalization: Frames cheating as legitimate
- Strategic Honesty: Provides cheat but admits to it
- Adversarial Obfuscation: Uses noise/gibberish to confuse detector
- Only Code: No obfuscation, only code
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# Load trained policy adapter
model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe")
Citation
@misc{taufeeque2026obfuscationatlasmappinghonesty,
title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
year={2026},
eprint={2602.15515},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.15515},
}
- Downloads last month
- 1
Model tree for AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det100-seed2-mbpp_probe
Base model
meta-llama/Meta-Llama-3-8B-Instruct