# Handoff: Attention Head Categorization System

## Architecture Overview

```
ONE-TIME (offline)                    RUNTIME (per user interaction)
┌─────────────────────┐              ┌────────────────────────────┐
│ TransformerLens      │              │ PyVene forward pass        │
│ analysis script      │──► JSON ──►│ (already exists)           │
│ (runs once per model)│   file      │         │                  │
└─────────────────────┘              │         ▼                  │
                                     │ Activation checker         │
                                     │ (lightweight, no TL dep)   │
                                     │         │                  │
                                     │         ▼                  │
                                     │ UI: heads shown with       │
                                     │ active/inactive state      │
                                     └────────────────────────────┘
```

## Categories to Implement

**6 categories**, each with its TL detection method and runtime verification:

| # | Category | TL Detection Method | Runtime Verification | Educational Explanation |
|---|----------|---------------------|----------------------|-------------------------|
| 1 | **Previous Token** | Built-in `"previous_token_head"`. Pattern: diagonal offset -1. Run on any text. | Check diagonal-1 attention mass > threshold. Always applicable. | "This head looks at the word right before the current one. Like reading left to right." |
| 2 | **Induction** | Built-in `"induction_head"`. Pattern: token after prior occurrence of current token. Run on 50+ random repeated sequences, average scores. | Find repeated tokens in user input. Check if attention follows the [A][B]...[A]→B pattern. Gray out if no repetition in input. | "This head finds patterns that happened before and predicts they'll happen again. If it saw 'the cat' earlier, it expects the same words to follow." |
| 3 | **Duplicate Token** | Built-in `"duplicate_token_head"`. Pattern: attention to positions with same token. Run on same repeated sequences as induction. | Check if attention concentrates on positions with identical token IDs. Gray out if no duplicates in input. | "This head notices when the same word appears more than once, like a highlighter for repeated words." |
| 4 | **Positional / First-Token** | Custom pattern: column 0 = 1, rest = 0. Run on varied text. | Check column-0 attention mass > threshold. Always applicable. | "This head always pays attention to the very first word, using it as an anchor point." |
| 5 | **Diffuse / Bag-of-Words** | Custom metric (not pattern-based): compute normalized entropy of each head's attention distribution across many inputs. High entropy = diffuse. | Check if attention entropy is high and max attention is low. Always applicable. | "This head spreads its attention evenly across many words, gathering general context rather than focusing on one spot." |
| 6 | **Other / Unclassified** | Heads that score below threshold on all 5 categories above. | No runtime check needed. Show as neutral. | "This head's pattern doesn't fit our simple categories -- it may be doing something more complex." |

## Component 1: One-Time Analysis Script

**File:** `scripts/analyze_heads.py` (new, standalone, not part of the Dash app)

**Dependencies:** `transformer-lens`, `torch`, `json` (only needed for this script, not at runtime)

**Workflow:**

1. For each model in the target list:
   - Load as `HookedTransformer`
   - Generate test inputs:
     - 50 random repeated-token sequences (for induction + duplicate detection)
     - 20 varied natural-language sentences (for previous-token, positional, diffuse)
   - Run `detect_head()` for each built-in category, averaging scores across inputs
   - Run custom detection for positional (column-0 pattern) and diffuse (entropy computation)
   - Collect all scores into a `[n_layers, n_heads]` tensor per category
2. For each category, identify "top heads":
   - Threshold-based: all heads with score > 0.4 (tune per category)
   - Enforce layer diversity: if top heads cluster in one layer, also include the best head from other layers that exceeds a lower threshold (e.g., 0.25)
   - Cap at ~8 heads per category to keep UI manageable
3. Write JSON output

**Target models to analyze (verify TL support first):**

- `gpt2` (definitely supported)
- `Qwen/Qwen2.5-0.5B` (verify -- TL has Qwen weight converters)
- `EleutherAI/pythia-70m` through `pythia-410m` (if you re-enable Pythia in the UI)
- `facebook/opt-125m` (if you re-enable OPT)
- research more target models and re-configure @utils/model_config.py

## Component 2: JSON Data File

**File:** `utils/head_categories.json`

**Structure:**

```json
{
  "gpt2": {
    "model_name": "gpt2",
    "num_layers": 12,
    "num_heads": 12,
    "analysis_date": "2026-02-16",
    "categories": {
      "previous_token": {
        "display_name": "Previous Token",
        "description": "Attends to the immediately preceding token",
        "icon": "arrow-left",
        "top_heads": [
          {"layer": 4, "head": 11, "score": 0.87},
          {"layer": 2, "head": 3, "score": 0.72}
        ]
      },
      "induction": {
        "display_name": "Induction",
        "description": "Completes repeated patterns: [A][B]...[A] → [B]",
        "icon": "repeat",
        "requires_repetition": true,
        "top_heads": [
          {"layer": 5, "head": 5, "score": 0.95},
          {"layer": 5, "head": 1, "score": 0.91},
          {"layer": 6, "head": 9, "score": 0.88}
        ]
      }
    },
    "all_scores": {
      "previous_token": [[0.12, 0.05, ...], ...],
      "induction": [[0.01, 0.02, ...], ...]
    }
  }
}
```

The `all_scores` matrix (full `[n_layers][n_heads]` scores) is included for potential future use (heatmap of head roles, etc.) but the `top_heads` lists are what the UI consumes.

## Component 3: Runtime Verification Module

**File:** Extend existing `utils/head_detection.py`

**What changes:**

- Evaluate existing functionality and remove all functions and excess code that will be replaced or is unnecessary
- Add a `load_head_categories(model_name)` function that reads from the JSON
- Add a `verify_head_activation(attention_weights, tokens, head_info, category)` function that:
  - Takes the attention matrix `[seq_len, seq_len]` for a specific head
  - Takes the input token IDs
  - Takes the category name
  - Returns an activation score (0.0 to 1.0)
- Each category has its own verification logic:

| Category | Verification Logic |
|----------|-------------------|
| `previous_token` | Mean of diagonal-1 values |
| `induction` | If repeated tokens exist: measure attention from position i to j+1 where token[i]==token[j]. If no repeats: return 0.0 (gray) |
| `duplicate_token` | If repeated tokens exist: measure attention from later occurrence to earlier occurrence. If no repeats: return 0.0 |
| `positional` | Mean of column-0 attention values |
| `diffuse` | Normalized entropy of attention distribution |

- Add a `get_active_head_summary(activation_data, model_name)` function that:
  - Loads categories from JSON
  - For each top head in each category, runs verification on the current attention weights
  - Returns a structure the UI can consume: `{category: [{layer, head, score, activation_score, is_active}, ...]}`

**Key design point:** This module does NOT import TransformerLens. It uses only `torch` and the attention weight tensors already captured by PyVene. The pattern-comparison math from TL's source is ~15 lines that you reimplement directly.

## Component 4: UI Changes

**File:** Extend `components/investigation_panel.py` or create a new section in the pipeline view

**Display concept:**

```
┌─────────────────────────────────────────┐
│ Attention Head Roles                    │
│                                         │
│ ● Previous Token          ○ Induction   │
│   L4-H11 ████████░░ 0.82    L5-H5 (no  │
│   L2-H3  ██████░░░░ 0.65    repetition  │
│                              in input)   │
│ ● Positional              ○ Duplicate   │
│   L0-H1  █████████░ 0.91    L3-H0 (no  │
│   L1-H4  ██████░░░░ 0.58    duplicates) │
│                                         │
│ ● Diffuse/Spread                        │
│   L7-H8  ████████░░ 0.78               │
│                                         │
│  ● = active on your input               │
│  ○ = role exists but not triggered      │
│                                         │
│  ⓘ Why are some grayed out?             │
│  "Some heads only activate when your    │
│   input has specific patterns, like     │
│   repeated words. Try: 'The cat sat     │
│   on the mat. The cat slept.'"          │
└─────────────────────────────────────────┘
```

**Key UI elements:**

- Filled circle (active) vs open circle (inactive/grayed) for each category
- Per-head activation bars showing runtime strength
- A tooltip/info box explaining *why* heads are grayed (with suggested prompts that would activate them)
- Clicking a head navigates to its attention heatmap in the existing BertViz visualization

**The "suggested prompt" is pedagogically powerful:** it invites the student to experiment. "Try adding a repeated sentence to see induction heads light up." This turns passive observation into active discovery.

## Implementation Order

1. **Verify TL model support** for each target model (quick test: can you `HookedTransformer.from_pretrained("gpt2")` and `"Qwen/Qwen2.5-0.5B"`?)
2. **Write the one-time script** (`scripts/analyze_heads.py`) -- start with GPT-2 only
3. **Generate the JSON** for GPT-2
4. **Build the runtime verification** in `utils/head_detection.py` (extend, don't replace existing code)
5. **Build the UI component**
6. **Run the script** on remaining models, expanding the JSON
7. **Add the educational tooltips and suggested prompts**

## What This Does NOT Cover (Future Work)

- **Successor heads, name movers, copy suppression:** These require output/logit analysis, not attention-pattern analysis. They could be added later via MAPS or manual annotation.
- **Polysemanticity:** A head can belong to multiple categories. The JSON supports this (a head can appear in multiple `top_heads` lists). The UI should communicate this -- "This head is primarily an induction head but also shows previous-token behavior."
- **Per-input category discovery:** This system identifies *known* heads. It doesn't discover new categories or identify heads doing unexpected things on a specific input. Your existing heuristic code could remain as a secondary "what's happening right now" view.