ProtGPT2-Distilled-Small
A compact protein language model distilled from ProtGPT2 using complementary-regularizer distillation---a method that combines uncertainty-aware position weighting with calibration-aware label smoothing to achieve 54% better perplexity than standard knowledge distillation at 9.4x compression.
Preprint: Distilling Protein Language Models with Complementary Regularizers (Wijaya, 2026) — bioRxiv Code: github.com/ewijaya/protein-lm-distill
Model Summary
| Property | Value |
|---|---|
| Parameters | ~78M |
| Architecture | GPT-2 (6 layers, 8 heads, 768 embedding dim) |
| Compression | 9.4x (vs. 738M teacher) |
| Perplexity ratio | 7.05 (54% better than baseline KD) |
| Expected calibration error | 0.259 |
| Inference speedup | 4.1x over ProtGPT2 |
| GPU memory | 343 MB (9.4x reduction from teacher) |
| Throughput | ~86 sequences/min on NVIDIA L40S |
Quick Start
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
model = GPT2LMHeadModel.from_pretrained("littleworth/protgpt2-distilled-small")
tokenizer = GPT2Tokenizer.from_pretrained("littleworth/protgpt2-distilled-small")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
sequences = generator(
"<|endoftext|>",
max_length=256,
do_sample=True,
top_k=950,
repetition_penalty=1.2,
num_return_sequences=5,
eos_token_id=0,
pad_token_id=0,
truncation=True,
)
for i, seq in enumerate(sequences):
protein = seq["generated_text"].replace("<|endoftext|>", "").replace("\n", "")
protein = "".join(c for c in protein if c.isalpha())
print(f">Generated_{i}\n{protein}")
How It Works
This model was trained using complementary-regularizer distillation, which augments standard temperature-scaled knowledge distillation (Hinton et al., 2015) with two protein-specific enhancements:
Uncertainty-aware position weighting --- Uses teacher entropy to emphasize biologically variable regions (loops, surface residues) during distillation, directing learning capacity toward positions where the teacher's distributional knowledge is richest.
Calibration-aware label smoothing --- Applies confidence-dependent smoothing to teacher distributions, acting as a noise filter that removes miscalibration artifacts while preserving genuine amino acid substitution preferences.
The key finding: Each enhancement individually degrades distillation quality (+95% and +109% perplexity increase, respectively), yet their combination yields a 53% perplexity improvement over baseline---a phenomenon we call complementary regularizers. Smoothing removes the noise that weighting would amplify, while weighting compensates for the signal attenuation that smoothing introduces.
Performance
Compared to Baseline Knowledge Distillation
| Method | PPL Ratio | ECE | KL Divergence |
|---|---|---|---|
| Baseline KD | 15.19 | 0.235 | 2.03 |
| This model (complementary regularizers) | 7.05 | 0.259 | 1.69 |
| Improvement | 54% | --- | 17% |
Model Family Comparison
| Model | Params | Compression | PPL Ratio | Speedup | GPU Memory |
|---|---|---|---|---|---|
| ProtGPT2 (teacher) | 738M | 1x | 1.00 | 1.0x | 3,211 MB |
| Tiny | 37M | 20x | 5.06 | 5.3x | 170 MB |
| Small (this model) | 78M | 9.4x | 7.05 | 4.1x | 343 MB |
| Medium | 194M | 3.8x | 2.58 | 2.4x | 836 MB |
Biological Validity
Generated sequences produce amino acid distributions closely matching natural proteins (KL divergence from UniProt < 0.015), confirming that compressed models preserve biologically realistic sequence statistics.
When to Use This Model
- Balanced throughput and quality: 86 seq/min with moderate compression
- Mid-range deployment: 343 MB GPU memory suits most workstation GPUs
- On-premise inference: Run locally without sending proprietary sequences to cloud APIs
- Protein engineering pipelines: Good balance between speed and fidelity for iterative design
- Best for lysozyme-like families: Achieves 94% HMMER hit rate vs teacher's 69% on lysozyme at N=1,000, the highest of any model in the family
For maximum speed, consider the Tiny variant (5.3x speedup, 170 MB). For best quality, consider the Medium variant (2.58 PPL ratio).
Fine-Tuning on Custom Protein Families
This model serves as a superior starting point for domain adaptation compared to the full-size teacher. On lysozyme, it achieves a 94% HMMER hit rate versus the teacher's 69% at N=1,000 (+25 percentage points)---the highest of any model in the family---despite the teacher having lower test perplexity. This decoupling between perplexity and family-specific generation quality indicates that distilled representations capture family-level structural patterns more effectively during fine-tuning.
On conotoxin, this model achieves PPL 39 versus the teacher's 54 at N=1,000. At N=200, the lysozyme HMMER gap is even wider: 73% versus teacher's 28%.
This advantage stems from the synergy distillation method itself, not just model compression---a standard-distilled model with the same architecture performs at teacher level, while synergy-distilled models far exceed both.
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset
model_name = "littleworth/protgpt2-distilled-small"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Prepare your protein sequences as a list of strings
sequences = ["MKTLLILAVL...", "MKFLILLFNL..."] # your family sequences
dataset = Dataset.from_dict({"text": sequences})
dataset = dataset.map(
lambda x: tokenizer(x["text"], truncation=True, max_length=512),
batched=True, remove_columns=["text"],
)
trainer = Trainer(
model=model,
args=TrainingArguments(
output_dir="./finetuned-model",
num_train_epochs=20,
per_device_train_batch_size=8,
learning_rate=1e-4,
lr_scheduler_type="cosine",
warmup_steps=100,
fp16=True,
eval_strategy="epoch",
),
train_dataset=dataset,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
trainer.save_model("./finetuned-model")
Recommended fine-tuning hyperparameters for this model:
| Parameter | Value |
|---|---|
| Learning rate | 1e-4 |
| Batch size | 8 |
| Scheduler | Cosine with 100 warmup steps |
| Early stopping | Patience 3 on eval loss |
| Precision | FP16 |
| Gradient checkpointing | Not needed |
Training Details
| Parameter | Value |
|---|---|
| Teacher model | nferruz/ProtGPT2 (738M) |
| Training data | 10% UniProt subset (Parquet) |
| Temperature (T) | 2.0 |
| Alpha | 0.5 |
| Learning rate | 5e-4 |
| Epochs | 3 |
| Batch size | 32 (effective) |
| Optimizer | AdamW |
| Precision | FP16 |
| Uncertainty weighting | Enabled |
| Calibration smoothing | Enabled (lambda=0.1) |
Citation
@article {Wijaya2026.02.17.706304,
author = {Wijaya, Edward},
title = {Distilling Protein Language Models with Complementary Regularizers},
elocation-id = {2026.02.17.706304},
year = {2026},
doi = {10.64898/2026.02.17.706304},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Large autoregressive protein language models generate novel sequences de novo, but their size limits throughput and precludes rapid domain adaptation on scarce proprietary data. We distill a 738M-parameter protein language model into compact students using two protein-specific enhancements, uncertainty-aware position weighting and calibration-aware label smoothing, that individually degrade quality yet combine for substantial improvement. We trace this complementary-regularizer effect to information theory: smoothing denoises teacher distributions while weighting amplifies the cleaned signal at biologically variable positions. Students achieve up to 5x inference speedup, preserve natural amino acid distributions, and require as little as 170 MB of GPU memory, enabling deployment on consumer-grade hardware. When fine-tuned on protein families with as few as 50 sequences, students generate more family-matching sequences than the teacher, achieving higher sample efficiency and Pfam hit rates despite their smaller capacity. These results establish distilled protein language models as superior starting points for domain adaptation on scarce data.Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2026/02/25/2026.02.17.706304},
eprint = {https://www.biorxiv.org/content/early/2026/02/25/2026.02.17.706304.full.pdf},
journal = {bioRxiv}
}
Related Models
- ProtGPT2 --- the teacher model
- protgpt2-distilled-tiny --- 37M parameters, 20x compression
- protgpt2-distilled-medium --- 194M parameters, 3.8x compression
License
Apache 2.0
- Downloads last month
- 80
Model tree for littleworth/protgpt2-distilled-small
Base model
nferruz/ProtGPT2