Waypoint-6m
Waypoint-6m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.
Model summary
See our preprint for details
Causal language model trained on newline-separated taxonomic strings. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see Tokenizer below.
| Item | Details |
|---|---|
| Architecture | GPT-2 (model_type: gpt2 in config.json) |
| Vocab | Taxonomic tokenizer (vocab.json); size shown in the Hub file list / tokenizer_config.json |
| Remote code | Required — this repo includes tokenization_taxonomic.py for TaxonomicTokenizer |
Intended use
- Research and prototyping for taxonomic sequence modeling (e.g. pretraining representations, generation experiments).
- Not a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.
Usage
This repository is gated. To use it you'll need to:
- Request access — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
- Authenticate — log in to Hugging Face from your environment so the download tooling can use your token:
huggingface-cli login
Or set the token directly:
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
You can create a token at https://huggingface.co/settings/tokens.
Once both steps are done, you can load the model/dataset normally:
This checkpoint uses custom tokenizer code. You must pass trust_remote_code=True when loading the tokeniser.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = outpost-bio/Waypoint-6m
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Example: newline-separated taxonomic lines (format must match training)
text = (
"k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
"k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
"k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training
- Data: Atlas dataset
- Objective: causal LM on taxonomic sequences.
License
apache-2.0
Citation
Learning the Language of the Microbiome with Transformers
Neythen J Treloar, Saif Ur-Rehman, Jenny Yang
bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381
Model card contact
Maintainer / contact: neythen@outpost.bio
- Downloads last month
- 13