You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Waypoint-6m

Waypoint-6m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.

Model summary

See our preprint for details

Causal language model trained on newline-separated taxonomic strings. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see Tokenizer below.

Item	Details
Architecture	GPT-2 (`model_type: gpt2` in `config.json`)
Vocab	Taxonomic tokenizer (`vocab.json`); size shown in the Hub file list / `tokenizer_config.json`
Remote code	Required — this repo includes `tokenization_taxonomic.py` for `TaxonomicTokenizer`

Intended use

Research and prototyping for taxonomic sequence modeling (e.g. pretraining representations, generation experiments).
Not a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.

Usage

This repository is gated. To use it you'll need to:

Request access — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
Authenticate — log in to Hugging Face from your environment so the download tooling can use your token:

   huggingface-cli login

Or set the token directly:

   export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

You can create a token at https://huggingface.co/settings/tokens.

Once both steps are done, you can load the model/dataset normally:

This checkpoint uses custom tokenizer code. You must pass trust_remote_code=True when loading the tokeniser.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = outpost-bio/Waypoint-6m

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Example: newline-separated taxonomic lines (format must match training)
text = (
    "k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
    "k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
    "k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

Data: Atlas dataset
Objective: causal LM on taxonomic sequences.

License

apache-2.0

Citation

Learning the Language of the Microbiome with Transformers
Neythen J Treloar, Saif Ur-Rehman, Jenny Yang
bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381

Model card contact

Maintainer / contact: neythen@outpost.bio

Downloads last month: 13

Safetensors

Model size

10.1M params

Tensor type

F32