You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Waypoint-6m

Waypoint-6m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.

Model summary

See our preprint for details

Causal language model trained on newline-separated taxonomic strings. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see Tokenizer below.

Item Details
Architecture GPT-2 (model_type: gpt2 in config.json)
Vocab Taxonomic tokenizer (vocab.json); size shown in the Hub file list / tokenizer_config.json
Remote code Required — this repo includes tokenization_taxonomic.py for TaxonomicTokenizer

Intended use

  • Research and prototyping for taxonomic sequence modeling (e.g. pretraining representations, generation experiments).
  • Not a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.

Usage

This repository is gated. To use it you'll need to:

  1. Request access — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
  2. Authenticate — log in to Hugging Face from your environment so the download tooling can use your token:
   huggingface-cli login

Or set the token directly:

   export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

You can create a token at https://huggingface.co/settings/tokens.

Once both steps are done, you can load the model/dataset normally:

This checkpoint uses custom tokenizer code. You must pass trust_remote_code=True when loading the tokeniser.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = outpost-bio/Waypoint-6m

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Example: newline-separated taxonomic lines (format must match training)
text = (
    "k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
    "k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
    "k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

  • Data: Atlas dataset
  • Objective: causal LM on taxonomic sequences.

License

apache-2.0

Citation

Learning the Language of the Microbiome with Transformers
Neythen J Treloar, Saif Ur-Rehman, Jenny Yang
bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381

Model card contact

Maintainer / contact: neythen@outpost.bio

Downloads last month
13
Safetensors
Model size
10.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support