LatinCy Stanza (la_stanza_latincy)
A Stanza (Stanford NLP) model suite for Latin trained on harmonized Universal Dependencies treebanks from LatinCy. Provides tokenization, POS tagging, morphological features, lemmatization, dependency parsing, and named entity recognition.
Highlights
- Full NLP pipeline -- tokenizer, POS/morph tagger, lemmatizer, dependency parser, NER
- 6 UD treebanks + LASLA: POS/morph/lemma trained on ~2.87M tokens (UD+LASLA combined)
- Custom character language models trained on 1.6 GB of curated Latin text (13.7M sentences)
- Custom word vectors (CBOW-300, trained on curated Latin corpus)
- NER with 3 entity types: PERSON, LOC, NORP
Quick Start
import stanza
from huggingface_hub import snapshot_download
# Download models (one time)
model_dir = snapshot_download("latincy/la_stanza_latincy")
# Load pipeline
nlp = stanza.Pipeline("la", dir=model_dir, download_method=None)
# Annotate
doc = nlp("Gallia est omnis divisa in partes tres.")
for sent in doc.sentences:
for word in sent.words:
print(f"{word.text:12s} {word.upos:6s} {word.lemma:12s} {word.deprel}")
Output:
Gallia PROPN Gallia nsubj:pass
est AUX sum aux:pass
omnis DET omnis det
divisa VERB divido root
in ADP in case
partes NOUN pars obl
tres NUM tres nummod
. PUNCT . punct
NER
nlp = stanza.Pipeline("la", dir=model_dir, download_method=None,
processors="tokenize,ner")
doc = nlp("Caesar in Galliam cum legionibus contendit.")
for ent in doc.ents:
print(f"{ent.text:20s} {ent.type}")
Loading from a Local Directory
If you have the models locally (e.g., after cloning the HuggingFace repo):
nlp = stanza.Pipeline("la", dir="/path/to/la_stanza_latincy",
download_method=None)
Model Description
| Property | Value |
|---|---|
| Author | Patrick J. Burns / LatinCy |
| Model type | Stanza neural pipeline (BiLSTM-CRF, biaffine parser) |
| Language | Latin |
| License | MIT |
| Total size | ~1.1 GB (8 model files) |
| Framework | Stanza (Stanford NLP) |
Pipeline Components
| Component | Model File | Architecture |
|---|---|---|
| Tokenizer | tokenize/latincy.pt (11 MB) |
BiLSTM segmenter |
| POS/Morph | pos/latincy.pt (143 MB) |
BiLSTM tagger with CharLM + pretrained vectors |
| Lemmatizer | lemma/latincy.pt (46 MB) |
Seq2seq with edit classifier |
| Dep. Parser | depparse/latincy.pt (170 MB) |
Deep biaffine attention parser |
| NER | ner/latincy.pt (151 MB) |
BiLSTM-CRF with CharLM + pretrained vectors |
| CharLM (fwd) | forward_charlm/latincy.pt (197 MB) |
Character-level LSTM language model |
| CharLM (bwd) | backward_charlm/latincy.pt (197 MB) |
Character-level LSTM language model |
| Pretrain | pretrain/latincy.pt (174 MB) |
Word2Vec CBOW-300 embeddings |
Training Data
POS, Morphology, Lemmatization (UD + LASLA)
Trained on harmonized data from 6 Universal Dependencies Latin treebanks combined with the LASLA corpus (~1.84M tokens of classical Latin with POS, morphological features, and lemmas).
| Treebank | Full Name | Domain |
|---|---|---|
| ITTB | Index Thomisticus Treebank | Scholastic Latin (Thomas Aquinas) |
| LLCT | Late Latin Charter Treebank | Medieval legal charters |
| PROIEL | PROIEL Treebank | Vulgate Bible, historical texts |
| Perseus | Perseus Latin Treebank | Classical Latin (Caesar, Cicero, etc.) |
| UDante | UDante Treebank | Dante Alighieri (De vulgari eloquentia, etc.) |
| CIRCSE | CIRCSE Latin Treebank | LASLA-derived classical texts |
| LASLA | LASLA corpus | Classical Latin (morphology only, no deps) |
Combined: ~2.87M tokens for POS/morph/lemma; ~1.03M tokens (UD only) for tokenizer and dependency parsing.
NER
Trained on LatinCy NER annotations from 4 sources: 13,493 train / 3,195 dev sentences. Entity types: PERSON (79%), LOC (14%), NORP (7%).
Character Language Models
Trained on 1.6 GB of curated Latin text (13.7M sentences from 9 sources) for 15 epochs. Forward and backward CharLMs provide contextualized character-level features to the POS tagger, lemmatizer, parser, and NER.
Training Procedure
Tokenizer: BiLSTM segmenter trained on UD-only data.
POS/Morph tagger: BiLSTM with CharLM features and pretrained word vectors, trained on UD+LASLA combined data.
Lemmatizer: Seq2seq model with edit classifier, CharLM features, trained on UD+LASLA combined data.
Dependency parser: Deep biaffine attention parser with CharLM features and pretrained word vectors, trained on UD-only data.
NER tagger: BiLSTM-CRF with CharLM features and pretrained word vectors, 8,500 training steps with early stopping.
Evaluation Results
Overall Scores
| Component | Metric | v0.2 (CharLM) | v0.3 (Latin BERT) | Best | Split |
|---|---|---|---|---|---|
| Tokenizer | Token F1 | 98.24 | โ | v0.2 | dev |
| Tokenizer | Sentence F1 | 86.59 | โ | v0.2 | dev |
| POS | UPOS | 97.26 | 97.65 | v0.3 | test |
| POS | XPOS | โ | 97.38 | v0.3 | test |
| POS | UFeats | 92.80 | 93.93 | v0.3 | test |
| POS | AllTags | โ | 92.51 | v0.3 | test |
| Lemma | Accuracy | 97.87 | โ | v0.2 | test |
| Dep. Parse | UAS | 86.95 | 86.20 | v0.2 | test |
| Dep. Parse | LAS | 83.23 | 81.98 | v0.2 | test |
| Dep. Parse | MLAS | 76.96 | 75.23 | v0.2 | test |
| Dep. Parse | BLEX | 79.46 | 78.00 | v0.2 | test |
| NER | Entity F1 | 90.22 | 90.17 | v0.2 | dev |
| NER | PERSON F1 | 93.01 | 93.41 | v0.3 | dev |
| NER | LOC F1 | 80.88 | 79.47 | v0.2 | dev |
| NER | NORP F1 | 78.44 | 76.00 | v0.2 | dev |
v0.3 uses Latin BERT (Bamman & Burns 2020) as a transformer backend for POS tagging, where it improves all metrics. Depparse and NER perform best with CharLM alone. The published models use the best backend per component: Latin BERT for POS, CharLM for everything else.
Cross-Framework Comparison
Scores on held-out test sets unless noted. NER scores are on dev (no test set exists). Best score per component shown (v0.3 Latin BERT for POS, v0.2 CharLM for all others).
| Metric | LatinCy Stanza 0.3 |
LatinCy Flair 0.3 |
LatinCy UDPipe 0.2 |
LatinCy spaCy trf 3.9 |
|---|---|---|---|---|
| UPOS | 97.65 | 98.02 | 93.28 | 97.34 |
| UFeats | 93.93 | -- | 82.48 | 93.95 |
| Lemma | 97.87 | 97.41 | 93.05 | 94.63 |
| UAS | 86.95 | -- | 76.11 | 86.91 |
| LAS | 83.23 | -- | 71.29 | 82.04 |
| NER F1 | 90.22 | 92.22 | -- | 91.14 |
Stanza leads on UPOS (with Latin BERT), lemma, UAS, and LAS. Flair 0.3 (Latin BERT) leads on NER. spaCy trf is competitive across all metrics. UDPipe offers single-file portability usable from R, Python, CLI, and other platforms.
vs. Stanford's Official Latin Package (stanfordnlp/stanza-la)
Stanford distributes separate per-treebank models (ITTB, LLCT, Perseus, PROIEL, UDante) without character language models (nocharlm variants) and without NER. LatinCy Stanza trains a single unified model across all treebanks plus LASLA, with custom forward/backward CharLMs and pretrained word vectors. A direct benchmark comparison is planned for a future release.
Limitations
- No test split for NER: NER scores are on the dev set; no held-out test evaluation is available.
- Tokenizer scores on dev: No separate test evaluation was run for the tokenizer.
- LASLA data is morphology-only: Dependency parsing trained on UD data only (~1.03M tokens), not the full 2.87M token corpus.
- Mixed backends: POS uses Latin BERT transformer features; all other components use BiLSTM + CharLM. Transformer features did not improve depparse or NER.
- Large total size: The full model suite is ~1.1 GB due to 8 separate model files (including 2 CharLMs at 197 MB each). Individual components can be loaded selectively.
Future Development
The following Stanza processors are not yet implemented for Latin in this release but will be considered for future development:
- Constituency parsing (phrase structure)
- Coreference resolution
- Sentiment analysis
- Multi-word token (MWT) expansion
Also, we expect to train the next version of LatinCy Stanza using a transformer model for improved accuracy on morphological features and dependency parsing.
Version History
| Version | Date | Treebank Data | Changes |
|---|---|---|---|
| 0.3 | 2026-03 | LatinCy v3.9 | Latin BERT transformer backend for POS (UPOS +0.39, UFeats +1.13). Best-of per component: Latin BERT POS, CharLM for all others. |
| 0.2 | 2026-03 | LatinCy v3.9 | Retrained POS, lemma, depparse on harmonized treebanks with Gender feature fix. UFeats +0.60, UAS +0.22. |
| 0.1 | 2026-02 | LatinCy v3.8 | Initial release. All components (tokenizer, POS, lemma, depparse, NER, CharLM). |
References
- Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. 2020. "Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. https://nlp.stanford.edu/pubs/qi2020stanza.pdf.
Citation
@misc{burns2026latincystanza,
author = {Burns, Patrick J.},
title = {{LatinCy Stanza (la\_stanza\_latincy)}},
year = {2026},
url = {https://huggingface.co/latincy/la_stanza_latincy},
}
Acknowledgments
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
- Downloads last month
- -
Evaluation results
- Token F1 on UD Latin (combined)self-reported98.240
- Sentence F1 on UD Latin (combined)self-reported86.590
- UPOS on UD Latin (combined + LASLA)self-reported97.260
- UFeats on UD Latin (combined + LASLA)self-reported92.800
- Lemma Accuracy on UD Latin (combined + LASLA)self-reported97.870
- UAS on UD Latin (combined)self-reported86.950
- LAS on UD Latin (combined)self-reported83.230
- Entity F1 on LatinCy NER (4 sources)self-reported90.220