gheim-ch-560m

A multilingual token-classification model for personally-identifiable information (PII) detection across the four official Swiss languages (de_CH, fr_CH, it_CH, rm) and English. The model is a fine-tune of FacebookAI/xlm-roberta-large on joelbarmettler/gheim-ch-pii-212k. Output schema is a 33-class BIOES tag set (8 PII categories plus the outside class) aligned with the categorical naming used by openai/privacy-filter.

Two-variant release. This checkpoint is the Apache-2.0 commercial flagship, trained only on the in-domain gheim-ch-pii-212k so that the entire pipeline is end-to-end reproducible from the gheim repository alone. A sibling checkpoint joelbarmettler/gheim-ch-560m-research is trained on the same architecture + the same in-domain data plus external public NER / PII corpora (ai4privacy/openpii-1m, Babelscape/wikineural, tomaarsen/conll2003). It attains the same in-distribution numbers but substantially stronger cross-domain transfer on Swiss news (swissner PER char F1: 0.70 → 0.90) and external person-NER benchmarks. Non-commercial / research-only licence (CC BY-NC-SA 4.0 with Reuters research-only restriction). See the research card for the full comparison table.


Parameters	560M
Languages	de_CH, fr_CH, it_CH, rm, en
Categories	account_number, private_address, private_date, private_email, private_person, private_phone, private_url, secret
Tag scheme	BIOES (33 classes)
Max sequence length	512
License	Apache 2.0

Full report: the curation pipeline, training procedure, comparison against seven other PII / NER systems, cross-domain results on four external benchmarks, methodology validation, and an extended related-work section are documented in paper/paper.pdf. This card is the deployment-facing summary.

Intended use

The model classifies character-level spans of PII so that text can be redacted prior to transmission to systems where personal data should not appear (for example, third-party LLM APIs hosted outside the data subject's jurisdiction). Output spans are intended for substitution or masking, not for entity linking or re-identification. The training data follows a recall-oriented labelling policy under which publicly-listed institutional information (e.g. court switchboard numbers, parliament email addresses, public-official names) is flagged as PII. Applications requiring stricter precision should pair model output with downstream filtering.

Usage

Recommended: gheim SDKs (round-trip with sentinel restoration)

For the typical use case — anonymise text, send to an LLM, restore the originals on the way back — install the gheim Python or gheim npm package. This model is the default detector in both, and the wrappers handle sentinel allocation, streaming-aware decode, multi-turn coherent sessions, and a drop-in OpenAI client.

pip install "gheim[local,openai]"        # Python
npm install gheim openai @huggingface/transformers   # JS / TS

# Python — drop-in OpenAI client. Defaults to gheim-ch-560m.
from gheim.openai import OpenAI

client = OpenAI()  # accepts the same kwargs as openai.OpenAI
r = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user",
               "content": "Hi, my name is Joel. My phone is +41 44 268 12 34."}],
)
# r.choices[0].message.content has the original PII restored.
# OpenAI only ever saw "<PERSON_1>" and "<PHONE_1>".

// JS / TS — same idea.
import { OpenAI } from "gheim/openai";

const client = new OpenAI();  // accepts the same opts as openai's OpenAI
const r = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user",
               content: "Hi, my name is Joel. My phone is +41 44 268 12 34." }],
});

Streaming, async, tool calls, and 9 other text-carrying endpoints (responses, embeddings, moderations, audio.*, images.*) are wrapped automatically. Full surface in the package READMEs: Python · JS.

Alternative: raw transformers / transformers.js

If you only need a token classifier (no sentinel round-trip), use the HuggingFace pipelines directly.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo = "joelbarmettler/gheim-ch-560m"
tok = AutoTokenizer.from_pretrained(repo)
mdl = AutoModelForTokenClassification.from_pretrained(repo)
ner = pipeline("token-classification", model=mdl, tokenizer=tok,
               aggregation_strategy="simple")

text = ("Bitte überweisen Sie an Müller AG, IBAN CH9300762011623852957, "
        "Werdstrasse 36, 8004 Zürich.")
for span in ner(text):
    print(f"{span['entity_group']:<18} {span['score']:.2f} {span['word']!r}")

// Node / Bun / browser via @huggingface/transformers (transformers.js).
// Recommended: dtype: "fp16" when WebGPU is available (1.1 GB, byte-
// equivalent to fp32 on the forensic probe), "q8" for low-RAM fallback
// (557 MB, but see the JS-runtime caveat under Deployment formats below
// — int8 degrades sharply on Swiss edge cases via onnxruntime-web/WASM).
// The `gheim` SDK picks this automatically via its `dtype: "auto"` default.
import { pipeline } from "@huggingface/transformers";

const ner = await pipeline("token-classification",
  "joelbarmettler/gheim-ch-560m",
  { aggregation_strategy: "simple", dtype: "fp16", device: "webgpu" });
const out = await ner("Email me at alice@example.ch, phone +41 44 268 12 34.");

Performance

Strict-span F1 (seqeval) on the held-out test split of joelbarmettler/gheim-ch-pii-212k (21,246 chunks, document-isolated from the training split for the real-text portion). The test set was scored once.

Metric	Test	Validation
F1	0.910	0.910
Precision	0.890	0.891
Recall	0.932	0.930

Per-language × per-category char-level F1 on the same test split. Body cells are char-level F1 for each (language, category) pair; the right-most column gives the per-category average over languages (gold-weighted), and the bottom row gives the per-language average over categories. The bottom-right cell is the overall char F1.

Category	de_ch	fr_ch	it_ch	rm	en	Avg.
`account_number`	0.994	0.998	0.990	0.971	1.000	0.994
`private_address`	0.933	0.911	0.916	0.853	0.955	0.917
`private_date`	0.951	0.908	0.952	0.919	0.883	0.933
`private_email`	0.996	1.000	0.997	0.989	1.000	0.997
`private_person`	0.913	0.939	0.951	0.909	0.955	0.930
`private_phone`	0.995	0.996	0.996	0.985	1.000	0.995
`private_url`	0.990	0.993	0.993	0.994	0.970	0.991
`secret`	0.994	0.999	1.000	0.999	n/a	0.997
Avg.	0.944	0.931	0.956	0.918	0.954	0.940

Cross-domain (six external benchmarks)

PER character-level F1 on the six external benchmarks below. This gheim-ch-560m checkpoint is run zero-shot on every row; the research variant is included for context but is zero-shot only on the gretel + swissner + open-pii-500k rows — three of the six (ai4privacy/openpii-1m, Babelscape/WikiNeural, tomaarsen/conll2003) are present in its training mix (see the research model card for the full discussion). Numbers on those three rows for the research variant are therefore in-distribution, not transfer.

Benchmark	n	License	gheim-ch-560m (zero-shot)	gheim-ch-560m-research
`ZurichNLP/swissner` (Swiss-news NER, zero-shot for both)	800	CC BY 4.0	0.702	0.903
`ai4privacy/pii-masking-openpii-1m` (research trained on train split)	8,000	Apache 2.0 / CC BY 4.0	0.938	0.995†
`ai4privacy/open-pii-masking-500k` (zero-shot for both)	8,000	CC BY 4.0	0.933	0.982
`gretelai/synthetic_pii_finance_multilingual` (zero-shot for both)	4,800	Apache 2.0	0.624	0.627
Babelscape `WikiNeural` (research trained on train split)	8,000	CC BY-NC-SA 4.0	0.808	0.795†
`tomaarsen/conll2003` (research trained on train split, PER only)	3,453	research-only	0.911	0.765†

†Research-variant numbers on these three rows reflect in-distribution generalisation, not zero-shot cross-domain transfer (research training mix includes the train splits of these three datasets). Note that on Babelscape/WikiNeural and tomaarsen/conll2003 the research variant actually regresses relative to the Apache-2.0 baseline despite training on their train splits — the broader 8-category output schema produces non-PER false positives on news / Wikipedia text that the in-domain-only baseline doesn't make.

Per-language ZurichNLP/swissner PER char F1 (the headline cross-domain test for Swiss-market deployment, zero-shot for both checkpoints):

Language	gheim-ch-560m	gheim-ch-560m-research
de	0.539	0.931
fr	0.761	0.913
it	0.643	0.856
rm	0.409	0.873

For Swiss-news redaction at scale, the research variant is substantially stronger — especially on Romansh (+46 pp). For in-domain Swiss court / parliament / web text and structured PII (IBAN/AHV/email/phone), the two variants are essentially identical.

For the full comparison against eight other open PII / NER systems on the same Swiss test set, the methodology-validation reproductions of each baseline's published numbers, and the full three-variant training-mix experiment that motivates the two-checkpoint release, see paper/paper.pdf and the machine-readable matrix at eval/positioning_matrix.json.

Deployment formats

The model is published in four formats:

model.safetensors (root): fp32 PyTorch checkpoint, 2.2 GB, intended for server-side inference via transformers.
onnx/model.onnx (+ onnx/model.onnx_data): fp32 ONNX export, 2.2 GB, intended for server-side ONNX Runtime / GPU deployment.
onnx/model_fp16.onnx (+ onnx/model_fp16.onnx_data): fp16 ONNX export, 1.1 GB, recommended for browser/Node consumers via @huggingface/transformers when WebGPU is available. Byte-equivalent to fp32 on the forensic probe. Selected with dtype: "fp16" (or dtype: "auto" in the gheim SDK, which picks this on WebGPU).
onnx/model_quantized.onnx: int8 dynamic-quantised ONNX export, 557 MB, intended for low-RAM mobile fallback. Selected with dtype: "q8" (or dtype: "auto" when WebGPU is not available). JS-runtime caveat: this file is essentially fp32-equivalent under Python onnxruntime (91.0% forensic-probe perfect-rate vs 91.5% for fp32 / fp16) but degrades to 73.1% perfect-rate under transformers.js / onnxruntime-web — a JS runtime divergence, not a quantisation issue. Common-word surnames like "Bach" become false positives and some commercial-register person names go undetected. The published Python SDK is unaffected. See eval/q8_quality_report.md for the diagnostic and per-language / per-category breakdown.

Format	Size	Test strict F1	Test char F1	Δ strict vs fp32	Δ char vs fp32
PyTorch fp32	2.2 GB	0.9105	0.9461	(baseline)	(baseline)
ONNX fp32	2.2 GB	0.9105	0.9461	0.000	0.000
ONNX fp16	1.1 GB	≈0.9105	≈0.9461	0.000*	0.000*
ONNX int8 (dynamic)	557 MB	0.9044	0.9448	−0.0061	−0.0013

*fp16 measured equivalent to fp32 on the 212-case forensic probe and on a ~150k-token logit-divergence sample (mean abs logit diff 0.0011, 100% per-token argmax match). Full test-set numbers not separately tabulated.

Per-category int8 vs fp32 char F1 deltas (Python / onnxruntime CPU): account_number 0.00, private_address −0.003, private_date −0.002, private_email 0.00, private_person −0.001, private_phone 0.00, private_url 0.00, secret 0.00. The int8 quantisation cost is concentrated almost entirely on the private_address cell; structured-PII categories are unaffected.

The fp16 export above is produced by training/eval/quantize_onnx.py::quantize_fp16 with a post-conversion fixup that inserts fp16->fp32 promotion Casts at type-match-op boundaries (Div/Mul/MatMul/LayerNorm) and strips the redundant trailing classifier Cast. The naked onnxconverter_common output was unloadable in onnxruntime because XLM-R's attention block casts to fp32 around its sqrt(d_k) for numerical stability and the converter doesn't propagate that mix.

Training procedure

Selected from a controlled bake-off against ZurichNLP/swissbert (270M dense), each model receiving an identical 5 × 3 sweep over (learning rate, layer-wise LR decay) at 1 epoch. The winning configuration per base model was trained for 3 full epochs and selected by best validation F1. xlm-roberta-large won the bake-off (val F1 0.918 vs swissbert's 0.910). Selected configuration: AdamW, LR 5e-5 cosine with 5% warmup, no LLRD, effective batch 128 (per-device 64 × 2 GPUs DDP), bf16, 3 epochs, max sequence length 512. Best checkpoint at step 3,500 of 3,987 (epoch 2.63) by validation overall_f1 0.910. Wall time ≈ 66 min train + 5 min eval on 2 × RTX 4090. Full procedure including the hyperparameter sweep results is in paper/paper.pdf §3.

The training data is the train split of joelbarmettler/gheim-ch-pii-212k (170,001 chunks), used end-to-end with no external augmentation. The validation and test splits are held out from the same dataset.

Limitations

Recall-oriented labelling policy. The model inherits the dataset's policy of flagging publicly-listed institutional contact information. Applications needing stricter precision should apply downstream filtering or a private-vs-public-entity post-classifier.
private_address test strict F1 is 0.84 (char F1 0.92). Boundary placement on multi-token addresses is the dominant error mode.
account_number test strict F1 is 0.99 in the headline, but a small fraction of regex-shaped non-PII (numeric tables in court documents and parliamentary statistics) still slips through. For production use, pair the model with the regex front-end documented in the gheim library, which applies checksum validation (IBAN, AHV, VAT-CHE, Luhn).
Romansh test strict F1 is 0.89 (char F1 0.93), the weakest of the five languages. The RM training material is dominated by a single literary/journalistic register; performance on dialectal or technical RM text is unmeasured.
Cross-domain Swiss-news transfer is weaker than the in-distribution headline. On ZurichNLP/swissner the model scores 0.70 PER char F1 overall (per-lang: de 0.54 / fr 0.76 / it 0.64 / rm 0.41). The released checkpoint trains only on the in-domain gheim-ch-pii-212k corpus so the published pipeline is end-to-end reproducible from this repository alone; a multi-source training mix that includes external public NER corpora is being investigated as the next iteration.
Swiss German dialect (GSW) is not measured. The fasttext detector used in data preparation labels GSW as standard German.
Lone first-names in greetings can be missed. Bare first names in greeting positions (e.g. "Hallo Marius,") are a known coverage gap; pair with a deterministic greeting-pattern regex for chat-style inputs where this matters.
Re-identification is not in scope. The model is intended for redaction; it does not return entity-linked identifiers.

Note on the predecessor release

An earlier checkpoint of the same architecture was previously published at this URL, trained against an earlier dataset revision. That release surfaced several pipeline issues after publication — double-labelled overlapping spans, missing Geonames-CH gazetteer demotion of municipality names mis-tagged as people, and a forked synthetic generator that disagreed with the templated pipeline on output schema. The earlier checkpoint and dataset have been retired and moved to private archive repositories (joelbarmettler/gheim-ch-560m-archive and joelbarmettler/gheim-ch-pii-171k-archive); the current joelbarmettler/gheim-ch-560m checkpoint is a fresh fine-tune on the reproducible gheim-ch-pii-212k. The architecture and parameter count are unchanged; the training data and weights are new.

License

Apache 2.0, inherited from the base model FacebookAI/xlm-roberta-large. The training data (joelbarmettler/gheim-ch-pii-212k) is released under CC BY 4.0; attribution to its upstream corpora (the swiss-ai/apertus-pretrain-* datasets) is required when reusing the data.

Citation

@misc{barmettler2026gheim_ch_560m,
  title  = {gheim-ch-560m: A multilingual PII detection model for the Swiss market},
  author = {Joel Barmettler},
  year   = {2026},
  url    = {https://huggingface.co/joelbarmettler/gheim-ch-560m}
}

If the model is used in published work, please also cite the dataset:

@misc{barmettler2026gheim_ch_pii_212k,
  title  = {gheim-ch-pii-212k: A Swiss-grounded PII NER dataset with multi-LLM consensus labels and synthetic gap-fill},
  author = {Joel Barmettler},
  year   = {2026},
  url    = {https://huggingface.co/datasets/joelbarmettler/gheim-ch-pii-212k}
}

Maintainer

Joel Barmettler · jbarmettler@proton.me · joelbarmettler.xyz · github.com/joelbarmettlerUZH/gheim

Source code, issue tracker, and the wider gheim ecosystem (Python and Node libraries, redaction server, composite detector) are at github.com/joelbarmettlerUZH/gheim.

Downloads last month: 336

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for joelbarmettler/gheim-ch-560m

Base model

FacebookAI/xlm-roberta-large

Finetuned

(952)

this model

Dataset used to train joelbarmettler/gheim-ch-560m

Evaluation results

Strict-span F1 (seqeval) on gheim-ch-pii-212k
test set self-reported

0.910
Char-level F1 (label-aware) on gheim-ch-pii-212k
test set self-reported

0.946
Strict-span precision on gheim-ch-pii-212k
test set self-reported

0.890
Strict-span recall on gheim-ch-pii-212k
test set self-reported

0.931
PER char F1 (overall, zero-shot) on ZurichNLP/swissner
test set self-reported

0.702
PER char F1 (de, zero-shot) on ZurichNLP/swissner
test set self-reported

0.539
PER char F1 (fr, zero-shot) on ZurichNLP/swissner
test set self-reported

0.761
PER char F1 (it, zero-shot) on ZurichNLP/swissner
test set self-reported

0.643