PhoneticXeus
PhoneticXeus is a multilingual phone recognition model built on the XEUS speech encoder with self-conditioned intermediate CTC at encoder layers 4, 8, and 12. It outputs IPA phone sequences and was trained on the accent-mix split of IPAPack++ (70+ languages).
- Paper: arxiv 2603.29042
- Code: github.com/changelinglab/PhoneticXeus
- License: Apache 2.0
Files in this repo
| File | Description |
|---|---|
phoneticxeus_state_dict.pt |
Preferred. Plain torch.save of the model state_dict. |
checkpoint-22000.ckpt |
Original PyTorch-Lightning checkpoint (same weights, plus optimizer / scheduler state). Use this to resume training. |
ipa_vocab.json |
428-token IPA vocabulary ({token: id}). Required for decoding. |
config_tree.log |
Full Hydra config dump from training. |
Installation
git clone https://github.com/changelinglab/PhoneticXeus.git
cd PhoneticXeus
make install
source .venv/bin/activate
Loading the model
Important: PhoneticXeus uses self-conditioned interCTC. You must pass
the three interctc_* arguments shown below — otherwise the builder returns
a vanilla XEUS-PR model whose weights will not match this checkpoint.
import torch
from huggingface_hub import hf_hub_download
from src.model.xeusphoneme.builders import build_xeus_pr_inference
REPO = "changelinglab/PhoneticXeus"
ckpt_path = hf_hub_download(REPO, "phoneticxeus_state_dict.pt")
vocab_path = hf_hub_download(REPO, "ipa_vocab.json")
inference = build_xeus_pr_inference(
work_dir="exp/cache/xeus", # where the espnet/xeus config is cached
hf_repo="espnet/xeus",
checkpoint=ckpt_path,
vocab_file=vocab_path,
device="cuda" if torch.cuda.is_available() else "cpu",
# self-conditioned interCTC (required)
interctc_weight=0.3,
interctc_layer_idx=[4, 8, 12],
interctc_use_conditioning=True,
ctc_weight=1.0,
)
The builder loads state_dict non-strictly, so both
phoneticxeus_state_dict.pt and checkpoint-22000.ckpt work with the same
call.
Inference
import torchaudio
wav, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(dim=0) # mono
results = inference(wav)
print(results[0]["processed_transcript"])
# e.g. "h ə l oʊ w ɝ l d"
Distributed / batched inference
The project's Hydra inference recipe points at this checkpoint out of the box:
python src/main.py \
experiment=inference/transcribe_xeuspr_selfctc \
data=powsmeval data.dataset_name=doreco \
inference.inference_runner.checkpoint=/path/to/phoneticxeus_state_dict.pt
See configs/experiment/inference/transcribe_xeuspr_selfctc.yaml for all
overrides.
Training
- Data: IPAPack++ accent-mix (70+ languages, IPA transcriptions).
- Objective: CTC + 0.3 × self-conditioned interCTC at layers 4 / 8 / 12.
- Steps: 22k. Full config is in
config_tree.log.
Evaluation
Reported in the paper using PER (Phone Error Rate), PFER (Phone Feature Error Rate), and FED (Feature Edit Distance).
Citation
@misc{pxeus26,
title={An Empirical Recipe for Universal Phone Recognition},
author={Shikhar Bharadwaj and Chin-Jou Li and Kwanghee Choi and Eunjung Yeo and William Chen and Shinji Watanabe and David R. Mortensen},
year={2026},
eprint={2603.29042},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.29042},
}