PhoneticXeus

PhoneticXeus is a multilingual phone recognition model built on the XEUS speech encoder with self-conditioned intermediate CTC at encoder layers 4, 8, and 12. It outputs IPA phone sequences and was trained on the accent-mix split of IPAPack++ (70+ languages).

Paper: arxiv 2603.29042
Code: github.com/changelinglab/PhoneticXeus
License: Apache 2.0

Files in this repo

File	Description
`phoneticxeus_state_dict.pt`	Preferred. Plain `torch.save` of the model `state_dict`.
`checkpoint-22000.ckpt`	Original PyTorch-Lightning checkpoint (same weights, plus optimizer / scheduler state). Use this to resume training.
`ipa_vocab.json`	428-token IPA vocabulary (`{token: id}`). Required for decoding.
`config_tree.log`	Full Hydra config dump from training.

Installation

git clone https://github.com/changelinglab/PhoneticXeus.git
cd PhoneticXeus
make install
source .venv/bin/activate

Loading the model

Important: PhoneticXeus uses self-conditioned interCTC. You must pass the three interctc_* arguments shown below — otherwise the builder returns a vanilla XEUS-PR model whose weights will not match this checkpoint.

import torch
from huggingface_hub import hf_hub_download
from src.model.xeusphoneme.builders import build_xeus_pr_inference

REPO = "changelinglab/PhoneticXeus"
ckpt_path = hf_hub_download(REPO, "phoneticxeus_state_dict.pt")
vocab_path = hf_hub_download(REPO, "ipa_vocab.json")

inference = build_xeus_pr_inference(
    work_dir="exp/cache/xeus",    # where the espnet/xeus config is cached
    hf_repo="espnet/xeus",
    checkpoint=ckpt_path,
    vocab_file=vocab_path,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # self-conditioned interCTC (required)
    interctc_weight=0.3,
    interctc_layer_idx=[4, 8, 12],
    interctc_use_conditioning=True,
    ctc_weight=1.0,
)

The builder loads state_dict non-strictly, so both phoneticxeus_state_dict.pt and checkpoint-22000.ckpt work with the same call.

Inference

import torchaudio

wav, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(dim=0)  # mono

results = inference(wav)
print(results[0]["processed_transcript"])
# e.g. "h ə l oʊ w ɝ l d"

Distributed / batched inference

The project's Hydra inference recipe points at this checkpoint out of the box:

python src/main.py \
    experiment=inference/transcribe_xeuspr_selfctc \
    data=powsmeval data.dataset_name=doreco \
    inference.inference_runner.checkpoint=/path/to/phoneticxeus_state_dict.pt

See configs/experiment/inference/transcribe_xeuspr_selfctc.yaml for all overrides.

Training

Data: IPAPack++ accent-mix (70+ languages, IPA transcriptions).
Objective: CTC + 0.3 × self-conditioned interCTC at layers 4 / 8 / 12.
Steps: 22k. Full config is in config_tree.log.

Evaluation

Reported in the paper using PER (Phone Error Rate), PFER (Phone Feature Error Rate), and FED (Feature Edit Distance).

Citation

@misc{pxeus26,
      title={An Empirical Recipe for Universal Phone Recognition},
      author={Shikhar Bharadwaj and Chin-Jou Li and Kwanghee Choi and Eunjung Yeo and William Chen and Shinji Watanabe and David R. Mortensen},
      year={2026},
      eprint={2603.29042},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.29042},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Space using changelinglab/PhoneticXeus 1

Collection including changelinglab/PhoneticXeus

PhoneticXeus

Collection

Universal Phone Recognition model • 3 items • Updated 8 days ago

Paper for changelinglab/PhoneticXeus

An Empirical Recipe for Universal Phone Recognition

Paper • 2603.29042 • Published 18 days ago • 5