PhoneticXeus

PhoneticXeus is a multilingual phone recognition model built on the XEUS speech encoder with self-conditioned intermediate CTC at encoder layers 4, 8, and 12. It outputs IPA phone sequences and was trained on the accent-mix split of IPAPack++ (70+ languages).

Files in this repo

File Description
phoneticxeus_state_dict.pt Preferred. Plain torch.save of the model state_dict.
checkpoint-22000.ckpt Original PyTorch-Lightning checkpoint (same weights, plus optimizer / scheduler state). Use this to resume training.
ipa_vocab.json 428-token IPA vocabulary ({token: id}). Required for decoding.
config_tree.log Full Hydra config dump from training.

Installation

git clone https://github.com/changelinglab/PhoneticXeus.git
cd PhoneticXeus
make install
source .venv/bin/activate

Loading the model

Important: PhoneticXeus uses self-conditioned interCTC. You must pass the three interctc_* arguments shown below — otherwise the builder returns a vanilla XEUS-PR model whose weights will not match this checkpoint.

import torch
from huggingface_hub import hf_hub_download
from src.model.xeusphoneme.builders import build_xeus_pr_inference

REPO = "changelinglab/PhoneticXeus"
ckpt_path = hf_hub_download(REPO, "phoneticxeus_state_dict.pt")
vocab_path = hf_hub_download(REPO, "ipa_vocab.json")

inference = build_xeus_pr_inference(
    work_dir="exp/cache/xeus",    # where the espnet/xeus config is cached
    hf_repo="espnet/xeus",
    checkpoint=ckpt_path,
    vocab_file=vocab_path,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # self-conditioned interCTC (required)
    interctc_weight=0.3,
    interctc_layer_idx=[4, 8, 12],
    interctc_use_conditioning=True,
    ctc_weight=1.0,
)

The builder loads state_dict non-strictly, so both phoneticxeus_state_dict.pt and checkpoint-22000.ckpt work with the same call.

Inference

import torchaudio

wav, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(dim=0)  # mono

results = inference(wav)
print(results[0]["processed_transcript"])
# e.g. "h ə l oʊ w ɝ l d"

Distributed / batched inference

The project's Hydra inference recipe points at this checkpoint out of the box:

python src/main.py \
    experiment=inference/transcribe_xeuspr_selfctc \
    data=powsmeval data.dataset_name=doreco \
    inference.inference_runner.checkpoint=/path/to/phoneticxeus_state_dict.pt

See configs/experiment/inference/transcribe_xeuspr_selfctc.yaml for all overrides.

Training

  • Data: IPAPack++ accent-mix (70+ languages, IPA transcriptions).
  • Objective: CTC + 0.3 × self-conditioned interCTC at layers 4 / 8 / 12.
  • Steps: 22k. Full config is in config_tree.log.

Evaluation

Reported in the paper using PER (Phone Error Rate), PFER (Phone Feature Error Rate), and FED (Feature Edit Distance).

Citation

@misc{pxeus26,
      title={An Empirical Recipe for Universal Phone Recognition},
      author={Shikhar Bharadwaj and Chin-Jou Li and Kwanghee Choi and Eunjung Yeo and William Chen and Shinji Watanabe and David R. Mortensen},
      year={2026},
      eprint={2603.29042},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.29042},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using changelinglab/PhoneticXeus 1

Collection including changelinglab/PhoneticXeus

Paper for changelinglab/PhoneticXeus