WeSpeaker ResNet34 Speaker Embedding Model (MLX)
This is an MLX port of the pyannote/wespeaker-voxceleb-resnet34-LM speaker embedding model from the WeSpeaker toolkit.
Model Description
ResNet34-based speaker embedding model trained on VoxCeleb for speaker recognition and diarization tasks. This MLX implementation provides identical functionality to the PyTorch original, optimized for Apple Silicon.
- Architecture: ResNet34 with [3, 4, 6, 3] block configuration
- Input: Mel spectrogram (batch, time_frames, freq_bins=80)
- Output: 256-dimensional speaker embeddings
- Parameters: 6.6M
- Model Size: 25MB
Performance
Speaker Similarity Preservation (vs PyTorch original):
- Max cosine similarity difference: 2.4%
- Mean cosine similarity difference: 0.8%
- Numerical accuracy: Max abs diff ~0.17
The model preserves speaker similarity relationships excellently, making it suitable for production speaker diarization and verification tasks.
Installation
pip install mlx numpy
Usage
import mlx.core as mx
import mlx.nn as nn
import numpy as np
# Load model
from resnet_embedding import load_resnet34_embedding
model = load_resnet34_embedding("weights.npz")
# Prepare mel spectrogram input (batch, time, freq)
# Example: 150 time frames, 80 mel bins
mel_spectrogram = mx.array(np.random.randn(1, 150, 80).astype(np.float32))
# Extract speaker embedding
embedding = model(mel_spectrogram) # Shape: (1, 256)
print(f"Embedding shape: {embedding.shape}")
print(f"Embedding norm: {float(mx.linalg.norm(embedding)):.4f}")
Computing Speaker Similarity
# Extract embeddings for two audio segments
embedding1 = model(mel_spec1) # (1, 256)
embedding2 = model(mel_spec2) # (1, 256)
# Compute cosine similarity
similarity = mx.sum(embedding1 * embedding2) / (
mx.linalg.norm(embedding1) * mx.linalg.norm(embedding2)
)
print(f"Speaker similarity: {float(similarity):.4f}")
# High similarity (>0.9) = same speaker
# Low similarity (<0.5) = different speakers
Input Requirements
The model expects mel spectrogram features with:
- Frequency bins: 80 (mel filterbanks)
- Time frames: Variable length (e.g., 100-300 frames)
- Format: (batch_size, time_frames, freq_bins)
- Data type: float32
Extracting Mel Spectrograms
You can use pyannote.audio for feature extraction:
from pyannote.audio import Model
import torch
# Load feature extractor from original model
pt_model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
# Extract features
waveform = torch.randn(1, 16000) # 1 second at 16kHz
with torch.no_grad():
# Features are automatically extracted by the model
# You can access them via: pt_model.sincnet, pt_model.tdnn, etc.
pass
Or use librosa:
import librosa
import numpy as np
# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)
# Extract mel spectrogram
mel_spec = librosa.feature.melspectrogram(
y=audio,
sr=sr,
n_fft=512,
hop_length=160, # 10ms at 16kHz
n_mels=80
)
# Convert to log scale
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
# Transpose to (time, freq) and add batch dimension
mel_spec_input = mel_spec_db.T[np.newaxis, :, :] # (1, time, 80)
Model Architecture
Input: (batch, time, freq=80)
β Add channel dimension
(batch, time, freq, 1)
β Transpose to match PyTorch layout
(batch, freq, time, 1)
β Conv2d (1β32, 3x3, padding=1)
(batch, freq, time, 32)
β BatchNorm + ReLU
β ResNet Layer1 (3 blocks, 32 channels)
β ResNet Layer2 (4 blocks, 32β64, stride=2)
β ResNet Layer3 (6 blocks, 64β128, stride=2)
β ResNet Layer4 (3 blocks, 128β256, stride=2)
(batch, freq', time', 256)
β Temporal Statistics Pooling (mean + std over time)
(batch, 5120)
β Fully Connected (5120β256)
Output: (batch, 256) speaker embeddings
Conversion Details
This model was converted from PyTorch to MLX with the following key fixes:
- Dimension ordering: Transposed input to match PyTorch's (freq, time) layout
- BatchNorm: Loaded running statistics and set model to eval mode
- No final normalization: PyTorch model doesn't apply L2 normalization
- Weight format: Conv2d weights transposed from (O,I,H,W) to (O,H,W,I)
Limitations
- Eval mode only: Model uses frozen BatchNorm statistics (not suitable for fine-tuning without modifications)
- Numerical precision: Small differences from PyTorch (~0.17 max abs diff) due to implementation differences
- Fixed architecture: 80 mel bins required (model architecture is hardcoded for this)
Applications
This model is suitable for:
- β Speaker diarization (who spoke when)
- β Speaker verification (is this the same speaker?)
- β Speaker identification (which speaker is this?)
- β Voice biometrics
- β Speaker clustering
Citation
Original model from WeSpeaker toolkit:
@inproceedings{wang2023wespeaker,
title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2023},
organization={IEEE}
}
Pyannote.audio implementation:
@inproceedings{Bredin2020,
title={pyannote.audio: neural building blocks for speaker diarization},
author={Herv{\'e} Bredin and Ruiqing Yin and Juan Manuel Coria and Gregory Gelly and Pavel Korshunov and Marvin Lavechin and Diego Fustes and Hadrien Titeux and Wassim Bouaziz and Marie-Philippe Gill},
booktitle={ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
year={2020},
}
License
This model follows the same license as the original PyTorch model. Please check the original model card for license details.
Conversion
Converted to MLX by the community. Original PyTorch model: pyannote/wespeaker-voxceleb-resnet34-LM
Validation: Speaker similarity preserved to within 2.4% of PyTorch implementation.
- Downloads last month
- 108
Model tree for mlx-community/wespeaker-voxceleb-resnet34-LM
Base model
pyannote/wespeaker-voxceleb-resnet34-LM