--- library_name: mlx tags: - speaker-recognition - speaker-embedding - speaker-diarization - audio - resnet - mlx - apple-silicon base_model: pyannote/wespeaker-voxceleb-resnet34-LM license: mit pipeline_tag: feature-extraction --- # WeSpeaker ResNet34 Speaker Embedding Model (MLX) This is an MLX port of the [pyannote/wespeaker-voxceleb-resnet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM) speaker embedding model from the WeSpeaker toolkit. ## Model Description **ResNet34-based speaker embedding model** trained on VoxCeleb for speaker recognition and diarization tasks. This MLX implementation provides identical functionality to the PyTorch original, optimized for Apple Silicon. - **Architecture**: ResNet34 with [3, 4, 6, 3] block configuration - **Input**: Mel spectrogram (batch, time_frames, freq_bins=80) - **Output**: 256-dimensional speaker embeddings - **Parameters**: 6.6M - **Model Size**: 25MB ## Performance **Speaker Similarity Preservation** (vs PyTorch original): - Max cosine similarity difference: **2.4%** - Mean cosine similarity difference: **0.8%** - Numerical accuracy: Max abs diff ~0.17 The model preserves speaker similarity relationships excellently, making it suitable for production speaker diarization and verification tasks. ## Installation ```bash pip install mlx numpy ``` ## Usage ```python import mlx.core as mx import mlx.nn as nn import numpy as np # Load model from resnet_embedding import load_resnet34_embedding model = load_resnet34_embedding("weights.npz") # Prepare mel spectrogram input (batch, time, freq) # Example: 150 time frames, 80 mel bins mel_spectrogram = mx.array(np.random.randn(1, 150, 80).astype(np.float32)) # Extract speaker embedding embedding = model(mel_spectrogram) # Shape: (1, 256) print(f"Embedding shape: {embedding.shape}") print(f"Embedding norm: {float(mx.linalg.norm(embedding)):.4f}") ``` ### Computing Speaker Similarity ```python # Extract embeddings for two audio segments embedding1 = model(mel_spec1) # (1, 256) embedding2 = model(mel_spec2) # (1, 256) # Compute cosine similarity similarity = mx.sum(embedding1 * embedding2) / ( mx.linalg.norm(embedding1) * mx.linalg.norm(embedding2) ) print(f"Speaker similarity: {float(similarity):.4f}") # High similarity (>0.9) = same speaker # Low similarity (<0.5) = different speakers ``` ## Input Requirements The model expects mel spectrogram features with: - **Frequency bins**: 80 (mel filterbanks) - **Time frames**: Variable length (e.g., 100-300 frames) - **Format**: (batch_size, time_frames, freq_bins) - **Data type**: float32 ### Extracting Mel Spectrograms You can use `pyannote.audio` for feature extraction: ```python from pyannote.audio import Model import torch # Load feature extractor from original model pt_model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM") # Extract features waveform = torch.randn(1, 16000) # 1 second at 16kHz with torch.no_grad(): # Features are automatically extracted by the model # You can access them via: pt_model.sincnet, pt_model.tdnn, etc. pass ``` Or use `librosa`: ```python import librosa import numpy as np # Load audio audio, sr = librosa.load("audio.wav", sr=16000) # Extract mel spectrogram mel_spec = librosa.feature.melspectrogram( y=audio, sr=sr, n_fft=512, hop_length=160, # 10ms at 16kHz n_mels=80 ) # Convert to log scale mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max) # Transpose to (time, freq) and add batch dimension mel_spec_input = mel_spec_db.T[np.newaxis, :, :] # (1, time, 80) ``` ## Model Architecture ``` Input: (batch, time, freq=80) ↓ Add channel dimension (batch, time, freq, 1) ↓ Transpose to match PyTorch layout (batch, freq, time, 1) ↓ Conv2d (1→32, 3x3, padding=1) (batch, freq, time, 32) ↓ BatchNorm + ReLU ↓ ResNet Layer1 (3 blocks, 32 channels) ↓ ResNet Layer2 (4 blocks, 32→64, stride=2) ↓ ResNet Layer3 (6 blocks, 64→128, stride=2) ↓ ResNet Layer4 (3 blocks, 128→256, stride=2) (batch, freq', time', 256) ↓ Temporal Statistics Pooling (mean + std over time) (batch, 5120) ↓ Fully Connected (5120→256) Output: (batch, 256) speaker embeddings ``` ## Conversion Details This model was converted from PyTorch to MLX with the following key fixes: 1. **Dimension ordering**: Transposed input to match PyTorch's (freq, time) layout 2. **BatchNorm**: Loaded running statistics and set model to eval mode 3. **No final normalization**: PyTorch model doesn't apply L2 normalization 4. **Weight format**: Conv2d weights transposed from (O,I,H,W) to (O,H,W,I) ## Limitations - **Eval mode only**: Model uses frozen BatchNorm statistics (not suitable for fine-tuning without modifications) - **Numerical precision**: Small differences from PyTorch (~0.17 max abs diff) due to implementation differences - **Fixed architecture**: 80 mel bins required (model architecture is hardcoded for this) ## Applications This model is suitable for: - ✅ **Speaker diarization** (who spoke when) - ✅ **Speaker verification** (is this the same speaker?) - ✅ **Speaker identification** (which speaker is this?) - ✅ **Voice biometrics** - ✅ **Speaker clustering** ## Citation Original model from WeSpeaker toolkit: ```bibtex @inproceedings{wang2023wespeaker, title={Wespeaker: A research and production oriented speaker embedding learning toolkit}, author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2023}, organization={IEEE} } ``` Pyannote.audio implementation: ```bibtex @inproceedings{Bredin2020, title={pyannote.audio: neural building blocks for speaker diarization}, author={Herv{\'e} Bredin and Ruiqing Yin and Juan Manuel Coria and Gregory Gelly and Pavel Korshunov and Marvin Lavechin and Diego Fustes and Hadrien Titeux and Wassim Bouaziz and Marie-Philippe Gill}, booktitle={ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing}, year={2020}, } ``` ## License This model follows the same license as the original PyTorch model. Please check the [original model card](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM) for license details. ## Conversion Converted to MLX by the community. Original PyTorch model: [pyannote/wespeaker-voxceleb-resnet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM) **Validation**: Speaker similarity preserved to within 2.4% of PyTorch implementation.