BGE-M3 Matryoshka β ONNX INT8
ONNX INT8 dynamically-quantized version of tss-deposium/bge-m3-matryoshka-1024d for CPU-optimized inference.
Fine-tuned from BAAI/bge-m3 with MatryoshkaLoss β supports dynamic dimension truncation at query time (1024 -> 768 -> 512 -> 256) via simple array slicing, with a single model loaded in memory (~571 MB).
Key Features
- Dynamic dimensions: Truncate embeddings to 1024/768/512/256D at inference time β zero overhead
- ONNX INT8: CPU-optimized, no GPU required (~571 MB RAM)
- Matryoshka-trained: Quality degrades gracefully at lower dimensions (not random truncation)
- Cross-lingual: FR/EN/DE/ES/IT/PT/ZH/JA/KO β inherited from BGE-M3
- Drop-in replacement: Same 1024D output as
gpahal/bge-m3-onnx-int8, with better discrimination (+6.6%)
Benchmark Results
Tested on 4 semantic pairs (FR/EN cross-lingual) + 2 negative pairs. Discrimination = avg_positive_similarity - avg_negative_similarity (higher = better separation).
| Model | Dim | AvgPair | AvgNeg | Discrim | Throughput | Notes |
|---|---|---|---|---|---|---|
| m2v-bge-m3-1024d | 1024 | 0.470 | 0.159 | 0.312 | 581 t/s | GPU, 20x faster but weaker cross-lingual |
| gpahal/bge-m3-onnx-int8 | 1024 | 0.695 | 0.317 | 0.377 | 28 t/s | CPU baseline |
| This model | 1024 | 0.698 | 0.295 | 0.403 | 29 t/s | Best discrimination (+6.6% vs baseline) |
| This model @768D | 768 | 0.711 | 0.315 | 0.397 | 29 t/s | -1.4% discrim, -25% storage |
| This model @512D | 512 | 0.729 | 0.348 | 0.381 | 29 t/s | -5.4% discrim, -50% storage |
| This model @256D | 256 | 0.658 | 0.200 | 0.458* | 25 t/s | *Anomalous β loses cross-lingual |
Per-Pair Breakdown (Cosine Similarity)
| Pair | m2v-1024 | onnx-1024 | matr-1024 | matr-768 | matr-512 | matr-256 |
|---|---|---|---|---|---|---|
| couple_serrage (FR/EN) | 0.056 | 0.351 | 0.284 | 0.321 | 0.371 | 0.197 |
| fogg_depart (FR/FR) | 0.634 | 0.837 | 0.866 | 0.873 | 0.878 | 0.825 |
| revenue_q2 (EN/FR) | 0.391 | 0.645 | 0.686 | 0.697 | 0.711 | 0.658 |
| moteur_spec (FR/EN) | 0.801 | 0.946 | 0.954 | 0.954 | 0.955 | 0.950 |
Dimension Recommendations
| Dimension | vs M2V baseline | vs bge-m3-onnx-1024D | Storage (1M vectors) | Recommendation |
|---|---|---|---|---|
| 1024D | +29.2% | +6.6% | 4 GB | Maximum quality |
| 768D | +27.2% | +5.3% | 3 GB | Safe, -25% storage |
| 512D | +22.1% | +1.1% | 2 GB | Best for cloud CPU scaling |
| 256D | anomalous | anomalous | 1 GB | Not recommended (loses cross-lingual) |
512D Matryoshka is BETTER than full-resolution bge-m3-onnx-int8 at 1024D (+1.1% discrimination) with half the storage and 2x faster cosine similarity.
Usage
With ONNX Runtime (Python)
from onnxruntime import InferenceSession
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("tss-deposium/bge-m3-matryoshka-1024d-onnx-int8")
session = InferenceSession("model_quantized.onnx")
inputs = tokenizer("Bonjour le monde", return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, dict(inputs))
embedding_1024d = outputs[0][0] # [1024] float32
# Matryoshka truncation β just slice!
embedding_768d = embedding_1024d[:768]
embedding_512d = embedding_1024d[:512]
embedding_256d = embedding_1024d[:256]
With Deposium TurboV2 API
# Full 1024D (default)
curl -X POST http://localhost:11435/api/embed \
-d '{"model": "bge-m3-matryoshka", "input": "Bonjour le monde"}'
# Truncated to 512D (50% less storage, ~95% quality)
curl -X POST http://localhost:11435/api/embed \
-d '{"model": "bge-m3-matryoshka", "input": "Bonjour", "dimensions": 512}'
Training Details
- Base model: BAAI/bge-m3 (XLM-RoBERTa, 0.6B params)
- Fine-tuning: MatryoshkaLoss + MultipleNegativesRankingLoss
- Matryoshka dims: [1024, 768, 512, 256]
- Dataset: 672,676 training samples / 35,405 eval samples
- Epochs: 3, batch size 32, lr 2e-5, bf16
- ONNX export: Dynamic INT8 quantization (avx512_vnni)
- Size: ~571 MB (vs ~2.2 GB FP32)
Files
| File | Description |
|---|---|
model_quantized.onnx |
ONNX INT8 quantized model |
config.json |
Model configuration |
ort_config.json |
ONNX Runtime configuration |
tokenizer.json |
Fast tokenizer |
tokenizer_config.json |
Tokenizer configuration |
sentencepiece.bpe.model |
SentencePiece BPE model |
special_tokens_map.json |
Special tokens mapping |
License
MIT β commercial use allowed.
- Downloads last month
- 5