BGE-M3 Matryoshka β€” ONNX INT8

ONNX INT8 dynamically-quantized version of tss-deposium/bge-m3-matryoshka-1024d for CPU-optimized inference.

Fine-tuned from BAAI/bge-m3 with MatryoshkaLoss β€” supports dynamic dimension truncation at query time (1024 -> 768 -> 512 -> 256) via simple array slicing, with a single model loaded in memory (~571 MB).

Key Features

  • Dynamic dimensions: Truncate embeddings to 1024/768/512/256D at inference time β€” zero overhead
  • ONNX INT8: CPU-optimized, no GPU required (~571 MB RAM)
  • Matryoshka-trained: Quality degrades gracefully at lower dimensions (not random truncation)
  • Cross-lingual: FR/EN/DE/ES/IT/PT/ZH/JA/KO β€” inherited from BGE-M3
  • Drop-in replacement: Same 1024D output as gpahal/bge-m3-onnx-int8, with better discrimination (+6.6%)

Benchmark Results

Tested on 4 semantic pairs (FR/EN cross-lingual) + 2 negative pairs. Discrimination = avg_positive_similarity - avg_negative_similarity (higher = better separation).

Model Dim AvgPair AvgNeg Discrim Throughput Notes
m2v-bge-m3-1024d 1024 0.470 0.159 0.312 581 t/s GPU, 20x faster but weaker cross-lingual
gpahal/bge-m3-onnx-int8 1024 0.695 0.317 0.377 28 t/s CPU baseline
This model 1024 0.698 0.295 0.403 29 t/s Best discrimination (+6.6% vs baseline)
This model @768D 768 0.711 0.315 0.397 29 t/s -1.4% discrim, -25% storage
This model @512D 512 0.729 0.348 0.381 29 t/s -5.4% discrim, -50% storage
This model @256D 256 0.658 0.200 0.458* 25 t/s *Anomalous β€” loses cross-lingual

Per-Pair Breakdown (Cosine Similarity)

Pair m2v-1024 onnx-1024 matr-1024 matr-768 matr-512 matr-256
couple_serrage (FR/EN) 0.056 0.351 0.284 0.321 0.371 0.197
fogg_depart (FR/FR) 0.634 0.837 0.866 0.873 0.878 0.825
revenue_q2 (EN/FR) 0.391 0.645 0.686 0.697 0.711 0.658
moteur_spec (FR/EN) 0.801 0.946 0.954 0.954 0.955 0.950

Dimension Recommendations

Dimension vs M2V baseline vs bge-m3-onnx-1024D Storage (1M vectors) Recommendation
1024D +29.2% +6.6% 4 GB Maximum quality
768D +27.2% +5.3% 3 GB Safe, -25% storage
512D +22.1% +1.1% 2 GB Best for cloud CPU scaling
256D anomalous anomalous 1 GB Not recommended (loses cross-lingual)

512D Matryoshka is BETTER than full-resolution bge-m3-onnx-int8 at 1024D (+1.1% discrimination) with half the storage and 2x faster cosine similarity.

Usage

With ONNX Runtime (Python)

from onnxruntime import InferenceSession
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("tss-deposium/bge-m3-matryoshka-1024d-onnx-int8")
session = InferenceSession("model_quantized.onnx")

inputs = tokenizer("Bonjour le monde", return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, dict(inputs))
embedding_1024d = outputs[0][0]  # [1024] float32

# Matryoshka truncation β€” just slice!
embedding_768d = embedding_1024d[:768]
embedding_512d = embedding_1024d[:512]
embedding_256d = embedding_1024d[:256]

With Deposium TurboV2 API

# Full 1024D (default)
curl -X POST http://localhost:11435/api/embed \
  -d '{"model": "bge-m3-matryoshka", "input": "Bonjour le monde"}'

# Truncated to 512D (50% less storage, ~95% quality)
curl -X POST http://localhost:11435/api/embed \
  -d '{"model": "bge-m3-matryoshka", "input": "Bonjour", "dimensions": 512}'

Training Details

  • Base model: BAAI/bge-m3 (XLM-RoBERTa, 0.6B params)
  • Fine-tuning: MatryoshkaLoss + MultipleNegativesRankingLoss
  • Matryoshka dims: [1024, 768, 512, 256]
  • Dataset: 672,676 training samples / 35,405 eval samples
  • Epochs: 3, batch size 32, lr 2e-5, bf16
  • ONNX export: Dynamic INT8 quantization (avx512_vnni)
  • Size: ~571 MB (vs ~2.2 GB FP32)

Files

File Description
model_quantized.onnx ONNX INT8 quantized model
config.json Model configuration
ort_config.json ONNX Runtime configuration
tokenizer.json Fast tokenizer
tokenizer_config.json Tokenizer configuration
sentencepiece.bpe.model SentencePiece BPE model
special_tokens_map.json Special tokens mapping

License

MIT β€” commercial use allowed.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tss-deposium/bge-m3-matryoshka-1024d-onnx-int8

Base model

BAAI/bge-m3
Quantized
(1)
this model