Octen-Embedding-0.6B β INT4 ONNX (MatMulNBits, block_size=32)
INT4-quantized ONNX of Octen/Octen-Embedding-0.6B. Smallest resident memory of all variants (~1.2 GB RSS) with 1.00 top-1 retrieval accuracy.
Quantization details
| Property | Value |
|---|---|
| Method | onnxruntime.quantization.MatMulNBitsQuantizer |
| Bits | 4 |
| Block size | 32 (one scale per 32 consecutive weights) |
| Symmetry | Symmetric (no zero-point) |
| Op | MatMulNBits contrib op (ORT β₯ 1.16, CPU / CUDA / CoreML EPs) |
| Ops quantized | MatMul only β Gather (embedding table) left in FP32 |
Block-wise vs per-tensor: block_size=32 gives 32 768 calibration scale values for a 1024Γ1024 matrix vs 1 for per-tensor INT8. This fine granularity explains why INT4 shows higher cosine fidelity to FP32 (0.945) than per-tensor INT8 (0.830) despite using half the bits.
Note on dynamic batch: this variant was produced from the legacy
torch.onnx.export(not dynamo). It runs correctly at batch=1 only. If you need batch > 1 for throughput, use the INT8 variant which is based on the dynamo export. A dynamo-based INT4 re-export is planned.
Benchmark (Apple M-series, CPU)
| Metric | Value |
|---|---|
| Ingest throughput | ~2.6 ch/s |
| Top-1 hybrid accuracy | 1.00 |
| RSS memory | ~1.15 GB |
| File size | ~0.9 GB |
Quality metrics vs FP32
Measured on 8 diverse EN/DE sentences (3 semantic triplets):
| Metric | Value |
|---|---|
| Cosine similarity to FP32 (mean) | 0.945 |
| Cosine similarity to FP32 (min) | 0.930 |
| Semantic ordering (3/3 triplets) | β |
| Triplet margin (mean) | 0.241 |
| Anisotropy (avg pairwise cos) | 0.233 |
| Unit-norm compliance | β |
The high cosine fidelity (0.945) despite using only 4 bits comes from block-wise calibration (block_size=32), which is far finer-grained than the per-tensor INT8 approach.
Model details
| Property | Value |
|---|---|
| Embedding dim | 1024 |
| Max context | 32 768 tokens |
| Inputs | input_ids [batch, seq], attention_mask [batch, seq] |
| Output | last_hidden_state [batch, seq, 1024] |
| Pooling | Last-token pooling + L2 normalisation |
| Batch support | batch=1 only (legacy export limitation) |
Inference (batch=1)
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_truncation(max_length=512)
# CPUExecutionProvider supports MatMulNBits 4-bit
session = ort.InferenceSession("model.int4.onnx", providers=["CPUExecutionProvider"])
text = "semantic search example"
enc = tokenizer.encode(text)
ids = np.array([enc.ids], dtype=np.int64)
mask = np.array([enc.attention_mask], dtype=np.int64)
lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0] # [1, seq, 1024]
emb = lhs[0, mask[0].sum() - 1] # last non-padding token
emb = emb / np.linalg.norm(emb)
print(emb.shape) # (1024,)
Files
| File | Size | Description |
|---|---|---|
model.int4.onnx |
~3 MB | ONNX graph with MatMulNBits nodes |
model.int4.onnx.data |
~855 MB | 4-bit weight data + scales |
tokenizer.json |
11 MB | HuggingFace fast tokenizer |
Variants
| Repo | Precision | Size | Batch | Notes |
|---|---|---|---|---|
| cstr/octen-embedding-0.6b-onnx | FP32 | 2.4 GB | dynamic | Reference |
| cstr/octen-embedding-0.6b-onnx-int8 | INT8 | 1.1 GB | dynamic | Recommended |
| cstr/octen-embedding-0.6b-onnx-int4 | INT4 | 0.9 GB | batch=1 | This repo β minimum RAM |
License
Apache 2.0.
- Downloads last month
- 82
Model tree for cstr/octen-embedding-0.6b-onnx-int4
Base model
Qwen/Qwen3-0.6B-Base