Octen-Embedding-0.6B β€” INT4 ONNX (MatMulNBits, block_size=32)

INT4-quantized ONNX of Octen/Octen-Embedding-0.6B. Smallest resident memory of all variants (~1.2 GB RSS) with 1.00 top-1 retrieval accuracy.

Quantization details

Property Value
Method onnxruntime.quantization.MatMulNBitsQuantizer
Bits 4
Block size 32 (one scale per 32 consecutive weights)
Symmetry Symmetric (no zero-point)
Op MatMulNBits contrib op (ORT β‰₯ 1.16, CPU / CUDA / CoreML EPs)
Ops quantized MatMul only β€” Gather (embedding table) left in FP32

Block-wise vs per-tensor: block_size=32 gives 32 768 calibration scale values for a 1024Γ—1024 matrix vs 1 for per-tensor INT8. This fine granularity explains why INT4 shows higher cosine fidelity to FP32 (0.945) than per-tensor INT8 (0.830) despite using half the bits.

Note on dynamic batch: this variant was produced from the legacy torch.onnx.export (not dynamo). It runs correctly at batch=1 only. If you need batch > 1 for throughput, use the INT8 variant which is based on the dynamo export. A dynamo-based INT4 re-export is planned.

Benchmark (Apple M-series, CPU)

Metric Value
Ingest throughput ~2.6 ch/s
Top-1 hybrid accuracy 1.00
RSS memory ~1.15 GB
File size ~0.9 GB

Quality metrics vs FP32

Measured on 8 diverse EN/DE sentences (3 semantic triplets):

Metric Value
Cosine similarity to FP32 (mean) 0.945
Cosine similarity to FP32 (min) 0.930
Semantic ordering (3/3 triplets) βœ…
Triplet margin (mean) 0.241
Anisotropy (avg pairwise cos) 0.233
Unit-norm compliance βœ…

The high cosine fidelity (0.945) despite using only 4 bits comes from block-wise calibration (block_size=32), which is far finer-grained than the per-tensor INT8 approach.

Model details

Property Value
Embedding dim 1024
Max context 32 768 tokens
Inputs input_ids [batch, seq], attention_mask [batch, seq]
Output last_hidden_state [batch, seq, 1024]
Pooling Last-token pooling + L2 normalisation
Batch support batch=1 only (legacy export limitation)

Inference (batch=1)

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_truncation(max_length=512)

# CPUExecutionProvider supports MatMulNBits 4-bit
session = ort.InferenceSession("model.int4.onnx", providers=["CPUExecutionProvider"])

text = "semantic search example"
enc  = tokenizer.encode(text)
ids  = np.array([enc.ids],             dtype=np.int64)
mask = np.array([enc.attention_mask],  dtype=np.int64)

lhs  = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]  # [1, seq, 1024]
emb  = lhs[0, mask[0].sum() - 1]   # last non-padding token
emb  = emb / np.linalg.norm(emb)
print(emb.shape)  # (1024,)

Files

File Size Description
model.int4.onnx ~3 MB ONNX graph with MatMulNBits nodes
model.int4.onnx.data ~855 MB 4-bit weight data + scales
tokenizer.json 11 MB HuggingFace fast tokenizer

Variants

Repo Precision Size Batch Notes
cstr/octen-embedding-0.6b-onnx FP32 2.4 GB dynamic Reference
cstr/octen-embedding-0.6b-onnx-int8 INT8 1.1 GB dynamic Recommended
cstr/octen-embedding-0.6b-onnx-int4 INT4 0.9 GB batch=1 This repo β€” minimum RAM

License

Apache 2.0.

Downloads last month
82
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/octen-embedding-0.6b-onnx-int4

Quantized
(11)
this model