Octen-Embedding-0.6B — INT4 ONNX (MatMulNBits, block_size=32)

INT4-quantized ONNX of Octen/Octen-Embedding-0.6B. Smallest resident memory of all variants (~1.2 GB RSS) with 1.00 top-1 retrieval accuracy.

Quantization details

Property	Value
Method	`onnxruntime.quantization.MatMulNBitsQuantizer`
Bits	4
Block size	32 (one scale per 32 consecutive weights)
Symmetry	Symmetric (no zero-point)
Op	`MatMulNBits` contrib op (ORT ≥ 1.16, CPU / CUDA / CoreML EPs)
Ops quantized	`MatMul` only — Gather (embedding table) left in FP32

Block-wise vs per-tensor: block_size=32 gives 32 768 calibration scale values for a 1024×1024 matrix vs 1 for per-tensor INT8. This fine granularity explains why INT4 shows higher cosine fidelity to FP32 (0.945) than per-tensor INT8 (0.830) despite using half the bits.

Note on dynamic batch: this variant was produced from the legacy torch.onnx.export (not dynamo). It runs correctly at batch=1 only. If you need batch > 1 for throughput, use the INT8 variant which is based on the dynamo export. A dynamo-based INT4 re-export is planned.

Benchmark (Apple M-series, CPU)

Metric	Value
Ingest throughput	~2.6 ch/s
Top-1 hybrid accuracy	1.00
RSS memory	~1.15 GB
File size	~0.9 GB

Quality metrics vs FP32

Measured on 8 diverse EN/DE sentences (3 semantic triplets):

Metric	Value
Cosine similarity to FP32 (mean)	0.945
Cosine similarity to FP32 (min)	0.930
Semantic ordering (3/3 triplets)	✅
Triplet margin (mean)	0.241
Anisotropy (avg pairwise cos)	0.233
Unit-norm compliance	✅

The high cosine fidelity (0.945) despite using only 4 bits comes from block-wise calibration (block_size=32), which is far finer-grained than the per-tensor INT8 approach.

Model details

Property	Value
Embedding dim	1024
Max context	32 768 tokens
Inputs	`input_ids [batch, seq]`, `attention_mask [batch, seq]`
Output	`last_hidden_state [batch, seq, 1024]`
Pooling	Last-token pooling + L2 normalisation
Batch support	batch=1 only (legacy export limitation)

Inference (batch=1)

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_truncation(max_length=512)

# CPUExecutionProvider supports MatMulNBits 4-bit
session = ort.InferenceSession("model.int4.onnx", providers=["CPUExecutionProvider"])

text = "semantic search example"
enc  = tokenizer.encode(text)
ids  = np.array([enc.ids],             dtype=np.int64)
mask = np.array([enc.attention_mask],  dtype=np.int64)

lhs  = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]  # [1, seq, 1024]
emb  = lhs[0, mask[0].sum() - 1]   # last non-padding token
emb  = emb / np.linalg.norm(emb)
print(emb.shape)  # (1024,)

Files

File	Size	Description
`model.int4.onnx`	~3 MB	ONNX graph with MatMulNBits nodes
`model.int4.onnx.data`	~855 MB	4-bit weight data + scales
`tokenizer.json`	11 MB	HuggingFace fast tokenizer

Variants

Repo	Precision	Size	Batch	Notes
cstr/octen-embedding-0.6b-onnx	FP32	2.4 GB	dynamic	Reference
cstr/octen-embedding-0.6b-onnx-int8	INT8	1.1 GB	dynamic	Recommended
cstr/octen-embedding-0.6b-onnx-int4	INT4	0.9 GB	batch=1	This repo — minimum RAM

License

Apache 2.0.

Downloads last month: 82

Model tree for cstr/octen-embedding-0.6b-onnx-int4

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

Octen/Octen-Embedding-0.6B

Quantized

(11)

this model