zerank-1-small โ€” ONNX Export

ONNX export of zeroentropy/zerank-1-small, a 1.7B Qwen3-based reranker. Includes three quantization levels for CPU inference.

Files

File Format Size Description
model.onnx + model.onnx_data FP16 ~3.2 GB Full precision
model_int8.onnx + model_int8.onnx_data INT8 ~2.5 GB Weight-only INT8 (per-tensor symmetric)
model_int4_full.onnx INT4 ~1.3 GB MatMulNBits INT4, block_size=32

Conversion scripts: export_zerank_v2.py (FP16 export with dynamic batch), stream_int8.py (INT8 quantization).

โš ๏ธ Important: chat template required

This model is a Qwen3-based causal LM that scores (query, document) relevance by extracting the "Yes" token logit at the last position. It requires a specific prompt format โ€” plain pair tokenization produces meaningless scores.

Always format inputs using the Qwen3 chat template with system=query, user=document:

# using the tokenizer directly (matches training format exactly):
messages = [
    {"role": "system", "content": query},
    {"role": "user",   "content": document},
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

This produces the following fixed string (equivalent, usable without a tokenizer):

<|im_start|>system
{query}
<|im_end|>
<|im_start|>user
{document}
<|im_end|>
<|im_start|>assistant

Usage with ONNX Runtime (Python)

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

MODEL_PATH = "model_int8.onnx"   # or model.onnx, model_int4_full.onnx
MAX_LENGTH = 512

sess = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
tok  = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")

def format_pair(query: str, doc: str) -> str:
    messages = [
        {"role": "system", "content": query},
        {"role": "user",   "content": doc},
    ]
    return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def rerank(query: str, documents: list[str]) -> list[float]:
    scores = []
    for doc in documents:
        text = format_pair(query, doc)
        enc  = tok(text, return_tensors="np", truncation=True, max_length=MAX_LENGTH)
        logit = sess.run(["logits"], {
            "input_ids":      enc["input_ids"].astype(np.int64),
            "attention_mask": enc["attention_mask"].astype(np.int64),
        })[0]
        scores.append(float(logit[0, 0]))
    return scores

query = "What is a panda?"
docs  = [
    "The giant panda is a bear species endemic to China.",
    "The sky is blue and the grass is green.",
    "Pandas are mammals in the family Ursidae.",
]
scores = rerank(query, docs)
for s, d in sorted(zip(scores, docs), reverse=True):
    print(f"[{s:.3f}] {d}")
# [+6.8] The giant panda is a bear species endemic to China.
# [+2.1] Pandas are mammals in the family Ursidae.
# [-5.8] The sky is blue and the grass is green.

Batch inference: The v2 export (model.onnx) supports batch_size > 1 via a dynamic causal+padding mask. Pad a batch with the tokenizer and pass the full batch at once for higher throughput.

Usage with fastembed-rs

use fastembed::{RerankInitOptions, RerankerModel, TextRerank};

let mut reranker = TextRerank::try_new(
    RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
).unwrap();

// The chat template is applied automatically; batch_size > 1 is supported.
let results = reranker.rerank(
    "What is a panda?",
    vec![
        "The giant panda is a bear species endemic to China.",
        "The sky is blue.",
        "Pandas are mammals in the family Ursidae.",
    ],
    true,
    Some(32),
).unwrap();

for r in &results {
    println!("[{:.3}] {}", r.score, r.document.as_ref().unwrap());
}

Export details

export_zerank_v2.py wraps Qwen3ForCausalLM in a ZeRankScorerV2 that:

  1. Builds a 4D causal+padding attention mask explicitly from input_ids.shape[0] โ€” this makes the batch dimension dynamic in the ONNX graph (enabling batch_size > 1).
  2. Runs the transformer body โ†’ hidden [batch, seq, hidden]
  3. Gathers the hidden state at the last real-token position (attention_mask.sum - 1)
  4. Applies lm_head, slices the "Yes" token (id 9454) โ†’ [batch, 1]

Output: logits [batch, 1] โ€” raw Yes-token logit (higher = more relevant). FP16 weights, opset 18.

stream_int8.py performs fully streaming weight-only INT8 quantization:

  • Never loads the full 6.4 GB FP32 model into RAM (peak ~1.5 GB)
  • Symmetric per-tensor quantization: scale = max(|w|) / 127
  • Adds DequantizeLinear โ†’ MatMul nodes for all MatMul B-weights
  • Non-MatMul tensors (embeddings, LayerNorm) kept as FP32

Benchmarks (from original model card)

NDCG@10 with text-embedding-3-small as initial retriever (Top 100 candidates):

Task Embedding only cohere-rerank-v3.5 Llama-rank-v1 zerank-1-small zerank-1
Code 0.678 0.724 0.694 0.730 0.754
Finance 0.839 0.824 0.828 0.861 0.894
Legal 0.703 0.804 0.767 0.817 0.821
Medical 0.619 0.750 0.719 0.773 0.796
STEM 0.401 0.510 0.595 0.680 0.694
Conversational 0.250 0.571 0.484 0.556 0.596

See zeroentropy/zerank-1-small for full details and Apache-2.0 license.

Downloads last month
689
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/zerank-1-small-ONNX

Finetuned
Qwen/Qwen3-4B
Quantized
(8)
this model