---
license: apache-2.0
license_name: apache-2.0
license_link: https://www.apache.org/licenses/LICENSE-2.0
tags:
- text
- image
- video
- multimodal-embedding
- vidore
- colpali
- colqwen3
- multilingual-embedding
- quantized
- awq
- autoround
- w4a16
language:
- multilingual
library_name: transformers
pipeline_tag: visual-document-retrieval
base_model:
- TomoroAI/tomoro-colqwen3-embed-4b
---

# TomoroAI/tomoro-ai-colqwen3-embed-4b-awq

## Overview

This is a **W4A16 quantized** version of [TomoroAI/tomoro-colqwen3-embed-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b), a state-of-the-art [ColPali](https://arxiv.org/abs/2407.01449)-style multimodal embedding model. The quantization was performed using [AutoRound](https://github.com/intel/auto-round) with AutoAWQ backend.

The quantized model achieves **~3.5 GB memory usage** (vs 8.4 GB for the original), enabling deployment on consumer GPUs while maintaining competitive retrieval performance.

## Model Details

| Property | Value |
|----------|-------|
| **Original Model** | [TomoroAI/tomoro-colqwen3-embed-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) |
| **Parameters** | 4.0B |
| **Quantization** | W4A16 (4-bit weights, 16-bit activations) |
| **Quantization Method** | AutoRound with AutoAWQ backend |
| **Calibration Sequence Length** | 1024 |
| **Memory Usage (Quantized)** | ~3.5 GB |
| **Memory Usage (Original)** | 8.4 GB |
| **Embedding Dimension** | 320 |
| **Max Visual Tokens** | 1280 |

## Quantization Configuration

| Parameter | Value |
|-----------|-------|
| **Bits** | 4 |
| **Group Size** | 128 |
| **Symmetric** | True |
| **Calibration Dataset** | NeelNanda/pile-10k (AutoRound default) |
| **Calibration Sequence Length** | 1024 |
| **Iterations** | 1000 |
| **Number of Samples** | 560 |
| **Batch Size** | 80 |
| **Quantized Layers** | 252 |
| **FP16 Layers (Vision)** | 105 |

> **Note:** Only the text tower (language model) is quantized. The vision encoder remains in FP16/BF16 to preserve visual feature quality.

## Performance

### NDCG@5 on ViDoRe Benchmark (All Languages)

| Model | Average NDCG@5 | Change |
|-------|----------------|--------|
| Original (FP16) | 0.70023 | - |
| **This Model (W4A16, seqlen=1024)** | **0.69768** | **-0.36%** |

### NDCG@5 on ViDoRe Benchmark (English Only)

| Model | Average NDCG@5 | Change |
|-------|----------------|--------|
| Original (FP16) | 0.74743 | - |
| **This Model (W4A16, seqlen=1024)** | **0.74582** | **-0.21%** |

### Performance Summary

- **Benchmarks Improved:** 17
- **Benchmarks Degraded:** 23
- **Overall Quality Retention:** ~99.6%

### Benchmark Comparison Charts

> **Note:** Here, "seqlen" refers to the **calibration dataset sequence length used during quantization**, not the maximum sequence length supported by the original model. The model retains the full sequence length of the original, but quantization statistics are collected with the calibration seqlen shown.


#### Performance Comparison (All Languages)

![Performance Comparison - All Languages](https://raw.githubusercontent.com/goodhamgupta/evaluation/main/performance_comparison_4B_all_languages.png)

#### Performance Difference vs Original (All Languages)

![Performance Difference - All Languages](https://raw.githubusercontent.com/goodhamgupta/evaluation/main/performance_diff_4B_all_languages.png)

#### Performance Comparison (English Only)

![Performance Comparison - English](https://raw.githubusercontent.com/goodhamgupta/evaluation/main/performance_comparison_4B_english.png)

#### Performance Difference vs Original (English Only)

![Performance Difference - English](https://raw.githubusercontent.com/goodhamgupta/evaluation/main/performance_diff_4B_english.png)

## Memory Efficiency

The quantized model enables deployment on GPUs with limited memory:

| GPU Memory | Original Model | Quantized Model |
|------------|----------------|-----------------|
| 8 GB | Marginal | Fits with batch size ~64 |
| 12 GB | Fits comfortably | Fits with batch size ~256 |
| 16 GB | Fits comfortably | High batch sizes possible |
| 24 GB | Fits comfortably | High batch sizes possible |

## Usage

### Prerequisites

```bash
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install auto-round==0.9.2
pip install autoawq==0.2.9
pip install transformers pillow requests
pip install flash-attn --no-build-isolation  # Optional but recommended
```

### Inference Code

```python
import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
from io import BytesIO

# Configuration
MODEL_ID = "TomoroAI/tomoro-ai-colqwen3-embed-4b-awq"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load Model & Processor
processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
    MODEL_ID,
    dtype=DTYPE,
    attn_implementation="sdpa",  # Use "flash_attention_2" if available
    trust_remote_code=True,
    device_map=DEVICE,
).eval()

# Sample queries and documents
queries = [
    "Retrieve the city of Singapore",
    "Retrieve the city of Beijing",
]
doc_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
]

def load_image(url: str) -> Image.Image:
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers, timeout=10)
    resp.raise_for_status()
    return Image.open(BytesIO(resp.content)).convert("RGB")

def encode_queries(texts):
    batch = processor.process_texts(texts=texts)
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    with torch.inference_mode():
        out = model(**batch)
    return out.embeddings.to(torch.bfloat16).cpu()

def encode_docs(urls):
    images = [load_image(url) for url in urls]
    features = processor.process_images(images=images)
    features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
    with torch.inference_mode():
        out = model(**features)
    return out.embeddings.to(torch.bfloat16).cpu()

# Encode and score
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(doc_urls)
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)
```

## Comparison with Other Calibration Lengths

| Calibration Length | Avg NDCG@5 | Delta | Best For |
|--------------------|------------|-------|----------|
| seqlen=256 | 0.69611 | -0.59% | Short document retrieval |
| seqlen=512 | 0.69696 | -0.47% | Balanced use cases |
| seqlen=1024 | 0.69768 | -0.36% | Long document retrieval |

## Limitations

- **Reduced Precision:** 4-bit quantization introduces some accuracy loss compared to the original FP16 model.
- **Vision Encoder:** The vision encoder is not quantized to preserve visual feature quality.
- **Inference Backend:** Performance depends on the inference backend (AutoAWQ, vLLM, etc.).

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the original model.

## Acknowledgements

- **Original Model:** [TomoroAI/tomoro-colqwen3-embed-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) by [Tomoro AI](https://tomoro.ai/)
- **Quantization Tool:** [AutoRound](https://github.com/intel/auto-round) by Intel
- **Base Architecture:** [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) by Alibaba

## Citation

If you use this model, please cite both the original model and this quantized version:

```bibtex
@misc{huang2025beyond,
  author = {Huang, Xin and Tan, Kye Min},
  title = {Beyond Text: Unlocking True Multimodal, End-to-end RAG with Tomoro ColQwen3},
  year = {2025},
  url = {https://tomoro.ai/insights/beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3},
  publisher = {Tomoro.ai}
}

@misc{autoround,
  author = {Intel Corporation},
  title = {AutoRound: Advanced Weight-Only Quantization Algorithm},
  year = {2024},
  url = {https://github.com/intel/auto-round}
}
```