--- license: apache-2.0 license_name: apache-2.0 license_link: https://www.apache.org/licenses/LICENSE-2.0 tags: - text - image - video - multimodal-embedding - vidore - colpali - colqwen3 - multilingual-embedding - quantized - awq - autoround - w4a16 language: - multilingual library_name: transformers pipeline_tag: visual-document-retrieval base_model: - TomoroAI/tomoro-colqwen3-embed-4b --- # TomoroAI/tomoro-ai-colqwen3-embed-4b-awq ## Overview This is a **W4A16 quantized** version of [TomoroAI/tomoro-colqwen3-embed-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b), a state-of-the-art [ColPali](https://arxiv.org/abs/2407.01449)-style multimodal embedding model. The quantization was performed using [AutoRound](https://github.com/intel/auto-round) with AutoAWQ backend. The quantized model achieves **~3.5 GB memory usage** (vs 8.4 GB for the original), enabling deployment on consumer GPUs while maintaining competitive retrieval performance. ## Model Details | Property | Value | |----------|-------| | **Original Model** | [TomoroAI/tomoro-colqwen3-embed-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) | | **Parameters** | 4.0B | | **Quantization** | W4A16 (4-bit weights, 16-bit activations) | | **Quantization Method** | AutoRound with AutoAWQ backend | | **Calibration Sequence Length** | 1024 | | **Memory Usage (Quantized)** | ~3.5 GB | | **Memory Usage (Original)** | 8.4 GB | | **Embedding Dimension** | 320 | | **Max Visual Tokens** | 1280 | ## Quantization Configuration | Parameter | Value | |-----------|-------| | **Bits** | 4 | | **Group Size** | 128 | | **Symmetric** | True | | **Calibration Dataset** | NeelNanda/pile-10k (AutoRound default) | | **Calibration Sequence Length** | 1024 | | **Iterations** | 1000 | | **Number of Samples** | 560 | | **Batch Size** | 80 | | **Quantized Layers** | 252 | | **FP16 Layers (Vision)** | 105 | > **Note:** Only the text tower (language model) is quantized. The vision encoder remains in FP16/BF16 to preserve visual feature quality. ## Performance ### NDCG@5 on ViDoRe Benchmark (All Languages) | Model | Average NDCG@5 | Change | |-------|----------------|--------| | Original (FP16) | 0.70023 | - | | **This Model (W4A16, seqlen=1024)** | **0.69768** | **-0.36%** | ### NDCG@5 on ViDoRe Benchmark (English Only) | Model | Average NDCG@5 | Change | |-------|----------------|--------| | Original (FP16) | 0.74743 | - | | **This Model (W4A16, seqlen=1024)** | **0.74582** | **-0.21%** | ### Performance Summary - **Benchmarks Improved:** 17 - **Benchmarks Degraded:** 23 - **Overall Quality Retention:** ~99.6% ### Benchmark Comparison Charts > **Note:** Here, "seqlen" refers to the **calibration dataset sequence length used during quantization**, not the maximum sequence length supported by the original model. The model retains the full sequence length of the original, but quantization statistics are collected with the calibration seqlen shown. #### Performance Comparison (All Languages) ![Performance Comparison - All Languages](https://raw.githubusercontent.com/goodhamgupta/evaluation/main/performance_comparison_4B_all_languages.png) #### Performance Difference vs Original (All Languages) ![Performance Difference - All Languages](https://raw.githubusercontent.com/goodhamgupta/evaluation/main/performance_diff_4B_all_languages.png) #### Performance Comparison (English Only) ![Performance Comparison - English](https://raw.githubusercontent.com/goodhamgupta/evaluation/main/performance_comparison_4B_english.png) #### Performance Difference vs Original (English Only) ![Performance Difference - English](https://raw.githubusercontent.com/goodhamgupta/evaluation/main/performance_diff_4B_english.png) ## Memory Efficiency The quantized model enables deployment on GPUs with limited memory: | GPU Memory | Original Model | Quantized Model | |------------|----------------|-----------------| | 8 GB | Marginal | Fits with batch size ~64 | | 12 GB | Fits comfortably | Fits with batch size ~256 | | 16 GB | Fits comfortably | High batch sizes possible | | 24 GB | Fits comfortably | High batch sizes possible | ## Usage ### Prerequisites ```bash pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128 pip install auto-round==0.9.2 pip install autoawq==0.2.9 pip install transformers pillow requests pip install flash-attn --no-build-isolation # Optional but recommended ``` ### Inference Code ```python import torch from transformers import AutoModel, AutoProcessor from PIL import Image import requests from io import BytesIO # Configuration MODEL_ID = "TomoroAI/tomoro-ai-colqwen3-embed-4b-awq" DTYPE = torch.bfloat16 DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # Load Model & Processor processor = AutoProcessor.from_pretrained( MODEL_ID, trust_remote_code=True, max_num_visual_tokens=1280, ) model = AutoModel.from_pretrained( MODEL_ID, dtype=DTYPE, attn_implementation="sdpa", # Use "flash_attention_2" if available trust_remote_code=True, device_map=DEVICE, ).eval() # Sample queries and documents queries = [ "Retrieve the city of Singapore", "Retrieve the city of Beijing", ] doc_urls = [ "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg", "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG", ] def load_image(url: str) -> Image.Image: headers = {"User-Agent": "Mozilla/5.0"} resp = requests.get(url, headers=headers, timeout=10) resp.raise_for_status() return Image.open(BytesIO(resp.content)).convert("RGB") def encode_queries(texts): batch = processor.process_texts(texts=texts) batch = {k: v.to(DEVICE) for k, v in batch.items()} with torch.inference_mode(): out = model(**batch) return out.embeddings.to(torch.bfloat16).cpu() def encode_docs(urls): images = [load_image(url) for url in urls] features = processor.process_images(images=images) features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()} with torch.inference_mode(): out = model(**features) return out.embeddings.to(torch.bfloat16).cpu() # Encode and score query_embeddings = encode_queries(queries) doc_embeddings = encode_docs(doc_urls) scores = processor.score_multi_vector(query_embeddings, doc_embeddings) print(scores) ``` ## Comparison with Other Calibration Lengths | Calibration Length | Avg NDCG@5 | Delta | Best For | |--------------------|------------|-------|----------| | seqlen=256 | 0.69611 | -0.59% | Short document retrieval | | seqlen=512 | 0.69696 | -0.47% | Balanced use cases | | seqlen=1024 | 0.69768 | -0.36% | Long document retrieval | ## Limitations - **Reduced Precision:** 4-bit quantization introduces some accuracy loss compared to the original FP16 model. - **Vision Encoder:** The vision encoder is not quantized to preserve visual feature quality. - **Inference Backend:** Performance depends on the inference backend (AutoAWQ, vLLM, etc.). ## License This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the original model. ## Acknowledgements - **Original Model:** [TomoroAI/tomoro-colqwen3-embed-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) by [Tomoro AI](https://tomoro.ai/) - **Quantization Tool:** [AutoRound](https://github.com/intel/auto-round) by Intel - **Base Architecture:** [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) by Alibaba ## Citation If you use this model, please cite both the original model and this quantized version: ```bibtex @misc{huang2025beyond, author = {Huang, Xin and Tan, Kye Min}, title = {Beyond Text: Unlocking True Multimodal, End-to-end RAG with Tomoro ColQwen3}, year = {2025}, url = {https://tomoro.ai/insights/beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3}, publisher = {Tomoro.ai} } @misc{autoround, author = {Intel Corporation}, title = {AutoRound: Advanced Weight-Only Quantization Algorithm}, year = {2024}, url = {https://github.com/intel/auto-round} } ```