--- license: other license_name: nvidia-open-model-license license_link: >- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ tags: - retrieval - visual document retrieval - vlm embedding - page image embedding - text embedding - semantic search - question-answering retrieval - rag - transformers language: - multilingual library_name: transformers --- # Model Overview ### Description: llama-nemotron-embed-vl-1b-v2 was developed by NVIDIA for **multimodal** question-answering retrieval. The model can embed document pages in the form of image, text, or combined image–text inputs. Documents can be retrieved given a user query in text form. The model supports page images containing text, tables, charts, and infographics. We report the evaluation of this model on two internal multimodal retrieval benchmarks, and on the popular [ViDoRe](https://huggingface.co/vidore) V1 and V2 benchmarks and the new [Vidore V3](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3) benchmark. An embedding model is a crucial component of a retrieval system because it transforms information into dense vector representations. An embedding model is typically a transformer encoder that processes tokens of input text or images (for example, questions, passages, or page images) to output an embedding. llama-nemotron-embed-vl-1b-v2 is a combined language and vision model. The llama-nemotron-embed-vl-1b-v2 is part of the [Nemotron RAG collection](https://huggingface.co/collections/nvidia/nemotron-rag) of open models available on HuggingFace. It is also available for optimized inference as a NIM (NVIDIA Inference Microservice) from NVIDIA NeMo Retriever, which provides state-of-the-art, commercially-ready models and microservices optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can readily customize them for domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants. This model is ready for commercial use. ### License/Terms of use The use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) and the use of the post-processing scripts are licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt). Additional Information: [Llama 3.2 Community Model License Agreement](https://www.llama.com/llama3_2/license/). Built with Llama. ### Deployment Geography: Global
### Use Case:
The llama-nemotron-embed-vl-1b-v2 is suitable for users who want to build a multimodal question-and-answer application over a large corpus, leveraging the latest dense retrieval technology. The input of the model is a text or document image and the output a fixed-size embedding vector. The embedding model is a bi-encoder that supports context in textual format (e.g. the query or the OCR text of a page or a section of a document) or the image of a document page. Typically, the embedding model is used first to embed (vectorize) the whole corpus (document images or text chunks), and embeddings are stored stored in a vector database associated to its raw content (image or text). Then at inference time, the embedding model is used to embed the query. The embeddings of the query and relevant context from the corpus should be close in the embedding space. ### Release Date:
12/18/2025 via https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2 ## References(s): [Technical report - "Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model"](https://www.arxiv.org/abs/2507.05513)
### Citation ``` @inproceedings{moreira2025_nvretriever, author = {Moreira, Gabriel de Souza P. and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even}, title = {Improving Text Embedding Models with Positive-aware Hard-negative Mining}, year = {2025}, isbn = {9798400720406}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3746252.3761254}, doi = {10.1145/3746252.3761254}, pages = {2169–2178}, numpages = {10}, keywords = {contrastive learning, distillation, embedding models, hard-negative mining, rag, text retrieval, transformers}, location = {Seoul, Republic of Korea}, series = {CIKM '25} } ``` ### **Model Architecture** **Architecture Type:** Transformer
**Network Architecture:** [Eagle VLM](https://huggingface.co/collections/nvidia/eagle) architecture with Llama 3.2 1B language model and SigLip2 400 image encoder
The llama-nemotron-embed-vl-1b-v2 embedding model is a transformer encoder, with approximately 1.7B parameters. It is a fine-tuned version of NVIDIA Eagle family of models, using Llama 3.2 1B language model and SigLip2 400M image encoder. The language model submodule has 16 layers with embedding size of 2048, and is pre-trained on public datasets. Embedding models for retrieval are typically trained with a bi-encoder architecture, that encodes query and document independently. The model applies mean pooling over the output token embeddings from the language model, so that it outputs a single embedding with 2048 dimensions. Contrastive learning is used to train the embedding model to maximize the similarity between the query and the document page that contains the answer, while minimizing the similarity between the query and sampled negative pages that are not useful to answer the question. The vision-language model encoder incorporates key innovations from NVIDIA, including [Eagle 2 research](https://arxiv.org/abs/2501.14818) and [nemoretriever-parse](https://build.nvidia.com/nvidia/nemoretriever-parse), which use a tiling-based VLM architecture. This architecture, available on [Hugging Face](https://huggingface.co/collections/nvidia/eagle-2-6764ba887fa1ef387f7df067), significantly enhances multimodal understanding through its dynamic tiling and mixture of vision encoders design. It particularly improves performance on tasks that involve high-resolution images and complex visual content. **Number of model parameters:** - Llama 3.2 1B language model: 1.23 B (Transformer parameters: 973 M, Token embedding parameters: 262 M) - SigLip 2 image encoder: 428.77 M ## Input(s):
**Input Type(s):** Image, Text
**Input Format(s):**
- Image: Red, Green, Blue (RGB)
- Text: String
**Input Parameters:**
- Image: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
**Output Parameters:**
- Image/Text Embedding (2D) - embedding of 2048 dimensions
**Other Properties Related to Input:** - The model's maximum context length we evaluated is 10240 tokens.
- Each image tile consumes 256 tokens. We have tested this model extensively with these settings on config.json - `max_input_tiles = 6`, `use_thumbnails = True`, so that every image is split into maximum 6 tiles + 1 thumbnail (whole image at lower resolution), consuming about 1792 visual tokens. If you embed both page image and text (e.g. page OCR), the sum of the visual tokens (explained above) and the text tokens should not be higher than 10240 tokens. ## Output(s) **Output Type:** Floats
**Output Format:** List of float arrays (embeddings)
**Output:** Model outputs embedding vectors of maximum dimension 2048 for each input.
**Other Properties Related to Output:** N/A
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (such as GPU cores) and software frameworks (such as CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. ## Installation The model requires `transformers>=4.56.0` and optionally flash-attention. ```bash pip install "transformers>=4.56.0" pip install "flash-attn>=2.6.3,<2.8" --no-build-isolation ``` ## Transformers Usage ```python import torch from transformers import AutoModel, AutoProcessor from transformers.image_utils import load_image modality = "image" # Load model model_name_or_path = "nvidia/llama-nemotron-embed-vl-1b-v2" device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModel.from_pretrained( model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="flash_attention_2", device_map="auto" ).eval() # Set max number of tokens (p_max_length) based on modality if modality == "image": p_max_length = 2048 elif modality == "image_text": p_max_length = 10240 elif modality == "text": p_max_length = 8192 model.processor.p_max_length = p_max_length # Sets max number of tiles an image can be split. Each tile consumes 256 tokens. model.processor.max_input_tiles = 6 # Enables an extra tile with the full image at lower resolution model.processor.use_thumbnail = True # Example usage: single query with multiple image documents query = "How is AI improving the intelligence and capabilities of robots?" image_paths = [ "https://developer.download.nvidia.com/images/isaac/nvidia-isaac-lab-1920x1080.jpg", "https://blogs.nvidia.com/wp-content/uploads/2018/01/automotive-key-visual-corp-blog-level4-av-og-1280x680-1.png", "https://developer-blogs.nvidia.com/wp-content/uploads/2025/02/hc-press-evo2-nim-25-featured-b.jpg" ] # Load all images (load_image handles both local paths and URLs) images = [load_image(img_path) for img_path in image_paths] # Text descriptions corresponding to each image/document (used in image_text and text modalities) document_texts = [ "AI enables robots to perceive, plan, and act autonomously.", "AI is transforming autonomous vehicles by enabling safer, smarter, and more reliable decision-making on the road.", "A biological foundation model designed to analyze and generate DNA, RNA, and protein sequences." ] # Run inference (common for all modalities) with torch.inference_mode(): queries_embeddings = model.encode_queries([query]) if modality == "image_text": documents_embeddings = model.encode_documents(images=images, texts=document_texts) elif modality == "image": documents_embeddings = model.encode_documents(images=images) elif modality == "text": documents_embeddings = model.encode_documents(texts=document_texts) def _l2_normalize(x: torch.Tensor, eps: float = 1e-12) -> torch.Tensor: return x / (x.norm(p=2, dim=-1, keepdim=True) + eps) # Computes cosine similarity (as they are already normalized) between the query embeddings and the document embeddings cos_sim = _l2_normalize(queries_embeddings) @ _l2_normalize(documents_embeddings).T # Flatten logits to 1D array (handle both [batch_size] and [batch_size, 1] shapes) cos_sim_flat = cos_sim.flatten() # Get sorted indices (highest to lowest) sorted_indices = torch.argsort(cos_sim_flat, descending=True) print(f"\nQuery: {query}\n") print(f"\nRanking (highest to lowest relevance for the modality {modality}):") for rank, idx in enumerate(sorted_indices, 1): doc_idx = idx.item() sim_val = cos_sim_flat[doc_idx].item() if modality == "text": print(f" Rank {rank}: cos_sim={sim_val:.4f} | Text: {document_texts[doc_idx]}") else: # image or image_text modality print(f" Rank {rank}: cos_sim={sim_val:.4f} | Image: {image_paths[doc_idx]}") ``` ## Software Integration: **Runtime Engine(s)**: TensorRT, Triton, NeMo Retriever Embedding NIM (upcoming)
**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Hopper, NVIDIA Lovelace
**Preferred/Supported Operating System(s):** Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
## Model Version(s): `llama-nemotron-embed-vl-1b-v2`
## Training and Evaluation Datasets: ### Training Dataset The development of large-scale, public, open-QA datasets has enabled tremendous progress in powerful embedding models. However, the following issues limit the use of these models in commercial settings. - One popular dataset, named MS MARCO, restricts ‌commercial licensing. - Many multimodal datasets use synthetic data generation with proprietary models. NVIDIA's training dataset is based on public QA datasets, and only includes datasets that have a license for commercial applications. **Properties:** The text component is comprised of semi-supervised pre-training on 12M samples from public datasets and fine-tuning on 1.5M samples from public datasets. The VLM component uses only commercially-viable data from the [Eagle2](https://github.com/NVlabs/EAGLE) training data and other public datasets.
**Data Modality**: Image, Text **Image Training Data Size** - 1 Million to 1 Billion Images (about 2,57 Million) **Text Training Data Size** - 1 Billion to 10 Trillion Tokens (about 1.6 Billion) **Data Collection Method by dataset**: Hybrid: Automated, Human, Synthetic
**Labeling Method by dataset**: Hybrid: Automated, Human, Synthetic
### Evaluation Datasets #### Vision document retrieval benchmarks We evaluated **llama-nemotron-embed-vl-1b-v2** on the popular [ViDoRe](https://huggingface.co/vidore) V1, V2 and on the new [ViDoRe V3](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3). More details on [ViDoRe leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard) can be found on their leaderboard. We also evaluated the **llama-nemotron-embed-vl-1b-v2** on two internal visual document retrieval datasets: - **DigitalCorpora-10k**: A dataset with questions based on a corpus of 10k documents from [DigitalCorpora](https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/) that have a good mixture of text, tables, and charts. - **Earnings V2**: an internal retrieval dataset of 287 questions based on 500 PDFs, mostly consisting of earnings reports from big tech companies. For those interested in reproducing our results, one of our internal datasets (DigitalCorpora-10k) can be created by following instructions in [this notebook](https://github.com/NVIDIA/nv-ingest/blob/main/evaluation/digital_corpora_download.ipynb) ([download script](https://github.com/NVIDIA/nv-ingest/blob/main/evaluation/digital_corpora_download.ipynb)) from the NeMo Retriever Extraction GitHub repository. #### Text retrieval benchmarks We evaluated **llama-nemotron-embed-vl-1b-v2** on 92 text retrieval datasets, from the benchmarks BEIR, MIRACL (multi-language), MLQA (cross-language) and MLDR (long-context). **Data Collection Method by dataset**: Hybrid: Automated, Human, Synthetic
**Labeling Method by dataset**: Hybrid: Automated, Human, Synthetic
### Evaluation Results #### Visual Document Retrieval (page retrieval) In this section, we compare the performance of **llama-nemotron-embed-vl-1b-v2** with its previous version **llama-3.2-nemoretriever-1b-vl-embed-v1** (closed weights), available as a NIM [here](https://build.nvidia.com/nvidia/llama-3_2-nemoretriever-1b-vlm-embed-v1). You can see [here](https://build.nvidia.com/nvidia/llama-3_2-nemoretriever-1b-vlm-embed-v1/modelcard) how the previous model compares to other small sized VLMs. In below table, it is possible to see that the new **llama-nemotron-embed-vl-1b-v2** provides much better retrieval accuracy (Recall@5) for the image and image+text modalities than its predecessor. *Note:* Image+Text modality means that both the page image and its text (that might be extracted by some OCR library like [NV-Ingest](https://github.com/NVIDIA/nv-ingest)) are fed as input to the embedding model for more accurate representation and retrieval.
Visual Document Retrieval benchmarks (page retrieval) - Avg Recall@5 on DC10k, Earnings V2, ViDoRe V1, V2, V3
Modality
Model Text Image Image + Text
llama-nemotron-embed-1b-v2 (former name: llama-3_2-nv-embedqa-1b-v2) 69.35% - -
llama-3.2-nemoretriever-1b-vlm-embed-v1 (closed weights, NIM-only) 71.07% 70.46% 71.71%
llama-nemotron-embed-vl-1b-v2 71.04% 71.20% 73.24%
#### Text Retrieval benchmarks (chunk retrieval) The **llama-nemotron-embed-vl-1b-v2** also improves retrieval accuracy on text retrieval benchmarks compared to our competitive text-only embedding model **llama-nemotron-embed-1b-v2**. That means you can deploy our single VLM-based model **llama-nemotron-embed-vl-1b-v2** regardless the modality of your corpus to be retrieved is image or text.
Text Retrieval benchmarks (chunk retrieval) - Avg. Recall@5
Model BEIR retrieval + TechQA MIRACL MLQA MLDR Average
llama-nemotron-embed-1b-v2 (former name: llama-3_2-nv-embedqa-1b-v2) 68.60% 60.75% 79.86% 59.55% 67.19%
llama-nemotron-embed-vl-1b-v2 69.19% 60.48% 79.90% 60.09% 67.42%
## Inference **Acceleration Engine**: TensorRT
**Test Hardware**: H100, A100, L40S, A10G, B200, RTX PRO 6000 ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
# Bias | Field | Response | | ----- | ----- | | Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None | | Measures taken to mitigate against unwanted bias | None | # Explainability | Field | Response | | ----- | ----- | | Intended Application & Domain: | Document and query embedding for question and answer retrieval. | | Model Type: | Transformer encoder. | | Intended User: | Generative AI creators working with conversational AI models. Users who want to build a question and answer application over a large corpus, leveraging the latest dense retrieval technologies. The corpus can be images of PDFs, such as text, tables, charts or infographics, or extracted plain text. | | Output: | Array of float numbers (Dense Vector Representation for the input text). | | Describe how the model works: | Model transforms the input into a dense vector representation. | | Technical Limitations: | The model's max sequence length is 10240. Longer text inputs should be truncated. | | Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | N/A | | Verified to have met prescribed NVIDIA quality standards: | Yes | | Performance Metrics: | Accuracy, Throughput, and Latency. | | Potential Known Risks: | This model does not guarantee to always retrieve the correct passage(s) for a given query. | | Licensing & Terms of Use: | The use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) and the use of the post-processing scripts are licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt). Additional Information: [Llama 3.2 Community Model License Agreement](https://www.llama.com/llama3_2/license/). Built with Llama. | # Privacy | Field | Response | | ----- | ----- | | Generatable or reverse engineerable personal data? | None | | Personal data used to create this model? | None Known | | How often is dataset reviewed? | Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes. | | Is there provenance for all datasets used in training? | Yes | | Does data labeling (annotation, metadata) comply with privacy laws? | Yes | | Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. | | Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ | # Safety | Field | Response | | ----- | ----- | | Model Application(s): | Document Embedding for Retrieval. User queries can be text and documents can be text, document page images, charts, tables, and infographics. | | Describe the life critical impact (if present) | Not applicable | | Use Case Restrictions: | The use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) and the use of the post-processing scripts are licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt). Additional Information: [Llama 3.2 Community Model License Agreement](https://www.llama.com/llama3_2/license/). Built with Llama. | | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |