NVIDIA has released two new compact AI models designed for multimodal document retrieval. The Llama Nemotron VL-1B models enable accurate search across PDFs, images, and visual documents.
The release includes llama-nemotron-embed-vl-1b-v2 for embeddings and llama-nemotron-rerank-vl-1b-v2 for reranking. Both models work with standard vector databases out of the box.
Addressing Real-World Document Challenges
Enterprise data exists in complex formats including PDFs with charts, scanned contracts, and slide decks. Traditional text-only retrieval systems miss crucial visual information in these documents.
Multimodal RAG pipelines solve this problem by enabling retrieval over text, images, and layouts together. This approach delivers more accurate and actionable answers for enterprise applications.
The embedding model condenses visual and textual information into single-vector representations. This design ensures compatibility with all standard vector databases at millisecond-latency scale.
The reranking model reorders retrieved candidates to improve relevance scores significantly. It boosts downstream answer quality without requiring changes to storage or index formats.
Benchmark Performance Results
NVIDIA evaluated both models across five visual document retrieval datasets. Testing included ViDoRe V1, V2, V3, DigitalCorpora-10k, and an internal earnings report dataset.
The llama-nemotron-embed-vl-1b-v2 achieved 73.24% Recall@5 using combined image and text modality. When paired with the reranker, accuracy increased to an impressive 77.64%.
The embedding model outperforms its predecessor llama-3.2-nemoretriever-1b-vlm-embed-v1 across all modalities. The reranker improves retrieval accuracy by approximately 6-7% per modality.
Compared to competitors, the models demonstrate superior performance on text and combined modalities. The permissive commercial license makes them ideal for enterprise deployments.
Technical Architecture Details
The embedding model contains approximately 1.7 billion parameters based on transformer architecture. It fine-tunes the NVIDIA Eagle family using Llama 3.2 1B and SigLip2 400M vision encoder.
The model applies mean pooling over output token embeddings from the language model. It outputs single embeddings with 2048 dimensions for efficient storage and retrieval.
Contrastive learning trains the model to increase similarity between queries and relevant documents. The training simultaneously decreases similarity to negative samples for better discrimination.
The reranking model also contains 1.7 billion parameters as a cross-encoder architecture. A binary classification head handles the ranking task after mean pooling aggregation.
Enterprise Adoption Examples
Several major organizations have already implemented these models in production environments. Cadence uses them for design and EDA workflow documentation retrieval.
IBM Storage applies the models to index product guides, configuration manuals, and architecture diagrams. The system helps AI interpret complex infrastructure documentation more accurately.
ServiceNow deploys multimodal embeddings for their Chat with PDF experiences. The reranker selects most relevant pages for each user query across organizational documents.
Availability and Integration
Both models are available now on Hugging Face for immediate download. They run efficiently on most NVIDIA GPU resources.
Developers can integrate the embedding model with any vector database for multimodal search. The reranker adds as a second-stage filter on top-k results.
The models reduce hallucinations by grounding generation on better evidence retrieval. They work together to keep VLM responses accurate and contextually relevant.
NVIDIA continues expanding its Nemotron model family for enterprise AI applications. The new releases support building sophisticated multimodal agents.
Source: Hugging Face Blog | NVIDIA Nemotron Collection

