--- language: - en - es - fr - de - it - hi - mr - sa - kn - te - ta - ml - zh - ja - ko - ar - bn - gu - or - pa - ru - th license: gemma library_name: transformers tags: - vision-language - retrieval - multimodal - multilingual - document-retrieval - matryoshka-embeddings - dense-retrieval - 22-languages pipeline_tag: visual-document-retrieval base_model: - google/gemma-3-4b-it model-index: - name: NetraEmbed results: - task: type: image-text-retrieval name: Cross-Lingual Document Retrieval dataset: type: Cognitive-Lab/nayanair-bench name: Nayana-IR Cross-Lingual split: test metrics: - type: ndcg_at_5 value: 0.716 name: NDCG@5 - type: recall_at_10 value: 0.871 name: Recall@10 - type: map_at_10 value: 0.703 name: MAP@10 - type: mrr_at_10 value: 0.775 name: MRR@10 - task: type: image-text-retrieval name: Monolingual Document Retrieval dataset: type: Cognitive-Lab/nayanair-bench name: Nayana-IR Monolingual split: test metrics: - type: ndcg_at_5 value: 0.738 name: NDCG@5 - type: recall_at_10 value: 0.844 name: Recall@10 - type: map_at_10 value: 0.709 name: MAP@10 - type: mrr_at_10 value: 0.751 name: MRR@10 - task: type: image-text-retrieval name: English Document Retrieval dataset: type: vidore/vidore-benchmark name: ViDoRe v2 split: test metrics: - type: ndcg_at_5 value: 0.554 name: NDCG@5 - type: recall_at_10 value: 0.637 name: Recall@10 - type: map_at_10 value: 0.437 name: MAP@10 - type: mrr_at_10 value: 0.647 name: MRR@10 --- # NetraEmbed ![NetraEmbed Banner](https://cdn-uploads.huggingface.co/production/uploads/6442d975ad54813badc1ddf7/wNumrelVx2ldL9VffaiGS.png) [![Paper](https://img.shields.io/badge/arXiv-2512.03514-b31b1b.svg)](https://arxiv.org/abs/2512.03514) [![GitHub](https://img.shields.io/badge/GitHub-colpali-181717?logo=github)](https://github.com/adithya-s-k/colpali) [![Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/Cognitive-Lab/NetraEmbed) [![Blog](https://img.shields.io/badge/Blog-CognitiveLab-blue)](https://www.cognitivelab.in/blog/introducing-netraembed) [![Demo](https://img.shields.io/badge/Demo-Try%20it%20out-green)](https://huggingface.co/spaces/AdithyaSK/NetraEmbed) [![Colab](https://img.shields.io/badge/Colab-Run%20Code-F9AB00?logo=googlecolab&logoColor=white)](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_InferenceDemo.ipynb) [![Colab](https://img.shields.io/badge/Colab-Gradio%20Demo-F9AB00?logo=googlecolab&logoColor=white)](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_Gradio_Demo_final.ipynb) **NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone. ## Model Description NetraEmbed is a multilingual multimodal embedding model that encodes both visual documents and text queries into single dense vectors. It supports multiple languages and enables efficient similarity search at multiple embedding dimensions (768, 1536, 2560) through Matryoshka representation learning. - **Model Type:** Multilingual Multimodal Embedding Model with Matryoshka embeddings - **Architecture:** BiEncoder with Gemma3-4B backbone - **Embedding Dimensions:** 768, 1536, 2560 (Matryoshka) - **Capabilities:** Multilingual, Multimodal (Vision + Text) - **Use Case:** Visual document retrieval, multilingual semantic search, cross-lingual document understanding ## Paper 📄 **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)** ## Installation ```bash pip install git+https://github.com/adithya-s-k/colpali.git ``` ## Quick Start ```python import torch from PIL import Image from colpali_engine.models import BiGemma3, BiGemmaProcessor3 # Load model and processor model_name = "Cognitive-Lab/NetraEmbed" # Load model once (supports all Matryoshka dimensions) model = BiGemma3.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda", ) processor = BiGemmaProcessor3.from_pretrained(model_name) # Load your images images = [ Image.open("document1.jpg"), Image.open("document2.jpg"), ] # Define queries queries = [ "What is the total revenue?", "Show me the organizational chart", ] # Process and encode batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_texts(queries).to(model.device) # Choose embedding dimension at inference time: 768, 1536, or 2560 # Use lower dims for faster search, higher for better accuracy embedding_dim = 1536 # Balanced performance with torch.no_grad(): image_embeddings = model(**batch_images, embedding_dim=embedding_dim) # Shape: (num_images, embedding_dim) query_embeddings = model(**batch_queries, embedding_dim=embedding_dim) # Shape: (num_queries, embedding_dim) # Compute similarity scores using cosine similarity scores = processor.score( qs=query_embeddings, ps=image_embeddings, ) # Shape: (num_queries, num_images) # Get best matches for i, query in enumerate(queries): best_idx = scores[i].argmax().item() print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})") ``` ### Testing Multiple Dimensions You can test different embedding dimensions without reloading the model: ```python # Load model once model = BiGemma3.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda", ) # Test all Matryoshka dimensions for embedding_dim in [768, 1536, 2560]: print(f"\nTesting dimension: {embedding_dim}") with torch.no_grad(): image_embeddings = model(**batch_images, embedding_dim=embedding_dim) query_embeddings = model(**batch_queries, embedding_dim=embedding_dim) scores = processor.score(qs=query_embeddings, ps=image_embeddings) print(f"Scores shape: {scores.shape}") print(f"Best match score: {scores.max().item():.4f}") ``` ## Matryoshka Embeddings NetraEmbed supports three embedding dimensions that can be selected **at inference time**: | Dimension | Use Case | Speed | Accuracy | |-----------|----------|-------|----------| | 768 | Fast search, large-scale | ⚡⚡⚡ | ⭐⭐ | | 1536 | Balanced performance | ⚡⚡ | ⭐⭐⭐ | | 2560 | Maximum accuracy | ⚡ | ⭐⭐⭐⭐ | **Key Advantage:** Load the model once and dynamically choose dimensions at inference time. No need to reload the model to test different dimensions or switch between accuracy/speed trade-offs! ## Use Cases - **Efficient Document Retrieval:** Fast search through millions of documents - **Semantic Search:** Find visually similar documents - **Scalable Vector Search:** Works with FAISS, Milvus, Pinecone, etc. - **Cross-lingual Retrieval:** Multilingual visual document search ## Model Details - **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it) - **Vision Encoder:** SigLIP - **Training Data:** Multilingual document datasets - **Embedding Strategy:** Single-vector (BiEncoder) - **Similarity Function:** Cosine similarity - **Matryoshka Dimensions:** 768, 1536, 2560 ## Performance NetraEmbed achieves state-of-the-art performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2. ### Benchmark Results **Nayana-IR Cross-Lingual** | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |-------|:------:|:---------:|:------:|:------:| | **NetraEmbed** | **0.716** | **0.871** | **0.703** | **0.775** | | Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 | | ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 | | ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 | | GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 | | ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 | | ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 | **Nayana-IR Monolingual** | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |-------|:------:|:---------:|:------:|:------:| | **NetraEmbed** | **0.738** | **0.844** | **0.709** | **0.751** | | ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 | | ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 | | GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 | | ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 | | ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 | **ViDoRe v2** | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |-------|:------:|:---------:|:------:|:------:| | ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 | | Jina-Embeddings-v4 | 0.576 | 0.686 | - | - | | GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 | | ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 | | **NetraEmbed** | **0.554** | **0.637** | **0.437** | **0.647** | | ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 | | ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 | **Key Results:** - 🏆 **State-of-the-art** on multilingual retrieval (0.716 NDCG@5 cross-lingual) - 📈 **152% improvement** over ColPali-v1.3 on cross-lingual tasks - 🌍 Consistent performance across **22 languages** and diverse scripts - ⚡ **250x more efficient** than multi-vector approaches (~10KB vs ~2.5MB per document) See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and per-language analysis. ## Citation ```bibtex @misc{kolavi2025m3druniversalmultilingualmultimodal, title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, author={Adithya S Kolavi and Vyoman Jain}, year={2025}, eprint={2512.03514}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2512.03514} } ``` ## License This model is released under the same license as the base Gemma3 model. ## Acknowledgments This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in). Built on top of the Gemma3 architecture with Matryoshka representation learning.