Model Card for CoMP-MM-1B

This is an VFM that supports native image resolution inputs, continually pre-trained from DINOv2.

Model Sources

Repository: https://github.com/SliMM-X/CoMP-MM
Paper: https://arxiv.org/abs/2503.18931
Project Page: https://slimm-x.github.io/comp

How to Get Started with the Model

Install the github repo, and use the code below to get started with the model.

import torch
from slimm.model.processor import SliMMQwen2VLProcessor
from slimm.model.utils_vl import process_vision_info
from slimm.model.vision_encoder import CoMPDinov2Model
from PIL import Image

model_path = "SliMM-X/CoMP-DINOv2-Large"

model = CoMPDinov2Model.from_pretrained(
    model_path, torch_dtype="auto", device_map="cuda", w_merger=False
).to(torch.bfloat16)

processor = SliMMQwen2VLProcessor.from_pretrained(model_path)

image_input = Image.open("https://slimm-x.github.io/comp/figs/teaser.png")
inputs = processor(
    images=image_input,
    return_tensors="pt",
)

inputs = inputs.to("cuda")
output_feat = model(inputs.pixel_values.to(torch.bfloat16), inputs.image_grid_thw)
print(output_feat)

Citation

BibTeX:

@article{comp2025,
      title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models}, 
      author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
      year={2025},
      journal={arXiv preprint arXiv:2503.18931}, 
}

Downloads last month: 10

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SliMM-X/CoMP-DINOv2-Large

Base model

facebook/dinov2-large

Finetuned

(25)

this model

Paper for SliMM-X/CoMP-DINOv2-Large

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Paper • 2503.18931 • Published Mar 24, 2025 • 30