DINOv2 Model Guide

Getting Started with Modelbit

Modelbit is an MLOps platform that lets you train and deploy any ML model, from any Python environment, with a few lines of code.

Table of Contents

Getting StartedOverviewUse CasesStrengthsLimitationsLearning Type

Model Comparisons

No items found.

Getting Started

Model Documentation


Deploy this model to a REST API

🏠 Testing Locally

Before deploying, it's always good practice to test things locally! In this section, we'll focus on setting up our environment and using this model locally before deployment.

!apt update && apt upgrade -y && apt autoremove -y
!pip3 install --upgrade pip setuptools wheel
!pip3 install --upgrade --no-cache-dir --extra-index-url https://pypi.nvidia.com cuml-cu11
!pip3 install --upgrade --no-cache-dir \
    --extra-index-url https://pypi.nvidia.com \
    --extra-index-url https://download.pytorch.org/whl/cu117 \
    git+https://github.com/facebookresearch/dinov2.git \

🔃 Download and Load DINOv2 Weights into Memory

Import the necessary dependencies required to run the demo. Then, load the DINOv2 weights. Depending on how much VRAM you have, choose the size of your weights carefully.

For free Colab notebooks, you can utilize the T4 GPUs which provide 16gb of VRAM. This means you should be able to load any of the mentioned weights in the DINOv2 repository. You can change your runtime context on the top right corner of Colab.

  • To determine the size of weights you can load into DINOv2, view this table here.
  • To view the current names of each class size for image classification, click here.

import torch
import torchvision.transforms as T
import json
import urllib
import requests
from PIL import Image
from io import BytesIO

# Get ImageNet labels
imagenet_class_url = 'https://raw.githubusercontent.com/anishathalye/imagenet-simple-labels/master/imagenet-simple-labels.json'
imagenet_classes = json.loads(urllib.request.urlopen(imagenet_class_url).read())

# Set a device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the DINOv2 model
# dinov2_vitg14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg_lc').eval().to(device)
dinov2_vitg14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg_lc').eval().to(device)

🧪 Test DINOv2 Locally with a Sample Image

Now that you have loaded the DINOv2 weights, we can pass a preprocessed image to the model. To do this, simply use wget or upload an image already on your machine to your Colab directory.

!wget -O golden_retriever.jpg https://www.princeton.edu/sites/default/files/styles/crop_2048_ipad/public/images/2022/02/KOA_Nassau_2697x1517.jpg?itok=AuZckGYV

Next, you'll want to preprocess the image for DINOv2. Since the model was pretrained with ImageNet, we are using ImageNet preprocessing on the image.

image = Image.open('golden_retriever.jpg')

transform = T.Compose([
    T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),

image = transform(image).to(device)

Now, we can pass the image through out model and get a class ID and label.

with torch.no_grad():
    features = dinov2_vitg14_reg_lc(image.unsqueeze(0))


✅ Prepare a DINOv2 Image Classification Function

def dinov2_classifier(img_url):
    response = requests.get(img_url)
    image = Image.open(BytesIO(response.content))

    # Preprocess the image
    transform = T.Compose([
        T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
        T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
    image = transform(image)

    # Move the image to the GPU if available
    image = image.to(device)

    # Extract the features
    with torch.no_grad():
        features = dinov2_vitg14_reg_lc(image.unsqueeze(0))

    # Print the features
    return {'index': features.argmax(-1).item(),
            'label': imagenet_classes[features.argmax(-1).item()]

We can now test locally since we've contained our inference code into a function. Simply pass in a URL into dinov2_classifier().

Feel free to choose any image online from the ImageNet Classes.


🚀 Deploying Whisper to a REST API Endpoint

Now that we've verified it works locally, it's time to see how easy it is to take our code and deploy directly to Modelbit with minimal lines of code.

🔐 Log into modelbit

import modelbit

# Log into the 'modelbit' service using the development ("dev") branch
# Ensure you create a "dev" branch in Modelbit or use the "main" branch for your deployment
mb = modelbit.login(branch="dev")

📦 Deploying to Modelbit

mb.deploy(dinov2_classifier, require_gpu=True)

Model Overview

Grounding DINO v2 is a novel approach in the field of computer vision, specifically tailored for open-set object detection. This subtask of object detection is unique as it addresses the challenge of identifying and localizing objects that the model may not have encountered during its training phase. Open-set object detection is vital in real-world applications, given the immense variety of object classes and the impracticality of gathering labeled data for every possible type​​.

The model amalgamates DINO (a self-supervised learning algorithm) with grounded pre-training, a method that leverages both visual and textual information. This synergy significantly enhances the model's proficiency in detecting and recognizing previously unseen objects in diverse scenarios​​.

Release and Development

Grounding DINO v2 was developed by Meta AI as part of their ongoing efforts in advancing computer vision technologies. This model represents a significant leap from its predecessor, DINO, by offering robust segmentation, classification, image retrieval, and depth estimation capabilities. It's built on a substantial dataset of 142 million images, ensuring a vast learning scope and improved performance over traditional image-text pretraining methods​​.


The core of Grounding DINO v2 lies in its innovative architecture which accepts pairs of images and text as input, outputting object boxes with associated confidence scores. This structure enables the model to effectively determine the relevancy of objects in relation to the provided textual context, thereby enhancing its detection accuracy​​.

Libraries and Frameworks

The development and implementation of Grounding DINO v2 involved state-of-the-art tools and frameworks, including advanced versions of PyTorch and distributed training methodologies. These technologies facilitated efficient training cycles, even for large-scale models, by optimizing memory usage and computational speed​​.

Use Cases

Grounding DINO v2's flexibility and adaptability make it suitable for various real-world scenarios. Some key applications include:

Zero-Shot Object Detection: The model's capability to detect objects outside the predefined set of classes in the training data makes it highly versatile for numerous real-world tasks.

Referring Expression Comprehension (REC): Grounding DINO can identify and localize specific objects or regions within an image based on textual descriptions, which is particularly useful in image and video processing pipelines.

Elimination of Hand-Designed Components: The model simplifies the object detection pipeline by removing the need for components like Non-Maximum Suppression (NMS), improving efficiency and performance​​.

Deployment and Accessibility

The model can be deployed as a REST API endpoint, allowing for broader application and ease of integration into existing systems. This deployment can be achieved using platforms like Modelbit, which facilitate the deployment of machine learning models directly from data science notebooks to REST endpoints​​.


A key strength of Grounding DINO v2 is its self-supervised learning approach, which allows it to learn from a vast array of images without the need for extensive labeled data. This capability makes it highly adaptable and efficient for numerous computer vision tasks. Moreover, its ability to provide high-quality features without the necessity for fine-tuning underscores its practical applicability in various scenarios​​.


While Grounding DINO v2 presents a significant advancement, it may still face challenges inherent in self-supervised learning models, such as the need for large and diverse datasets to train effectively. Additionally, the complexity of its architecture might require substantial computational resources, although model distillation techniques have been employed to mitigate this​​.

Learning Type & Algorithmic Approach

Grounding DINO v2 employs self-supervised learning, a technique that enables learning from unlabeled data, a notable departure from traditional methods that rely heavily on annotated datasets. This approach allows the model to understand and process a broader range of visual information without the limitations of text-based descriptions, making it particularly effective for tasks like monocular depth estimation​​.

Ready to deploy your ML model?

Get a demo and learn how ML teams are deploying and managing ML models with Modelbit.
Book a Demo