Deploying a Grounding DINO Model to a REST API Endpoint for Open-Set Object Detection with Prompts

Over recent years, computer vision has experienced significant advancements, primarily driven by innovations in deep learning. Many object detection models are traditionally designed to recognize a predefined set of classes. Introducing new classes to these models often demands collecting and annotating new data, followed by retraining the model from scratch—a process that is both time- and compute-intensive.

Zero-shot object detection—identifying objects in images without prior training in those specific classes—has garnered substantial interest recently. The advantages of zero-shot object segmentation are manifold in computer vision. This technique allows models to recognize and delineate objects in images even if they weren't exposed to such objects during their training phase.

That is especially beneficial in real-world situations where the array of objects to recognize is immense and obtaining labeled data for every potential object class is impractical. Through zero-shot segmentation, you can devise adaptive and efficient models capable of recognizing unfamiliar objects without necessitating retraining or additional labeled data acquisition.

Moreover, these zero-shot methodologies can considerably decrease the time and resources dedicated to data labeling, which is often a significant challenge in crafting effective computer vision solutions.

Researchers at Meta AI have employed self-supervised learning combined with Transformer algorithms to craft a zero-shot method named DINO, short for "Distillation with No Labels." This method was evolved by researchers outside of Minto Grounding-DINO, a vision-language pre-training technique that harnesses the potential of self-supervised learning and attention mechanisms, delivering outstanding results in open-set object detection.

The primary aim behind Grounding DINO is to establish a robust system capable of detecting diverse objects as described through human language inputs, eliminating the necessity for model retraining. The model can discern and detect objects when given a textual prompt.

In this article, you will learn how to deploy the Grounding DINO Model as a REST API endpoint for object detection using Modelbit. Let’s delve right in! 🚀

Overview of the Solution

Here’s the flow of the solution you are going to build:

Here are the steps you’ll walk through to have a deployed model endpoint:

First off, you will set up a Grounding DINO model on Colab notebook.
Load the model into memory in your environment.
Use the model on sample images to test the model.
Use Modelbit to deploy the Grounding DINO model as a REST API endpoint.

You will use a CPU Colab instance for this article (not GPU-inferencing on Modelbit). As a result, you will need to match your local environment to the production environment.

Let’s hop in! 🏊

🧑‍💻 Installation and Setup

For an interactive experience, access the Colab Notebook, which contains all the provided code and is ready to run!

The pre-installed Colab versions of `torch` and `torchvision` are the CUDA versions for GPU-based training and inferencing. This walkthrough uses the CPU installations to load the Grounding DINO model and for inference on Modelbit.


!pip install torch==2.0.1+cpu torchvision==0.15.2+cpu -f https://download.pytorch.org/whl/torch_stable.html

Once the installation is complete, restart your runtime to point the Colab session to the new installations.

Next, you want to organize your setup by managing the paths for all the data, files, and installations. Get the current working directory—likely `/content`, a bit of a handful to work with already—where the notebook session is running and assign that directory to the `HOME` variable:


import os
HOME = os.getcwd()

⬇️Install Grounding DINO from the GitHub Repository

Install Grounding DINO by cloning the official library repository from GitHub because it is not officially distributed through `pip` yet:


# Using latest version of pip
!pip install --upgrade pip

# Change the current directory to the value stored in the 'HOME' variable
%cd {HOME}  

# Clone (or download) the 'GroundingDINO' repository from GitHub to the current directory.
!git clone https://github.com/IDEA-Research/GroundingDINO.git

# Change the current directory to the 'GroundingDINO' folder, which has just been cloned.
%cd {HOME}/GroundingDINO

Also install the Modelbit Python package. You’ll use that later to wrap your model and deploy the model pipeline (a function) to a REST endpoint:


# Quietly (-q) install the Python package in the current directory requirement file
!pip install -q -e .

# Install the latest version of 'modelbit' for model deployment quietly
!pip install -q --upgrade modelbit

You should get similar output from the cell:

If you are using Colab, most of the dependencies for running Grounding DINO are likely already available. But if you are running this walkthrough on a local session, please check the "requirements.txt" file to ensure you install all the dependencies (we will not use "supervision" in this article).

📥 Download Grounding DINO weights

Obtain the Grounding DINO model weights:


# Create a new directory named 'weights' inside the HOME directory.
!mkdir {HOME}/weights

# Change the current directory of the notebook to the 'weights' directory we just created.
%cd {HOME}/weights

# Quietly download the model weights file 'groundingdino_swint_ogc.pth' from the given URL.
!wget -q https://github.com/IDEA-Research/GroundingDINO/releases download/v0.1.0-alpha/groundingdino_swint_ogc.pth

In this case, you got the weights from a pre-release checkpoint file from the official repository. You can also consider using the model data from the other releases.

🖼️ Download Sample Data

Next, fetch sample data to evaluate the Grounding DINO object detection model. For this purpose, you will use dog and cat images from Unsplash, available under a free license:


# Change the current directory of the notebook to the HOME directory.
%cd {HOME}

# Create a new directory named 'data' inside the HOME directory where we will store our sample data
!mkdir {HOME}/data

# Change the current directory of the notebook to the newly created 'data' directory.
%cd {HOME}/data

# Downloading images in quiet mode (without showing download progress )
# Download a picture of a dog and cat.jpg
!wget -q -O dog_and_cat.jpg http://doc.modelbit.com/img/cat.jpg

# Download a picture of a dog far away in snow
!wget -q -O dog_snow.jpg http://doc.modelbit.com/img/dog-in-snow.jpg

Load Grounding DINO Model

Next, you will load the Grounding DINO model into the session runtime using the weights you obtained earlier.

Navigate to the "GroundingDINO" directory you created earlier within the HOME directory.


%cd {HOME}/GroundingDINO

Make the necessary imports from "groundingdino". You can find them and the complete code in the Colab Notebook. The "load_image" and "predict" utilities from the "groundingdino.util.inference" module load the bounding box’s dimensions, predict the coordinates, and provide accompanying annotations.

Next, define a function to load the weights of the Grounding DINO model. This function takes three arguments:

"model_config_path": Path to the configuration file of the model.
"model_checkpoint_path": Path to the saved model checkpoint file.
"device": The device on which the model is supposed to run (defaults to "cpu").

The function loads the model checkpoint into memory using PyTorch and maps the weights to the CPU. (If you run your session on a GPU instance, remove this argument and ensure "device" defaults to “cuda”.)


# Function to load Grounding DINO model from groundingdino/util/inference.py
def load_model(model_config_path: str, model_checkpoint_path: str, device='cpu'):
    args = SLConfig.fromfile(model_config_path)
    args.device = device
    model = build_model(args)
    checkpoint = torch.load(model_checkpoint_path, map_location=torch.device('cpu'))
    model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
    _ = model.eval()
    return model

Notice the "SLConfig" class? This collects the configuration files required to build the model in memory.

Define the paths to configuration files and model weights:


device = "cpu"
# Define the path to the configuration file for the GroundingDINO model within the 'GroundingDINO' directory.
CONFIG_PATH = f"{HOME}/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py"

# Define the path to the previously downloaded weights file for the GroundingDINO model in the 'weights' directory.
WEIGHTS_PATH = f"{HOME}/weights/groundingdino_swint_ogc.pth"

Load the GroundingDINO model using the specified configuration and weight paths:


model = load_model(CONFIG_PATH, WEIGHTS_PATH)

You should get a similar

🔎Use Grounding DINO Model for Object Detection on Sample Data

For the implementation, you'll need to provide the following arguments as inputs:

Image name
"TEXT_PROMPT", which dictates the annotations or detections
"BOX_THRESHOLD"
"TEXT_THRESHOLD"

These thresholds help filter out bounding boxes and text predictions below certain confidence levels. Depending on your dataset and use case, tweaking these values might help you achieve better results. Experiment to find the most suitable thresholds.

🏳️Create Helper Function To Plot Bounding Boxes

To visualize the bounding boxes and labels on the sample image, define a helper function that takes in an image, the Grounding DINO model, the text prompt, the confidence threshold for the bounding boxes, and the text descriptions to detect objects within the image and visualize them using bounding boxes.

The function overlays the detected objects with predicted textual descriptions (phrases), associated labels, and confidence scores (logits).

The function takes four arguments:

"image": Tensor representation of the image.
"boxes": Predicted bounding boxes.
"logits": Confidence scores for each bounding box.
"phrases": Associated labels/text prompts for each bounding box.

See the complete implementation:


def showInferenceImage(image, boxes, logits, phrases):
    """
    Plot bounding boxes and labels on the image.

    Args:
    - image (str): Tensor representation of the image
    - boxes (float): Predicted bounding box.
    - logits (float): Confidence score of the predicted box.
    - phrases (str): Associated labels/tokinizd text prompts for each bounding box.
    """

    # Convert the image tensor (C x H x W) to numpy format (H x W x C) for visualization.
    image_np = image.permute(1, 2, 0).numpy()

    # Denormalizing the image tensor. This step is essential because neural networks
    # often expect images in a normalized format. Here we are converting it back to the original format.
    mean = torch.tensor([0.485, 0.456, 0.406]).numpy()
    std = torch.tensor([0.229, 0.224, 0.225]).numpy()
    image_np = (image_np * std + mean).clip(0, 1)

    # Get the dimensions (height and width) of the image.
    img_height, img_width = image_np.shape[:2]

    # Create a figure for visualization.
    fig = plt.figure(figsize=(12, 12))
    plt.imshow(image_np)  # Display the image.

    # For each detected object in the image:
    for (box, logit, phrase) in zip(boxes, logits, phrases):
        # Extract the center coordinates and dimensions of the box.
        x_center, y_center, width, height = box

        # Convert normalized coordinates [0, 1] back to pixel values.
        x_center = x_center * img_width
        y_center = y_center * img_height
        width = width * img_width
        height = height * img_height

        # Calculate the top-left (x1, y1) and bottom-right (x2, y2) coordinates of the box.
        x1 = x_center - (width / 2)
        y1 = y_center - (height / 2)
        x2 = x_center + (width / 2)
        y2 = y_center + (height / 2)

        # Draw the bounding box on the image.
        rect = plt.Rectangle((x1, y1), x2 - x1, y2 - y1, fill=False, edgecolor='red', linewidth=1)
        plt.gca().add_patch(rect)

The function returns "fig", the figure with the image and the drawn bounding boxes and labels. You will use this to log the predicted image to Modelbit later on.

🧪Test Model On Sample Images

Define certain variables required for processing an image, and the prompt you will use in the context for the model to detect objects and annotate.


# Define the name of the image file to be processed
IMAGE_NAME = "dog_snow.jpg"

# Construct the full path to the image using the predefined HOME directory
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

# Specify the text prompt for which annotations or detections are sought
TEXT_PROMPT = "Find the dog in the snow"

# Set the confidence threshold for considering bounding boxes. Boxes below this score will be ignored.
BOX_TRESHOLD = 0.5

# Set the confidence threshold for text predictions. Predictions below this score will be disregarded.
TEXT_TRESHOLD = 0.5

You will load the sample image from the local path before calling the function. "load_image()" expects a local path to the image, applies some transformations, and converts it to a "numpy" array. It returns both the transformed image and the image tensor.


# Load the image from the provided path and extract both its metadata and pixel data
image_source, image = load_image(IMAGE_PATH)

Run a test inference with "predict()" that uses the model to predict bounding boxes, confidence scores (logits), and associated phrases (captions):


boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=TEXT_PROMPT,
        box_threshold=BOX_TRESHOLD,
        text_threshold=TEXT_TRESHOLD,
        device="cpu"
    )

Perfect! Now call the helper function you defined earlier:


showInferenceImage(image, boxes, logits, phrases)

This should produce an outcome where the image is annotated with the object detected enclosed in a bounding box, and it displays a confidence level of 90%.

One of the standout features of Grounding DINO is its text template capability. You saw it above with the contextual “Find the dog in the snow” prompt. It allows for highly specific descriptions to be detected.

See the model’s performance on a second test image in the Colab Notebook.

🚀 Deploy Grounding DINO Model to a REST API Endpoint

Now’s the fun part—deployment! You have successfully tested the Grounding DINO model. In an ideal, perhaps real-world scenario, you would likely need to fine-tune the base model on your datasets. But in this case, we are fine with the model’s ability to classify general objects.

“Fun” is not often a word you associate with deploying computer vision models. Well, it’s so in this article, thanks to Modelbit. Modelbit makes it seamless to deploy machine learning models directly from your data science notebooks.

With a few wrappers and one click, you can deploy your model from the notebook, and Modelbit helps you compile the model and the corresponding dependencies into a container build and ships the build to a REST endpoint you can call from any application.

🌐 Helper function to download an image from the internet and save it to a local path

Remember that the "load_image()" function you used earlier only accepts the local image path? If you send images from a remote server (URL), you must create a function to help you download the image(s) and save them to a local path.

Modelbit allows you to POST single and batch inference requests. In this article, you will create a function to download an image for a single inference.


def download_image(url):
    image_file_path = 'image.png' # create a local directory/path to save the image
    r = requests.get(url, timeout=4.0)
    if r.status_code != requests.codes.ok:
        assert False, 'Status code error: {}.'.format(r.status_code)

    with Image.open(BytesIO(r.content)) as im:
        im.save(image_file_path)

    print('Image downloaded from url: {} and saved to: {}.'.format(url, image_file_path))
    
    return image_file_path # the image path

Modelbit offers a free trial you can sign up for if you haven't already. It provides a fully custom Python environment backed by your git repo.

Log into the "modelbit" service and create a development ("dev") or staging (“stage”) branch for staging your deployment. Learn how to work with branches in the docs.

If you cannot create a “dev” branch, you can use the default "main" branch for your deployment:


mb = modelbit.login(branch="dev") #or branch="main"

Click the resulting link to authenticate the kernel through your Modelbit account:

Next, encapsulate the image prediction pipeline into a function, establishing preset thresholds to ensure consistent performance. Designed for deployment as a REST API, this function will execute at runtime to provide predictions.

The function "dog_predict()" determines if the text prompt was identified in the image, logs the image to Modelbit using the "mb.log_image()" API, and assesses whether the average logits (confidence score) surpasses the "box_threshold".

In a case where you have a unique use case, and maybe you have fine-tuned Grounding DINO to work on your dataset, you can use "mb.log_image()" to log the predicted image to the platform so you or any SME can go in and inspect the accuracy of the detections given the prompt.

Define "dog_predict()":


# Encapsulate the image prediction into a function with pre-defined thresholds for robustness
def dog_predict(image_url, text_prompt):
    # Define pre-set thresholds for bounding boxes and text predictions
    BOX_TRESHOLD = 0.3
    TEXT_TRESHOLD = 0.3

    # call the func to download and save the image to a local path
    image_path = download_image(image_url)

    # Load the image from the provided path and extract both its metadata and pixel data
    image_source, image = load_image(image_path)

    # Use the model to predict bounding boxes, confidence scores (logits), and associated phrases (captions)
    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=text_prompt,
        box_threshold=BOX_TRESHOLD,
        text_threshold=TEXT_TRESHOLD,
        device="cpu"
    )

    pred_img = showInferenceImage(image, boxes, logits, phrases)
    mb.log_image(pred_img) # show the predicted boxes on the image in modelbit logs

    # Check the average confidence of detected boxes against the threshold
    if logits.mean() > BOX_TRESHOLD:
        return "Possible object detected"
    else:
        return "Not detected"

We're now production-ready! Pass the model prediction function ("dog_predict") and the project dependencies to the "modelbit.deploy()" API. Modelbit adeptly identifies all dependencies, encompassing other Python functions and variables that "dog_predict" relies on.

Generally, Modelbit also detects the essential Python and system packages. Once this is done, it will set up a REST API for you!


mb.deploy(dog_predict,
          python_packages=["git+https://github.com/IDEA-Research/GroundingDINO.git",
                           "matplotlib==3.7.1",
                           "numpy==1.23.5",
                           "torch==2.0.1+cpu",
                           "torchvision==0.15.2+cpu",
                           "opencv-python==4.8.0.76",
                           "Pillow==9.4.0"],
          system_packages=["python3-opencv"]
          )

In this case, you also want to make sure the Python packages and versions you explicitly defined for the production environment correspond with the packages you used in the notebook session. If you are unsure, run "!pip freeze" to see what package versions you used during development.

After running "mb.deploy(...)" You should get a similar output:

Click the “View in Modelbit” button to see your deployed endpoint in your dashboard. Click the “dog_predict” deployment (yours should only have 1 version):

In the dashboard under “🌳 Environment” tab, you should see the container build running under the "dev" branch (if you specified that earlier). You can inspect the logs for errors or troubleshooting.

You can either call the API from the client programmatically with Python and the "requests" package or use "cURL". Check the Colab Notebook for how to use both request formats.

For the production image, you will use this image of two dogs, but one opens his mouth.

Pass the image URL and a text prompt as your input data to the API endpoint. Here's how you can do it with Python:


import json

requests.post(
    "https://.app.modelbit.com/v1/dog_predict/dev/latest",
    headers={"Content-Type": "application/json"},
    data=json.dumps(
        {
            "data": [
                "https://unsplash.com/photos/8S0cSJ1Dy9Q/download?ixid=M3wxMjA3fDB8MXxzZWFyY2h8OXx8Zml2ZSUyMGRvZ3N8ZW58MHx8fHwxNjkzMTg0Nzg5fDA&force=true",
                "a dog with his mouth open",
            ]
        }
    ),
).json()

Be sure to replace the "<ENTER WORKPACE NAME>" placeholder with your Modelbit workspace name.

If everything works in your code, your output should look like this:


# OUTPUT:

{'data': 'Possible object detected'}

Back on your dashboard, go to “📚Logs” to see the predicted bounding box and the other response data:

Super! The model predicts the bounding box that correctly detects the object (in this case, a dog) in the sample image based on the text prompt "a dog with his mouth open".

Next Steps

You are now ready to integrate this API into your product or web application for production. Before integrating it into your application, learn how to secure your API within Modelbit.

Check the Modelbit dashboard to explore core functionalities to maintain your endpoint in production. See options to help you log prediction data, monitor endpoint usage, manage production environment dependencies, and so on. Explore your dashboard and the documentation for more information.

Till next time, keep shipping!