Over recent years, computer vision has experienced significant advancements, primarily driven by innovations in deep learning. Many object detection models are traditionally designed to recognize a predefined set of classes. Introducing new classes to these models often demands collecting and annotating new data, followed by retraining the model from scratch—a process that is both time- and compute-intensive.
Zero-shot object detection—identifying objects in images without prior training in those specific classes—has garnered substantial interest recently. The advantages of zero-shot object segmentation are manifold in computer vision. This technique allows models to recognize and delineate objects in images even if they weren't exposed to such objects during their training phase.
That is especially beneficial in real-world situations where the array of objects to recognize is immense and obtaining labeled data for every potential object class is impractical. Through zero-shot segmentation, you can devise adaptive and efficient models capable of recognizing unfamiliar objects without necessitating retraining or additional labeled data acquisition.
Moreover, these zero-shot methodologies can considerably decrease the time and resources dedicated to data labeling, which is often a significant challenge in crafting effective computer vision solutions.
Researchers at Meta AI have employed self-supervised learning combined with Transformer algorithms to craft a zero-shot method named DINO, short for "Distillation with No Labels." This method was evolved by researchers outside of Minto Grounding-DINO, a vision-language pre-training technique that harnesses the potential of self-supervised learning and attention mechanisms, delivering outstanding results in open-set object detection.
The primary aim behind Grounding DINO is to establish a robust system capable of detecting diverse objects as described through human language inputs, eliminating the necessity for model retraining. The model can discern and detect objects when given a textual prompt.
In this article, you will learn how to deploy the Grounding DINO Model as a REST API endpoint for object detection using Modelbit. Let’s delve right in! 🚀
Here’s the flow of the solution you are going to build:
Here are the steps you’ll walk through to have a deployed model endpoint:
You will use a CPU Colab instance for this article (not GPU-inferencing on Modelbit). As a result, you will need to match your local environment to the production environment.
Let’s hop in! 🏊
For an interactive experience, access the Colab Notebook, which contains all the provided code and is ready to run!
The pre-installed Colab versions of `torch` and `torchvision` are the CUDA versions for GPU-based training and inferencing. This walkthrough uses the CPU installations to load the Grounding DINO model and for inference on Modelbit.
Once the installation is complete, restart your runtime to point the Colab session to the new installations.
Next, you want to organize your setup by managing the paths for all the data, files, and installations. Get the current working directory—likely `/content`, a bit of a handful to work with already—where the notebook session is running and assign that directory to the `HOME` variable:
Install Grounding DINO by cloning the official library repository from GitHub because it is not officially distributed through `pip` yet:
Also install the Modelbit Python package. You’ll use that later to wrap your model and deploy the model pipeline (a function) to a REST endpoint:
You should get similar output from the cell:
If you are using Colab, most of the dependencies for running Grounding DINO are likely already available. But if you are running this walkthrough on a local session, please check the "requirements.txt" file to ensure you install all the dependencies (we will not use "supervision" in this article).
Obtain the Grounding DINO model weights:
Next, fetch sample data to evaluate the Grounding DINO object detection model. For this purpose, you will use dog and cat images from Unsplash, available under a free license:
Next, you will load the Grounding DINO model into the session runtime using the weights you obtained earlier.
Navigate to the "GroundingDINO" directory you created earlier within the HOME directory.
Make the necessary imports from "groundingdino". You can find them and the complete code in the Colab Notebook. The "load_image" and "predict" utilities from the "groundingdino.util.inference" module load the bounding box’s dimensions, predict the coordinates, and provide accompanying annotations.
Next, define a function to load the weights of the Grounding DINO model. This function takes three arguments:
The function loads the model checkpoint into memory using PyTorch and maps the weights to the CPU. (If you run your session on a GPU instance, remove this argument and ensure "device" defaults to “cuda”.)
Notice the "SLConfig" class? This collects the configuration files required to build the model in memory.
Define the paths to configuration files and model weights:
Load the GroundingDINO model using the specified configuration and weight paths:
You should get a similar
For the implementation, you'll need to provide the following arguments as inputs:
These thresholds help filter out bounding boxes and text predictions below certain confidence levels. Depending on your dataset and use case, tweaking these values might help you achieve better results. Experiment to find the most suitable thresholds.
To visualize the bounding boxes and labels on the sample image, define a helper function that takes in an image, the Grounding DINO model, the text prompt, the confidence threshold for the bounding boxes, and the text descriptions to detect objects within the image and visualize them using bounding boxes.
The function overlays the detected objects with predicted textual descriptions (phrases), associated labels, and confidence scores (logits).
The function takes four arguments:
See the complete implementation:
The function returns "fig", the figure with the image and the drawn bounding boxes and labels. You will use this to log the predicted image to Modelbit later on.
Define certain variables required for processing an image, and the prompt you will use in the context for the model to detect objects and annotate.
You will load the sample image from the local path before calling the function. "load_image()" expects a local path to the image, applies some transformations, and converts it to a "numpy" array. It returns both the transformed image and the image tensor.
Run a test inference with "predict()" that uses the model to predict bounding boxes, confidence scores (logits), and associated phrases (captions):
Perfect! Now call the helper function you defined earlier:
This should produce an outcome where the image is annotated with the object detected enclosed in a bounding box, and it displays a confidence level of 90%.
One of the standout features of Grounding DINO is its text template capability. You saw it above with the contextual “Find the dog in the snow” prompt. It allows for highly specific descriptions to be detected.
See the model’s performance on a second test image in the Colab Notebook.
Now’s the fun part—deployment! You have successfully tested the Grounding DINO model. In an ideal, perhaps real-world scenario, you would likely need to fine-tune the base model on your datasets. But in this case, we are fine with the model’s ability to classify general objects.
“Fun” is not often a word you associate with deploying computer vision models. Well, it’s so in this article, thanks to Modelbit. Modelbit makes it seamless to deploy machine learning models directly from your data science notebooks.
With a few wrappers and one click, you can deploy your model from the notebook, and Modelbit helps you compile the model and the corresponding dependencies into a container build and ships the build to a REST endpoint you can call from any application.
Remember that the "load_image()" function you used earlier only accepts the local image path? If you send images from a remote server (URL), you must create a function to help you download the image(s) and save them to a local path.
Modelbit allows you to POST single and batch inference requests. In this article, you will create a function to download an image for a single inference.
Modelbit offers a free trial you can sign up for if you haven't already. It provides a fully custom Python environment backed by your git repo.
Log into the "modelbit" service and create a development ("dev") or staging (“stage”) branch for staging your deployment. Learn how to work with branches in the docs.
If you cannot create a “dev” branch, you can use the default "main" branch for your deployment:
Click the resulting link to authenticate the kernel through your Modelbit account:
Next, encapsulate the image prediction pipeline into a function, establishing preset thresholds to ensure consistent performance. Designed for deployment as a REST API, this function will execute at runtime to provide predictions.
The function "dog_predict()" determines if the text prompt was identified in the image, logs the image to Modelbit using the "mb.log_image()" API, and assesses whether the average logits (confidence score) surpasses the "box_threshold".
In a case where you have a unique use case, and maybe you have fine-tuned Grounding DINO to work on your dataset, you can use "mb.log_image()" to log the predicted image to the platform so you or any SME can go in and inspect the accuracy of the detections given the prompt.
We're now production-ready! Pass the model prediction function ("dog_predict") and the project dependencies to the "modelbit.deploy()" API. Modelbit adeptly identifies all dependencies, encompassing other Python functions and variables that "dog_predict" relies on.
Generally, Modelbit also detects the essential Python and system packages. Once this is done, it will set up a REST API for you!
In this case, you also want to make sure the Python packages and versions you explicitly defined for the production environment correspond with the packages you used in the notebook session. If you are unsure, run "!pip freeze" to see what package versions you used during development.
After running "mb.deploy(...)" You should get a similar output:
Click the “View in Modelbit” button to see your deployed endpoint in your dashboard. Click the “dog_predict” deployment (yours should only have 1 version):
In the dashboard under “🌳 Environment” tab, you should see the container build running under the "dev" branch (if you specified that earlier). You can inspect the logs for errors or troubleshooting.
For the production image, you will use this image of two dogs, but one opens his mouth.
Pass the image URL and a text prompt as your input data to the API endpoint. Here's how you can do it with Python:
Be sure to replace the "<ENTER WORKPACE NAME>" placeholder with your Modelbit workspace name.
If everything works in your code, your output should look like this:
Back on your dashboard, go to “📚Logs” to see the predicted bounding box and the other response data:
Super! The model predicts the bounding box that correctly detects the object (in this case, a dog) in the sample image based on the text prompt "a dog with his mouth open".
You are now ready to integrate this API into your product or web application for production. Before integrating it into your application, learn how to secure your API within Modelbit.
Check the Modelbit dashboard to explore core functionalities to maintain your endpoint in production. See options to help you log prediction data, monitor endpoint usage, manage production environment dependencies, and so on. Explore your dashboard and the documentation for more information.
Till next time, keep shipping!