OWLv2 Model Guide

Getting Started with Modelbit

Modelbit is an MLOps platform that lets you train and deploy any ML model, from any Python environment, with a few lines of code.

Table of Contents

Getting StartedOverviewUse CasesStrengthsLimitationsLearning Type

Model Comparisons

No items found.

Getting Started

Model Documentation


Deploying OWLv2 to A Rest API Endpoint for Object Detection

Installations and Set Up

Let's start by installing 🤗 Transformers.

!pip install --upgrade git+https://github.com/huggingface/transformers.git modelbit

Load image

We'll perform inference on the familiar cat and dog image.

from PIL import Image
import requests

url = 'http://doc.modelbit.com/img/cat.jpg'
image = Image.open(requests.get(url, stream=True).raw)

Load model and processor

Next, we load an OWLv2 checkpoint from the hub. Note that the authors released several, with various patch sizes, and training schemes (self-trained only, self-trained + fine-tuned, and an ensemble). We'll load an ensemble checkpoint here since it performs best. Note that the authors also did release larger checkpoints, which have even better performance.

from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")

We'll define some texts which the model can detect. We can prepare the image and the texts for the model using the processor:

texts = [['a cat', 'a dog']]
inputs = processor(text=texts, images=image, return_tensors="pt")
for key, val in inputs.items():
    print(f"{key}: {val.shape}")

Forward pass

Next we perform a forward pass. As we're at inference time, we use the torch.no_grad() operator to save memory (we don't need to compute any gradients).

import torch

with torch.no_grad():
  outputs = model(**inputs)


Finally let's plot the results! Important to note here is that the authors visualize the bounding boxes on the preprocessed (padded + resized) image, rather than the original one. Hence here we'll take the pixel_values created by the processor and "unnormalize" them. This gives us the preprocessed image, minus normalization.

As can be seen, the preprocessed image is an image that pads the original one to a square.

import numpy as np
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD

def get_preprocessed_image(pixel_values):
    pixel_values = pixel_values.squeeze().numpy()
    unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
    unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
    unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
    unnormalized_image = Image.fromarray(unnormalized_image)
    return unnormalized_image

unnormalized_image = get_preprocessed_image(inputs.pixel_values)

# Convert outputs (bounding boxes and class logits) to COCO API
target_sizes = torch.Tensor([unnormalized_image.size[::-1]])
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.2)
i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

Inference Function for Object detection

The "get_owlv2_base" function, decorated with @cache, is our key player. This function uses snapshot_download to fetch the specific backbone.

The use of @cache is a clever optimization; it ensures that once the model and processor are loaded, they are stored in memory. This significantly speeds up future calls to this function, as it avoids reloading the model and processor from scratch each time, making it ideal for deployments.

from functools import cache
from huggingface_hub import snapshot_download

def get_owlv2_base():
    model_path = snapshot_download(repo_id="google/owlv2-base-patch16-ensemble")
    processor = Owlv2Processor.from_pretrained(model_path)
    model = Owlv2ForObjectDetection.from_pretrained(model_path)
    return model, processor

def convert_to_coco_api(outputs, unnormalized_image, texts, processor):
    Convert model outputs to a format compatible with the COCO API.

    - outputs: The raw outputs from the object detection model.
    - unnormalized_image: The original PIL Image object before normalization.
    - texts: A list of text labels corresponding to the detected objects.
    - processor: The processing object capable of post-processing the raw outputs.

    A list of dictionaries, each representing a detected object with its label, confidence score, and bounding box.
    detection_results = []
    target_sizes = torch.Tensor([unnormalized_image.size[::-1]])
    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.2)
    i = 0  # Assuming we're only processing the first image for simplicity.
    text = texts[i]
    boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

    for box, score, label in zip(boxes, scores, labels):
        box = [round(i, 2) for i in box.tolist()]
            "label": text[label],
            "confidence": round(score.item(), 3),
            "box": box

    return detection_results

owlv2_inference(image_url = "http://images.cocodataset.org/val2017/000000039769.jpg", texts = [['a cat', 'remote control']])

Deploy OWLV2 to a REST API Endpoint

Log into modelbit

import modelbit as mb


#Deploy the depth anything function to modelbit
          python_packages=["git+https://github.com/huggingface/transformers.git", "scipy==1.12.0"],

Test the REST Endpoint with a Single Image

You can test your REST Endpoint by sending single or batch production images to it for inference.

Use the requests package to POST a request to the API and use json to format the response to print nicely:

⚠️ Replace the ENTER_WORKSPACE_NAME placeholder with your workspace name.

import json
import requests

              data=json.dumps({"data": ["https://doc.modelbit.com/img/cat.jpg", [['a cat', 'a dog']]]})).json()

You can also test your endpoint from the command line using:

curl -s -XPOST "https://ENTER_WORKSPACE_NAME.us-east-1.modelbit.com/v1/depth_any_inference/latest" -d
'{"data": ["https://doc.modelbit.com/img/cat.jpg"]}' | json_pp

⚠️ Replace the ENTER_WORKSPACE_NAME placeholder with your workspace name.

Model Overview

OWLv2 f zero-shot object detection, a process that eliminates the need for manually annotated bounding boxes, thereby making the detection process more efficient and less tedious. The model is built upon the foundations of its predecessor, OWL-ViT v1, and utilizes a transformer-based architecture. What distinguishes OWLv2 is its capability to leverage self-training on a web-scale dataset, using pseudo labels generated by an existing detector to enhance performance significantly​​​​.

By building upon the OWL-ViT framework and employing self-training techniques, OWLv2 has set new benchmarks in the performance of zero-shot object detection, demonstrating the model's scalability and efficiency in handling web-scale datasets​​.

Use Cases

OWLv2's versatility makes it applicable across a wide range of industries, from retail and e-commerce to safety, security, telecommunications, and transportation. Its ability to accurately detect objects without prior labeled data makes it a powerful tool for developing innovative solutions in various sectors​​.


One of the primary strengths of OWLv2 is its exceptional performance in zero-shot object detection, significantly reducing the need for labor-intensive manual annotations. Moreover, the model's self-training capability allows it to scale to web-sized datasets, further enhancing its utility and application potential​​.


While OWLv2 represents a significant advancement, the reliance on large-scale datasets and the complexity of its transformer-based architecture may pose challenges in terms of computational resources and the expertise required for customization and optimization​​.

Learning Type & Algorithmic Approach

OWLv2 employs a zero-shot learning approach, utilizing self-training techniques that leverage existing detectors to generate pseudo-box annotations on image-text pairs. This method enables the model to improve its detection capabilities through exposure to vast amounts of unannotated data, thereby broadening its applicability and performance in real-world scenarios​​.

Ready to deploy your ML model?

Get a demo and learn how ML teams are deploying and managing ML models with Modelbit.
Book a Demo