CVT Model Guide

Getting Started with Modelbit

Modelbit is an MLOps platform that lets you train and deploy any ML model, from any Python environment, with a few lines of code.

Table of Contents

Getting StartedOverviewUse CasesStrengthsLimitationsLearning Type

Model Comparisons

No items found.

Getting Started

Model Documentation

Deploying Convolutional Vision Transformer to A Rest API Endpoint for Image Classification

Installations and Set Up

Let's start by installing 🤗 Transformers.

!pip install --upgrade git+ modelbit

Load image

We'll perform inference on the familiar cat and dog image.

from PIL import Image
import requests

url = ''
image =, stream=True).raw)

Load model and processor

Next, we load the raw model CVT checkpoint from the hub. See the model hub to look for fine-tuned versions on a task that interests you.

from transformers import AutoImageProcessor, CvtForImageClassification

processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13")
model = CvtForImageClassification.from_pretrained("microsoft/cvt-13")

We'll define some texts which the model can detect. We can prepare the image and the texts for the model using the processor:

Forward pass

Next we perform a forward pass. As we're at inference time, we use the torch.no_grad() operator to save memory (we don't need to compute any gradients).

import torch

inputs = processor(image, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_label = logits.argmax(-1).item()

Inference Function for Image Classification

The get_cvt_base function, decorated with @cache, is our key player. This function uses snapshot_download to fetch the specific backbone.

The use of @cache is a clever optimization; it ensures that once the model and processor are loaded, they are stored in memory. This significantly speeds up future calls to this function, as it avoids reloading the model and processor from scratch each time, making it ideal for deployments.

from functools import cache
from huggingface_hub import snapshot_download

def get_cvt_base():
    model_path = snapshot_download(repo_id="microsoft/cvt-13")
    processor = AutoImageProcessor.from_pretrained(model_path)
    model = CvtForImageClassification.from_pretrained(model_path)
    return model, processor

def cvt_inference(image_url):
    model, processor = get_cvt_base()
    image =, stream=True).raw)
    print("Image url loaded")
    inputs = processor(image, return_tensors="pt")
    with torch.no_grad():
      logits = model(**inputs).logits
    predicted_label = logits.argmax(-1).item()
    return model.config.id2label[predicted_label]

cvt_inference(image_url = "")

Deploy CVT to a REST API Endpoint

Log into modelbit

import modelbit as mb


#Deploy the conv next v2 function to modelbit

Test the REST Endpoint with a Single Image

You can test your REST Endpoint by sending single or batch production images to it for inference.

Use the requests package to POST a request to the API and use json to format the response to print nicely:

⚠️ Replace the ENTER_WORKSPACE_NAME placeholder with your workspace name.

import json
import requests"",
              data=json.dumps({"data": [""]})).json()

You can also test your endpoint from the command line using:

curl -s -XPOST "" -d '{"data": [""]}' | json_pp

⚠️ Replace the ENTER_WORKSPACE_NAME placeholder with your workspace name.

Model Overview

CvT incorporates a unique architecture that merges the local processing capabilities of convolutions with the dynamic attention mechanisms of transformers. This hybrid approach allows CvT to efficiently process images while retaining contextual information over varying scales.

The model's architecture is designed to capitalize on the strengths of both CNNs and ViTs, providing a robust framework for handling image classification tasks with improved accuracy and lower computational costs.

Developed by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang, the CvT model was introduced to address the limitations inherent in pure transformer models for vision tasks. By integrating convolutional elements, CvT achieves superior performance metrics on benchmark datasets like ImageNet, setting new standards for image classification and analysis.

CvT's architecture features a novel convolutional token embedding mechanism and a convolutional transformer block. These components work in tandem to enhance the model's ability to capture local spatial contexts while maintaining the global receptive field provided by transformers. The architecture supports hierarchical representation learning, enabling efficient processing of images at different resolutions.

Use Cases

CvT has demonstrated exceptional performance in various vision-based tasks, including image classification, object detection, and semantic segmentation. Its versatility and efficiency make it suitable for applications ranging from autonomous driving to medical image analysis.


The primary strength of CvT lies in its ability to combine the local processing advantages of CNNs with the global context awareness of transformers. This leads to superior performance on image recognition tasks, outperforming traditional CNNs and ViTs in terms of accuracy and efficiency.


While CvT offers numerous advantages, its performance can be contingent on the availability of large-scale datasets for training. Additionally, the integration of convolutions into the transformer architecture might introduce complexity, potentially requiring more resources for model training and fine-tuning.

Learning Type & Algorithmic Approach

CvT employs supervised learning, utilizing a blend of convolutional operations and self-attention mechanisms. This hybrid approach allows for effective feature extraction and representation learning, making CvT a powerful tool for tackling complex vision tasks.

Ready to deploy your ML model?

Get a demo and learn how ML teams are deploying and managing ML models with Modelbit.
Book a Demo