Deploying LLaVa Multi-Modal Model to Production

Michael Butler, ML Community Lead
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

In this post, we’ll briefly introduce the LLaVA model, and walk you through a tutorial on how to quickly deploy it to a production environment behind a REST API using Google Colab and Modelbit.

Intro to LLaVa - a multimodal vision model

Meet Llava, a computer vision model that’s quietly transforming the way machines interact with the world. LLaVA (Large Language-and-Vision Assistant), is an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding. Imagine a tool that doesn’t just see images but understands them, reads the text embedded in them, and reasons about their context—all while conversing with you in a way that feels almost natural. Llava isn’t just another incremental step in AI; it’s a leap towards a future where machines not only process information but truly grasp it.

Llava stands out because it blurs the lines between visual and textual understanding. Most models are specialists—they either excel at processing images or parsing text. Llava does both. It’s like having a polyglot who speaks fluent visual and textual languages, making it adept at tasks like explaining a complex chart, recognizing and reading text in photos, or identifying intricate details in high-resolution images.

But Llava isn’t a monolith; it comes in various forms tailored to different needs. There’s the base Llava, perfect for general applications where you need robust, versatile AI assistance. Then there’s Llava-Med, a specialized version that delves into the complexities of medical imaging, providing insights into CT scans and MRIs that would make a radiologist proud. Imagine a tool that can spot subtle anomalies in a lung scan and explain them in detail, all within a matter of seconds.

an image of two lions in the modelbit UI
A view of LLaVA's logs in Modelbit

For those with heavy computational demands, Llava scales up with versions packing up to 34 billion parameters, offering unmatched depth in understanding and analysis. And for developers, the MoE-Llava variant leverages a Mixture of Experts approach, efficiently distributing the workload to handle complex multimodal data without breaking a sweat.

Llava isn’t just a model; it’s a peek into the future of how we’ll interact with machines. It’s not about replacing human expertise but amplifying it—whether in medicine, education, or any field where understanding visuals is as crucial as reading text. In a world overflowing with data, Llava offers a way to cut through the noise and truly comprehend what’s in front of us.

Tutorial - Deploying LLaVA to a REST API Endpoint

Now, let’s walk through the steps to deploy the LLaVa multi-modal vision and language model to a REST endpoint. Here are the steps we’ll take:

  1. We’ll use Google Colab to set up our LLaVA model, install necessary packages, and create an inference function that we can use to call our model.
  2. We’ll use Modelbit to deploy our LLaVA model to a production environment with a REST API. If you don’t have a Modelbit account, you can create a free trial here.

Setting up our notebook

We recommend using Google Colab notebook with high memory and a T4 GPU for this example. Because finding A100s on Colab can be hit-or-miss, we'll start by using a version that fits on a T4. Scroll down for a guide to deploying a larger version.

First, install the accelerate, "bitsandbytes" and "modelbit" packages:

!pip install accelerate bitsandbytes modelbit

Go ahead and login to Modelbit:

import modelbit
mb = modelbit.login()

Import the rest of your dependencies:

from transformers import AutoProcessor, LlavaForConditionalGeneration
from huggingface_hub import snapshot_download
from PIL import Image
import requests
import bitsandbytes
import accelerate
from functools import cache

Finally, download the LLaVa weights from HuggingFace:

snapshot_download(repo_id="llava-hf/llava-1.5-7b-hf", local_dir="/content/llava-hf")

Building the model and performing an inference

First we'll write a function that loads the model:

def load_model():
 model = LlavaForConditionalGeneration.from_pretrained("./llava-hf", local_files_only=True, load_in_8bit=True)
 processor = AutoProcessor.from_pretrained("./llava-hf", local_files_only=True, load_in_8bit=True)
 return model, processor

Note l"oad_in_8bit=true", which quantizes the model to fit in VRAM in a T4 GPU.

The "@cache" decorator will cause this function to only load the model once. After that, it stays in memory. The same behavior will be preserved in production in Modelbit.

Next we'll write our function that prompts the model and returns the result:

def prompt_llava(url: str, prompt: str):
 model, processor = load_model()
 image =, stream=True).raw)
 mb.log_image(image) # Log the input image in Modelbit
 full_prompt = f"USER: \n{prompt} ASSISTANT:"
 inputs = processor(text=full_prompt, images=image, return_tensors="pt").to("cuda")
 generate_ids = model.generate(**inputs, max_new_tokens=15)
 response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].split("ASSISTANT:")[1]
 return response

This function downloads the picture from the URL and prompts LLaVa with the picture and the text prompt, returning just the model's response..

Deploying LLaVA to REST

From here, deployment to a REST API is just one line of code:

         python_packages=["bitsandbytes==0.43.1", "accelerate==0.30.1"],

We want to make sure to bring along the weight files from the "llava-hf" directory using the extra_files parameter.

And we want to specify "bitsandbytes" and accelerate dependencies because transformers needs them in this case but does not specify that dependency.

Finally, of course, this model requires a GPU.

Optional: Deploying a larger version of LLaVA

If you can get an A100 from Colab, you can build a larger (non-quantized) version of the model and deploy that to Modelbit!

To do so, simply remove "load_in_8bit=true" in your from_pretrained calls. Since quantized models automatically load into CUDA but default models do not, you'll also need to add ".to("cuda")" to your model construction. Here's the new "load_model" definition:

def load_model():
 model = LlavaForConditionalGeneration.from_pretrained("./llava-hf", local_files_only=True).to("cuda")
 processor = AutoProcessor.from_pretrained("./llava-hf", local_files_only=True)
 return model, processor

The inference function is unchanged. Finally, when deploying, make sure you specify a large enough GPU:


No need for "accelerate" or "bitsandbytes" since we're no longer quantizing.

You can now call your LLaVa model from its production REST endpoint!

Next Steps

Now that you’ve successfully deployed a LLaVA model to production, you can start to build an application around it, or integrate it into an existing app.

If you have questions and want to learn more about LLaVa or Modelbit, feel free to reach out to us!

Want more tutorials for deploying ML models to production?

Deploy Custom ML Models to Production with Modelbit

Join other world class machine learning teams deploying customized machine learning models to REST Endpoints.
Get Started for Free