LLaVa Model Guide

Getting Started with Modelbit

Modelbit is an MLOps platform that lets you train and deploy any ML model, from any Python environment, with a few lines of code.

Table of Contents

Getting StartedOverviewUse CasesStrengthsLimitationsLearning Type

Model Comparisons

No items found.

Getting Started

Model Documentation


Deploying LLaVa to A Rest API Endpoint for Multimodal Generation

Installations and Set Up

Let's start by installing the necessary libraries. Here we install 🤗 Transformers, Modelbit, Accelerate and Bitsandbytes in order to load the model in Google Colab. This enables 4-bit inference with clever quantization techniques, shrinking the size of the model considerably, while maintaining performance of the original size.

!pip install --upgrade git+https://github.com/huggingface/transformers.git modelbit

!pip install accelerate==0.25.0 bitsandbytes==0.42.0 cloudpickle==3.0.0

Load model and processor

Next, we load a model and corresponding processor from the hub. We specify device_map="auto" in order to automatically place the model on the available GPUs/CPUs.

from transformers import AutoProcessor, LlavaForConditionalGeneration
from transformers import BitsAndBytesConfig
import torch
import cloudpickle

quantization_config = BitsAndBytesConfig(

model_id = "llava-hf/llava-1.5-7b-hf"

processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto")

Prepare image and text for the model

import requests
from PIL import Image

image1 = Image.open(requests.get("http://doc.modelbit.com/img/cat.jpg", stream=True).raw)
image2 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)

Pipeline API

The Pipeline API in Hugging Face Transformers is a convenient tool that simplifies the process of performing inference with pre-trained models. It abstracts away the underlying complexities of model loading, input preprocessing, and output postprocessing, allowing users to focus on their specific task.

Note: the pipeline API doesn't leverage a GPU by default, one needs to pass the device argument for that. See the collection of Hugging Face-compatible checkpoints.

from transformers import pipeline

pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})

max_new_tokens = 200
prompt = "USER: \nWhat are the things I should be cautious about when I visit this place?\nASSISTANT:"

outputs = pipe(image2, prompt=prompt, generate_kwargs={"max_new_tokens": 200})


Inference Function for Generating Responses

The "get_llava_model" function, decorated with @cache, is our key player. This function uses "snapshot_download" to fetch the specific backbone.

The use of @cache is a clever optimization; it ensures that once the model and processor are loaded, they are stored in memory. This significantly speeds up future calls to this function, as it avoids reloading the model and processor from scratch each time, making it ideal for deployments.

with open('llava_pipe.pkl', 'wb') as file:
    cloudpickle.dump(pipe, file)

from functools import cache
import pickle
from transformers.dynamic_module_utils import init_hf_modules

def get_llava_model():
    with open('llava_pipe.pkl', 'rb') as file:
        content = pickle.load(file)
    return content

def llava_image_to_text_inference(image_url, prompt, max_new_tokens = 200):
    pipe = get_llava_model()
    image = Image.open(requests.get(image_url, stream=True).raw)
    prompt = f"USER: \n{prompt}\nASSISTANT:"

    outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": max_new_tokens})
    generated_text = outputs[0]["generated_text"]

    return generated_text

llava_image_to_text_inference(image_url="https://llava-vl.github.io/static/images/view.jpg", prompt="Describe what is happening here")

Deploy LLaVA to a REST API Endpoint

Log into modelbit

import modelbit as mb


#Deploy the depth anything function to modelbit
                           "accelerate==0.25.0","bitsandbytes==0.42.0", "cloudpickle==3.0.0"],

Test the REST Endpoint with a Single Image

You can test your REST Endpoint by sending single or batch production images to it for inference.

Use the requests package to POST a request to the API and use json to format the response to print nicely:

⚠️ Replace the "ENTER_WORKSPACE_NAME" placeholder with your workspace name.

import json
import requests

              data=json.dumps({"data": ["https://doc.modelbit.com/img/cat.jpg", "Describe the image"]})).json()

You can also test your endpoint from the command line using:

curl -s -XPOST "https://ENTER_WORKSPACE_NAME.us-east-1.modelbit.com/v1/run_wrapper/latest" -d '{"data": ["https://doc.modelbit.com/img/cat.jpg", "Describe the image"]}' | json_pp

⚠️ Replace the "ENTER_WORKSPACE_NAME" placeholder with your workspace name.

Model Overview

LLaVa stands for Large Language and Vision Assistant, a pioneering model introduced at NeurIPS 2023. It represents a significant leap in combining vision encoding and language models to create a multimodal understanding system. Unlike traditional models, LLaVa integrates a vision encoder with a large language model called Vicuna, aiming to provide comprehensive visual and language understanding.

This approach allows LLaVa to perform exceptionally well in chat capabilities and scientific question answering, mimicking aspects of multimodal GPT-4.

Release and Development:

The LLaVa model was developed by a collaborative effort between researchers from the University of Wisconsin-Madison, Microsoft Research, and Columbia University. It was released following the NeurIPS 2023 conference, demonstrating state-of-the-art performance on several benchmarks. LLaVa version 1.5, an enhancement over the original, underscores the model's rapid evolution and its commitment to utilizing all public data efficiently.

Use Cases

LLaVa is particularly effective in multimodal chat applications and scientific question answering. It demonstrates advanced capabilities in understanding and responding to visual and textual inputs, making it ideal for applications requiring comprehensive multimodal understanding. Its use cases extend to educational tools, customer service bots, and interactive systems requiring detailed visual and language comprehension.


The strengths of LLaVa include its exceptional multimodal understanding, ability to mimic aspects of GPT-4 in chat scenarios, and its state-of-the-art performance in science QA tasks. The model's innovative instruction tuning approach enables it to adapt effectively to new tasks and datasets, showcasing its flexibility and robustness.


Despite its advancements, LLaVa faces challenges common to multimodal models, such as the need for large and diverse training datasets and the complexity of integrating visual and textual data. Additionally, its performance heavily depends on the quality and scope of the instruction-following data used for training.

Learning Type & Algorithmic Approach

LLaVa employs a two-stage instruction tuning approach for learning, starting with pre-training for feature alignment and then moving to fine-tuning for specific applications. This method reflects a combination of supervised and unsupervised learning, focusing on enhancing the model's zero-shot capabilities in multimodal settings.

Ready to deploy your ML model?

Get a demo and learn how ML teams are deploying and managing ML models with Modelbit.
Book a Demo