Guide to Deploying ML Models to Production in 2024

You have an ML model that you want to deploy to production. Excellent! Before you forge ahead with deploying your model to production you’ll first need to answer a question, and then make an important decision.

The Question: Who will interact with your deployed ML model?

The answer to this question will inform the majority of the decisions you make about how to deploy, serve, and maintain your model running in production. Here are a few examples of what we mean:

Your ML model will be used internally

Let’s say you have built a model that you’ve trained on internal CRM data. It does a really excellent job predicting whether any given inbound marketing lead will convert to a paying customer.

If your sales and marketing teams will be the only ones who need to consume the predictions from the model, then you probably don’t need to build a lot of custom infrastructure on top of Docker and Kubernetes. You of course can whip up a bunch of custom infrastructure if you choose to - but it would probably be overkill.

When the ML model will be use by your customers

If your model is a neural network built with Pytorch that will be interacted with by your customers through your product’s UI in some way, and has specific latency requirements, then you are going to have a lot more to consider when deciding how to serve it in production.

Chiefly, you’ll want to ensure that the customer experience powered in part by the model you’re using is excellent.

Answering this first question about who will be interacting with your model leads directly to the important decision you’ll need to make.

The Decision: Build or Buy - Choosing an ML Platform for Model Deployment

Wait, I thought we were just talking about deploying ML models to production? We are! However, if you’re going to deploy and serve models in production the right way, you need to think about it in terms of how you’ll build an ML platform around your model(s)!

The term “ML Platform” might elicit thoughts of clunky, hard to implement software from legacy vendors. Although picking something like that is an option available to you, what we really mean when we talk about an ML Platform are the processes and tools you use to deploy and serve ML models into production.

An ML platform doesn’t need to be an actual end-to-end platform built by a vendor, but there are some minimum components you’ll need in place in order to successfully deploy and serve ML models in production:

Development Environment: Where you will build, train, and test your models.
Model Deployment: How you will package your model and deploy it.
Data Management: How your models will fetch the features and necessary data.
Cloud Compute: The hardware that will power your models.
Observability: How you will set up monitoring and logging for your models in production.
Version Control: How you will integrate your model’s code with your Git workflows for things like code reviews and roll backs to previous versions of your model.

As we’ll cover throughout this guide, you can choose to build these components and piece them together completely from scratch, choose best in class solutions for each component, or even go with an end to end platform from the major cloud providers.

So for this guide, we’ll break out the options at your disposal for model deployment based on the three main paths you can take:

Build a home-grown ML Platform from scratch
Build an ML Platform with best in class tools (Modern ML Stack)
Buy an end-to-end ML Platform

Building a Home-Grown ML Platform for Deploying ML Models

There are a number of open source frameworks available to engineering teams to help guide them during the buildout of a custom ML platform that can serve a specific model. What is important to understand, however, is that by choosing to build a homegrown ML platform, you’re also making a commitment to staff a full product and engineering team dedicated to ensuring that:

Existing ML models in production can meet SLAs like uptime and latency.
Models can be version controlled and integrated with the company’s CI/CD processes.
Alerting and monitoring are set up to address issues like model drift or errors.
Newer (more complex) model technologies can be deployed fast enough to meet customer demand.

If you plan to go this route, or at least consider it, then we recommend building a plan ahead of time.

If you map out everything you will need to deploy your model and successfully serve it in production well in advance, you’ll be in a much better position than if you try to deal with problems in the middle of the process.

Building a Plan for Building Your Homegrown ML Platform to Serve ML Models in Production

To build your plan, you’ll want to have answers to these 9 questions before you start building. We know that seems like a lot of questions, but the really are all important!

1. How will you package the model?

Good model deployments are repeatable model deployments. To start, you’ll need to collect all of the dependencies of your model so that they can be installed the same way, every time. If you don’t, you’ll end up in a situation where the model works on your machine, but doesn’t on the model hosting server.

Most teams do this with Docker, but it’s not always easy. You’ll need to convert your notebook into a Python script, collect all the package dependencies into a requirements.txt, and build a Dockerfile that copies and installs these (and any other dependencies) into the version of Python and Linux that’s best suited for your ML model.

You’ll also need some way for the model to be called from the Docker container, which often means creating a Flask app to handle inputs and outputs from the model.

2. Where will the model be hosted?

Now that you have a Docker image with your model and all its dependencies, it’s time to host it! Typically data science teams cannot deploy using the same infrastructure as the product engineering teams, and that’s a good thing. Product engineering teams tend to move a lot slower than data science teams, and that makes releasing and iterating on new versions of models much harder and slower.

Depending on the characteristics of your ML model you may need a large server with lots of RAM and maybe GPUs, or perhaps you can use something small and serverless. Setting up your own hosting infrastructure also means figuring out inbound and outbound network connectivity, DNS settings, and various settings and permissions for building and pulling the docker image.

3. Who will maintain the hosting environments?

The hosting environment will need to be monitored and occasionally upgraded. Product engineering teams do not typically use Python or have time for managing extra infrastructure, so you’ll need to have your own experts that can manage and maintain your model’s Python environments.

Yes, environments, plural! As the famous expression goes, “two is one; one is none.” Servers crash, and so you need at least two servers so that your model doesn’t have an outage if one of the servers goes down. Of course, if one of your servers goes down, you’ll need to be alerted so you can fix it, or bring up a new one.

More than one person on your team needs to share responsibility with managing the environment. Don’t let the “one is none” rule lead to an extended outage because the one person who knows how to recover the model hosting server is on vacation with Slack very intentionally placed on silent. We all deserve a break - except for your ML model that you’ve deployed into production.

4. What happens when the model has a problem?

Just like servers, models can crash. Sometimes the process running the model will run out of RAM and get killed by the operating system. Or perhaps the model throws a particularly bad exception and crashes the Python process with a segfault. In any event, you’ll need monitoring for these events, and recovery logic to keep the model serving through these problematic events.

5. Where will the model’s logs get sent and stored?

At minimum you’ll want logs of what inputs were used to call the model, how long the model took to respond, and what the result was. As your use cases get more advanced, you’ll also want logging around which version of the model was used, and other related data points.

These logs need to get sent from your hosting environment to someplace where you can search and analyze them. Frequently logs are used for alerting, so your monitoring infrastructure is likely going to need to be paired with your logging infrastructure.

6. How will new versions of the model get deployed?

You’ll want to deploy new versions of your model after retraining or tweaking some hyperparameters that led to a quality boost. To minimize the time it takes, and the risk of errors from manual steps, you’ll need to build automation that packages and deploys your model with as little human input as possible. A good, but complicated, template to follow here is the CI/CD processes followed by engineering teams.

7. How will model versions get rolled back to recover from a deployment problem?

Inevitably, if you deploy enough versions of your ML model, one of them will be bad and you’ll need to roll it back. Make sure you have a plan for rolling back bad versions of models, and test it on a regular basis so you’ll know it works in an emergency.

8. What happens when the parameters sent to the model change? What about the response?

As you deploy more versions of your model you’ll make changes to improve its performance. Perhaps the newest version of the model expects more parameters, or returns multiple quality scores instead of one.

To make releasing changes like these easier, you’ll want the ability to host multiple versions of the same model at the same time, at different URLs. This way you can keep the old one running, and switch to the new one once your product is ready to send (and receive) new data.

9. Who controls access to the model?

Last but never least, security. You’ll need API keys to control who is allowed to call the model, and logs recording who made changes to the model. Finally, you’ll need user permissions limiting who is allowed to make changes to the model. These permissions should be integrated into a user management system, so that it’s easy to onboard and offboard members of your team without risking the security of your deployed model.

Practical Example: Deploying an XGBoost Model to Production with Docker and Flask

Let’s take a look at the steps involved in the process of deploying an XGBoost model to production. The goal is to have your machine learning model running inside a Docker container, accessible through a REST API built with Flask. This setup allows you to easily integrate your model with web applications, mobile apps, or any system that can communicate with HTTP endpoints.

Prerequisites

Before we start, ensure you have the following installed on your machine:

Docker
Python (version 3.6 or newer)
Postman or "curl" (for testing the API)

Step 1: Train Your XGBoost Model

First, let's train a simple XGBoost model. We'll use a synthetic dataset for this example. Install XGBoost in your Python environment if you haven't already:


pip install xgboost

Now, let's train the model:


import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the XGBoost model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)

# Save the model to a file
model.save_model('model.json')

Step 2: Create a Flask App

Next, we'll create a simple Flask app to serve our model. First, install Flask:


pip install flask

Create a file named “app.py” and add the following code to create a Flask application with a single endpoint that makes predictions with our model:


from flask import Flask, request, jsonify
import xgboost as xgb
import numpy as np

app = Flask(__name__)

# Load the XGBoost model
model = xgb.XGBClassifier()
model.load_model("model.json")

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    prediction = model.predict(np.array(data['features']))
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

Step 3: Dockerize the Flask App

To deploy our Flask application with Docker, we need to create a "Dockerfile". This file describes the environment needed to run our app. Create a "Dockerfile" in the same directory as your Flask app with the following content:


# Use an official Python runtime as a parent image
FROM python:3.8-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Make port 5000 available to the world outside this container
EXPOSE 5000

# Define environment variable
ENV NAME World

# Run app.py when the container launches
CMD ["python", "app.py"]

Create a "requirements.txt" file that lists the Flask and XGBoost packages:


flask
xgboost
numpy

Build the Docker image with the following command:


docker build -t xgboost-api .

Run the Docker container:


docker run -p 5000:5000 xgboost-api

Step 4: Test Your API

With your Docker container running, use Postman or "curl" to test the API:


curl -X POST -H "Content-Type: application/json" \
    -d '{"features": [[1, 2, 3, ..., 20]]}' \
    http://localhost:5000/predict

Replace “[1, 2, 3, ..., 20]” with an actual feature array of length 20.

What’s next?

Congratulations! You've successfully deployed an XGBoost model in a Docker container with a Flask REST API. This setup is a robust starting point for integrating machine learning models into production environments. Remember, there are still a few other VERY important components to serving models in production that you’ll have to have figured out like:

Data Management: How your models will fetch the features and necessary data.
Cloud Compute: The hardware that will power your models.
Observability: How you will set up monitoring and logging for your models in production.
Version Control: How you will integrate your model’s code with your Git workflows for things like code reviews and roll backs to previous versions of your model.

Which ties in nicely to the second main path you can take when building an ML platform for deploying models. For many teams, they may choose to build a few components of their platform in-house and layer in other components that are built by solutions providers. Whether you take this approach, or want to piece together a platform that “just works right away”, this next session is for you.

Build an ML Platform with Best-in-Class Tools for ML Model Deployment

Building an ML platform with best-in-class tools means you get to cherry-pick the most efficient and powerful technologies at each step of the machine learning lifecycle, from data ingestion and model development to deployment and ongoing monitoring. It's not just about having the best tools in isolation; it's about choosing tools that work harmoniously to support your machine learning projects.

The benefits of this strategy are clear: you get a tailor-made solution that leverages the strengths of each component. However, it’s a strategy that demands a deep understanding of the ML ecosystem required for your specific models, the ability to integrate disparate systems seamlessly, and may require managing relationships with a few vendors.

Opting for this route offers flexibility and can lead to superior performance, provided that the integrations are well-managed.

Let’s take a look at a basic example of building an ML platform for model deployment with four best in class tools that cover the entirety of the ML lifecycle:

Development Environment: Hex

If you’ve never heard of Hex, they build a collaborative data workspace designed to streamline the way teams work with data. It integrates seamlessly with popular data warehouses like Snowflake, Redshift, and BigQuery, and supports analysis in collaborative SQL and Python-powered notebooks. Hex allows users to create interactive data apps and share them easily, making data analysis more accessible across teams. It focuses on improving collaboration through features like real-time multiplayer editing, commenting, version control, and the creation of interactive, shareable data apps.

Machine learning teams love the fact that they never have to worry about setting up local environments when collaborating on ML model development projects. As a bonus, they have very robust integrations with dozens of other tools.

Best of all, Hex offers the ability to run computationally intensive training jobs using on-demand compute from Modelbit.

Model Deployment, Cloud Compute, Version Control: Modelbit

Modelbit is a platform designed to streamline the deployment of machine learning models into production environments. It offers a straightforward and efficient process for deploying any custom ML model, with just a few lines of code, directly into a production setting behind a REST API.

Modelbit supports deployments across various environments, including Jupyter Notebooks, Colab, and Hex. It emphasizes security and scalability, with features like git-backed version control, CI/CD, and robust monitoring capabilities. The infrastructure behind Modelbit is engineered to handle large-scale models, providing on-demand GPUs, and allowing for flexible deployment options whether on their cloud or a customer's own infrastructure.

Modelbit's approach to ML deployment is designed to cater to the needs of teams looking for speed and efficiency, from fraud detection to advanced computer vision applications. By focusing on reducing the overhead and complexity traditionally associated with deploying ML models, Modelbit positions itself as a tool for teams that prioritize rapid iteration and deployment of ML models to enhance their products and services.

Data Management: Tecton

Tecton is a feature platform for machine learning that aims to simplify the end-to-end management of ML features, from design through deployment to monitoring. It addresses the complexities of data engineering for real-time ML, such as training/serving skew and real-time transformations, by offering tools to manage the entire ML feature lifecycle. This includes feature management, logic, repositories, and stores, designed to ensure mission-critical reliability and control costs.

Observability: Arize

Arize is an AI observability and LLM evaluation platform that specializes in monitoring, troubleshooting, and evaluating machine learning models across various applications like NLP, computer vision, and recommender systems. It offers tools for real-time model issue detection, root cause analysis, and performance improvement, supporting a wide range of model frameworks and environments. Arize's platform is designed to integrate seamlessly with a company's ML ecosystem, enhancing the reliability and efficiency of AI applications in production.

Buy an End-to-End ML Platform for Model Deployment

The allure of established end-to-end machine learning platforms like Sagemaker and Databricks is undeniable for enterprises seeking to standup ML capabilities. But in 2024, the decision to invest in what could be considered legacy platforms warrants a closer examination, particularly for those who haven't yet taken the plunge.

Starting with Databricks, it's important to acknowledge its foundation: Apache Spark. Once hailed for its ability to handle large-scale data processing, Spark is increasingly being scrutinized for its relevance in today's rapidly evolving data landscape. The primary concern lies in its batch-processing DNA, which, despite updates and enhancements, still lags behind newer technologies designed with real-time processing at their core. This isn't to say Spark lacks utility but rather to highlight that the nature of data processing demands has shifted, with a growing preference for systems that offer more agility and efficiency in handling live data streams.

Moreover, the complexity of managing and optimizing Spark jobs can introduce a steep learning curve and operational overhead, which may not align with the leaner, more dynamic approaches many organizations are now aspiring to. As the ecosystem around machine learning continues to expand, the flexibility to integrate with a wide array of tools and technologies becomes crucial. In this light, platforms tethered too closely to Spark might seem restrictive, pushing potential users to consider alternatives that offer broader compatibility and ease of innovation.

Turning to Sagemaker, Amazon's foray into the ML platform space has certainly captured significant market share. Yet, feedback from the user community often points to challenges around usability and the outdated user interface. Sagemaker's UI, described by some as less intuitive, can be a hurdle for teams seeking to democratize machine learning within their organizations. The learning curve is not just a matter of mastering the platform but also navigating the AWS ecosystem's complexity, requiring familiarity with additional AWS services for a seamless experience. This integration with AWS's vast but intricate architecture means that to leverage Sagemaker effectively, one must first navigate the broader AWS landscape—a daunting task for newcomers and even for experienced users looking for streamlined solutions.

While legacy platforms like Sagemaker and Databricks have played pivotal roles in the advancement of machine learning capabilities within the enterprise, their appeal in 2024 is increasingly nuanced. The dynamic nature of the tech landscape, coupled with evolving needs for more agile, user-friendly, and versatile tools, suggests that the decision to adopt these platforms should be weighed carefully.

For those yet to commit, it might be worth exploring newer, more flexible solutions that promise to keep pace with the rapid evolution of machine learning technologies and methodologies.

Getting Started with Your Deployment: Tutorials with model and deployment code from Modelbit

If you want to deploy an ML model fast, you can deploy any ML model to a full production environment with a REST API using Modelbit. Here a a few tutorials and resources we have put together: