Announcing Modelbit Infrastructure v1.0

Startups are hypotheses: Every startup is a bet that the world can be better in one highly specific, but massively impactful way. At Modelbit, our hypothesis is that machine learning practitioners will change the world with their models. They just need it to be a little easier to deploy those models to production.

As befitting a hypothesis, our product began as a prototype. Like all good prototypes, we made great use of on-demand and off-the-shelf infrastructure as we tested and iterated our hypothesis. Much of our backend was serverless, making extensive use of AWS Lambda and similar tools to host our own management infrastructure as well as containerized customer models.

As we’ve onboarded more and more customers, our hypothesis has become more stable and more proven-out, and the requirements for our infrastructure have become clearer. At the same time, our customers’ needs were clearly pushing the limits of what we could do with off-the-shelf infrastructure. It was time to build the infrastructure our market demands.

We had a good debate internally about whether this was Infrastructure 2.0, 3.0 or 4.0. After all, we’ve been iterating constantly! But the more we thought about it, the more we realized: As the first fully bespoke Modelbit backend, it is truly Modelbit’s Infrastructure 1.0.

Goals

Our new infrastructure needed to solve the following problems:

Scale: Models deployed on Modelbit routinely scale into the tens of gigabytes of memory used, both in system memory and GPU memory. The most common LLM deployed on Modelbit is Llama2 7B, which uses 14GB VRAM. There are certainly larger ones.
On-Premises Delivery: While Modelbit’s web application is served from our infrastructure on AWS, our deployments can be deployed into our infrastructure or our customers’ infrastructure.
Cold Start Time: Serverless infrastructure required us to store non-running models in S3. The time to download and start a large model from S3 is too long. We wanted to turn cold starts into “lukewarm starts” from local disk.
A many-to-many mapping of GPUs to customers and deployments: Customers may share GPUs across their deployments. They may also safely share GPUs with other customers if they are both smaller on-demand customers. Conversely, customers may need multiple dedicated GPUs for a single deployment if that deployment is under heavy load.
Heterogeneous hardware configurations: While many of our customers deploy demanding, large-scale models, we also have many XGBoost regressors, Scikit-Learn Random Forests, and other little models. We may fit these small models into very large boxes that are not fully loaded. But we may also have more modest hardware that’s dedicated to them, depending on our load at the time.
Hot upgrades to running models: Models are isolated and containerized, and live in their containers until our customers choose to redeploy them. Even our own changes do not generally impact running models. However, there are times where we need to upgrade the infrastructure serving running models – for example, when deploying a critical security patch or adding new infrastructure features. We need the ability to upgrade our infrastructure without incurring cold starts or downtime.

Overview

When a request for an inference comes in, the request is routed by our load balancers to the Ingress Service. The Ingress Service serves as part router and part puppetmaster, deciding which servers will serve which requests, breaking up batches and stitching together results as necessary, and handling Modelbit’s management needs like logging and billing.

Behind the Ingress Service are a heterogeneous collection of EC2 boxes in different configurations, some with GPUs and some without, each running one or more Runtime Proxies. The Runtime Proxies load and run the Docker containers that serve customer models.

Runtime Proxies report their configuration and status to a Redis cluster. The Ingress Services read from the cluster to make their management decisions. (And also to serve up a status page and management console that the Modelbit team refreshes throughout the day. 😉)

The Puppet Master: Ingress Service

Inference requests come in with a URL like:

https://customer_name.app.modelbit.com/v1/model_name/staging/latest

In this URL, “app.modelbit.com” refers to Modelbit’s runtime cluster in AWS US-East-2; “customer_name” refers to the customer; “staging” refers to the git branch; “model_name” refers to the customer’s particular model deployment; and “latest” is a pointer to the latest deployed model version. (“v1” refers to our optimism that we may someday make a breaking change to our REST API format. 😉)

Of course, the particular Ingress Service that receives this request will be one of the ones running at app.modelbit.com in AWS US-East-2. It will first dereference the “latest” pointer to discover that we must call version 46 of the model because it’s the latest one. Customers may also set custom pointer aliases “e.g. /v1/segment_image/harrys_version” which the Ingress Service will also need to dereference. (These pointers are files whose source of truth is in S3 and whose caches live locally on the Ingress Services.)

Once the Ingress Service knows exactly which model must serve the request, it will decide which Runtime Proxy server(s) will run the request. For large batches, the Ingress Service may decide to break the batches up into multiple smaller batches and run them on multiple Runtime Proxies in parallel.

In deciding which Runtime Proxies will handle the request, the Ingress Service will scan all the runtime Proxies to see if any of them currently have the model already loaded. It will strongly prefer those servers, because they will not incur any cold-start time for our customers. If no such servers are available, or if those servers are heavily loaded, the Ingress Service will look for lightly loaded servers to handle the request.

It will then fire the request off and wait for the response. The response will come back with the actual inference response data, as well as metadata like the stdout and stderr outputs for logging, and the description and stack trace of any errors. If Ingress Service had previously split the request into multiple batches, then it will stitch together the logs and sum up the inferences before returning them all to the user.

The Tank: Runtime Proxy

The Runtime Proxy’s function is to load the desired customer model if it’s not already loaded, run that model, and do all the necessary monitoring and bookkeeping.

In the case of a new model deployment, the Runtime Proxy will need to fetch the URL of the model’s docker container, pull it down and start the container. Once the container is alive and ready for inferences, it will report its status in Redis as ready to receive inference requests for that model.

Often, a particular Runtime Proxy was selected to run an inference because it has run inferences for that model recently. It may already have a running container for that model, or have a copy on disk. Upon receiving a request for inference, the Runtime Proxy will boot the container if necessary, call it via a Unix socket, receive the results, return them to the Ingress Service, and wait. If the model isn’t called again soon, and the customer has not chosen the “Keep Warm” feature for the model, it will eventually pause the running Docker container.

Runtime Proxies and GPUs

Some of our Runtime Proxies have GPUs, and while Runtime Proxies can run multiple models at once, they can only run one model on the GPU at a time. If a request comes in requesting an inference for the model currently occupying the GPU, then no changes are necessary: that model will run and serve the request!

If not, then the model previously occupying the GPU will need to release the GPU memory. Simply restarting that model’s container usually suffices: As part of the restart, the model will release its hold on the GPU, but will otherwise be ready for future inferences. If the customer code doesn’t release the GPU on restart (perhaps due to some kind of hang or crash in the customer code), the Runtime Proxy will terminate the container to ensure the GPU is freed.

The Runtime Proxy then loads the new model on GPU and runs it for inference. Finally the Runtime Proxy monitors resource usage and other metadata, and returns that information to the Ingress Service alongside the inferences.

Future Work

As of this writing, our new infrastructure has been serving 100% of deployments and inference requests in Modelbit for the last few weeks. Results have been quite positive! We’ve been able to serve a wide variety of customer models with very divergent resource needs on a simple, stable infrastructure.

As we look to the future, we anticipate a handful of future developments:

Multiple GPU-requiring models running on a single GPU: While GPU memory virtualization technology is not as mature as system memory virtualization, we would love to share GPUs across models at the same time when practical.
Smarter scheduling: We decide where to run models based on the Runtime Proxies’ current load. But we can also know things about the historical performance of jobs that are currently running, as well as the jobs that we want to run. For example, if a host is heavily loaded but we know the job it’s running will finish soon, it’s still okay to send that host another job.
Migrating running models across hosts: If a host frees up, it may make sense to checkpoint a currently running container, move it to the free host, and start it again. This could result in faster performance than letting it finish on the more heavily-loaded host.

As with all startups, future customer requirements and the evolution of such a fast-moving space will inevitably impose requirements we never thought of. We can’t wait to look back months and years from now and see if our infrastructure evolved in the ways we thought it would.

‍

Goals

Overview

The Puppet Master: Ingress Service

The Tank: Runtime Proxy

Runtime Proxies and GPUs

Future Work

Deploy Custom ML Models to Production with Modelbit

Contact Us

Resources

Product