Startups are hypotheses: Every startup is a bet that the world can be better in one highly specific, but massively impactful way. At Modelbit, our hypothesis is that machine learning practitioners will change the world with their models. They just need it to be a little easier to deploy those models to production.
As befitting a hypothesis, our product began as a prototype. Like all good prototypes, we made great use of on-demand and off-the-shelf infrastructure as we tested and iterated our hypothesis. Much of our backend was serverless, making extensive use of AWS Lambda and similar tools to host our own management infrastructure as well as containerized customer models.
As we’ve onboarded more and more customers, our hypothesis has become more stable and more proven-out, and the requirements for our infrastructure have become clearer. At the same time, our customers’ needs were clearly pushing the limits of what we could do with off-the-shelf infrastructure. It was time to build the infrastructure our market demands.
We had a good debate internally about whether this was Infrastructure 2.0, 3.0 or 4.0. After all, we’ve been iterating constantly! But the more we thought about it, the more we realized: As the first fully bespoke Modelbit backend, it is truly Modelbit’s Infrastructure 1.0.
Our new infrastructure needed to solve the following problems:
When a request for an inference comes in, the request is routed by our load balancers to the Ingress Service. The Ingress Service serves as part router and part puppetmaster, deciding which servers will serve which requests, breaking up batches and stitching together results as necessary, and handling Modelbit’s management needs like logging and billing.
Behind the Ingress Service are a heterogeneous collection of EC2 boxes in different configurations, some with GPUs and some without, each running one or more Runtime Proxies. The Runtime Proxies load and run the Docker containers that serve customer models.
Runtime Proxies report their configuration and status to a Redis cluster. The Ingress Services read from the cluster to make their management decisions. (And also to serve up a status page and management console that the Modelbit team refreshes throughout the day. 😉)
Inference requests come in with a URL like:
In this URL, “app.modelbit.com” refers to Modelbit’s runtime cluster in AWS US-East-2; “customer_name” refers to the customer; “staging” refers to the git branch; “model_name” refers to the customer’s particular model deployment; and “latest” is a pointer to the latest deployed model version. (“v1” refers to our optimism that we may someday make a breaking change to our REST API format. 😉)
Of course, the particular Ingress Service that receives this request will be one of the ones running at app.modelbit.com in AWS US-East-2. It will first dereference the “latest” pointer to discover that we must call version 46 of the model because it’s the latest one. Customers may also set custom pointer aliases “e.g. /v1/segment_image/harrys_version” which the Ingress Service will also need to dereference. (These pointers are files whose source of truth is in S3 and whose caches live locally on the Ingress Services.)
Once the Ingress Service knows exactly which model must serve the request, it will decide which Runtime Proxy server(s) will run the request. For large batches, the Ingress Service may decide to break the batches up into multiple smaller batches and run them on multiple Runtime Proxies in parallel.
In deciding which Runtime Proxies will handle the request, the Ingress Service will scan all the runtime Proxies to see if any of them currently have the model already loaded. It will strongly prefer those servers, because they will not incur any cold-start time for our customers. If no such servers are available, or if those servers are heavily loaded, the Ingress Service will look for lightly loaded servers to handle the request.
It will then fire the request off and wait for the response. The response will come back with the actual inference response data, as well as metadata like the stdout and stderr outputs for logging, and the description and stack trace of any errors. If Ingress Service had previously split the request into multiple batches, then it will stitch together the logs and sum up the inferences before returning them all to the user.
The Runtime Proxy’s function is to load the desired customer model if it’s not already loaded, run that model, and do all the necessary monitoring and bookkeeping.
In the case of a new model deployment, the Runtime Proxy will need to fetch the URL of the model’s docker container, pull it down and start the container. Once the container is alive and ready for inferences, it will report its status in Redis as ready to receive inference requests for that model.
Often, a particular Runtime Proxy was selected to run an inference because it has run inferences for that model recently. It may already have a running container for that model, or have a copy on disk. Upon receiving a request for inference, the Runtime Proxy will boot the container if necessary, call it via a Unix socket, receive the results, return them to the Ingress Service, and wait. If the model isn’t called again soon, and the customer has not chosen the “Keep Warm” feature for the model, it will eventually pause the running Docker container.
Some of our Runtime Proxies have GPUs, and while Runtime Proxies can run multiple models at once, they can only run one model on the GPU at a time. If a request comes in requesting an inference for the model currently occupying the GPU, then no changes are necessary: that model will run and serve the request!
If not, then the model previously occupying the GPU will need to release the GPU memory. Simply restarting that model’s container usually suffices: As part of the restart, the model will release its hold on the GPU, but will otherwise be ready for future inferences. If the customer code doesn’t release the GPU on restart (perhaps due to some kind of hang or crash in the customer code), the Runtime Proxy will terminate the container to ensure the GPU is freed.
The Runtime Proxy then loads the new model on GPU and runs it for inference. Finally the Runtime Proxy monitors resource usage and other metadata, and returns that information to the Ingress Service alongside the inferences.
As of this writing, our new infrastructure has been serving 100% of deployments and inference requests in Modelbit for the last few weeks. Results have been quite positive! We’ve been able to serve a wide variety of customer models with very divergent resource needs on a simple, stable infrastructure.
As we look to the future, we anticipate a handful of future developments:
As with all startups, future customer requirements and the evolution of such a fast-moving space will inevitably impose requirements we never thought of. We can’t wait to look back months and years from now and see if our infrastructure evolved in the ways we thought it would.