Congratulations, you’ve built an amazing ML model and you’re ready to integrate it into your product! Since ML models are generally in Python and products generally aren’t, you’ve decided to deploy your model as a REST API. Nearly every product can use REST, so that will work well.
But where will the model run? There are many kinds of infrastructure, each with trade-offs, ranging from price vs. performance to simplicity vs. durability. We are here to help! At Modelbit, we have used a lot of different infrastructure for hosting models, and decided we should share the pros and cons of the popular options.
We are going to focus on the infrastructure and avoid DevOps processes and network configuration. DevOps processes such as deployment, authentication, versioning, and rollbacks are essential but can be discussed separately from the infrastructure. We dare not subject you to the horrors and how-to’s of AWS network configuration.
By the way, if this is too much and you want a solution that “just works”, deploy your ML models with Modelbit!
Lambda is the easiest infrastructure for running your Python models because you will never have to worry about managing servers or scale limitations.
If your model fits in a Lambda function, you are in a great situation. Lambdas are simple to operate, they scale automatically, and cost next to nothing. The easiest models that fit in Lambda are “models”, with quotes. These models are of the lightweight business logic type that don’t rely on pip-installed packages like XGBoost, PyTorch or Tensorflow.
However, AWS Lambda can still be a great option if you need some pip packages. To do that you need to use Docker. While AWS Lambda has a feature called “Layers”, you should avoid using it. Lambda Layers are frustrating to create and quite limited in their use, so just don’t. Docker is also a prerequisite for more advanced deployment infrastructure, so you’ll need to learn it at some point, why not now?
It might take a few tries to build your first Docker image because the Docker language is pretty wacky. But once you get the hang of it you’ll have gained a new superpower! With Docker in your skillset you can eliminate the whole category of “works on my box but nowhere else” problems. So, make a Docker image containing your model and install the necessary pip packages. Then push the image to AWS ECR (Lambda requires Docker images to be in ECR) and create a Lambda function using that image. AWS has a tutorial you can follow.
You’ve got Docker, you’ve got Lambda, now you have to choose how large to make your Lambda function. Make it large: a few GB of memory. You’re worth it. Besides, Lambda only charges a fraction of a fraction of a penny per millisecond so it is not worth stressing about the size of your Lambda. Larger Lambdas will run faster than smaller Lambdas, which means they can cost about the same. You get 1 CPU core per 1,769MB of memory, so don’t go smaller than that.
AWS Lambda is not a good choice for models or pip packages that have a lot of IO because AWS Lambda’s file system is slow. The Tensorflow package, for example, consists of approximately 3.14 zillion python files, and can take almost a minute to import in a Lambda.
AWS Lambda also has a 10GB limit on the size of the uncompressed Docker image. Inconveniently, the image size your Docker repo shows will be the compressed size. So one day you’ll be surprised that your image is too big for Lambda and you’ll have no immediate recourse.
For example, PyTorch barely fits. And PyTorch with a large checkpoint file probably will not fit. As soon as your image passes a couple gigabytes you’ll also notice some pretty long cold starts, as Lambda takes a while to download, decompress, and boot your image.
Should you use Provisioned Lambdas to solve for cold starts? Definitely not. Like Layers, Provisioned Lambas is a feature to avoid. They have unintuitive behavior around updates and versioning that makes using them with a REST API unnecessarily complicated. Plus, creating a Provisioned Lambda can silently fail if it takes more than 2 minutes to initialize.
So while Lambda is great for small models with a few pip packages, it’s not the solution for medium-sized models. For that, it’s time for AWS Fargate!
Fargate provides the predictable performance of a server running your ML models, without needing to think about the server itself.
Welcome to the next level of model deployment! You’ve graduated from AWS Lambda with your ML model in a Docker image, and you’re looking for a bit more control. Maybe you need more CPU, or more RAM, or to eliminate cold starts: Fargate has you covered!
To use Fargate, you’ll need to create an AWS ECS Cluster and Service that uses Fargate to load and run your container from ECR. It can be a lot of clicking or configuration, but usually it is not too painful.
Good news: the exact same Docker container will boot faster on AWS Fargate than on AWS Lambda for two reasons:
More good news: Fargate allows for a lot more CPU and RAM than Lambda. Lambda maxes out at 5-ish CPUs and 10GB RAM. Fargate goes up to 16 CPUs and 120 GB RAM. So if your model is hungry, you can feed it on Fargate.
You can run multiple Docker images in the same Fargate instance. Which means, when you have multiple models that run in different containers, they can share the same Fargate instance to save on costs. In this setup, more models does not mean more cost.
And Fargate is “managed” in the same way as Lambda. If your Fargate instance crashes, no problem, ECS will make a new one automatically. Want to save money with Spot instances? That’s a checkbox. You don’t need to think about Availability Zones or Auto-scaling Groups or Instance Types. It’s all taken care of for you.
Unfortunately, it’s not all rainbows and unicorns with AWS Fargate. Our sad-emoji list starts with cost: you pay the whole time the Fargate instance is running, unlike Lambda where you only pay when requests are processed. The world of ML model hosting is generally on-demand, and Fargate is not.
You also have to choose how many Fargate instances you want. AWS Lambda automatically scales up depending on the number of incoming requests. Fargate doesn’t, and requires you to decide the number of instances you’ll need ahead of time..
The bandwidth between ECR and Fargate is quite low, about 25MB/s per Docker image layer up to a handful of layers at once. That means deployments of large, multi-gigabyte models can take a really long time on Fargate. And they’ll take a long time every time, since Fargate doesn’t cache layer downloads.
And worst of all, Fargate doesn’t have GPUs! Supporting GPUs has been on their roadmap since 2019, so that’s not a great sign. And if you don’t already, you’re going to need GPUs for your bigger models.
So, where to next? EC2 of course!
EC2 is the most customizable and capable option for hosting ML models. It’s also the hardest to configure and keep running reliably.
If you’re here you’re way past throwing a FastAPI server up into us-east-1 and declaring “mission accomplished”.
When you’re on EC2 you’ve left the calm shores of managed deployments like Lambda and Fargate, and need to work to keep your deployment afloat. Server crashed? Hard disk full? Service discovery missing? More servers during the day than at night? Solving those problems and more will take a significant chunk of your time.
But it’s worth it, if you’re deploying a ton of models! Modelbit’s deployments are built on clusters of EC2 with GPUs to provide simple and powerful experiences for our customers. We’ve invested in auto-scaling to match customer demand and heal from crashes, and rapid rollouts with caching layers that make multi-gigabyte models quick to load. Plus all the git-based deployment flows needed for versioning, rollbacks, etc.
It was a lot of work, but EC2 enabled us to build a ML model hosting platform that works well for many different kinds of models, customers and use cases.
If you’d rather build on top of Modelbit, instead of making your own, sign up! We’ll take care of the MLOps so you can focus on building your ML models.