A technical guide to architecting large ML pipelines on Snowflake warehouses

Tom O'Neill, Co-Founder & CTO
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

There are several kinds of machine learning pipelines. Each requires different architectural choices because of different expectations around latency, volume and environment capabilities. The constraints for serving models used in real-time threat detection in video feeds are very different from those for batch information extraction across hundreds of millions of documents.

This post focuses on the latter: Large-scale batch inferences utilizing Snowflake. At Modelbit we’ve helped many customers perform inferences on the data in their Snowflake warehouses, and we’ve learned a lot along the way. If you’re thinking about building your own ML pipeline on Snowflake, then this post is for you!

Below we’ll share hard-learned lessons and recommendations for how you should (and shouldn’t) build your own ML pipeline on Snowflake. And at the end we’ll share some predictions for where we think this technology is going and what the future holds.

Expectations for our ML pipeline

When we’re talking about an ML pipeline in Snowflake, we’re talking about something dbt could orchestrate. Such a pipeline would process hundreds of millions of rows through dozens or more different ML models to create a collection of tables containing the inference results.

When we refer to a model, we’re referring to thousands of lines of Python code spread across many source files, pickled artifacts, Python packages, binary weights and checkpoint files. It’s everything you need to process the raw data into an inference.

Finally, the pipeline needs to be able to support GPUs. While your models today might not need them, the future of machine learning will include many more use cases where GPUs are effective. Adding support for GPUs should be a checkbox, not a whole second architecture.

With those expectations set, let’s dive into ML pipelines in Snowflake!

In the beginning there were only External Functions

Running an ML model means running a complex Python program. Each model has code and assets that are often sensitive to even small changes in the Python environment. And each model has its own set of libraries and versions it depends on, often including the system packages that vary from one flavor of Linux to another. Containerization is the solution to this set of problems.

Until recently it wasn’t reasonable to run complex Python programs, like ML models, within Snowflake. (More on that later.) The only reasonable way to run one model, let alone dozens, was to call out from Snowflake to an ML service you create and manage, like an EC2 that hosts and runs your models.

Snowflake created a feature called External Functions to enable this scenario.

The ideal external function

External functions allow you to define a SQL function that sends its arguments out to a service you control. Your service processes the list of rows it gets and returns a list of results which Snowflake consumes as if it were the results of any other SQL function. You make a REST API for your ML model and use external functions to call it. Simple!

Unfortunately, this is a lie. External functions cannot call an arbitrary endpoint. They can only call authenticated AWS API Gateways (or the Azure/GCP equivalent) via an API integration. Calling an API Gateway from Snowflake means you also need to configure a bunch of IAM permissions and network plumbing. The API Gateway also needs to call something that can rapidly scale and respond or else it’ll error. So in reality Snowflake calls an API Gateway, which then calls an AWS Lambda function, and finally the Lambda function can call your ML service:

With the above setup you can call your ML service with data from Snowflake and get the inferences materialized into Snowflake tables.

Data size matters. Snowflake has a limit of 16MB per cell. AWS Lambda functions have a body limit of 6MB. This means it’s possible there’s data in your warehouse that cannot be sent externally because the Lambda cannot receive it. If you’re trying to process 10MB documents then external functions cannot support you. (So skip to the next section!)
This size limit works in reverse too. If your ML service generates a response for a batch of rows that’s larger than 6MB you won’t be able to send it back to the Lambda to send to Snowflake.

If you’re still reading, you’ll be disappointed to learn that this diagram is also a lie! While it will work for a prototype, it will not scale to large datasets. And large datasets are why we’re here.

Scaling issues with external functions

Snowflake offers two API modes with their external functions. The default is a synchronous API that’s easy to work with: When a request comes in, your service processes it and returns the response like any REST API. However there are two big problems with the synchronous API.

The first is timeouts. The Snowflake query calling the external function can run as long as you’d like, but the API Gateway has a timeout of 30 seconds that cannot be changed. If your model runs in 30s, you may think you’re out of the woods. You are not, because of the second wrinkle: concurrency.

The Snowflake warehouse chooses the concurrency of the batches of rows it sends to your service. We’ve seen it start by sending as few as 8 concurrent batches, each containing a few dozen rows. We’ve also seen it choose to send as many as 1000 concurrent batches, each with thousands of rows.

The number of concurrent batches and the batch sizes are not predictable, though in general it does send more when using a larger warehouse. And since we’re working with hundreds of millions of rows in this pipeline, we’re using pretty large warehouses.

You don’t control the concurrency with which Snowflake calls your service. How Snowflake chooses the level of concurrency is a black box and is not configurable. Building a service that can efficiently handle rapidly changing levels of concurrency is a challenge.

The combination of a low timeout with high concurrency means reliably responding to all traffic from an external function within 30 seconds is quite problematic. Thankfully, Snowflake’s external functions have an asynchronous API that is a better fit for large workloads like ours.

Using the async API

When your service responds to the synchronous API it returns a HTTP 200 with the inference results. Snowflake sees the HTTP 200 as meaning the request is complete. However, your service can respond with other HTTP codes to trigger other behavior.

If you respond with a HTTP 202, Snowflake will take that to mean your service has accepted the batch. For the next 10 minutes Snowflake will poll, asking for the results of that batch. When the results are ready, just respond with the results and a HTTP 200 and you’re all set.

So the async API looks like a better fit for large batches of inferences! Unfortunately, it’s limited by a 10-minute timeout for processing each batch.

The async API only polls for 10 minutes, not for the duration of the query in the warehouse. If your service takes longer than 10 minutes to complete any single batch, Snowflake will error the query and fail this step of the pipeline. The 10-minute limit is not configurable.

Your architecture for using the async API will need a queue so that your service can have a pending list of batches for the model to run, and a completed set of results that Snowflake can poll for and fetch. Due to the 10-minute timeout, it’s important to work on batches in-order.

That 10-minute async API limit is often not enough when the Snowflake warehouse sends a (metric) gazillion batches all at once. To buy a little more time to scale up your ML service and process all the batches, your service can respond with an HTTP 429. The 429 tells Snowflake that the service is busy and it’ll slow down the rate of sending new batches to your service. You’ll still have only 10 minutes to respond to in-flight requests.

Always work on batches in-order. If you process batches in random order it’s possible for early batches to wait more than 10 minutes and cause Snowflake to error the entire query. This becomes very challenging when also responding with 429 since you won’t have those batches in the queue.

However, you can only respond with HTTP 429 for about 12 minutes. If you’re still 429’ing after 12 minutes, Snowflake will again error and fail this step of the pipeline.

Responding with 429s doesn’t affect the initial concurrency volume. If the Snowflake warehouse decides to send a zillion requests all at once at the start of the query, your service has to complete them all within the time limit. There isn’t a way to tell the warehouse “start slow and ramp up”.

We can do better!

External functions with the async API work pretty well, and until a few months ago were the only option for workloads like this.

Recently Snowflake added External Access to User-Defined Functions (UDFs). This makes real-world ML pipelines possible with Snowflake compute for the first time!

Then came External Access

Two years ago Snowflake launched Snowpark for Python and allowed running arbitrary Python code within the warehouse as UDFs. These Python UDFs had several limitations, the chief of which was no network access. Without network access Snowpark simply wasn’t viable for complex ML scenarios like ours.

That changed in the last few months when Snowflake launched External Access. With additional configuration it’s now possible to call out to the internet from Python UDFs. This means the architecture above can be greatly simplified while also allowing more control.

Houston, do you read me?

Now we’re in a world where Python code that’s running in Snowflake can call out to Python code running in your service. Sounds a bit like microservices, right? Yes it does. Developing here is better and easier than developing with External Functions, and is our recommended approach for making external API calls.

No lies this time, the architecture can really look like this:

The simplicity of the Python UDF with External Access is a welcome change from External Functions!

Communication is the answer

Your Python UDF is responsible for how it communicates with your service. By default, the UDF isn’t allowed to communicate with any external service. To enable external access you need to create a network rule to allow traffic to leave Snowflake. The network rule requires you to specify every domain and subdomain your Python UDF is allowed to communicate with.

Frustratingly, the network rule does not support wildcards, so you need to specify every subdomain. That can be a lot of subdomains when you’re talking to AWS services. Set your log-level to DEBUG, run your UDF, and let "boto3" fail to connect. Then wait 10 minutes and check your events table for "boto3’"s error logs, showing which hosts it couldn’t reach. Update your network rule, rinse, repeat.

Your Python UDF must also handle authentication. Previously the External Function used IAM permissions to authenticate to your API Gateway. With External Access you own your own authentication. Snowflake has a set of features around secrets that’ll give you access to your own API keys or auth tokens within your Python UDF.

Finally, in order to connect the network rule and secret to your Python UDF, you must create an access integration. With these three pieces you can create a Python UDF that can talk to (some of) the internet by setting the "external_access_integrations" field in your UDF definition:

create or replace function my_schema.my_udf(...)
  returns ...
  language python
  runtime_version = 3.10
  HANDLER = 'handler'
  external_access_integrations = (my_access_integration)
  packages = ('snowflake-snowpark-python', 'requests')
  secrets = ('authToken' = my_secret )
AS $$
import _snowflake, requests, os, pandas

os.environ["AUTH_TOKEN"] = _snowflake.get_generic_secret_string('authToken')

def handler(inputDf):
  # your code to call the REST API of your ML service
  return requests.post("https://your-domain.com/api/run_model",
    headers={"x-auth-token": os.environ["AUTH_TOKEN"]},

Along with authentication, your UDF code is also responsible for retries and throttling. If your service isn’t ready for the next batch of inferences, your Python UDF can simply "sleep" a few seconds and try again. Like before, there is a 10-minute timeout. Unlike before, this timeout is configurable.

To change the timeout of your Python UDF, contact Snowflake Support. The setting exists but it’s not customer-facing. Raise it to an hour so you’re never worried about a Python UDF killing a long-running SQL job because one batch unluckily took 10.1 minutes to complete.

The simplicity of this UDF is its strength. You can call whatever API you’d like, however you’d like, with a dozen lines of Python inside your Snowflake warehouse. 

Mostly rainbows and unicorns

Python UDFs with External Access check a lot of boxes for making it easy to build an ML pipeline on top of Snowflake. And since your ML inferences are running on your EC2s it’s possible to have any shape hardware you need to run your model.

Having a capable Python environment within the Snowflake warehouse raises the obvious question: Can the ML models just run within Snowpark? I’m glad you asked!

Running your ML pipeline inside Snowflake Snowpark

If (and that’s a big IF) your ML models can run within Snowflake, then you should probably just run them there! The architecture will be simpler than anything that relies on models running on EC2s. And scaling up/down model concurrency happens in lockstep with your warehouse size.

We’ve seen 4XL warehouses process inferences for half a billion rows in 10 minutes, while a decent cluster of EC2s would take close to an hour. Part of this is due to the serialization and networking overhead of moving a lot of data out of Snowflake and through EC2s and back. And part of this is because large Snowflake warehouses are just fast!

ML models in Snowpark

Running your models with Snowflake Snowpark is very similar to the Python UDFs discussed earlier. You define a Python UDF with the code and packages needed, and pass the inputted DataFrame to your model.

However, the developer ergonomics of Snowpark are quite poor for complex ML models. Remember, our models are “thousands of lines of Python code spread across many source files, pickled artifacts, installed Python packages, binary weights and checkpoint files”. Effectively writing (and testing!) Python embedded within a SQL function becomes a challenge after just a couple dozen lines of code.

Alternatively you could put all your code in a Stage. But that means for every change during development you must (1) package your code into a "whl", (2) upload it to a stage, (3) recreate your UDF to use the new code and files, then finally (4) call the UDF to see if your code change works. This is not realistic for real-world development, and if you’re selling this as a service to real customers for real dollars, I do not recommend bringing this idea up on a sales call.

Avoid using Stages whenever possible. They’re like S3 but worse because they (1) are opaque to your other development tools and (2) have unusual limitations like requiring some files to only be in the root directory for them to be loadable by some Snowflake functions. Just use S3.

Thankfully there’s a better way: dynamic code injection. Yep, you heard that right!

Dynamic code injection to Python UDFs

Developing Python code within a Snowflake function is … challenging. There are plenty of great editors and type checkers and test frameworks that run on Python code in normal Python files. There are none that work on Python code within your Snowflake SQL functions. Our recommendation: don’t develop Python within SQL functions. Instead, use the magic of Python to inject the code at runtime!

Our Python UDFs download their code and dependencies during initial execution from our ML service. This means it’s possible to fix bugs, improve performance, and adjust the behavior of our Python UDFs without changing the UDF! Just re-run the SQL statement and the UDF will fetch new code and execute it.

Not storing your Python code in SQL warehouse also means you get normal CI/CD workflows, git versioning, syntax highlighting and all the other normal development tools you’re used to.

Most importantly, this means we can use normal Python development tools and workflows to improve our Python UDFs very quickly: just save the current file and re-run your SQL query. There are few things more valuable than rapid iteration in software development.

During local development you can use ngrok to provide a host to put in your Snowflake network rule. This way your warehouse can download the Python code you’re developing from your development machine, as you develop it. No need to wait for pushes or release pipelines or even re-running “create or replace function”!

Here’s an example of a Python UDF that downloads and executes its own code during startup:

create or replace function my_schema.my_udf(...)
  returns ...
  language python
  runtime_version = 3.10
  HANDLER = 'handler'
  external_access_integrations = (my_access_integration)
  packages = ('snowflake-snowpark-python', 'requests')
  secrets = ('authToken' = my_secret )
AS $$
import _snowflake, requests, os, sys, pandas, time
from functools import cache

os.environ["AUTH_TOKEN"] = _snowflake.get_generic_secret_string('authToken')

def getBootstrapCode():
  return requests.get(f'https://your-domain.com/api/get_bootstrap',
    headers={"x-auth-token": os.environ["AUTH_TOKEN"]}).text

def bootstrap():
  exec(getBootstrapCode(), globals(), globals())
  runBootstrap() # created during code injection

def handler(inputDf):
  import my_request_handler # created by bootstrap's exec
  return my_request_handler.run(inputDf) # executes the model

In the above code you’ll notice that in the "handler" we’re calling "bootstrap()". The "bootstrap" function downloads code from our ML service (a webserver) and executes it with "exec". It’s the Python version of "curl | bash", but using trusted code from a server you control.

Snowflake runs many instances of your Python UDF in the same “container”. That means they share the same writable "/tmp" volume. Snowflake recommends using their FileLock example to prevent different instances from overwriting each others’ files. In our experience, that doesn’t work unless you’re super careful with file flushing and also add multiprocessing locks.
Instead, we recommend a simpler approach: pick a random subdirectory of /tmp during initialization. Then each instance of your Python UDF will have its own folder to do work within and never risk overwrites from sibling instances. The Snowpark disks are tiny so optimizing for disk I/O isn’t worthwhile.

With dynamic code injection it’s possible to rapidly iterate on Python UDFs while developing. It also means your models can fetch different assets at different times. If you’ve trained one model per customer, it’s easy to fetch each customer’s model at runtime. Likewise if your code needs access to assets in S3, or a feature store, or a model registry, or to send logs somewhere, this approach makes it easy.

As a bonus, you can return different code and state to your Python UDFs at runtime based on conditions within your ML service. This allows your ML service to decide which version of a model to run, decoupling ML model releases from database schema releases. Your ML team can fix model bugs, or ship model improvements, without them needing to learn dbt or your database team needing to be involved. Don’t look now, but I think we’re getting closer to true CI/CD!

Favor software development practices over database development practices when it comes to ML architecture decisions, even when using Snowflake. Getting inferences from ML models is a lot more similar to software engineering than data engineering.

Debugging within Snowpark

To help with debugging and understanding what your model is doing, you can configure Python’s "logging" library to stream events to a table in Snowflake. Keep in mind that it sometimes takes well over ten minutes for events to show up in the table, and events may not show up if the function errors, so this isn’t a great option for real-time debugging and iteration.

Hear us out: For real-time debugging, we recommend throwing exceptions with whatever state you want to know. The event table logging is far too slow. Instead "raise Exception({“show-me”: “something”})"! It’s like a "print(...)" that’ll show up immediately in the Snowflake console while you’re debugging.

Limitations of Snowpark UDFs

First up, there are no GPUs. At the beginning of this post we discussed needing an architecture where adding GPUs was possible. So while Snowpark may be a good short-term solution, it’s not the ideal solution. For that stay tuned for the next section.

If you don’t need GPUs here are some additional limitations to keep in mind:

Few packages: Snowflake Snowpark doesn’t have access to "pip" or PyPi. Instead, it has access to a few thousand packages in the Snowflake Anaconda channel spanning tens of thousands of individual versions. For comparison, Pypi has hundreds of thousands of packages and millions of versions. So it’s quite likely some of the packages you’re using aren’t in Snowpark. To work around that you can change the packages you’re using or download and unzip them into your environment. Unzipping packages only works for “pure” Python packages (if your package needs to compile C-extension you’re out of luck).

To find out what packages are in Snowpark, look in information_schema.packages:

select * from information_schema.packages where language = 'python';

Low disk space: The standard Snowflake warehouse allocates about 500MB to the writable "/tmp" directory that your Python UDFs have access to. That’s plenty of space for Python programs but barely any space for many ML models. To get more disk space you need to upgrade to a Snowpark-optimized warehouse. If your use case supports steaming file reads in from elsewhere, that may be an option instead of downloading the file locally.

Single threaded: The Python Snowpark environment is designed to work best with single-threaded models. Spawning new processes is blocked and will raise an exception. You may need to configure your model to only use one core.

No compilation support: Some Python packages require compiling C-extensions when they install. That’s not supported by Snowpark, and there’s no viable workaround at this time.

All-or-nothing batches: When you’re running a zillion inferences there’s always the possibility of a bug that throws an exception. If any one of your inferences raises an exception the whole batch is lost. It doesn’t matter that only one row of the zillion had the issue. The query is terminated and the inferences are lost. You need to build your own error handling logic to prevent one rogue inference from tanking the whole batch of inferences.

Success with Snowpark

Despite the limitations it’s possible to have a lot of success with Snowpark, and we’ve been excited to build on top of it. When it works, it works really well. Looking ahead, we have some guesses for the future of Snowpark and related technologies from our friends at Snowflake.

The future is nigh

Snowflake has several exciting ML-related features in development. Of course, there are Snowpark Python UDFs that continue to improve with every release. There’s Snowpark ML which provides friendly APIs for training certain models and creating Snowpark UDFs for those models. And lastly there is Snowpark Container Services, which allows for hosting docker containers within the Snowflake compute environment, including containers with access to GPUs.

To create a truly viable ML platform, Snowflake will need to merge these seemingly-disparate projects into a single coherent ML experience. We bet that’s what they’re working on now. We think they will bring the best of each of these worlds together. Here’s what we mean:

Snowpark Python UDFs

What it’ll bring to the table: The great strength here is the “scales with your warehouse” style of on-demand compute. This is undeniably convenient and a natural fit for generating and storing inferences during orchestrated SQL queries. External Access makes live code injection possible, which enables the marriage of true ML development workflows with scalable-on-demand compute.

What they’ll discard: The inability to customize Python environments beyond a few thousand hand-selected packages. The lack of true network support. The lack of any hardware or Linux system customization at all, most obviously GPUs. All of this they can and will bring in from Snowpark Container Services instead.

Snowpark ML

What it’ll bring to the table: The notebook API is a banger for end users getting started with new models one at a time. The registry is the first of many workflow-level features that Snowflake will need and is probably building to support real ML use cases. Things like model monitoring and alerting, two-way git sync, feature pipelines, shadow deployments, etc. are most likely already in development from this team to enable a fabulous end-user experience for managing lots of ML models and pipelines.

What they’ll discard: A real ML team needs a real API that can be called from a production system, not just a notebook. "create or replace function" (a la Snowpark Python UDFs) is a weird way to create entire ML pipelines, but at least it’s scalable. Snowpark ML currently only supports three types of models, plus “custom” models that need to inherit from a Snowflake-written wrapper class in a notebook. None of this is going to work for serious teams. (To say nothing of a 5GB model size limitation that we assume is temporary.) The freedom afforded by the runtime environment of Snowpark Container Services is what they’ll want to go with instead.

Snowpark Container Services

What it’ll bring to the table: It’s all about Docker (and GPUs!). Support for custom environments with totally custom Linux environments, packages and tools will make it possible to run totally custom ML models in Snowflake for the very first time. 

What they’ll discard: More control over scaling and concurrency is necessary to make Snowpark Container Services viable for large ML workloads. When they merge that scaling behavior of Snowpark Python UDFs with the full environment customization of Snowpark Container Services – and add in a more natural docker repo interface, and avoid the “10-minute timeout” mistakes of the past – it’ll be Game On for ML workloads in Snowflake!

The Snowflake-based ML architecture of the future

Snowflake has the beginnings of something really special. If they can build a single product with the scaling characteristics of Snowpark Python UDFs; the hardware & software flexibility of Snowpark Container Services; and the high-level ML workflows promised by the Snowpark ML roadmap, they will have a Databricks killer on their hands. 

We think this is coming in the years ahead from Snowflake. We can’t wait.

Wrapping up

Architecting ML pipelines is hard. There are many constraints imposed by the model’s requirements, by the volume of the data, and by the latency demanded at inference time. Of course there’s also the cost of compute and the need for an efficient developer experience in both creating and maintaining the solution. There are no easy answers!

The most flexible option is to host your model wherever it runs best and call out to it from a Python UDF. If you ever want to call the same model outside of SQL (e.g. as a REST API for your product), this is a good path to follow.

If you’re able to fit your model into Snowpark, and you only need to call your model from SQL, then Snowpark Python UDFs are a slam dunk. It’s the simplest architecture, and simplicity counts for a lot.

Snowpark Container Services and Snowpark ML are compelling demos from Snowflake that will hopefully grow into true options for real-world ML workloads as they add scale and flexibility, respectively.

Finally, if you’re like me, you’re excited for a future where Snowflake’s many ML projects combine into something unified and compelling.

Until then, happy architecting!

By the way, if you’re looking for an easy way to deploy your ML models into your Snowflake warehouse, sign up for Modelbit!

Deploy Custom ML Models to Production with Modelbit

Join other world class machine learning teams deploying customized machine learning models to REST Endpoints.
Get Started for Free