ML at an ML Company: How Modelbit uses Modelbit

Machine Learning at a Machine Learning Company

When you deploy a machine learning model to Modelbit, the model and all its supporting code, Python environment, and system environment are all packaged into a Docker container and shipped to a cloud production environment. That production environment is then scaffolded with a git repository, a REST API, a logging system, a load balancer and more.

All of this happens automatically, with just one call to “modelbit.deploy()”. But behind the scenes, quite a lot has to happen at once! The biggest, most time-consuming operation is creating the Docker container with the system packages and Python packages automatically detected from the development environment.

It’s important to give our users a sense of how long this process will take. Ideally it only takes a few moments. For some large environments, like those with large neural net frameworks or custom GPU drivers, it can take longer. That’s okay, as long as the user knows what to expect.

Here’s the feature in action! In this post, we’ll show you how we build this and deployed it with our own product:

Predicting Environment Build Times

Fortunately, we have the build logs and timings of the many thousands of production environments that have been built in Modelbit. Here’s a sampling of ten rows of that raw data. As you can see, for each environment, we’ve got the raw PIP requirements.txt describing the Python environment, a list of the system packages installed with “apt get”, and the build time in milliseconds.

That build time is exactly what we want to predict! Which makes this the perfect use case for a regression.

Feature Engineering

For features, we want a single feature for every popular Python package and every popular system package. The feature value will be a boolean value of whether the package is present in the environment or not. In this way, we hope to basically force the regression to memorize the typical build time of each package and package combination that we’ve seen in the wild.

This is definitely overfitting in a sense, but in this case it’s the kind of overfitting that we specifically want. We’ve seen these packages and their combinations before, and we know what to expect. The goal is to get the model to save those weights and spit them back out at the appropriate time.

With that in mind, here’s our feature engineering code to wrangle this dataset into its appropriate shape:

{%CODE python%}
def simplePyPackageName(req):
    req = req.lower()
    if req.startswith("git+https://"):
        return req.split("/")[-1].replace(".git", "")
    return re.split("[\[\]=<>]+", req)[0]

def parseReqTxt(reqTxt):
    if reqTxt is None:
        return []
    packages = []
    for r in reqTxt.split("\n"):
        rClean = r.lower().strip()
        if rClean == "" or rClean.startswith(("--", "#", "https://")):
            continue
        packages.append(simplePyPackageName(rClean))
    return packages

def parseSysPackages(sysPkgs):
    if sysPkgs is None or sysPkgs.strip() == "":
        return []
    return json.loads(sysPkgs)

def aggregatePyPackages(packageCounts, reqTxt):
    for p in parseReqTxt(reqTxt):
        if p not in packageCounts:
            packageCounts[p] = 0
        packageCounts[p] += 1

def aggregateSysPackages(packageCounts, sysPkgs):
    for p in parseSysPackages(sysPkgs):
        if p not in packageCounts:
            packageCounts[p] = 0
        packageCounts[p] += 1

def gatherAggregates(df):
    pyPackageCounts = {}
    sysPackageCounts = {}
    df["requirementsTxt"].fillna("").map(lambda x: aggregatePyPackages(pyPackageCounts, x))
    df["systemPackages"].fillna("").map(lambda x: aggregateSysPackages(sysPackageCounts, x))
    return pyPackageCounts, sysPackageCounts

_rawData = mb.get_dataset("environments")
_pyPackageCounts, _sysPackageCounts = gatherAggregates(_rawData)
{%/CODE%}

This is a lot of code all at once, but basically all we’re doing is parsing out the text of the requirements.txt file and the list of system packages, making sure the names are normalized, and putting it into two structured dictionaries. You can see the partial results here:

From here, we want to filter this list down to the most common packages to avoid noise. If we’ve only ever seen a package once, including it in the training data is going to cause more harm than good.

{%CODE python%}
def filterToCommonPackages(packageCounts, minCount = 3):
return set([p for p, v in packageCounts.items() if v >= minCount])

_commonPyPackages = filterToCommonPackages(_pyPackageCounts)
_commonSysPackages = filterToCommonPackages(_sysPackageCounts)
{%/CODE%}

Careful readers will notice that the threshold for a “common” package is set at 3. This is effectively a model hyperparameter in that it’s a pre-training constant and changing it affects the model that gets trained and therefore that model’s accuracy.

In tuning this number, we found no change in accuracy score in setting a relatively high threshold – 20 package occurrences – but we found that our accuracy on very large environments, like those with torchvision and CUDA drivers, was worse. While there aren’t enough of those environments to move the overall accuracy score, we made the business decision that the risk of overfitting – and even perhaps being a bit wrong on average-size environments – was worth getting those right. It’s the very large environments where the user benefit of an accurate wait time estimate is highest.

Here’s the output of the above code. You can see we’ve got a list of the most common packages that we can use for filtering.
‍

Now that we’ve got a list of common packages, we want to put our training DataFrame together! To do so, we’ll go back to the original list of environments, each with its list of packages and its actual build time in milliseconds. We’ll pivot that dataset into one with a boolean value for each of the common packages; we’ll drop uncommon package information on the floor. Here’s the code:

{%CODE python%}
def hasPyPackage(reqTxt, packageName):
    for p in parseReqTxt(reqTxt):
        if p.sta. tswith(packageName):
            return True
    return False

def hasSysPackage(sysPackages, packageName):
    return packageName in parseSysPackages(sysPackages)

def expandToFeatures(df, commonPyPackages, commonSysPackages):
    df['requirementsTxt'] = df['requirementsTxt'].fillna("").str.lower()
    df['systemPackages'] = df['systemPackages'].fillna("")
    features = []
    for p in commonPyPackages:
        features.append(
            pd.Series(
                df["requirementsTxt"].map(lambda x: hasPyPackage(x, p)),
                name=f"py_{p}"))

    for p in commonSysPackages:
        features.append(
            pd.Series(
                df["systemPackages"].map(lambda x: hasSysPackage(x, p)),
                name=f"sys_{p}"))

    df = pd.concat([df, *features], axis=1)
    featureNames = set([str(s.name) for s in features])
    X = df[["id", *featureNames]]
    y = df["buildTimeMs"]
    return X, y

X, y = expandToFeatures(_rawData, _commonPyPackages, _commonSysPackages)
{%/CODE%}

The example output gives us a sense of what the training DataFrame looks like:

Training The Model

As with a lot of real-world production models, the model technology itself and the training step is relatively simple compared to feature selection and feature engineering. If you’ve got good data that’s predictive of the result, a lot of technologies will work for you.

We chose XGBoost for our regression because it’s well-documented, battle-tested, and it gave us the ability to tune some of the hyperparameters. Here’s the code:

{%CODE python%}
def trainRegressor(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

    regressor = xgb.XGBRegressor(objective='reg:squarederror',
                        learning_rate=0.1,
                        max_depth=10,
                        n_estimators=50)
    regressor.fit(X_train.drop("id", axis=1), y_train)

    predictions = regressor.predict(X_test.drop("id", axis=1))
    y_pred = pd.Series(predictions, index=X_test.index)

    analysis = pd.concat(
        [
            X_test["id"],
            pd.Series(y_test, name="Actual") / 60_000,
            pd.Series(y_pred, name="Prediction", index=y_test.index) / 60_000,
            pd.Series((y_pred - y_test).abs() / 60_000, name="Error")
        ],
        axis=1)

    avgError = analysis['Error'].mean()
    print(f"Average error minutes: {avgError}\n")

    print("Highest error:")
    display(analysis.sort_values(by="Error", ascending=False).head(10))
    return regressor
regressor = trainRegressor(X, y)
{%/CODE%}

While we did try a number of hyperparameter values for the regression training step itself, none of them moved the accuracy score more than 10% or so. Most notable is spending time crafting our own accuracy score: The average number of minutes (in absolute value) of error in predicting historical environment build times.

You can see here that our predictor is off by less than 30 seconds in the average case. Since it’s powering a feature that measures in minutes, this isn’t bad at all.
‍

You’ll also notice that we kept an eye on the highest-error environments to make sure we’re never too far off. We have a tendency to overestimate long environment build times by a couple minutes. These tend to be large, uncommon environments. As they get more popular we may expect the regression to get better at predicting them. These are the environments whose predictions were improved by bringing the common package threshold down to 3, without impacting the overall accuracy. A good tradeoff.

Deploying The Model

Of course we used Modelbit to deploy this model!

{%CODE python%}
def environmentBuildTiming(df):
    expandedDf, _ = expandToFeatures(df, _commonPyPackages, _commonSysPackages)
    featDf = expandedDf[regressor.get_booster().feature_names]
    predictions = list(regressor.predict(featDf))
    return [{"estBuildTimeMs": p} for p in predictions]

exampleDf = pd.DataFrame.from_dict({
    "requirementsTxt": ["scikit-learn==1.1.2\npandas==2.0.1"],
    "systemPackages": ['["libgomp1"]']
})

mb.deploy(predictBuildTime, dataframe_mode=True, example_dataframe=exampleDf)
{%/CODE%}

This gave us a REST API that we call directly from our product, supplying it with the raw text of a requirements.txt file and a list of system packages. The API returns the predicted deploy time in milliseconds.
‍

And yes, of course, the predicted build time of the regression that predicts build times was quite accurate!

Future Work

Our strategy of building a boolean feature for every Python package and system package that we’ve seen frequently makes the model very good at predicting times for packages, and package combinations, that we’ve seen many times before. The model is less good at predicting build times for packages we’ve never seen before, especially if they’re the only package in the environment!

As always, more data would improve the situation. Metadata like package size on disk might have been helpful, except for a recent trend in Python packages to download very large metadata files at install time. This causes them to look small when they are not. There might be other sources of package metadata, such as author organization or explicit relationships between packages, that could be helpful here. As with all things Modelbit, we will continue to improve the technology every day.

‍

Machine Learning at a Machine Learning Company

Predicting Environment Build Times

Feature Engineering

Training The Model

Deploying The Model

Future Work

Deploy Custom ML Models to Production with Modelbit

Contact Us

Resources

Product