5 Pandas DataFrame tricks to make Data Science life easier

The year is 2012. Barack Obama was just elected president on a platform of Hope and Change. A data team is a new and exciting idea, and you are at the forefront. Your small band of scrappy data hackers has just stood up a newfangled cloud data warehouse and cloud BI platform. Now it’s time for the next frontier … data science!

By and large, we’ve lived the life we dreamed in 2012. Training machine learning models right from our notebooks. Brought the power of AI to business problems large and small, from inbound lead scoring to product shipping times. SciKit-Learn and XGBoost. Then GPT-3. Then TensorFlow and PyTorch.

And Pandas DataFrames. So many Pandas DataFrames. Originally built by hedge fund quants to do financial analysis, Pandas and its DataFrames have become the lingua franca of more-efficient data processing in Python, especially for data scientists. It comes with its own types, its own overloaded Python operators, and a raw C++ implementation that makes it super fast -- and a bit unconventional to the raw Python programmers we were in 2012.

Along the way, we’ve learned some life-saving tips and tricks for data scientists keeping their training and testing data in PandasDataFrames. Here are our favorites:

1. Apply a whole-column transformation with ".apply"

We love this for applying feature transformations and feature encodings. Let’s say you’ve got a feature “bases_after_hit” with values“1st", “2nd", “3rd" and “Home”. (Plus some garbage in the dataset, as you do.) You might want to encode these in a way that preserves their ordering, i.e. a home run is 4 bases, 3 base is 3 bases, etc:

{%CODE python%}
def encode_bases_feature(bases_after_hit: str) -> int:

   feature_map = {
      ‘1st': 1,
      ‘2nd': 2,
      ‘3rd’: 3,
     ‘Home’: 4
  }

   if bases_after_hitin feature_map:
      return feature_map[bases_after_hit]
   return 0
{%/CODE%}

Iterating over each row in the training DataFrame and calling our function on the value will run in pure Python. On a large dataset, we can take advantage of Pandas’s efficiencies with .apply, i.e.:

{%CODE python%}
training_df[‘bases_feature’] = dataset_df[‘bases_after_hit’].apply(encode_bases_feature)
{%/CODE%}

Boom! One line of quick Pandas goodness.

2. Bulk-convert types with “.astype”

This quick bulk action lets you rapidly typecast entire series in your DataFrame, or even multiple series!

We use this most often when predicting the probability of a binary outcome. For example, we want to know how likely a sales lead is to convert, or how likely a customer is to churn. Our training data will include a boolean series telling us whether the outcome occurred. Our regression will want a number to target. We simply run:

{%CODE python%}
target_df[‘converted’] = dataset_df[‘converted’].astype(int)
{%/CODE%}

This will give us a column of 0’s and 1’s that’s ideal for training.

3. Drop the NaN’s from a dataset with “.dropna”

You might think that in a perfect world, the training data is perfectly clean. We disagree. In a perfect world, the training data comes right from the real world, which means it’s got all kinds of junk in it. Including garbage in a numeric field, which Pandas converted to “NaN” (a Numpy type) when creating the DataFrame.

For most models, we like to drop these rows from the training data, and then handle garbage in production by doing some default behavior when a prediction is unavailable.

Let’s say we’re training a customer churn predictor, and we have a feature that’s the number of active users at the customer. We’ll run:

{%CODE python%}
training_df.dropna(subset=”num_active_users”, inplace=True)
{%/CODE%}

This will drop the whole row if the “num_active_users” value in that row is NaN. Perfect.

Notice we’re running this in place. This is a fine practice, as is doing it as part of the assignment into the training DataFrame, i.e.:

{%CODE python%}
training_df = dataset_df.dropna(subset=”num_active_users”)
{%/CODE%}

4. Append new data with “concat”

Appending a DataFrame -- or “concatenating,” in Pandas parlance -- is most useful when updating your training data.

We use this most often when updating a model that’s been inthe wild for a little while. We want to start with the exact same training data that was used in training the first model, and then update that with new data.It looks like this:

{%CODE python%}
new_training_df = pandas.concat([old_training_df, new_training_df, ignore_index=True)
{%/CODE%}

Notice the “ignore_index” named parameter. Without it,Pandas will happily duplicate the indices, i.e., if our first DataFrame has 50rows and our second DataFrame has 25 rows, there will be two rows with index 0, two rows with index one, etc. up to index 24! Some ML frameworks mayn ot care about DataFrame indices, but some definitely do. Best to be clean about it.

5. Load a CSV into a DataFrame with “read_csv”

Yes, I know. You have a live, secured connection to the data warehouse for all your training data needs. Training and testing data is governed and versioned by your fully-featured MLOps suite 😉. You don’t store company data on your local laptop or the VM where your notebook runs.

While we all aspire to this Garden of Eden, the fact is data comes in from CSVs all the time. A partner uploads critical data to our S3bucket irregularly in CSV format. Some data from the next team over will greatly improve our model, and the quickest way was a quick CSV dump from their systems. The truth is, CSVs are everywhere.

For large CSV files, skip the Python file open and its resulting iterator. Again, this will run as raw Python code one row at a time. Instead, let Pandas bulk-load the CSV for you:

{%CODE python%}
training_df = pandas.load_csv(‘path_to_file.csv’)
{%/CODE%}

In one line you’ve got a DataFrame with its column names derived from the header row and the types inferred. It’s a great starting point.

We hope these help! We work with DataFrames every day, and these are just some of the tricks we find most helpful. Let us know which ones you like best.

‍

1. Apply a whole-column transformation with ".apply"

2. Bulk-convert types with “.astype”

3. Drop the NaN’s from a dataset with “.dropna”

4. Append new data with “concat”

5. Load a CSV into a DataFrame with “read_csv”

Deploy Custom ML Models to Production with Modelbit

Contact Us

Resources

Product