April 25, 2024

How to prevent data leakage in pandas & scikit-learn ☔

Let's pretend you're working on a supervised Machine Learning problem using Python's scikit-learn library. Your training data is in a pandas DataFrame, and you discover missing values in a column that you were planning to use as a feature.

After considering your options, you decide to impute the missing values, which means that you're going to fill in the missing values with reasonable values.

How should you perform the imputation?

Option 1 is to fill in the missing values in pandas, and then pass the transformed data to scikit-learn.
Option 2 is to pass the original data to scikit-learn, and then perform all data transformations (including missing value imputation) within scikit-learn.

Option 1 will cause data leakage, whereas option 2 will prevent data leakage.

Here are questions you might be asking:

What is data leakage?
Why is data leakage problematic?
Why would data leakage result from missing value imputation in pandas?
How can I prevent data leakage when using pandas and scikit-learn?

Answers below! 👇

What is data leakage?

Data leakage occurs when you inadvertently include knowledge from testing data when training a Machine Learning model.

Why is data leakage problematic?

Data leakage is problematic because it will cause your model evaluation scores to be less reliable. This may lead you to make bad decisions when tuning hyperparameters, and it will lead you to overestimate how well your model will perform on new data.

It's hard to know whether data leakage will skew your evaluation scores by a negligible amount or a huge amount, so it's best to just avoid data leakage entirely.

Why would data leakage result from missing value imputation in pandas?

Your model evaluation procedure (such as cross-validation) is supposed to simulate the future, so that you can accurately estimate right now how well your model will perform on new data.

But if you impute missing values on your whole dataset in pandas and then pass your dataset to scikit-learn, your model evaluation procedure will no longer be an accurate simulation of reality. That's because the imputation values will be based on your entire dataset (meaning both the training portion and the testing portion), whereas the imputation values should just be based on the training portion.

In other words, imputation based on the entire dataset is like peeking into the future and then using what you learned from the future during model training, which is definitely not allowed.

How can we avoid this in pandas?

You might think that one way around this problem would be to split your dataset into training and testing sets and then impute missing values using pandas. (Specifically, you would need to learn the imputation value from the training set and then use it to fill in both the training and testing sets.)

That would work if you're only ever planning to use train/test split for model evaluation, but it would not work if you're planning to use cross-validation. That's because during 5-fold cross-validation (for example), the rows contained in the training set will change 5 times, and thus it's quite impractical to avoid data leakage if you use pandas for imputation while using cross-validation!

How else can data leakage arise?

So far, I've only mentioned data leakage in the context of missing value imputation. But there are other transformations that if done in pandas on the full dataset will also cause data leakage.

For example, feature scaling in pandas would lead to data leakage, and even one-hot encoding (or "dummy encoding") in pandas would lead to data leakage unless there's a known, fixed set of categories.

More generally, any transformation which incorporates information about other rows when transforming a row will lead to data leakage if done in pandas.

How does scikit-learn prevent data leakage?

Now that you've learned how data transformations in pandas can cause data leakage, I'll briefly mention three ways in which scikit-learn prevents data leakage:

First, scikit-learn transformers have separate fit and transform steps, which allow you to base your data transformations on the training set only, and then apply those transformations to both the training set and the testing set.
Second, the fit and predict methods of a Pipeline encapsulate all calls to fit_transform and transform so that they're called at the appropriate times.
Third, cross_val_score splits the data prior to performing data transformations, which ensures that the transformers only learn from the temporary training sets that are created during cross-validation.

Conclusion

When working on a Machine Learning problem in Python, I recommend performing all of your data transformations in scikit-learn, rather than performing some of them in pandas and then passing the transformed data to scikit-learn.

Besides helping you to prevent data leakage, this enables you to tune the transformer and model hyperparameters simultaneously, which can lead to a better performing model!

One final note...

This post is an excerpt from my upcoming video course, Master Machine Learning with scikit-learn.

Join the waitlist below to get free lessons from the course and a special launch discount 👇

New? Start here!

Log in / Sign up for courses

Get weekly tips 💌

About Data School