November 12, 2019

How to encode categorical features with scikit-learn (video)

In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn?

In this 28-minute video, you'll learn:

How to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step
How to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously
Why you should use scikit-learn (rather than pandas) for preprocessing your dataset

If you want to follow along with the code, you can download the Jupyter notebook from GitHub.

Click on a timestamp below to jump to a particular section:

0:22 Why should you use a Pipeline?
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with OneHotEncoder
15:01 Selecting columns for preprocessing with ColumnTransformer
19:00 Creating a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?

scikit-learn documentation for OneHotEncoder, ColumnTransformer, and Pipeline
My free course: Introduction to Machine Learning with scikit-learn
My videos on cross-validation and grid search
My lesson notebook on StandardScaler

New? Start here!

Log in / Sign up for courses

Get weekly tips 💌

About Data School

How to encode categorical features with scikit-learn (video)

How to encode categorical features with scikit-learn (video)

Related Resources