How to encode categorical features for ML with scikit-learn

In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn?

In this 28-minute video, you'll learn:

How to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step
How to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously
Why you should use scikit-learn (rather than pandas) for preprocessing your dataset

If you want to follow along with the code, you can download the Jupyter notebook from GitHub.

Click on a timestamp below to jump to a particular section:

0:22 Why should you use a Pipeline?
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with OneHotEncoder
15:01 Selecting columns for preprocessing with ColumnTransformer
19:00 Creating a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?

scikit-learn documentation for OneHotEncoder, ColumnTransformer, and Pipeline
My free course: Introduction to Machine Learning with scikit-learn
My videos on cross-validation and grid search
My lesson notebook on StandardScaler

Related Resources