How to encode categorical features with scikit-learn (video)
In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn?
In this 28-minute video, you'll learn:
- How to use
OneHotEncoder
andColumnTransformer
to encode your categorical features and prepare your feature matrix in a single step - How to include this step within a
Pipeline
so that you can cross-validate your model and preprocessing steps simultaneously - Why you should use scikit-learn (rather than pandas) for preprocessing your dataset
If you want to follow along with the code, you can download the Jupyter notebook from GitHub.
Click on a timestamp below to jump to a particular section:
0:22 Why should you use a Pipeline
?
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with OneHotEncoder
15:01 Selecting columns for preprocessing with ColumnTransformer
19:00 Creating a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?
Related Resources
- scikit-learn documentation for OneHotEncoder, ColumnTransformer, and Pipeline
- My free course: Introduction to Machine Learning with scikit-learn
- My videos on cross-validation and grid search
- My lesson notebook on StandardScaler