How to encode categorical features with scikit-learn (video)
In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn?
In this 28-minute video, you'll learn:
- How to use
ColumnTransformerto encode your categorical features and prepare your feature matrix in a single step
- How to include this step within a
Pipelineso that you can cross-validate your model and preprocessing steps simultaneously
- Why you should use scikit-learn (rather than pandas) for preprocessing your dataset
If you want to follow along with the code, you can download the Jupyter notebook from GitHub.
Click on a timestamp below to jump to a particular section:
0:22 Why should you use a
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with
15:01 Selecting columns for preprocessing with
19:00 Creating a two-step
19:54 Cross-validating a
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?
- scikit-learn documentation for OneHotEncoder, ColumnTransformer, and Pipeline
- My video series: Introduction to Machine Learning in Python
- My videos on cross-validation and grid search
- My lesson notebook on StandardScaler
P.S. Want to master Machine Learning in Python? Enroll in my online course, Machine Learning with Text in Python!
NEW VIDEO: How do I encode categorical features using scikit-learn? 📺— Kevin Markham (@justmarkham) November 12, 2019
- How to use OneHotEncoder & ColumnTransformer
- How to include this step within a Pipeline
- Why *not* to use pandas for dummy encoding#Python #MachineLearning pic.twitter.com/JokO0FahQX