Introduction to machine learning in Python with scikit-learn (video series)
In the data science course that I teach for General Assembly, we spend a lot of time using scikit-learn, Python's library for machine learning. I love teaching scikit-learn, but it has a steep learning curve, and my feeling is that there are not many scikit-learn resources that are targeted towards machine learning beginners. Thus I decided to create a series of scikit-learn video tutorials, which I launched in April in partnership with Kaggle!
The series now contains 10 video tutorials totaling 4.5 hours. My goal with this series is to help motivated individuals to gain a thorough grasp of both machine learning fundamentals and the scikit-learn workflow. I don't presume any familiarity with machine learning, which is why the first video focuses exclusively on answering the question, "What is machine learning, and how does it work?" And although the series does assume that you have some familiarity with Python, the second video contains my suggested resources for learning Python if you're just getting started with the language.
I've embedded the video playlist below, or you can watch it on YouTube. I've also listed the agenda for each video, along with links to the blog post and Jupyter Notebook associated with each video. (My GitHub repository contains all of the Notebooks, which may be useful as reference material!)
Update: In 2018, I updated the notebooks to be compatible with scikit-learn 0.19.1 and Python 3.6. You can read about those changes here.
I hope you enjoy the series, and welcome your comments and questions. Please subscribe to my YouTube channel to be notified when new videos are released!
List of videos
What is machine learning, and how does it work? (video, notebook)
- What is machine learning?
- What are the two main categories of machine learning?
- What are some examples of machine learning?
- How does machine learning "work"?
Setting up Python for machine learning: scikit-learn and Jupyter Notebook (video, notebook)
- What are the benefits and drawbacks of scikit-learn?
- How do I install scikit-learn?
- How do I use the Jupyter Notebook?
- What are some good resources for learning Python?
Getting started in scikit-learn with the famous iris dataset (video, notebook)
- What is the famous iris dataset, and how does it relate to machine learning?
- How do we load the iris dataset into scikit-learn?
- How do we describe a dataset using machine learning terminology?
- What are scikit-learn's four key requirements for working with data?
Training a machine learning model with scikit-learn (video, notebook)
- What is the K-nearest neighbors classification model?
- What are the four steps for model training and prediction in scikit-learn?
- How can I apply this pattern to other machine learning models?
Comparing machine learning models in scikit-learn (video, notebook)
- How do I choose which model to use for my supervised learning task?
- How do I choose the best tuning parameters for that model?
- How do I estimate the likely performance of my model on out-of-sample data?
Data science pipeline: pandas, seaborn, scikit-learn (video, notebook)
- How do I use the pandas library to read data into Python?
- How do I use the seaborn library to visualize data?
- What is linear regression, and how does it work?
- How do I train and interpret a linear regression model in scikit-learn?
- What are some evaluation metrics for regression problems?
- How do I choose which features to include in my model?
Cross-validation for parameter tuning, model selection, and feature selection (video, notebook)
- What is the drawback of using the train/test split procedure for model evaluation?
- How does K-fold cross-validation overcome this limitation?
- How can cross-validation be used for selecting tuning parameters, choosing between models, and selecting features?
- What are some possible improvements to cross-validation?
Efficiently searching for optimal tuning parameters (video, notebook)
- How can K-fold cross-validation be used to search for an optimal tuning parameter?
- How can this process be made more efficient?
- How do you search for multiple tuning parameters at once?
- What do you do with those tuning parameters before making real predictions?
- How can the computational expense of this process be reduced?
Evaluating a classification model (video, notebook)
- What is the purpose of model evaluation, and what are some common evaluation procedures?
- What is the usage of classification accuracy, and what are its limitations?
- How does a confusion matrix describe the performance of a classifier?
- What metrics can be computed from a confusion matrix?
- How can you adjust classifier performance by changing the classification threshold?
- What is the purpose of an ROC curve?
- How does Area Under the Curve (AUC) differ from classification accuracy?
Encoding categorical features (video, notebook)
- Why should you use a Pipeline?
- How do you encode categorical features with OneHotEncoder?
- How do you apply OneHotEncoder to selected columns with ColumnTransformer?
- How do you build and cross-validate a Pipeline?
- How do you make predictions on new data using a Pipeline?
- Why should you use scikit-learn (rather than pandas) for preprocessing?
At the PyCon 2016 conference, I taught a 3-hour tutorial that builds upon this video series. The recording is embedded below, or you can watch it on YouTube:
Here are the topics I covered:
- Model building in scikit-learn (refresher)
- Representing text as numerical data
- Reading a text-based dataset into pandas
- Vectorizing our dataset
- Building and evaluating a model
- Comparing models
- Examining a model for further insight
- Practicing this workflow on another dataset
- Tuning the vectorizer (discussion)
Visit this GitHub repository to access the tutorial notebooks and many other recommended resources. If you want to go even deeper into this material, I teach an online course, Machine Learning with Text in Python.