February 20, 2015

A friendly introduction to linear regression (using Python)

A few weeks ago, I taught a 3-hour lesson introducing linear regression to my data science class. It's not the fanciest machine learning technique, but it is a crucial technique to learn for many reasons:

It's widely used and well-understood.
It runs very fast!
It's easy to use because minimal "tuning" is required.
It's highly "interpretable", meaning that it's easy to explain to others.
It's the basis for many other machine learning techniques.

The most accessible (yet thorough) introduction to linear regression that I've found is Chapter 3 of An Introduction to Statistical Learning (ISL) by Hastie & Tibshirani. Their examples are crystal clear and the material is presented in a logical fashion, but it covers a lot more detail than I wanted to present in class. As well, their code is written in R, and my data science class is taught in Python.

My Jupyter Notebook on linear regression

When teaching this material, I essentially condensed ISL chapter 3 into a single Jupyter Notebook, focusing on the points that I consider to be most important and adding a lot of practical advice. As well, I wrote all of the code in Python, using both Statsmodels and scikit-learn to implement linear regression.

Click here to view the Jupyter Notebook.

Here is a detailed list of topics covered in the Notebook:

reading data into Python using pandas
identifying the features, response, and observations
plotting the relationship between each feature and the response using Matplotlib
introducing the form of simple linear regression
estimating linear model coefficients
interpreting model coefficients
using the model for prediction
plotting the "least squares" line
quantifying confidence in the model
identifying "significant" coefficients using hypothesis testing and p-values
assessing how well the model fits the observed data
extending simple linear regression to include multiple predictors
comparing feature selection techniques: R-squared, p-values, cross-validation
creating "dummy variables" (using pandas) to handle categorical predictors

Resources

If you would like to go deeper into linear regression, here are a few resources I would suggest:

Chapter 3 of An Introduction to Statistical Learning (which can be downloaded for free!) extends this lesson to include more advanced topics, such as detecting collinearity, diagnosing model fit, and transforming predictors to fit non-linear relationships.
This introduction to linear regression is well-written, mathematically thorough, and includes lots of good advice.

If you liked this Notebook, here are some other Data School resources that might interest you:

Quick reference guide to applying and interpreting linear regression
Jupyter Notebook demonstrating logistic regression in Python
15 hours of expert videos introducing machine learning
Python or R for data science?
My free 4-hour course on machine learning in Python

Do you have any questions about linear regression in Python? Please let me know in the comments below!

New? Start here!

Log in / Sign up for courses

Get weekly tips 💌

About Data School

A friendly introduction to linear regression (using Python)

My Jupyter Notebook on linear regression

Table of contents

Resources