February 20, 2015

A friendly introduction to linear regression (using Python)

A few weeks ago, I taught a 3-hour lesson introducing linear regression to my data science class. It's not the fanciest machine learning technique, but it is a crucial technique to learn for many reasons:

• It's widely used and well-understood.
• It runs very fast!
• It's easy to use because minimal "tuning" is required.
• It's highly "interpretable", meaning that it's easy to explain to others.
• It's the basis for many other machine learning techniques.

The most accessible (yet thorough) introduction to linear regression that I've found is Chapter 3 of An Introduction to Statistical Learning (ISL) by Hastie & Tibshirani. Their examples are crystal clear and the material is presented in a logical fashion, but it covers a lot more detail than I wanted to present in class. As well, their code is written in R, and my data science class is taught in Python.

My Jupyter Notebook on linear regression

When teaching this material, I essentially condensed ISL chapter 3 into a single Jupyter Notebook, focusing on the points that I consider to be most important and adding a lot of practical advice. As well, I wrote all of the code in Python, using both Statsmodels and scikit-learn to implement linear regression.

Here is a detailed list of topics covered in the Notebook:

• reading data into Python using pandas
• identifying the features, response, and observations
• plotting the relationship between each feature and the response using Matplotlib
• introducing the form of simple linear regression
• estimating linear model coefficients
• interpreting model coefficients
• using the model for prediction
• plotting the "least squares" line
• quantifying confidence in the model
• identifying "significant" coefficients using hypothesis testing and p-values
• assessing how well the model fits the observed data
• extending simple linear regression to include multiple predictors
• comparing feature selection techniques: R-squared, p-values, cross-validation
• creating "dummy variables" (using pandas) to handle categorical predictors

Resources

If you would like to go deeper into linear regression, here are a few resources I would suggest:

If you liked this Notebook, here are some other Data School resources that might interest you:

Do you have any questions about linear regression in Python? Please let me know in the comments below!

P.S. Want to receive more content like this in your inbox? Subscribe to the Data School newsletter.