A friendly introduction to linear regression (using Python)
A few weeks ago, I taught a 3-hour lesson introducing linear regression to my data science class. It's not the fanciest machine learning technique, but it is a crucial technique to learn for many reasons:
- It's widely used and well-understood.
- It runs very fast!
- It's easy to use because minimal "tuning" is required.
- It's highly "interpretable", meaning that it's easy to explain to others.
- It's the basis for many other machine learning techniques.
The most accessible (yet thorough) introduction to linear regression that I've found is Chapter 3 of An Introduction to Statistical Learning (ISL) by Hastie & Tibshirani. Their examples are crystal clear and the material is presented in a logical fashion, but it covers a lot more detail than I wanted to present in class. As well, their code is written in R, and my data science class is taught in Python.
My Jupyter Notebook on linear regression
When teaching this material, I essentially condensed ISL chapter 3 into a single Jupyter Notebook, focusing on the points that I consider to be most important and adding a lot of practical advice. As well, I wrote all of the code in Python, using both Statsmodels and scikit-learn to implement linear regression.
Click here to view the Jupyter Notebook.
Table of contents
Here is a detailed list of topics covered in the Notebook:
- reading data into Python using pandas
- identifying the features, response, and observations
- plotting the relationship between each feature and the response using Matplotlib
- introducing the form of simple linear regression
- estimating linear model coefficients
- interpreting model coefficients
- using the model for prediction
- plotting the "least squares" line
- quantifying confidence in the model
- identifying "significant" coefficients using hypothesis testing and p-values
- assessing how well the model fits the observed data
- extending simple linear regression to include multiple predictors
- comparing feature selection techniques: R-squared, p-values, cross-validation
- creating "dummy variables" (using pandas) to handle categorical predictors
If you would like to go deeper into linear regression, here are a few resources I would suggest:
- Chapter 3 of An Introduction to Statistical Learning (which can be downloaded for free!) extends this lesson to include more advanced topics, such as detecting collinearity, diagnosing model fit, and transforming predictors to fit non-linear relationships.
- This introduction to linear regression is well-written, mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.
If you liked this Notebook, here are some other Data School resources that might interest you:
- Quick reference guide to applying and interpreting linear regression
- Jupyter Notebook demonstrating logistic regression in Python
- 15 hours of expert videos introducing machine learning
- Python or R for data science?
- My 4-hour video series on machine learning in Python
Do you have any questions about linear regression in Python? Please let me know in the comments below!
P.S. Want to receive more content like this in your inbox? Subscribe to the Data School newsletter.
New IPython notebook: Intro to linear regression in #python using scikit-learn, statsmodels, pandas, matplotlib http://t.co/T7MP4784jP— Kevin Markham (@justmarkham) February 20, 2015