February 20, 2015

# A friendly introduction to linear regression (using Python)

A few weeks ago, I taught a 3-hour lesson introducing linear regression to my data science class. It's not the fanciest machine learning technique, but it is a crucial technique to learn for many reasons:

• It's widely used and well-understood.
• It runs very fast!
• It's easy to use because minimal "tuning" is required.
• It's highly "interpretable", meaning that it's easy to explain to others.
• It's the basis for many other machine learning techniques.

The most accessible (yet thorough) introduction to linear regression that I've found is Chapter 3 of An Introduction to Statistical Learning (ISL) by Hastie & Tibshirani. Their examples are crystal clear and the material is presented in a logical fashion, but it covers a lot more detail than I wanted to present in class. As well, their code is written in R, and my data science class is taught in Python.

## My Jupyter Notebook on linear regression

When teaching this material, I essentially condensed ISL chapter 3 into a single Jupyter Notebook, focusing on the points that I consider to be most important and adding a lot of practical advice. As well, I wrote all of the code in Python, using both Statsmodels and scikit-learn to implement linear regression.

Here is a detailed list of topics covered in the Notebook:

• reading data into Python using pandas
• identifying the features, response, and observations
• plotting the relationship between each feature and the response using Matplotlib
• introducing the form of simple linear regression
• estimating linear model coefficients
• interpreting model coefficients
• using the model for prediction
• plotting the "least squares" line
• quantifying confidence in the model
• identifying "significant" coefficients using hypothesis testing and p-values
• assessing how well the model fits the observed data
• extending simple linear regression to include multiple predictors
• comparing feature selection techniques: R-squared, p-values, cross-validation
• creating "dummy variables" (using pandas) to handle categorical predictors

## Resources

If you would like to go deeper into linear regression, here are a few resources I would suggest:

If you liked this Notebook, here are some other Data School resources that might interest you:

Do you have any questions about linear regression in Python? Please let me know in the comments below!