# A friendly introduction to linear regression (using Python)

A few weeks ago, I taught a 3-hour lesson introducing linear regression to my data science class. It's not the fanciest machine learning technique, but it is **a crucial technique to learn** for many reasons:

- It's widely used and well-understood.
- It runs very fast!
- It's easy to use because minimal "tuning" is required.
- It's highly "interpretable", meaning that it's easy to explain to others.
- It's the basis for many other machine learning techniques.

The most **accessible (yet thorough) introduction to linear regression** that I've found is Chapter 3 of An Introduction to Statistical Learning (ISL) by Hastie & Tibshirani. Their examples are crystal clear and the material is presented in a logical fashion, but it covers a lot more detail than I wanted to present in class. As well, **their code is written in R**, and my data science class is taught in Python.

## My IPython Notebook on linear regression

When teaching this material, **I essentially condensed ISL chapter 3 into a single IPython Notebook**, focusing on the points that I consider to be most important and adding a lot of practical advice. As well, **I wrote all of the code in Python**, using both Statsmodels and scikit-learn to implement linear regression.

**Click here to view the IPython Notebook.**

## Table of contents

Here is a detailed list of **topics covered** in the Notebook:

- reading data into Python using pandas
- identifying the features, response, and observations
- plotting the relationship between each feature and the response using Matplotlib
- introducing the form of simple linear regression
- estimating linear model coefficients
- interpreting model coefficients
- using the model for prediction
- plotting the "least squares" line
- quantifying confidence in the model
- identifying "significant" coefficients using hypothesis testing and p-values
- assessing how well the model fits the observed data
- extending simple linear regression to include multiple predictors
- comparing feature selection techniques: R-squared, p-values, cross-validation
- creating "dummy variables" (using pandas) to handle categorical predictors

## Resources

**If you would like to go deeper into linear regression**, here are a few resources I would suggest:

- Chapter 3 of An Introduction to Statistical Learning (which can be
**downloaded for free!**) extends this lesson to include more advanced topics, such as detecting collinearity, diagnosing model fit, and transforming predictors to fit non-linear relationships. - This introduction to linear regression is well-written, mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.

**If you liked this Notebook**, here are some other Data School resources that might interest you:

- Quick reference guide to applying and interpreting linear regression
- IPython Notebook demonstrating logistic regression in Python
- 15 hours of expert videos introducing machine learning
- Python or R for data science?
- My 4-hour video series on machine learning in Python

Do you have any questions about linear regression in Python? **Please let me know in the comments below!**

P.S. Want to receive more content like this in your inbox? Subscribe to the Data School newsletter.

New IPython notebook: Intro to linear regression in #python using scikit-learn, statsmodels, pandas, matplotlib http://t.co/T7MP4784jP

— Kevin Markham (@justmarkham) February 20, 2015