Quick reference guide to applying and interpreting linear regression
After learning a complex topic, I find it helpful to create a "quick reference guide" for myself, so that I can easily review the key points of that topic before applying it to a data problem or teaching it to others. When that topic is conceptual (such as linear regression), those guides tend to resemble the notes you might take from a classroom lecture. When that topic is code-based, those guides tend to contain examples and annotated lists of commands, like my quick reference guide to Git, my dplyr tutorial, or my Python reference guide.
I created this guide to linear regression a while ago, after reading Hastie and Tibshirani's excellent An Introduction to Statistical Learning (with Applications in R). Now that I'm a Data Science instructor for General Assembly, I've made a personal commitment to sharing these guides so that my students and others can benefit from them.
Please note that this is not a tutorial, and is not suitable for teaching you linear regression if you are not already familiar with it. Instead, it is only intended to be a light reference guide to applying linear regression and interpreting the output, and ignores many nuances of the topic. However, I have listed resources for deepening your understanding (and applying it to R, Python, and other statistical packages) at the bottom of this post.
Your feedback and clarifications are welcome!
Simple Linear Regression:
Computing coefficients
- Estimate B0 (intercept) and B1 (slope) based on least squares
- "Residuals" are the discrepancies between the actual and predicted y values
- Total residuals for a given model is the "residual sum of squares" (RSS)
- Least squares line minimizes RSS
How accurate is B1?
- How much would B1 vary under repeated sampling? (Thus, how "accurate" is it?)
- Calculate the "standard error" (SE) of B1, and the 95% "confidence interval" ranges from B1 +- 2*SE
- Interpretation: If you sampled the data 100 times, 95% of those confidence intervals would contain the "true" B1
Is B1 non-zero?
- Null hypothesis: x and y are not related (thus B1=0)
- Alternative hypothesis: there is some relationship between x and y (thus B1 != 0)
- "t-statistic" = B1 / SE(B1) = number of standard deviations that B1 is from zero
- Higher t-statistic (more than 2) is stronger evidence that there is a relationship
- "p-value" is probability that this relationship is occurring by chance
- Lower p-value (less than 0.05) is stronger evidence of a relationship
How well does the model fit the data?
- "Residual standard error" (RSE) is computed using RSS
- "Large" RSE is a poor fit, but RSE is measured in y units
- "R-squared" (R^2) is proportion of variability in y that can be explained using x
- Ranges from 0 to 1
- 0.75 means fitted model showed 75% reduction in error over null model
- Higher R^2 indicates a stronger relationship between x and y
Multiple Linear Regression:
Computing coefficients
- Involves more than 1 predictor, thus has more than 1 slope coefficient
- Still estimate B0, B1, B2, etc. by minimizing RSS
Is at least one coefficient non-zero?
- Null hypothesis: B1 = B2 = etc. = 0
- Compute F-statistic: will be close to 1 when null hypothesis is true, and much larger than 1 when null hypothesis is false
- Even if the p-value for an individual coefficient is small, you still need to check F-statistic for the entire model (especially when the number of predictors is large)
- When n (number of observations) is large, F-statistic does not have to be particularly large to reject the null hypothesis
- When n is small, larger F-statistic is required to reject the null hypothesis
- Examine p-value for F-statistic to help you decide whether to reject the null hypothesis
How well does the model fit the data?
- Use R^2 and RSE
- "Large" increase in R^2 when adding a variable to the model is evidence that you should keep it in the model
- "Small" increase in R^2 when adding a variable to the model is evidence that you can leave it out
Qualitative (aka "categorical") predictors
- Create dummy variable(s): one less dummy variable than the number of levels
- Example with three levels: intercept coefficient (B0) represents the "baseline" (average response for the first level), B1 represents the difference between the second level and the baseline, and B2 represents the difference between the third level and the baseline
Interactions between variables
- Add interaction terms and examine the p-value for those terms
- Also check whether R^2 for the model with interactions is better than one without
- If you add an interaction term, also include the "main effects" (even if their individual p-values don't justify it)
Problems with Linear Regression:
Non-linear relationships
- Plot residuals versus fitted y values (multiple linear regression) or residuals versus x (simple linear regression): pattern indicates non-linearity
- Try using non-linear transformations of predictors in the model: ln(x), sqrt(x), x^2
Non-constant variance of residuals (aka "heteroskedasticity")
- Indicated by funnel shape in residual plot
- Try transforming y using a concave function: ln(y), sqrt(y)
- Or try using weighted least squares
Outliers
- Plot studentized residuals: greater than 3 is an outlier
- Try removing the observation from the dataset
High leverage points
- Simple linear regression: look for observations for which x is outside the normal range
- Multiple linear regression: compute leverage statistics - close to 1 is high leverage
- Try removing the observation from the dataset
Collinearity
- Exists whenever there is a correlation between two or more predictors
- Detect pairs of highly correlated variables by examining the correlation matrix for high absolute values
- Detect multicollinearity (three or more correlated variables) by computing the variance inflation factor (VIF) for each predictor
- Minimum VIF is 1
- VIF greater than 5 or 10 indicates problematic amount of collinearity
- Try removing one of the correlated predictors from the model, or combining them into a single predictor
Resources:
- This guide is largely adapted from Chapter 3 of An Introduction to Statistical Learning, a book that I highly recommend to any newcomers to statistical learning/machine learning (and which is available as a free PDF download). There are also 15 hours of videos associated with the book, as well as a wealth of R code included in the book.
- I created a substantial Jupyter Notebook introducing linear regression in Python.
- Dr. Robert Nau (Duke University) has a highly readable and practical guide to linear regression, split across a dozen medium-length posts.
- The DataRobot Blog has a guide for using statsmodels in Python, with one post on simple linear regression and another on multiple linear regression.