Guide to an in-depth understanding of logistic regression
When faced with a new classification problem, machine learning practitioners have a dizzying array of algorithms from which to choose: Naive Bayes, decision trees, Random Forests, Support Vector Machines, and many others. Where do you start? For many practitioners, the first algorithm they reach for is one of the oldest in the field: logistic regression.
Here are just a few of the attributes of logistic regression that make it incredibly popular: it's fast, it's highly interpretable, it doesn't require input features to be scaled, it doesn't require any tuning, it's easy to regularize, and it outputs well-calibrated predicted probabilities.
But despite its popularity, it is often misunderstood. Here are a few common questions about logistic regression:
- Why is it called "logistic regression" if it's used for classification?
- Why is it considered a linear model?
- How do you interpret the model coefficients?
As a teacher, I've found that my best lessons are the ones in which I explain a topic step-by-step in the way that I wish it had been taught to me. I struggled when I was learning logistic regression, which is why I'm so pleased to have written a lesson that may help you to grasp this challenging topic.
In order to give you additional context for the lesson, I created this guide that includes suggested prerequisites, a practical exercise, and a lengthy set of additional resources to allow you to go deeper into this topic.
Please note that the lesson code is written in Python, and so you will get the most out of it if you are a user of Python and scikit-learn. However, most components of this guide cover conceptual or mathematical material, and should be useful to all readers regardless of programming background.
I'd love to hear from you in the comments below! What questions do you have about logistic regression? Is this kind of guide helpful to you for learning a new topic? Are there other guides you would like me to create?
- Watch Rahul Patwari's videos on probability (5 minutes) and odds (8 minutes).
- Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln). Then, review this brief summary of exponential functions and logarithms.
- Browse through my introductory slides on machine learning to make sure you are clear on the difference between regression and classification problems.
- Read Sebastian Raschka's overview of the supervised learning process for a look at the typical steps used to solve a classification problem.
- Read my linear regression lesson notebook to ensure you are familiar with its form and interpretation, since the logistic regression lesson will build upon it. Alternatively, watch The Easiest Introduction to Regression Analysis (14 minutes).
- Setosa has an interactive visualization that may also help you to grasp linear regression.
- For a walkthrough of the classification process using Python's scikit-learn library, watch videos 3 and 4 (35 minutes) from my scikit-learn video series. (Here are the associated notebooks.)
Logistic Regression Lesson
My logistic regression lesson notebook covers the following topics using the glass identification dataset:
- Refresh your memory on how to do linear regression in scikit-learn
- Attempt to use linear regression for classification
- Show you why logistic regression is a better alternative for classification
- Brief overview of probability, odds, e, log, and log-odds
- Explain the form of logistic regression
- Explain how to interpret logistic regression coefficients
- Demonstrate how logistic regression works with categorical features
- Compare logistic regression with other models
As a way to practice applying what you've learned, participate in Kaggle's introductory Titanic competition and use logistic regression to predict passenger survival. Kaggle links to helpful tutorials for Python, R, and Excel, and their Scripts feature lets you run Python and R code on the Titanic dataset from within your browser.
- To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
- For a "math-ier" explanation of logistic regression, read Sebastian Raschka's overview of logistic regression. He also provides the code for a simple logistic regression implementation in Python, and he has a section on logistic regression in his machine learning FAQ.
- For more guidance in interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
- The scikit-learn documentation on probability calibration explains what it means for a predicted probability to be calibrated, and my blog post on click-through rate prediction with logistic regression explains why calibrated probabilities are useful in the real world.
Further Reading (for scikit-learn users)
- If you're a scikit-learn user, it's worth reading the user guide and class documentation for logistic regression to understand the particulars of its implementation.
- If you'd like to improve your logistic regression model through regularization, read part 5 of my regularization lesson notebook.
Comparison with Other Models
- Choosing a Machine Learning Classifier is a short and highly readable comparison of logistic regression, Naive Bayes, decision trees, and Support Vector Machines.
- Supervised learning superstitions cheat sheet is a more thorough comparison of those classifiers, and includes links to lots of useful resources.
- Comparing supervised learning algorithms is a comparison table I created that includes both classification and regression models.
- Classifier comparison is scikit-learn's visualization of classifier decision boundaries.
- An Empirical Comparison of Supervised Learning Algorithms is a readable research paper from 2006, which was also presented as a talk (77 minutes).
- These lecture slides compare the inner workings of logistic regression and Naive Bayes, and this paper by Andrew Ng compares the performance of logistic regression and Naive Bayes across a variety of datasets.