February 21, 2016

Guide to an in-depth understanding of logistic regression

When faced with a new classification problem, machine learning practitioners have a dizzying array of algorithms from which to choose: Naive Bayes, decision trees, Random Forests, Support Vector Machines, and many others. Where do you start? For many practitioners, the first algorithm they reach for is one of the oldest in the field: logistic regression.

Here are just a few of the attributes of logistic regression that make it incredibly popular: it's fast, it's highly interpretable, it doesn't require input features to be scaled, it doesn't require any tuning, it's easy to regularize, and it outputs well-calibrated predicted probabilities.

But despite its popularity, it is often misunderstood. Here are a few common questions about logistic regression:

Why is it called "logistic regression" if it's used for classification?
Why is it considered a linear model?
How do you interpret the model coefficients?

As a teacher, I've found that my best lessons are the ones in which I explain a topic step-by-step in the way that I wish it had been taught to me. I struggled when I was learning logistic regression, which is why I'm so pleased to have written a lesson that may help you to grasp this challenging topic.

In order to give you additional context for the lesson, I created this guide that includes suggested prerequisites, a practical exercise, and a lengthy set of additional resources to allow you to go deeper into this topic.

Please note that the lesson code is written in Python, and so you will get the most out of it if you are a user of Python and scikit-learn. However, most components of this guide cover conceptual or mathematical material, and should be useful to all readers regardless of programming background.

I'd love to hear from you in the comments below! What questions do you have about logistic regression? Is this kind of guide helpful to you for learning a new topic? Are there other guides you would like me to create?

Prerequisite Knowledge

Mathematical terminology:

Watch Rahul Patwari's videos on probability (5 minutes) and odds (8 minutes).
Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln). Then, review this brief summary of exponential functions and logarithms.

Machine learning:

Browse through my introductory slides on machine learning to make sure you are clear on the difference between regression and classification problems.
Read Sebastian Raschka's overview of the supervised learning process for a look at the typical steps used to solve a classification problem.

Linear regression:

Read my linear regression lesson notebook to ensure you are familiar with its form and interpretation, since the logistic regression lesson will build upon it. Alternatively, watch The Easiest Introduction to Regression Analysis (14 minutes).
Setosa has an interactive visualization that may also help you to grasp linear regression.

scikit-learn (optional):

For a walkthrough of the classification process using Python's scikit-learn library, watch videos 3 and 4 (35 minutes) from my scikit-learn video series. (Here are the associated notebooks.)

Logistic Regression Lesson

My logistic regression lesson notebook covers the following topics using the glass identification dataset:

Refresh your memory on how to do linear regression in scikit-learn
Attempt to use linear regression for classification
Show you why logistic regression is a better alternative for classification
Brief overview of probability, odds, e, log, and log-odds
Explain the form of logistic regression
Explain how to interpret logistic regression coefficients
Demonstrate how logistic regression works with categorical features
Compare logistic regression with other models

Practical Exercise

As a way to practice applying what you've learned, participate in Kaggle's introductory Titanic competition and use logistic regression to predict passenger survival. Kaggle links to helpful tutorials for Python, R, and Excel, and their Scripts feature lets you run Python and R code on the Titanic dataset from within your browser.

Comparison with Other Models

Supervised learning superstitions cheat sheet is a thorough comparison of five common classifiers, and includes links to lots of useful resources.
Comparing supervised learning algorithms is a comparison table I created that includes both classification and regression models.
Classifier comparison is scikit-learn's visualization of classifier decision boundaries.
An Empirical Comparison of Supervised Learning Algorithms is a readable research paper from 2006, which was also presented as a talk (77 minutes).
These lecture slides compare the inner workings of logistic regression and Naive Bayes, and this paper by Andrew Ng compares the performance of logistic regression and Naive Bayes across a variety of datasets.

New? Start here!

Log in / Sign up for courses

Get weekly tips 💌

About Data School

Guide to an in-depth understanding of logistic regression

Prerequisite Knowledge

Logistic Regression Lesson

Practical Exercise

Further Reading

Further Reading (for scikit-learn users)

Comparison with Other Models