February 21, 2016 · machine learning Python tutorial

Guide to an in-depth understanding of logistic regression

When faced with a new classification problem, machine learning practitioners have a dizzying array of algorithms from which to choose: Naive Bayes, decision trees, Random Forests, Support Vector Machines, and many others. Where do you start? For many practitioners, the first algorithm they reach for is one of the oldest in the field: logistic regression.

Here are just a few of the attributes of logistic regression that make it incredibly popular: it's fast, it's highly interpretable, it doesn't require input features to be scaled, it doesn't require any tuning, it's easy to regularize, and it outputs well-calibrated predicted probabilities.

But despite its popularity, it is often misunderstood. Here are a few common questions about logistic regression:

As a teacher, I've found that my best lessons are the ones in which I explain a topic step-by-step in the way that I wish it had been taught to me. I struggled when I was learning logistic regression, which is why I'm so pleased to have written a lesson that may help you to grasp this challenging topic.

In order to give you additional context for the lesson, I created this guide that includes suggested prerequisites, a practical exercise, and a lengthy set of additional resources to allow you to go deeper into this topic.

Please note that the lesson code is written in Python, and so you will get the most out of it if you are a user of Python and scikit-learn. However, most components of this guide cover conceptual or mathematical material, and should be useful to all readers regardless of programming background.

I'd love to hear from you in the comments below! What questions do you have about logistic regression? Is this kind of guide helpful to you for learning a new topic? Are there other guides you would like me to create?

Prerequisite Knowledge

Mathematical terminology:

Machine learning:

Linear regression:

scikit-learn (optional):

Logistic Regression Lesson

My logistic regression lesson notebook covers the following topics using the glass identification dataset:

  1. Refresh your memory on how to do linear regression in scikit-learn
  2. Attempt to use linear regression for classification
  3. Show you why logistic regression is a better alternative for classification
  4. Brief overview of probability, odds, e, log, and log-odds
  5. Explain the form of logistic regression
  6. Explain how to interpret logistic regression coefficients
  7. Demonstrate how logistic regression works with categorical features
  8. Compare logistic regression with other models

Practical Exercise

As a way to practice applying what you've learned, participate in Kaggle's introductory Titanic competition and use logistic regression to predict passenger survival. Kaggle links to helpful tutorials for Python, R, and Excel, and their Scripts feature lets you run Python and R code on the Titanic dataset from within your browser.

Further Reading

Further Reading (for scikit-learn users)

Comparison with Other Models

Comments powered by Disqus