Example of logistic regression in Python using scikit-learn
Back in April, I provided a worked example of a real-world linear regression problem using R. These types of examples can be useful for students getting started in machine learning because they demonstrate both the machine learning workflow and the detailed commands used to execute that workflow.
My logistic regression example
This time around, I wanted to provide a machine learning example in Python using the ever-popular scikit-learn module. For my Data Science class, I worked through a classification problem using logistic regression and posted my results online in an IPython Notebook. Here are the steps demonstrated in this example:
- loading a dataset from
statsmodels
into apandas
DataFrame - exploring the data using
pandas
- visualizing the data using
matplotlib
- preparing the data for logistic regression using
patsy
- building a logistic regression model using
scikit-learn
- model evaluation using cross-validation from
scikit-learn
After viewing the notebook online, you can easily download the notebook and re-run this code on your own computer, especially because the dataset I used is built into statsmodels.
Related resources
- Guide to an in-depth understanding of logistic regression
- IPython Notebook introducing linear regression in Python
- 4-hour course on machine learning in Python
Publishing your own IPython Notebook
Much like R Markdown documents, IPython Notebooks are a great way to weave together your code, output, and explanation into a single document that can be shared with others via the IPython Notebook Viewer. And unlike R Markdown documents, IPython Notebooks are fully interactive once download by a user. Making a Notebook accessible via the Notebook Viewer is as simple as posting your .ipynb file to a publicly accessible URL (such as a GitHub repo or a Gist), and pasting the link to that file on the Notebook Viewer homepage.
If you're just getting started in Python, I highly recommend downloading the Anaconda distribution since it already contains all of the most popular Python modules for data analysis and scientific computing.