Comparing supervised learning algorithms
In the data science course that I instruct, we cover most of the data science pipeline but focus especially on machine learning. Besides teaching model evaluation procedures and metrics, we obviously teach the algorithms themselves, primarily for supervised learning.
Near the end of this 11-week course, we spend a few hours reviewing the material that has been covered throughout the course, with the hope that students will start to construct mental connections between all of the different things they have learned. One of the skills that I want students to be able to take away from this course is the ability to intelligently choose between supervised learning algorithms when working a machine learning problem. Although there is some value in the "brute force" approach (try everything and see what works best), there is a lot more value in being able to understand the trade-offs you're making when choosing one algorithm over another.
I decided to create a game for the students, in which I gave them a blank table listing the supervised learning algorithms we covered and asked them to compare the algorithms across a dozen different dimensions. I couldn't find a table like this on the Internet, so I decided to construct one myself! Here's what I came up with:
I wanted to share this table for two reasons: First, I thought it might be useful to others as a teaching or learning tool. (You're welcome to open it in Google Sheets and make a copy.) Second, I want to make it better, and one way to do that is to ask people more knowledgeable than me to tell me what I got wrong! :)
This table is a product of my own experience and research, but I'm not an expert in any one of these algorithms. If you have a suggestion for how this table can be improved, I'd love to hear it in the comments!
- Are any of my evaluations misleading or incorrect? (Of course, some of these dimensions are inherently subjective.)
- Are there any other "important" dimensions for comparison that should be added to this table?
- Are there any other algorithms that you would like me to add to this table? (Currently, it only includes algorithms that were taught in my course.)
I realize that the characteristics and relative performance of each algorithm can vary based upon the particulars of the data (and how well it is tuned), and thus some may argue that attempting to construct an "objective" comparison is an ill-advised task. However, I would argue that there is still value in providing this table as a set of general guidelines and as a starting point for comparing algorithms for your own supervised learning task.
Happy (machine) learning!
- Choosing a Machine Learning Classifier: Edwin Chen's short and highly readable guide.
- scikit-learn's "Machine Learning Map": Their guide for choosing the "right" estimator for your task.
- Machine Learning Done Wrong: Thoughtful advice on common mistakes to avoid in machine learning, some of which relate to algorithmic selection.
- Practical machine learning tricks from the KDD 2011 best industry paper: More advanced advice than the resources above.
- An Empirical Comparison of Supervised Learning Algorithms: Research paper from 2006.
- View all Data School posts on machine learning
P.S. There are other discussions about this post on Kaggle and DataTau.
P.P.S. I teach an online course about Machine Learning with Text in Python.
Comparing 8 common supervised learning algorithms on 13 different dimensions: http://t.co/nqND7EWN3H #machinelearning— Kevin Markham (@justmarkham) February 27, 2015