Simple guide to confusion matrix terminology
A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.
I wanted to create a "quick reference guide" for confusion matrix terminology because I couldn't find an existing resource that suited my requirements: compact in presentation, using numbers instead of arbitrary variables, and explained both in terms of formulas and sentences.
Let's start with an example confusion matrix for a binary classifier (though it can easily be extended to the case of more than two classes):
What can we learn from this matrix?
- There are two possible predicted classes: "yes" and "no". If we were predicting the presence of a disease, for example, "yes" would mean they have the disease, and "no" would mean they don't have the disease.
- The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease).
- Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
- In reality, 105 patients in the sample have the disease, and 60 patients do not.
Let's now define the most basic terms, which are whole numbers (not rates):
- true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
- true negatives (TN): We predicted no, and they don't have the disease.
- false positives (FP): We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")
- false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a "Type II error.")
I've added these terms to the confusion matrix, and also added the row and column totals:
This is a list of rates that are often computed from a confusion matrix for a binary classifier:
- Accuracy: Overall, how often is the classifier correct?
- (TP+TN)/total = (100+50)/165 = 0.91
- Misclassification Rate: Overall, how often is it wrong?
- (FP+FN)/total = (10+5)/165 = 0.09
- equivalent to 1 minus Accuracy
- also known as "Error Rate"
- True Positive Rate: When it's actually yes, how often does it predict yes?
- TP/actual yes = 100/105 = 0.95
- also known as "Sensitivity" or "Recall"
- False Positive Rate: When it's actually no, how often does it predict yes?
- FP/actual no = 10/60 = 0.17
- True Negative Rate: When it's actually no, how often does it predict no?
- TN/actual no = 50/60 = 0.83
- equivalent to 1 minus False Positive Rate
- also known as "Specificity"
- Precision: When it predicts yes, how often is it correct?
- TP/predicted yes = 100/110 = 0.91
- Prevalence: How often does the yes condition actually occur in our sample?
- actual yes/total = 105/165 = 0.64
A couple other terms are also worth mentioning:
- Null Error Rate: This is how often you would be wrong if you always predicted the majority class. (In our example, the null error rate would be 60/165=0.36 because if you always predicted yes, you would only be wrong for the 60 "no" cases.) This can be a useful baseline metric to compare your classifier against. However, the best classifier for a particular application will sometimes have a higher error rate than the null error rate, as demonstrated by the Accuracy Paradox.
- Cohen's Kappa: This is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance. In other words, a model will have a high Kappa score if there is a big difference between the accuracy and the null error rate. (More details about Cohen's Kappa.)
- F Score: This is a weighted average of the true positive rate (recall) and precision. (More details about the F Score.)
- ROC Curve: This is a commonly used graph that summarizes the performance of a classifier over all possible thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class. (More details about ROC Curves.)
And finally, for those of you from the world of Bayesian statistics, here's a quick summary of these terms from Applied Predictive Modeling:
In relation to Bayesian statistics, the sensitivity and specificity are the conditional probabilities, the prevalence is the prior, and the positive/negative predicted values are the posterior probabilities.
What did I miss? Are there any terms that need a better explanation? Your feedback is welcome!
BREAKING NEWS: Confusion matrices... no longer confusing! http://t.co/olnjKAISnM #machinelearning— Kevin Markham (@justmarkham) November 24, 2014
P.S. Want more content like this in your inbox? Subscribe to the Data School newsletter.