4.4 Model Evaluation
The evaluation of a classification model is crucial to assess its performance and ensure its effectiveness in real-world applications. Accuracy is the first metric that comes to mind when evaluating a model, but it is not sufficient. For example, in a study with 100 observations, where 95 are healthy individuals and 5 are cancer patients, a model that simply classifies every observation as healthy would achieve 95% accuracy. However, this would fail to identify the cancer patients, making the model useless for the task at hand. Next, we will explore some model evaluation methods to better understand and measure the performance of classification models.
4.4.2 More Choices
It is often difficult and inconvenient to compare different things by considering several dimensions at once. The best approach is to find a statistic that can simultaneously evaluate a classifier from multiple perspectives.
F-score: it is a statistic that combines precision and sensitivity into a single measure to evaluate the performance of a classifier, especially in situations where both false positives and false negatives are important. Essentially, it is the harmonic mean of precision and recall, giving a balance between the two metrics.
\[ \text{F-score} = 2 \times \frac{\text{Precision} \times \text{sensitivity}}{\text{Precision} + \text{sensitivity}} \] F-score ranges from 0 to 1, and it indicates the perfect precision and sensitivity (best performance) for a classifier when it is \(1\), but worst performance when it is \(0\) With the same example above, suppose we have a classifier always predict a person as a cancer patient, then this classifier has perfect sensitivity but very low precision which is 0.05. The F-score is \(2\times\frac{0.05 \times 1}{0.05 + 1} = 0.095\). If someone is willing to use this classifier, they must have ignored the negative effects of misclassifying a healthy person as a cancer patient.
Cohen Kappa Statistics: it is another option that can be used to comprehensively evaluate a classifier. Essentially, it is used to measure the agreement between two raters (classifiers). For example, suppose there are two classifiers both classify 100 cases. If the two classifiers agree with each other, then we can get the following matrix that is similar to the idea of confusion matrix.
Classifier 2: Positive | Classifier 2: Negative | |
---|---|---|
Classifier 1: Positive | 30 | 0 |
Classifier 1: Negative | 0 | 70 |
Ignoring whether they are good classifiers, we can say that the two classifiers have the exactly same predictions, in another word, the two classifiers agree with each other. Let’s see another example,
Classifier 2: Positive | Classifier 2: Negative | |
---|---|---|
Classifier 1: Positive | 30 | 10 |
Classifier 1: Negative | 5 | 55 |
In this case, apparently the two classifiers don’t have exactly the same predictions, since there are 5 cases that are predicted as negative by classifier 1 but positive by classifier 2, also 10 disagreements are in an opposite way. However, they still show a certain level of agreement. So, the question is can we design a statistic to quantify the agreement. Of course, the answer is Cohen Kappa statistic. Before showing you the formula of Kappa statistic, let’s clarify one thing. If we set ‘Classifier 1’ as the classifier you want to evaluate, and ‘Classifier 2’ as the ground truth, then this statistic will measure the agreement between your model and the ground truth, and the matrix becomes the confusion matrix.
Next, let’s have a look at the calculations of this statistic with the notations in a confusion matrix. \[ \kappa = \frac{P_o - P_e}{1 - P_e} \]
- \(P_o\) is the observed agreement: the proportion of times the two raters agree, i.e. the accuracy \[ P_o = \frac{TP + TN}{TP + TN + FP + FN} \]
- \(P_e\) is the expected agreement: the proportion of times the two raters would be expected to agree by chance \[ P_e = \left( \frac{(TP + FP)(TP + FN)}{(TP + TN + FP + FN)^2} \right) + \left( \frac{(TN + FP)(TN + FN)}{(TP + TN + FP + FN)^2} \right) \] In general, we can use the following table as reference to evaluate a classifier. Here is the information in table format:
Kappa (κ) Value | Interpretation |
---|---|
κ ≥ 0.81 | Almost perfect agreement |
0.61 ≤ κ < 0.80 | Substantial agreement |
0.41 ≤ κ < 0.60 | Moderate agreement |
0.21 ≤ κ < 0.40 | Fair agreement |
κ ≤ 0.20 | Slight agreement |
Negative | Worse than random chance |
Let’s go back to the previous example:
Classifier 2: Positive | Classifier 2: Negative | |
---|---|---|
Classifier 1: Positive | 30 (TP) | 10 (FP) |
Classifier 1: Negative | 5 (FN) | 55 (TN) |
In this case, \(\kappa = 0.68\), and it suggests a substantial agreement between the two classifiers. If the ‘classifier 2’ represents the ground truth, then \(\kappa\) indicates that ‘classifier 1’ is a rather good classifier.