4.4 Model Evaluation

The evaluation of a classification model is crucial to assess its performance and ensure its effectiveness in real-world applications. Accuracy is the first metric that comes to mind when evaluating a model, but it is not sufficient. For example, in a study with 100 observations, where 95 are healthy individuals and 5 are cancer patients, a model that simply classifies every observation as healthy would achieve 95% accuracy. However, this would fail to identify the cancer patients, making the model useless for the task at hand. Next, we will explore some model evaluation methods to better understand and measure the performance of classification models.

4.4.1 Confusion Matrix and related statistics

A confusion matrix is a powerful tool used to evaluate the performance of a classification model. It shows the counts of actual versus predicted classifications, providing insights into how well the model performs across different classes.

A general confusion matrix for a binary classification problem has the following form:

	Predicted Positive (P)	Predicted Negative (N)
Actual Positive (P)	True Positive (TP)	False Negative (FN)
Actual Negative (N)	False Positive (FP)	True Negative (TN)

In the matrix, the rows represent the actual class of the observations in the data set, i.e. the true labels. The first row (Actual Positive) contains all cases that actually belong to the positive class (e.g., cancer patients), while, the second row contains all cases that actually belong to the negative class (e.g., healthy individuals).

The columns represent the predicted class according to the model. The first column (Predicted Positive) contains all cases that the model predicted to be positive (e.g., predicted cancer), and the second column (Predicted Negative) contains all cases that the model predicted to be negative (e.g., predicted healthy).

With this structure, each cell in the matrix contains different meaning:

TP: The number of correct predictions where the actual class is positive and the model predicted positive.
FP: The number of incorrect predictions where the actual class is negative but the model predicted positive.
FN: The number of incorrect predictions where the actual class is positive but the model predicted negative.
TN: The number of correct predictions where the actual class is negative and the model predicted negative.

These metrics provide a comprehensive way to assess the performance of a classification model, especially when dealing with imbalanced data sets. For example, the confusion matrix of the useless classifier mentioned above is displayed below.

	Predicted Cancer	Predicted Healthy
Actual Cancer	0	5
Actual Healthy	0	95

In this example, in addition to accuracy, we can further calculate other statistics to comprehensively evaluate the performance of this classifier.

Sensitivity: (True positive rate) The proportion of true positive predictions out of all actual positive cases. This statistic indicates how effectively the classifier identifies the cases of interest, showing how sensitive it is to detecting positive instances.
\[ \text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{0}{0+5} = 0 \]
Specificity: (True negative rate) The proportion of true negative predictions out of all actual negative cases. \[ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} = \frac{95}{0+95} = 1 \] In the lazy classifier example, although this lazy classifier has very extremely high specificity, 100%, and high accuracy, 95%, we can’t say it is good at all as the extremely low sensitivity, 0. So, people usually simultaneously use the three statistics, i.e. accuracy, sensitivity, and specificity, to evaluate the performance of a classifier.
Precision: Sometime, people are also interested in the quality of positive predictions, then the proportion of true positive predictions out of all predicted positive cases, i.e. precision, is used. \[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \] In the lazy classifier example, it is an extreme cases and precision is not defined as no case is predicted as positive. Mathematically, we also arrive at the same conclusion. Since there are no positive predictions, 0 appears in the denominator, and therefore this ratio is not defined.

4.4.2 More Choices

It is often difficult and inconvenient to compare different things by considering several dimensions at once. The best approach is to find a statistic that can simultaneously evaluate a classifier from multiple perspectives.

F-score: it is a statistic that combines precision and sensitivity into a single measure to evaluate the performance of a classifier, especially in situations where both false positives and false negatives are important. Essentially, it is the harmonic mean of precision and recall, giving a balance between the two metrics.

\[ \text{F-score} = 2 \times \frac{\text{Precision} \times \text{sensitivity}}{\text{Precision} + \text{sensitivity}} \] F-score ranges from 0 to 1, and it indicates the perfect precision and sensitivity (best performance) for a classifier when it is \(1\), but worst performance when it is \(0\) With the same example above, suppose we have a classifier always predict a person as a cancer patient, then this classifier has perfect sensitivity but very low precision which is 0.05. The F-score is \(2\times\frac{0.05 \times 1}{0.05 + 1} = 0.095\). If someone is willing to use this classifier, they must have ignored the negative effects of misclassifying a healthy person as a cancer patient.

Cohen Kappa Statistics: it is another option that can be used to comprehensively evaluate a classifier. Essentially, it is used to measure the agreement between two raters (classifiers). For example, suppose there are two classifiers both classify 100 cases. If the two classifiers agree with each other, then we can get the following matrix that is similar to the idea of confusion matrix.

	Classifier 2: Positive	Classifier 2: Negative
Classifier 1: Positive	30	0
Classifier 1: Negative	0	70

Ignoring whether they are good classifiers, we can say that the two classifiers have the exactly same predictions, in another word, the two classifiers agree with each other. Let’s see another example,

	Classifier 2: Positive	Classifier 2: Negative
Classifier 1: Positive	30	10
Classifier 1: Negative	5	55

In this case, apparently the two classifiers don’t have exactly the same predictions, since there are 5 cases that are predicted as negative by classifier 1 but positive by classifier 2, also 10 disagreements are in an opposite way. However, they still show a certain level of agreement. So, the question is can we design a statistic to quantify the agreement. Of course, the answer is Cohen Kappa statistic. Before showing you the formula of Kappa statistic, let’s clarify one thing. If we set ‘Classifier 1’ as the classifier you want to evaluate, and ‘Classifier 2’ as the ground truth, then this statistic will measure the agreement between your model and the ground truth, and the matrix becomes the confusion matrix.

Next, let’s have a look at the calculations of this statistic with the notations in a confusion matrix. \[ \kappa = \frac{P_o - P_e}{1 - P_e} \]

\(P_o\) is the observed agreement: the proportion of times the two raters agree, i.e. the accuracy \[ P_o = \frac{TP + TN}{TP + TN + FP + FN} \]
\(P_e\) is the expected agreement: the proportion of times the two raters would be expected to agree by chance \[ P_e = \left( \frac{(TP + FP)(TP + FN)}{(TP + TN + FP + FN)^2} \right) + \left( \frac{(TN + FP)(TN + FN)}{(TP + TN + FP + FN)^2} \right) \] In general, we can use the following table as reference to evaluate a classifier. Here is the information in table format:

Kappa (κ) Value	Interpretation
κ ≥ 0.81	Almost perfect agreement
0.61 ≤ κ < 0.80	Substantial agreement
0.41 ≤ κ < 0.60	Moderate agreement
0.21 ≤ κ < 0.40	Fair agreement
κ ≤ 0.20	Slight agreement
Negative	Worse than random chance

Let’s go back to the previous example:

	Classifier 2: Positive	Classifier 2: Negative
Classifier 1: Positive	30 (TP)	10 (FP)
Classifier 1: Negative	5 (FN)	55 (TN)

In this case, \(\kappa = 0.68\), and it suggests a substantial agreement between the two classifiers. If the ‘classifier 2’ represents the ground truth, then \(\kappa\) indicates that ‘classifier 1’ is a rather good classifier.

Previous page | Lecture 4 Homepage