Solutions to Exercises in Lab 2

In this lab, we implement and practice Gaussian Discriminant Analysis with Breast Cancer Data. The breast cancer data set comprises 31 variables, with “diagnosis” indicating the diagnostic results of tumor samples, where M stands for malignant and B for benign. The remaining variables are extracted from medical images. For more information, read help.

Tasks: According to the requirements in each task, using data in the file BreastCancerTrain, we train a classifier with diagnosis as the target variable and radius, texture, and smoothness as the feature variables. We also test the resulting classifier on the data set in the file BreastCancerTest.

Import data:

# Import data
dat_tr = read.table("BreastCancerTrain.txt", header = T, sep = ",")
dat_te = read.table("BreastCancerTest.txt", header = T, sep = ",")

Task 1:

We assumed that for different diagnostic results, the feature variables belong to different normal distributions. Write an R function to implement a classifier based on the following decision rules \[ \widehat{y}=\arg\max_{y} f(\textbf{x}|y) \] where $f(x|y)$ is the conditional normal density function. Estimate the unknown parameter based on the training data set and then apply this classifier to the testing data set. Report the accuracy, sensitivity, specificity, and kappa statistics.

Solutions to Task 1:
Prepare the target variable y and the data matrix of feature variables X

y = dat_tr$Diagnosis 
X = as.matrix(dat_tr[,c(2,3,6)]) # choose 'radius', 'texture', and 'smoothness'

Estimate the mean vector and covariance matrix for each class.

# estimate all means and covaraince matrix first. 
# We use them for evaluate the likelihood value in line 26 and 27
mu_m = colMeans(X[which(y == "M"), ]) # estimate the means for M groups
S_m = cov(X[which(y == "M"), ]) # estimate the covariance matrix for M groups
mu_b = colMeans(X[which(y == "B"), ]) # estimate the means for B groups
S_b = cov(X[which(y == "B"), ]) # estimate the covariance matrix for B groups

Write the function classifier with values of feature variables for a new case x and model parameters mu1, S1, mu2, S2 as inputs. Here, we need function dmvnorm in package mvtnorm to evaluate the density value for the new case x, i.e. calculate the conditional likelihood of the new case. The final prediction is made by comparing the two conditional likelihood values.

library(mvtnorm) # need function 'dmvnorm' in this package 
# function for making decision based on the idea of GDA
classifier = function(x,mu1,S1,mu2,S2){
  ell1 = dmvnorm(x,mu1,S1) # the likelihood of x if assume it is from group 1
  ell2 = dmvnorm(x,mu2,S2) # the likelihood of x if assume it is from group 2
  res = ifelse(ell1 > ell2, "M", "B") # make decision
  return(res)
}

In the testing stage, we prepare the target variable and the feature variables data matrix for testing first.

y_te = dat_te$Diagnosis 
X_te = as.matrix(dat_te[,c(2,3,6)])

Apply the function classifier to each row of feature variable data matrix X_te and record the prediction results in y_pre.

N_te = dim(dat_te)[1] # Test Sample size
y_pre = numeric(N_te)
for(i in 1:N_te){ # loop over all observations in the testing set
  y_pre[i] = classifier(X_te[i, ], mu_m, S_m, mu_b, S_b) # apply our classifier
}

Compare the prediction results with truth y_pre == y_te. The results are a vector of logical values TRUE or FALSE. If we calculate the mean of this array, logical values will be transformed as 1 or 0 first, and then calculate the mean, i.e. the accuracy.

mean(y_pre == y_te) # accuracy

[1] 0.9035088

For more model performance statistics, we can apply the function confusionMatrix in package caret. It will return not only confusion matrix, but other statistics, like sensitivity, specificity, Kappa, and so on.

library(caret) # to calculate the confusion matrix and kappa statistic
confusionMatrix(as.factor(y_pre), as.factor(y_te), positive = "M")

Confusion Matrix and Statistics

          Reference
Prediction  B  M
         B 69  5
         M  6 34
                                          
               Accuracy : 0.9035          
                 95% CI : (0.8339, 0.9508)
    No Information Rate : 0.6579          
    P-Value [Acc > NIR] : 1.123e-09       
                                          
                  Kappa : 0.787           
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8718          
            Specificity : 0.9200          
         Pos Pred Value : 0.8500          
         Neg Pred Value : 0.9324          
             Prevalence : 0.3421          
         Detection Rate : 0.2982          
   Detection Prevalence : 0.3509          
      Balanced Accuracy : 0.8959          
                                          
       'Positive' Class : M

Task 2:

Modify the R function in task 1 such that the classifier based on the following decision rules is implemented. \[ \widehat{y}=\arg\max_{y} \Pr(y|\textbf{x}) \] Apply the modified classifier to the testing data set and report the accuracy, sensitivity, specificity, and kappa statistics.

Solutions to Task 2:

To make the decision with posterior probability, $\Pr(y|\textbf{x})$, we need to correct the prediction with the prior probability $\Pr(y)$. To do so, we need to estimate the prior probability with the training data first, then correct the prediction accordingly.

p_m = mean(y == "M") # estimate the prior probability
classifier = function(x,mu1,S1,mu2,S2,p_m){
  # Here, we use function 'dmvnorm' in package 'mvtnorm'
  ell1 = dmvnorm(x,mu1,S1)*p_m # modify the likelihood by prior probability 
  ell2 = dmvnorm(x,mu2,S2)*(1-p_m)
  res = ifelse(ell1 > ell2, "M", "B")
  return(res)
}

for(i in 1:N_te){
  y_pre[i] = classifier(X_te[i, ], mu_m, S_m, mu_b, S_b, p_m)
}
mean(y_pre == y_te)

[1] 0.9210526

confusionMatrix(as.factor(y_pre), as.factor(y_te), positive = "M")

Confusion Matrix and Statistics

          Reference
Prediction  B  M
         B 71  5
         M  4 34
                                          
               Accuracy : 0.9211          
                 95% CI : (0.8554, 0.9633)
    No Information Rate : 0.6579          
    P-Value [Acc > NIR] : 3.991e-11       
                                          
                  Kappa : 0.8235          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8718          
            Specificity : 0.9467          
         Pos Pred Value : 0.8947          
         Neg Pred Value : 0.9342          
             Prevalence : 0.3421          
         Detection Rate : 0.2982          
   Detection Prevalence : 0.3333          
      Balanced Accuracy : 0.9092          
                                          
       'Positive' Class : M

Task 3:

Task 3.1: Apply lda function to estimate an LDA classifier based on the training data set. Test the resulting classifier with the testing set and report the accuracy, sensitivity, specificity, and kappa statistics.

Solutions to Task 3.1:

Now, we directly apply the function lda in MASS package. First, we need to make sure the class of dat_tr is data frame. With this type of data, the usage of lda is very simple. First, we use variable names to create the model, e.g. Diagnosis~radius+texture+smoothness, then specify the data set. If you have proper prior probability from previous study, you also can specify it with argument prior, otherwise, the prior will be estimated from the input data.

library(MASS)
# estimate the LDA model by function 'lda'
m = lda(Diagnosis~radius+texture+smoothness, data = dat_tr, prior = c(0.5,0.5))

The prediction stage is also very straightforward. We can apply function predict for prediction. However, you have to make sure the input data for newdata argument should have the same variable names as the training data dat_tr and it also has to be a data frame.

# apply the output model 'm' on observations in the testing set.
res = predict(m, newdata = dat_te)

The prediction results are saved in res. By checking the structure of res, we can see that there are 3 slots in the list, and the first two are the most important. $class can be directly used as the prediction results. $posterior is the predicted posterior probability for each case.

str(res)

List of 3
 $ class    : Factor w/ 2 levels "B","M": 2 2 2 2 1 1 2 1 1 1 ...
 $ posterior: num [1:114, 1:2] 0.02072 0.00151 0.16858 0.43011 0.99986 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:114] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "B" "M"
 $ x        : num [1:114, 1] 1.4 2.358 0.58 0.102 -3.213 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:114] "1" "2" "3" "4" ...
  .. ..$ : chr "LD1"

y_pre = res$class
mean(y_pre == y_te)

[1] 0.9210526

Task 3.2: Given the output of lda function in task 3.1, write down the expression of the corresponding classifier.

# weights:
(w = m$scaling)

                  LD1
radius      0.3593405
texture     0.1225361
smoothness 38.4264427

# bias:
(w_0 = mean(m$means%*%m$scaling))

[1] 11.50199

Task 3.3: Review your solution in Task 1. Is your solution in Task 1 a LDA classifier or QDA classifier? In fact, the solution in Task 1 is a QDA classifier as we estimate the covariance matrix for each class separately. If one want to create the same function as lda, the pooled covaraince matrix has to be estimated. To do so, we remove the means from the feature varaibles data matrix for each class first, then use them to estimate the pooled contrivance matrix.

X_M = X[which(y == "M"), ]
N_XM = dim(X_M)[1]
X_M = X_M - matrix(rep(mu_m, N_XM), nrow = N_XM)

X_B = X[which(y == "B"), ]
N_XB = dim(X_B)[1]
X_B = X_B - matrix(rep(mu_b, N_XB), nrow = N_XB)

S_pooled = cov(rbind(X_M, X_B))

classifier = function(x,mu1,mu2,S,p_m = 0.5){
  # Here, we use function 'dmvnorm' in package 'mvtnorm'
  ell1 = dmvnorm(x,mu1,S)*p_m # modify the likelihood by prior probability 
  ell2 = dmvnorm(x,mu2,S)*(1-p_m)
  res = ifelse(ell1 > ell2, "M", "B")
  return(res)
}

classifier(X_te[1, ], mu_m, mu_b, S_pooled, 0.5)

[1] "M"

After the modification, the function classifier has exactly same prediction results as the model obtained by function lda.

Lecture 4 Homepage