# Import data
= read.table("BreastCancerTrain.txt", header = T, sep = ",")
dat_tr = read.table("BreastCancerTest.txt", header = T, sep = ",") dat_te
Solutions to Exercises in Lab 2
In this lab, we implement and practice Gaussian Discriminant Analysis with Breast Cancer Data. The breast cancer data set comprises 31 variables, with “diagnosis” indicating the diagnostic results of tumor samples, where M
stands for malignant and B
for benign. The remaining variables are extracted from medical images. For more information, read help
.
Tasks: According to the requirements in each task, using data in the file BreastCancerTrain, we train a classifier with diagnosis
as the target variable and radius,
texture,
and smoothness
as the feature variables. We also test the resulting classifier on the data set in the file BreastCancerTest.
Import data:
Task 1:
We assumed that for different diagnostic results, the feature variables belong to different normal distributions. Write an R function to implement a classifier based on the following decision rules \[ \widehat{y}=\arg\max_{y} f(\textbf{x}|y) \] where \(f(x|y)\) is the conditional normal density function. Estimate the unknown parameter based on the training data set and then apply this classifier to the testing data set. Report the accuracy, sensitivity, specificity, and kappa statistics.
Solutions to Task 1:
Prepare the target variable y
and the data matrix of feature variables X
= dat_tr$Diagnosis
y = as.matrix(dat_tr[,c(2,3,6)]) # choose 'radius', 'texture', and 'smoothness' X
Estimate the mean vector and covariance matrix for each class.
# estimate all means and covaraince matrix first.
# We use them for evaluate the likelihood value in line 26 and 27
= colMeans(X[which(y == "M"), ]) # estimate the means for M groups
mu_m = cov(X[which(y == "M"), ]) # estimate the covariance matrix for M groups
S_m = colMeans(X[which(y == "B"), ]) # estimate the means for B groups
mu_b = cov(X[which(y == "B"), ]) # estimate the covariance matrix for B groups S_b
Write the function classifier
with values of feature variables for a new case x
and model parameters mu1
, S1
, mu2
, S2
as inputs. Here, we need function dmvnorm
in package mvtnorm
to evaluate the density value for the new case x
, i.e. calculate the conditional likelihood of the new case. The final prediction is made by comparing the two conditional likelihood values.
library(mvtnorm) # need function 'dmvnorm' in this package
# function for making decision based on the idea of GDA
= function(x,mu1,S1,mu2,S2){
classifier = dmvnorm(x,mu1,S1) # the likelihood of x if assume it is from group 1
ell1 = dmvnorm(x,mu2,S2) # the likelihood of x if assume it is from group 2
ell2 = ifelse(ell1 > ell2, "M", "B") # make decision
res return(res)
}
In the testing stage, we prepare the target variable and the feature variables data matrix for testing first.
= dat_te$Diagnosis
y_te = as.matrix(dat_te[,c(2,3,6)]) X_te
Apply the function classifier
to each row of feature variable data matrix X_te
and record the prediction results in y_pre
.
= dim(dat_te)[1] # Test Sample size
N_te = numeric(N_te)
y_pre for(i in 1:N_te){ # loop over all observations in the testing set
= classifier(X_te[i, ], mu_m, S_m, mu_b, S_b) # apply our classifier
y_pre[i] }
Compare the prediction results with truth y_pre == y_te
. The results are a vector of logical values TRUE
or FALSE
. If we calculate the mean of this array, logical values will be transformed as 1 or 0 first, and then calculate the mean, i.e. the accuracy.
mean(y_pre == y_te) # accuracy
[1] 0.9035088
For more model performance statistics, we can apply the function confusionMatrix
in package caret
. It will return not only confusion matrix, but other statistics, like sensitivity, specificity, Kappa, and so on.
library(caret) # to calculate the confusion matrix and kappa statistic
confusionMatrix(as.factor(y_pre), as.factor(y_te), positive = "M")
Confusion Matrix and Statistics
Reference
Prediction B M
B 69 5
M 6 34
Accuracy : 0.9035
95% CI : (0.8339, 0.9508)
No Information Rate : 0.6579
P-Value [Acc > NIR] : 1.123e-09
Kappa : 0.787
Mcnemar's Test P-Value : 1
Sensitivity : 0.8718
Specificity : 0.9200
Pos Pred Value : 0.8500
Neg Pred Value : 0.9324
Prevalence : 0.3421
Detection Rate : 0.2982
Detection Prevalence : 0.3509
Balanced Accuracy : 0.8959
'Positive' Class : M
Task 2:
Modify the R function in task 1 such that the classifier based on the following decision rules is implemented. \[ \widehat{y}=\arg\max_{y} \Pr(y|\textbf{x}) \] Apply the modified classifier to the testing data set and report the accuracy, sensitivity, specificity, and kappa statistics.
Solutions to Task 2:
To make the decision with posterior probability, \(\Pr(y|\textbf{x})\), we need to correct the prediction with the prior probability \(\Pr(y)\). To do so, we need to estimate the prior probability with the training data first, then correct the prediction accordingly.
= mean(y == "M") # estimate the prior probability
p_m = function(x,mu1,S1,mu2,S2,p_m){
classifier # Here, we use function 'dmvnorm' in package 'mvtnorm'
= dmvnorm(x,mu1,S1)*p_m # modify the likelihood by prior probability
ell1 = dmvnorm(x,mu2,S2)*(1-p_m)
ell2 = ifelse(ell1 > ell2, "M", "B")
res return(res)
}
for(i in 1:N_te){
= classifier(X_te[i, ], mu_m, S_m, mu_b, S_b, p_m)
y_pre[i]
}mean(y_pre == y_te)
[1] 0.9210526
confusionMatrix(as.factor(y_pre), as.factor(y_te), positive = "M")
Confusion Matrix and Statistics
Reference
Prediction B M
B 71 5
M 4 34
Accuracy : 0.9211
95% CI : (0.8554, 0.9633)
No Information Rate : 0.6579
P-Value [Acc > NIR] : 3.991e-11
Kappa : 0.8235
Mcnemar's Test P-Value : 1
Sensitivity : 0.8718
Specificity : 0.9467
Pos Pred Value : 0.8947
Neg Pred Value : 0.9342
Prevalence : 0.3421
Detection Rate : 0.2982
Detection Prevalence : 0.3333
Balanced Accuracy : 0.9092
'Positive' Class : M
Task 3:
Task 3.1: Apply lda
function to estimate an LDA classifier based on the training data set. Test the resulting classifier with the testing set and report the accuracy, sensitivity, specificity, and kappa statistics.
Solutions to Task 3.1:
Now, we directly apply the function lda
in MASS
package. First, we need to make sure the class of dat_tr
is data frame. With this type of data, the usage of lda
is very simple. First, we use variable names to create the model, e.g. Diagnosis~radius+texture+smoothness
, then specify the data set. If you have proper prior probability from previous study, you also can specify it with argument prior
, otherwise, the prior will be estimated from the input data.
library(MASS)
# estimate the LDA model by function 'lda'
= lda(Diagnosis~radius+texture+smoothness, data = dat_tr, prior = c(0.5,0.5)) m
The prediction stage is also very straightforward. We can apply function predict
for prediction. However, you have to make sure the input data for newdata
argument should have the same variable names as the training data dat_tr
and it also has to be a data frame.
# apply the output model 'm' on observations in the testing set.
= predict(m, newdata = dat_te) res
The prediction results are saved in res
. By checking the structure of res
, we can see that there are 3 slots in the list, and the first two are the most important. $class
can be directly used as the prediction results. $posterior
is the predicted posterior probability for each case.
str(res)
List of 3
$ class : Factor w/ 2 levels "B","M": 2 2 2 2 1 1 2 1 1 1 ...
$ posterior: num [1:114, 1:2] 0.02072 0.00151 0.16858 0.43011 0.99986 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:114] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "B" "M"
$ x : num [1:114, 1] 1.4 2.358 0.58 0.102 -3.213 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:114] "1" "2" "3" "4" ...
.. ..$ : chr "LD1"
= res$class
y_pre mean(y_pre == y_te)
[1] 0.9210526
Task 3.2: Given the output of lda
function in task 3.1, write down the expression of the corresponding classifier.
# weights:
w = m$scaling) (
LD1
radius 0.3593405
texture 0.1225361
smoothness 38.4264427
# bias:
w_0 = mean(m$means%*%m$scaling)) (
[1] 11.50199
Task 3.3: Review your solution in Task 1. Is your solution in Task 1 a LDA classifier or QDA classifier? In fact, the solution in Task 1 is a QDA classifier as we estimate the covariance matrix for each class separately. If one want to create the same function as lda
, the pooled covaraince matrix has to be estimated. To do so, we remove the means from the feature varaibles data matrix for each class first, then use them to estimate the pooled contrivance matrix.
= X[which(y == "M"), ]
X_M = dim(X_M)[1]
N_XM = X_M - matrix(rep(mu_m, N_XM), nrow = N_XM)
X_M
= X[which(y == "B"), ]
X_B = dim(X_B)[1]
N_XB = X_B - matrix(rep(mu_b, N_XB), nrow = N_XB)
X_B
= cov(rbind(X_M, X_B))
S_pooled
= function(x,mu1,mu2,S,p_m = 0.5){
classifier # Here, we use function 'dmvnorm' in package 'mvtnorm'
= dmvnorm(x,mu1,S)*p_m # modify the likelihood by prior probability
ell1 = dmvnorm(x,mu2,S)*(1-p_m)
ell2 = ifelse(ell1 > ell2, "M", "B")
res return(res)
}
classifier(X_te[1, ], mu_m, mu_b, S_pooled, 0.5)
[1] "M"
After the modification, the function classifier
has exactly same prediction results as the model obtained by function lda
.