Lab 2: Exercises about Gaussian Discriminant Analysis
In this lab, we implement and practice Gaussian Discriminant Analysis with Breast Cancer Data. The breast cancer data set comprises 31 variables, with “diagnosis” indicating the diagnostic results of tumor samples, where M
stands for malignant and B
for benign. The remaining variables are extracted from medical images. For more information, read help
.
Tasks: According to the requirements in each task, using data in the file BreastCancerTrain, we train a classifier with diagnosis
as the target variable and radius,
texture,
and smoothness
as the feature variables. We also test the resulting classifier on the data set in the file BreastCancerTest.
Task 1:
We assumed that for different diagnostic results, the feature variables belong to different normal distributions. Write an R function to implement a classifier based on the following decision rules \[ \widehat{y}=\arg\max_{y} f(\textbf{x}|y) \] where \(f(x|y)\) is the conditional normal density function. Estimate the unknown parameter based on the training data set and then apply this classifier to the testing data set. Report the accuracy, sensitivity, specificity, and kappa statistics.
Tips:
The data can be imported by the function
write.table
. You need to setheader
asTRUE
andset
as,
.The inputs of the function should include variables
x
,mu_1
,S_1
,mu_2
, andS_2
, wherex
is an array containing the three feature variables,mu_x
is the estimated mean vector of categoryx
,S_x
is the estimated covariance matrix of categoryx
.Functions
colMeans
andcov
can be applied to estimate the unknown parameters.The function
dmvnorm
in the packagemvtnorm
can be applied to calculate the normal density value. For details, read the help document of this function.The function
confusionMatrix
in the packagecaret
can be used to calculate different metrics of performance.
Note: The first two inputs should be type of factor. Also, you need to specify the positive
as M
i.e. malignant.
Task 2:
Modify the R function in task 1 such that the classifier based on the following decision rules is implemented. \[ \widehat{y}=\arg\max_{y} P(y|\textbf{x}) \] Apply the modified classifier to the testing data set and report the accuracy, sensitivity, specificity, and kappa statistics.
Tips: The prior probability can be estimated as the rate of M
in the training data set.
Task 3:
Task 3.1: Apply lda
function to estimate an LDA classifier based on the training data set. Test the resulting classifier with the testing set and report the accuracy, sensitivity, specificity, and kappa statistics.
Task 3.2: Given the output of lda
function in task 3.1, write down the expression of the corresponding classifier.
Task 3.3: Review your solution in Task 1. Is your solution in Task 1 a LDA classifier or QDA classifier?