Lab 2: Exercises about Gaussian Discriminant Analysis

In this lab, we implement and practice Gaussian Discriminant Analysis with Breast Cancer Data. The breast cancer data set comprises 31 variables, with “diagnosis” indicating the diagnostic results of tumor samples, where M stands for malignant and B for benign. The remaining variables are extracted from medical images. For more information, read help.

Tasks: According to the requirements in each task, using data in the file BreastCancerTrain, we train a classifier with diagnosis as the target variable and radius, texture, and smoothness as the feature variables. We also test the resulting classifier on the data set in the file BreastCancerTest.

Task 1:

We assumed that for different diagnostic results, the feature variables belong to different normal distributions. Write an R function to implement a classifier based on the following decision rules \[ \widehat{y}=\arg\max_{y} f(\textbf{x}|y) \] where \(f(x|y)\) is the conditional normal density function. Estimate the unknown parameter based on the training data set and then apply this classifier to the testing data set. Report the accuracy, sensitivity, specificity, and kappa statistics.

Tips:

  • The data can be imported by the function write.table. You need to set header as TRUE and set as ,.

  • The inputs of the function should include variables x, mu_1, S_1, mu_2, and S_2, where x is an array containing the three feature variables, mu_x is the estimated mean vector of category x, S_x is the estimated covariance matrix of category x.

  • Functions colMeans and cov can be applied to estimate the unknown parameters.

  • The function dmvnorm in the package mvtnorm can be applied to calculate the normal density value. For details, read the help document of this function.

  • The function confusionMatrix in the package caret can be used to calculate different metrics of performance.

Note: The first two inputs should be type of factor. Also, you need to specify the positive as M i.e. malignant.

Task 2:

Modify the R function in task 1 such that the classifier based on the following decision rules is implemented. \[ \widehat{y}=\arg\max_{y} P(y|\textbf{x}) \] Apply the modified classifier to the testing data set and report the accuracy, sensitivity, specificity, and kappa statistics.

Tips: The prior probability can be estimated as the rate of M in the training data set.

Task 3:

Task 3.1: Apply lda function to estimate an LDA classifier based on the training data set. Test the resulting classifier with the testing set and report the accuracy, sensitivity, specificity, and kappa statistics.

Task 3.2: Given the output of lda function in task 3.1, write down the expression of the corresponding classifier.

Task 3.3: Review your solution in Task 1. Is your solution in Task 1 a LDA classifier or QDA classifier?

Lecture 4 Homepage

© 2024 Xijia Liu. All rights reserved. Contact: xijia.liu AT umu.se
Logo