Lab 4: Exercises about Model Validation and Selection

we implement and practice k-nearest neighbors method and subset selection methods with Breast cancer data and simulated data.

Task 1: About KNN and Cross Validation

In this task, we apply K-nearest neighbor method to Breast cancer data. We use variables radius, texture and smoothness to predict the target variable diagnosis.

Task 1.1: Practice KNN method

Read the help document of the function knn in library class first. This exercise aims to analyze how the choice of \(k\) affects the KNN method’s performance on training and test datasets and identify any patterns in overfitting or underfitting as \(k\) varies. Follow the instructions below to complete your experiment, and then draw your conclusions based on the experimental results.

# Step 1: Define the candidates of hyper-parameter k
K = seq(1,300,2)
num_k = length(K)
res_tr = res_te = numeric(num_k) # to keep performance of KNN in traing set and test set

# Step 2: For each k, store the model performance both in training set and test set
for(i in 1:num_k){
  # Predict the target variable for each case in training set by KNN with K[i]. 
  # Store the accuracy in `res_tr`
  # Predict the target variable for each case in test set by KNN with K[i].
  # Store the accuracy in `res_te`
}

# Step 3: Visualize the results

Task 1.2: Practice LOOCV

Read the help document of the function knn.cv in library class first. Apply LOOCV to select the best hyper-parameter \(k\) and test the model performance in the testing set.

Tips: You can apply function knn.cv to perform LOOCV to select the best hyper-parameter. The suggested instruction:

# Step 1: Define the candidates of hyper-parameter k
K = seq(1,300,2)
num_k = length(K)
cv_acc = numeric(num_k)
# Step 2: 
for(i in 1:num_k){
  # apply function `knn.cv` to calculate cross validation accuracy,
  # and store it in `cv_acc`.
}
# Step 3: Find the model with highest cross validation accuracy and estimate its performance in testing set.

Task 1.3: Practice 5-fold Cross Validation

Tune the hyper-parameter by 5-fold cross validation. Estimate the model performance of the optimal model with testing set.

Note: Function knn.cv only provide LOOCV option. So, you need to implement the 5-fold CV by your own code.

Task 2: About Feature Selection

In this task, we will use simulated data to practice feature selection methods. The simulated data can be generated using the following R code. From the code, we can see that the target variable is primarily determined by the first five feature variables. Below, we randomly display the first six observations of five feature variables and the target variable \(y\).

set.seed(8312)
n = 500
p = 20 
X = matrix(rnorm(n * p), nrow = n, ncol = p)
w = c(2, -3, 1, 2.3, 1.5, rep(0, p-5))
y = X %*% w + rnorm(n, mean = 0, sd = 1)
dat = data.frame(X, y = y)
rm( X, w, y)
head(dat[, c(sort(sample(1:20, 5)), 21)])

          X7          X8         X13         X14        X18         y
1 -0.4051647  0.04860872  0.03389675 -2.51265909 -0.7737131  2.683957
2 -0.3300477 -1.19580506 -1.48307329 -0.24639241 -0.7631660 -2.255025
3  0.8251963  0.33617031 -0.22148474  0.27703785 -0.4178979 -7.106195
4 -0.8314352  1.32395677  1.20709742  1.84552276 -0.8829887  2.433349
5 -0.5919659 -0.06201140  0.15292587  0.28684091 -1.6195710 -1.023411
6 -2.0953279 -1.45533806 -0.50448883  0.06009562  0.4953250 -2.891946

Task 2.1: Practive `leaps` package

Read the help document of function regsubsets. Apply this function to do best subset selection on the simulated data, dat. Print out the names of variable selected in the candidate model with 8 feature variables and their model coefficients.

Tips: Two options to get the regression coefficients:

You can find the included variables of the candidate model with 8 feature variables by the following code.

summary(m)$which # m: the output of `regsubsets`

Use function coef to extract the selected variables. You need to specify two arguments, the output of regsubsets function and the id of target candidate model.

Task 2.2: Prediction with candidate models

Set random seed as 2024. Randomly split the data set into training and testing sets (80% VS 20%). Use the same method as task 2.1 to find all the candidate models. Apply the candidate model with 8 feature variables to the testing set and estimate the RMSE.

Tips: Function model.matrix can be used for preparing the data matrix for prediction, e.g.,

x_test = model.matrix(y~., dat_te)

Task 2.3: Select the final optimal model

With the same data spiting, find the optimal model by training-against-validation method. Estimate the model performance with the testing set.

Task 2.4 (optinal): Practice `glmnet` package

Read the help document of cv.glmnet function in package glmnet first. Apply LASSO method to select the feature variables in dataset dat.

Note: If you are short on time, you can skip this task for now. We will focus on practicing this package in the lab next week.

Lecture 6 Homepage