Lab 5: Logistic Regression Classifier

We implement and practice LASSO and logistic regression classifier with diamonds dataset (download here) and breast-cancer dataset.

Task 1: About Penalized Regression (LASSO)

We use the diamonds dataset that contains data for 53,940 diamonds. Each diamond includes 10 variables such as price, cut, color, and others. Train a regression model with a LASSO penalty to predict the price from different feature variables.

  • The first column is useless, we should remove it.
  • Set random seed as 2023, then randomly split dataset training and testing set (80/20).
  • Apply cv.glmnet function to tune the hyper-parameter through a 10-fold cross-validation.
  • This is a good example showing how to include categorical feature variables when one implements LASSO regression by the glmnet package. The argument x in cv.glmnet function must be a matrix. Tips: We can use the function model.matrix. (We used this function in the previous lab)
  • Calculate the square root of MSE of the model with the testing set.

Task 2: About Logistic Regression

We use breast-cancer dataset to practice the Logistic regression and do some experiments. Train a logistic regression model with the training set to predict the diagnosis results from feature variables radius, texture, and smoothness. Evaluate the accuracy of the resulting model on the testing set.

  • Apply glm function to train the logistic regression model described above. Note: The variable Diagnosis is of character type, however, glm function only accepts a numerical or factor type target variable.
  • Evaluate the resulting model with the testing set. The function predict can be applied to predict the label of new observations with the resulting model. Different from a regression model, however, you need to specify another argument type to correctly get the prediction. You can choose response, then the predict function will return the posterior probability. Calculate the accuracy and kappa statistic.
  • Manually calculate the posterior probability for an observation in the training set.
  • Write down the decision boundary of the resulting model.

Task 3: About Penalized Logistic Regression

We use breast-cancer dataset to practice penalized logistic regression. Using all the 30 feature variables to train a penalized logistic regression model to predict the diagnosis results.

  • Apply cv.glmnet function with the training dataset. We do 10-fold cross-validation.
  • Print the coefficients of the final models with minimum and 1se misclassification errors
  • Evaluate the resulting model with the testing set. Calculate the accuracy and kappa statistic.

