Lab 5: Logistic Regression Classifier
We implement and practice LASSO and logistic regression classifier with diamonds dataset (download here) and breast-cancer dataset.
Task 1: About Penalized Regression (LASSO)
We use the diamonds dataset that contains data for 53,940 diamonds. Each diamond includes 10 variables such as price, cut, color, and others. Train a regression model with a LASSO penalty to predict the price from different feature variables.
- The first column is useless, we should remove it.
 - Set random seed as 2023, then randomly split dataset training and testing set (80/20).
 - Apply 
cv.glmnetfunction to tune the hyper-parameter through a 10-fold cross-validation. - This is a good example showing how to include categorical feature variables when one implements LASSO regression by the 
glmnetpackage. The argumentxincv.glmnetfunction must be a matrix. Tips: We can use the functionmodel.matrix. (We used this function in the previous lab) - Calculate the square root of MSE of the model with the testing set.
 
Task 2: About Logistic Regression
We use breast-cancer dataset to practice the Logistic regression and do some experiments. Train a logistic regression model with the training set to predict the diagnosis results from feature variables radius, texture, and smoothness. Evaluate the accuracy of the resulting model on the testing set.
- Apply 
glmfunction to train the logistic regression model described above. Note: The variableDiagnosisis of character type, however,glmfunction only accepts anumericalorfactortype target variable. - Evaluate the resulting model with the testing set. The function 
predictcan be applied to predict the label of new observations with the resulting model. Different from a regression model, however, you need to specify another argumenttypeto correctly get the prediction. You can chooseresponse, then thepredictfunction will return the posterior probability. Calculate the accuracy and kappa statistic. - Manually calculate the posterior probability for an observation in the training set.
 - Write down the decision boundary of the resulting model.
 
Task 3: About Penalized Logistic Regression
We use breast-cancer dataset to practice penalized logistic regression. Using all the 30 feature variables to train a penalized logistic regression model to predict the diagnosis results.
- Apply 
cv.glmnetfunction with the training dataset. We do 10-fold cross-validation. - Print the coefficients of the final models with 
minimumand1semisclassification errors - Evaluate the resulting model with the testing set. Calculate the accuracy and kappa statistic.
 
