Lab 5: Logistic Regression Classifier
We implement and practice LASSO and logistic regression classifier with diamonds
dataset (download here) and breast-cancer
dataset.
Task 1: About Penalized Regression (LASSO)
We use the diamonds
dataset that contains data for 53,940 diamonds. Each diamond includes 10 variables such as price, cut, color, and others. Train a regression model with a LASSO penalty to predict the price from different feature variables.
- The first column is useless, we should remove it.
- Set random seed as 2023, then randomly split dataset training and testing set (80/20).
- Apply
cv.glmnet
function to tune the hyper-parameter through a 10-fold cross-validation. - This is a good example showing how to include categorical feature variables when one implements LASSO regression by the
glmnet
package. The argumentx
incv.glmnet
function must be a matrix. Tips: We can use the functionmodel.matrix
. (We used this function in the previous lab) - Calculate the square root of MSE of the model with the testing set.
Task 2: About Logistic Regression
We use breast-cancer
dataset to practice the Logistic regression and do some experiments. Train a logistic regression model with the training set to predict the diagnosis results from feature variables radius
, texture
, and smoothness
. Evaluate the accuracy of the resulting model on the testing set.
- Apply
glm
function to train the logistic regression model described above. Note: The variableDiagnosis
is of character type, however,glm
function only accepts anumerical
orfactor
type target variable. - Evaluate the resulting model with the testing set. The function
predict
can be applied to predict the label of new observations with the resulting model. Different from a regression model, however, you need to specify another argumenttype
to correctly get the prediction. You can chooseresponse
, then thepredict
function will return the posterior probability. Calculate the accuracy and kappa statistic. - Manually calculate the posterior probability for an observation in the training set.
- Write down the decision boundary of the resulting model.
Task 3: About Penalized Logistic Regression
We use breast-cancer
dataset to practice penalized logistic regression. Using all the 30 feature variables to train a penalized logistic regression model to predict the diagnosis results.
- Apply
cv.glmnet
function with the training dataset. We do 10-fold cross-validation. - Print the coefficients of the final models with
minimum
and1se
misclassification errors - Evaluate the resulting model with the testing set. Calculate the accuracy and kappa statistic.