Lab 3: Exercises about Regression Models

In this lab, we implement and practice regression models with a famous benchmark data set for regression problem in machine learning, Boston data set. The data set can be found in package ‘MASS’. For the background information for the Boston data, check the help document by typing ?Boston in the R console.

Task 1:

Train a regression model to predict medv by lstat. In the lecture notes, we mentioned the concept of the loss function. When training a regression model, we are essentially solving an optimization problem. In the scenario of regression, the objective function, or the loss function, is the Mean Squared Error (MSE) loss. \[ \mathcal{L}(w_0,w_1) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i, w_0, w_1))^2 \] where \(f(x_i, w_0, w_1)\) is the model, here we want to train a simple linear regression, then \(f(x_i, w_0, w_1) = w_0 + w_1x_i\). In other words, \(f(x_i, w_0, w_1)\) is the predicted value \(\hat{y}_i\). The loss of the model, \(\mathcal{L}(w_0,w_1)\), depends on the values of \(w_0\) and \(w_1\). Therefor, our ultimate goal is to find a set of regression coefficients that minimizes the prediction error (MSE). There are many ways to find the optimal values for regression coefficients and we will try some of them next.

Task 1.1:

First, we try the most naive or brute force method, grid search.

Define the Search Range:
- Specify a range of possible values for each regression coefficient, i.e. \(w_0 \in [34, 35]\) and \(w_1 \in [-1,0]\).
- Divide the range into sufficiently small intervals, such as steps of 0.1.
Compute All Combinations:
- For each possible combination of regression coefficients, calculate the predicted values \(\hat{y}\) using the regression formula. Tips: You can implement it by a double loop or try to use expand.grid() function to create all the combination of possible values of two coefficients.
Calculate Errors:
- For each combination, compute the corresponding MSE: \[ \text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 \]
- Record the MSE for each combination.
Find the Optimal Solution:
- Iterate through all combinations and select the set of coefficients that yields the smallest MSE.

Task 1.2:

Clearly, brute force methods like grid search are highly inefficient and often struggle to guarantee precision. In machine learning, numerical algorithms are commonly used to approximate the optimal solution, such as gradient descent. However, we will not delve into such algorithms here. Interested students can explore them on their own, and we will discuss them further in the context of logistic regression.

Now, let’s return to the regression problem. As mentioned in the lecture notes, when using MSE as the objective function or loss function, there is a classic analytical solution: the Gauss’s least squares solution. In other words, we can derive a formula to precisely compute the optimal solution， i.e. \[ \widehat{w}_1=\frac{\sum_{i=1}^N(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^N(x_i-\overline{x})^2} \text{ and } \widehat{w}_0=\overline{y}-\widehat{w}_1\overline{x} \] So, in this sub task, you need to calculate the regression coefficients according to the formula of least square solutions.

Note: This formula only works for simple linear regression. For multivariate regression model, you need the formula in matrix form.

Task 1.3:

Apply lm function to estimate the regression coefficients. There are two usages of the lm function, however, the better way is using it with a data frame. With a data frame, you need to specify:

The model: medv ~ lstat
The data: set argument data = Boston

Task 1.4:

Compare the results from sub tasks 1.1 to 1.3 and draw a conclusion.

Task 2:

Train a multiple linear regression to predict ‘medv’

Task 2.1:

Use lstat and age as feature variables. Calculate the mean square errors of the model based on the data.

Task 2.2:

Use all the variables except medv as feature variables. Calculate the mean square errors of the model based on the data.

Tips: With a data frame, if you want to predict one variable with all the rest of variables in the data frame, then the model can specified as medv~.

Task 3

Predict medv with a polynomial regression of lstat. Since we are going to compare the model performances, we need to prepare a testing set. In other words, we split data into training and testing sets. Fix the random seed as 2023 set.seed(2023), then randomly select \(405\) observations as the training set and the rest as the testing set.

Task 3.1:

Train the same model in task 1 with the training set and evaluate the performance of the resulting model with both the training set and the testing set.

Task 3.2:

Train the second, 7th, and 20th-order polynomial regression with the training set and respectively evaluate the resulting models with both the training set and the testing set.

Task 3.3:

Compare the results from sub tasks 3.1 to 3.2 and draw a conclusion.

Lecture 5 Homepage