Solutions to Exercises in Lab 3

In this lab, we implement and practice regression models with a famous benchmark data set for regression problem in machine learning, Boston data set. The data set can be found in package ‘MASS’. For the background information for the Boston data, check the help document by typing ?Boston in the R console.

Import data

library(MASS)
dat = Boston

Task 1:

Train a regression model to predict medv by lstat. In the lecture notes, we mentioned the concept of the loss function. When training a regression model, we are essentially solving an optimization problem. In the scenario of regression, the objective function, or the loss function, is the Mean Squared Error (MSE) loss. \[ \mathcal{L}(w_0,w_1) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i, w_0, w_1))^2 \] where \(f(x_i, w_0, w_1)\) is the model, here we want to train a simple linear regression, then \(f(x_i, w_0, w_1) = w_0 + w_1x_i\). In other words, \(f(x_i, w_0, w_1)\) is the predicted value \(\hat{y}_i\). The loss of the model, \(\mathcal{L}(w_0,w_1)\), depends on the values of \(w_0\) and \(w_1\). Therefor, our ultimate goal is to find a set of regression coefficients that minimizes the prediction error (MSE). There are many ways to find the optimal values for regression coefficients and we will try some of them next.

Task 1.1:

First, we try the most naive or brute force method, grid search.

Define the Search Range:
- Specify a range of possible values for each regression coefficient, i.e. \(w_0 \in [34, 35]\) and \(w_1 \in [-1,0]\).
- Divide the range into sufficiently small intervals, such as steps of 0.1.
Compute All Combinations:
- For each possible combination of regression coefficients, calculate the predicted values \(\hat{y}\) using the regression formula. Tips: You can implement it by a double loop or try to use expand.grid() function to create all the combination of possible values of two coefficients.
Calculate Errors:
- For each combination, compute the corresponding MSE: \[ \text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 \]
- Record the MSE for each combination.
Find the Optimal Solution:
- Iterate through all combinations and select the set of coefficients that yields the smallest MSE.

Solutions Prepare grid for \(w_0\) and \(w_1\), and create a matrix res to keep MSEs for all possible combination of candidate values.

n = dim(dat)[1]
y = dat$medv
x = dat$lstat
w_0_candidate = seq(34,35, 0.1)
w_1_candidate = seq(-1,0, 0.1)
# define a matrix 'res' to store all the mse
r0 = length(w_0_candidate)
r1 = length(w_1_candidate)
res = matrix(0, r0, r1)

Do a double loop to go through all the possible combinations and record the MSEs.

# try all the combinations
for(i in 1:r0){
  for(j in 1:r1){
    residual = y - (w_0_candidate[i] + w_1_candidate[j]*x)
    res[i,j] = mean(residual^2)
  }
}
res # all MSEs

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]      [,7]     [,8]
 [1,] 40.01624 38.61676 41.43717 48.47745 59.73762 75.21767  94.91759 118.8374
 [2,] 39.78906 38.64265 41.71612 49.00946 60.52269 76.25580  96.20878 120.3816
 [3,] 39.58189 38.68854 42.01507 49.56147 61.32776 77.31393  97.51998 121.9459
 [4,] 39.39472 38.75443 42.33402 50.13348 62.15283 78.39206  98.85117 123.5302
 [5,] 39.22754 38.84031 42.67296 50.72549 62.99790 79.49019 100.20236 125.1344
 [6,] 39.08037 38.94620 43.03191 51.33750 63.86298 80.60833 101.57356 126.7587
 [7,] 38.95319 39.07209 43.41086 51.96951 64.74805 81.74646 102.96475 128.4029
 [8,] 38.84602 39.21798 43.80981 52.62152 65.65312 82.90459 104.37595 130.0672
 [9,] 38.75885 39.38386 44.22876 53.29353 66.57819 84.08272 105.80714 131.7514
[10,] 38.69167 39.56975 44.66771 53.98554 67.52326 85.28086 107.25833 133.4557
[11,] 38.64450 39.77564 45.12666 54.69755 68.48833 86.49899 108.72953 135.1799
          [,9]    [,10]    [,11]
 [1,] 146.9771 179.3366 215.9161
 [2,] 148.7744 181.3870 218.2195
 [3,] 150.5917 183.4574 220.5430
 [4,] 152.4290 185.5478 222.8864
 [5,] 154.2863 187.6582 225.2498
 [6,] 156.1637 189.7885 227.6333
 [7,] 158.0610 191.9389 230.0367
 [8,] 159.9783 194.1093 232.4602
 [9,] 161.9156 196.2997 234.9036
[10,] 163.8729 198.5100 237.3670
[11,] 165.8502 200.7404 239.8505

Find the optimal solution and print out the solutions

min(res)

[1] 38.61676

(id = which(res == min(res), arr.ind = T))

     row col
[1,]   1   2

# print the optimal solutions
w_0_candidate[id[1]]

[1] 34

w_1_candidate[id[2]]

[1] -0.9

Alternative way: We also can apply expand.grid function to create all possible combinations of the values of \(w_0\) and \(w_1\).

all_w_combinations = expand.grid(w_0_candidate, w_1_candidate)
head(all_w_combinations)

  Var1 Var2
1 34.0   -1
2 34.1   -1
3 34.2   -1
4 34.3   -1
5 34.4   -1
6 34.5   -1

r = dim(all_w_combinations)[1]
res = numeric(r)
for(i in 1:r){
  residual = y - (all_w_combinations[i,1] + all_w_combinations[i,2]*x)
  res[i] = mean(residual^2)
}
id = which(res == min(res))
# print the optimal
all_w_combinations[id,]

   Var1 Var2
12   34 -0.9

Task 1.2:

Clearly, brute force methods like grid search are highly inefficient and often struggle to guarantee precision. In machine learning, numerical algorithms are commonly used to approximate the optimal solution, such as gradient descent. However, we will not delve into such algorithms here. Interested students can explore them on their own, and we will discuss them further in the context of logistic regression.

Now, let’s return to the regression problem. As mentioned in the lecture notes, when using MSE as the objective function or loss function, there is a classic analytical solution: the Gauss’s least squares solution. In other words, we can derive a formula to precisely compute the optimal solution， i.e. \[ \widehat{w}_1=\frac{\sum_{i=1}^N(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^N(x_i-\overline{x})^2} \text{ and } \widehat{w}_0=\overline{y}-\widehat{w}_1\overline{x} \] So, in this sub task, you need to calculate the regression coefficients according to the formula of least square solutions.

Note: This formula only works for simple linear regression. For multivariate regression model, you need the formula in matrix form.

Solutions

n = dim(dat)[1]
y = dat$medv
x = dat$lstat
w1 = sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2)
# beta1 = cov(x,y)/var(x)
print(paste("w1:", w1))

[1] "w1: -0.950049353757991"

w0 = mean(y)-w1*mean(x)
print(paste("w0:", w0))

[1] "w0: 34.5538408793831"

Task 1.3:

Apply lm function to estimate the regression coefficients. There are two usages of the lm function, however, the better way is using it with a data frame. With a data frame, you need to specify:

The model: medv ~ lstat
The data: set argument data = Boston

Solutions

m = lm(medv~lstat, data = dat)
m$coefficients

(Intercept)       lstat 
 34.5538409  -0.9500494

Task 1.4:

Compare the results from sub tasks 1.1 to 1.3 and draw a conclusion.

Solutions: we can see that, both our own code in 1.2 and applying lm function in 1.3 produce the same answer. However, the solution of 1.1, grid search method, can only provide a very rough estimation. If one want to get a better estimation, then have to increase the dense of grid, however, it will be very costy.

Task 2:

Train a multiple linear regression to predict ‘medv’

Task 2.1:

Use lstat and age as feature variables. Calculate the mean square errors of the model based on the data.

Solutions

m1 = lm(medv~lstat+age, data = dat)
summary(m1)


Call:
lm(formula = medv ~ lstat + age, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,    Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

y_pre = predict(m1, dat)
mean((dat$medv - y_pre)^2)

[1] 37.88168

Task 2.2:

Use all the variables except medv as feature variables. Calculate the mean square errors of the model based on the data.

Tips: With a data frame, if you want to predict one variable with all the rest of variables in the data frame, then the model can specified as medv~.

Solutions

m1 = lm(medv~., data = dat)
summary(m1)


Call:
lm(formula = medv ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

y_pre = predict(m1, dat)
mean((dat$medv - y_pre)^2)

[1] 21.89483

Task 3

Predict medv with a polynomial regression of lstat. Since we are going to compare the model performances, we need to prepare a testing set. In other words, we split data into training and testing sets. Fix the random seed as 2023 set.seed(2023), then randomly select \(405\) observations as the training set and the rest as the testing set.

Solutions

set.seed(2023)
id = sample(1:n, round(0.8*n))
dat_tr = dat[id, ]
dat_te = dat[-id, ]

Task 3.1:

Train the same model in task 1 with the training set and evaluate the performance of the resulting model with both the training set and the testing set.

Solutions

m1 = lm(medv~lstat, data = dat_tr)
# training set
mse1_tr = mean( (dat_tr$medv - m1$fitted.value)^2 )
mse1_tr

[1] 39.80271

# testing set
y_pre = predict(m1, newdata = dat_te)
mse1_te = mean((dat_te$medv - y_pre)^2)
mse1_te

[1] 33.22158

Task 3.2:

Train the second, 7th, and 20th-order polynomial regression with the training set and respectively evaluate the resulting models with both the training set and the testing set.

Solutions

m2 = lm(medv~poly(lstat,2), data = dat_tr)
#m2 = lm(dat_tr$medv~I(dat_tr$lstat) + I(dat_tr$lstat^2)) # it is an alternative way
# training set
mse2_tr = mean( (dat_tr$medv - m2$fitted.value)^2 )
mse2_tr

[1] 30.16128

# testing set
y_pre = predict(m2, newdata = dat_te)
mse2_te = mean((dat_te$medv - y_pre)^2)
mse2_te

[1] 31.1774

m7 = lm(medv~poly(lstat,7), data = dat_tr)
# training set
mse7_tr = mean((m7$residuals)^2)
mse7_tr

[1] 26.82469

# testing set
y_pre = predict(m7, newdata = dat_te)
mse7_te = mean((dat_te$medv - y_pre)^2)
mse7_te

[1] 26.77753

m20= lm(medv~poly(lstat,20), data = dat_tr)
# training set
mse20_tr = mean((m20$residuals)^2)
mse20_tr

[1] 25.60726

# testing set
y_pre = predict(m20, newdata = dat_te)
mse20_te = mean((dat_te$medv - y_pre)^2)
mse20_te

[1] 56949982980

Task 3.3:

Compare the results from sub tasks 3.1 to 3.2 and draw a conclusion.

Solutions: From the results we can see that, 7th order polynomial regression has better predicting performance than simple linear regression model. However, the 20th order polynomial has the minimum MSE in training test, however very big MSE in testing set, therefore it definitely over fits the data set.

Lecture 5 Homepage