library(MASS)
= Boston dat
Solutions to Exercises in Lab 3
In this lab, we implement and practice regression models with a famous benchmark data set for regression problem in machine learning, Boston
data set. The data set can be found in package ‘MASS’. For the background information for the Boston
data, check the help document by typing ?Boston
in the R console.
Import data
Task 1:
Train a regression model to predict medv
by lstat
. In the lecture notes, we mentioned the concept of the loss function. When training a regression model, we are essentially solving an optimization problem. In the scenario of regression, the objective function, or the loss function, is the Mean Squared Error (MSE) loss. \[
\mathcal{L}(w_0,w_1) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i, w_0, w_1))^2
\] where \(f(x_i, w_0, w_1)\) is the model, here we want to train a simple linear regression, then \(f(x_i, w_0, w_1) = w_0 + w_1x_i\). In other words, \(f(x_i, w_0, w_1)\) is the predicted value \(\hat{y}_i\). The loss of the model, \(\mathcal{L}(w_0,w_1)\), depends on the values of \(w_0\) and \(w_1\). Therefor, our ultimate goal is to find a set of regression coefficients that minimizes the prediction error (MSE). There are many ways to find the optimal values for regression coefficients and we will try some of them next.
Task 1.1:
First, we try the most naive or brute force method, grid search.
- Define the Search Range:
- Specify a range of possible values for each regression coefficient, i.e. \(w_0 \in [34, 35]\) and \(w_1 \in [-1,0]\).
- Divide the range into sufficiently small intervals, such as steps of 0.1.
- Compute All Combinations:
- For each possible combination of regression coefficients, calculate the predicted values \(\hat{y}\) using the regression formula. Tips: You can implement it by a double loop or try to use
expand.grid()
function to create all the combination of possible values of two coefficients.
- For each possible combination of regression coefficients, calculate the predicted values \(\hat{y}\) using the regression formula. Tips: You can implement it by a double loop or try to use
- Calculate Errors:
- For each combination, compute the corresponding MSE: \[ \text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 \]
- Record the MSE for each combination.
- Find the Optimal Solution:
- Iterate through all combinations and select the set of coefficients that yields the smallest MSE.
Solutions Prepare grid for \(w_0\) and \(w_1\), and create a matrix res
to keep MSEs for all possible combination of candidate values.
= dim(dat)[1]
n = dat$medv
y = dat$lstat
x = seq(34,35, 0.1)
w_0_candidate = seq(-1,0, 0.1)
w_1_candidate # define a matrix 'res' to store all the mse
= length(w_0_candidate)
r0 = length(w_1_candidate)
r1 = matrix(0, r0, r1) res
Do a double loop to go through all the possible combinations and record the MSEs.
# try all the combinations
for(i in 1:r0){
for(j in 1:r1){
= y - (w_0_candidate[i] + w_1_candidate[j]*x)
residual = mean(residual^2)
res[i,j]
}
}# all MSEs res
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 40.01624 38.61676 41.43717 48.47745 59.73762 75.21767 94.91759 118.8374
[2,] 39.78906 38.64265 41.71612 49.00946 60.52269 76.25580 96.20878 120.3816
[3,] 39.58189 38.68854 42.01507 49.56147 61.32776 77.31393 97.51998 121.9459
[4,] 39.39472 38.75443 42.33402 50.13348 62.15283 78.39206 98.85117 123.5302
[5,] 39.22754 38.84031 42.67296 50.72549 62.99790 79.49019 100.20236 125.1344
[6,] 39.08037 38.94620 43.03191 51.33750 63.86298 80.60833 101.57356 126.7587
[7,] 38.95319 39.07209 43.41086 51.96951 64.74805 81.74646 102.96475 128.4029
[8,] 38.84602 39.21798 43.80981 52.62152 65.65312 82.90459 104.37595 130.0672
[9,] 38.75885 39.38386 44.22876 53.29353 66.57819 84.08272 105.80714 131.7514
[10,] 38.69167 39.56975 44.66771 53.98554 67.52326 85.28086 107.25833 133.4557
[11,] 38.64450 39.77564 45.12666 54.69755 68.48833 86.49899 108.72953 135.1799
[,9] [,10] [,11]
[1,] 146.9771 179.3366 215.9161
[2,] 148.7744 181.3870 218.2195
[3,] 150.5917 183.4574 220.5430
[4,] 152.4290 185.5478 222.8864
[5,] 154.2863 187.6582 225.2498
[6,] 156.1637 189.7885 227.6333
[7,] 158.0610 191.9389 230.0367
[8,] 159.9783 194.1093 232.4602
[9,] 161.9156 196.2997 234.9036
[10,] 163.8729 198.5100 237.3670
[11,] 165.8502 200.7404 239.8505
Find the optimal solution and print out the solutions
min(res)
[1] 38.61676
id = which(res == min(res), arr.ind = T)) (
row col
[1,] 1 2
# print the optimal solutions
1]] w_0_candidate[id[
[1] 34
2]] w_1_candidate[id[
[1] -0.9
Alternative way: We also can apply expand.grid
function to create all possible combinations of the values of \(w_0\) and \(w_1\).
= expand.grid(w_0_candidate, w_1_candidate)
all_w_combinations head(all_w_combinations)
Var1 Var2
1 34.0 -1
2 34.1 -1
3 34.2 -1
4 34.3 -1
5 34.4 -1
6 34.5 -1
= dim(all_w_combinations)[1]
r = numeric(r)
res for(i in 1:r){
= y - (all_w_combinations[i,1] + all_w_combinations[i,2]*x)
residual = mean(residual^2)
res[i]
}= which(res == min(res))
id # print the optimal
all_w_combinations[id,]
Var1 Var2
12 34 -0.9
Task 1.2:
Clearly, brute force methods like grid search are highly inefficient and often struggle to guarantee precision. In machine learning, numerical algorithms are commonly used to approximate the optimal solution, such as gradient descent. However, we will not delve into such algorithms here. Interested students can explore them on their own, and we will discuss them further in the context of logistic regression.
Now, let’s return to the regression problem. As mentioned in the lecture notes, when using MSE as the objective function or loss function, there is a classic analytical solution: the Gauss’s least squares solution. In other words, we can derive a formula to precisely compute the optimal solution, i.e. \[ \widehat{w}_1=\frac{\sum_{i=1}^N(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^N(x_i-\overline{x})^2} \text{ and } \widehat{w}_0=\overline{y}-\widehat{w}_1\overline{x} \] So, in this sub task, you need to calculate the regression coefficients according to the formula of least square solutions.
Note: This formula only works for simple linear regression. For multivariate regression model, you need the formula in matrix form.
Solutions
= dim(dat)[1]
n = dat$medv
y = dat$lstat
x = sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2)
w1 # beta1 = cov(x,y)/var(x)
print(paste("w1:", w1))
[1] "w1: -0.950049353757991"
= mean(y)-w1*mean(x)
w0 print(paste("w0:", w0))
[1] "w0: 34.5538408793831"
Task 1.3:
Apply lm
function to estimate the regression coefficients. There are two usages of the lm
function, however, the better way is using it with a data frame. With a data frame, you need to specify:
- The model:
medv ~ lstat
- The data: set argument
data = Boston
Solutions
= lm(medv~lstat, data = dat)
m $coefficients m
(Intercept) lstat
34.5538409 -0.9500494
Task 1.4:
Compare the results from sub tasks 1.1 to 1.3 and draw a conclusion.
Solutions: we can see that, both our own code in 1.2 and applying lm
function in 1.3 produce the same answer. However, the solution of 1.1, grid search method, can only provide a very rough estimation. If one want to get a better estimation, then have to increase the dense of grid, however, it will be very costy.
Task 2:
Train a multiple linear regression to predict ‘medv’
Task 2.1:
Use lstat
and age
as feature variables. Calculate the mean square errors of the model based on the data.
Solutions
= lm(medv~lstat+age, data = dat)
m1 summary(m1)
Call:
lm(formula = medv ~ lstat + age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-15.981 -3.978 -1.283 1.968 23.158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
lstat -1.03207 0.04819 -21.416 < 2e-16 ***
age 0.03454 0.01223 2.826 0.00491 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
= predict(m1, dat)
y_pre mean((dat$medv - y_pre)^2)
[1] 37.88168
Task 2.2:
Use all the variables except medv
as feature variables. Calculate the mean square errors of the model based on the data.
Tips: With a data frame, if you want to predict one variable with all the rest of variables in the data frame, then the model can specified as medv~.
Solutions
= lm(medv~., data = dat)
m1 summary(m1)
Call:
lm(formula = medv ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
= predict(m1, dat)
y_pre mean((dat$medv - y_pre)^2)
[1] 21.89483
Task 3
Predict medv
with a polynomial regression of lstat
. Since we are going to compare the model performances, we need to prepare a testing set. In other words, we split data into training and testing sets. Fix the random seed as 2023 set.seed(2023)
, then randomly select \(405\) observations as the training set and the rest as the testing set.
Solutions
set.seed(2023)
= sample(1:n, round(0.8*n))
id = dat[id, ]
dat_tr = dat[-id, ] dat_te
Task 3.1:
Train the same model in task 1 with the training set and evaluate the performance of the resulting model with both the training set and the testing set.
Solutions
= lm(medv~lstat, data = dat_tr)
m1 # training set
= mean( (dat_tr$medv - m1$fitted.value)^2 )
mse1_tr mse1_tr
[1] 39.80271
# testing set
= predict(m1, newdata = dat_te)
y_pre = mean((dat_te$medv - y_pre)^2)
mse1_te mse1_te
[1] 33.22158
Task 3.2:
Train the second, 7th, and 20th-order polynomial regression with the training set and respectively evaluate the resulting models with both the training set and the testing set.
Solutions
= lm(medv~poly(lstat,2), data = dat_tr)
m2 #m2 = lm(dat_tr$medv~I(dat_tr$lstat) + I(dat_tr$lstat^2)) # it is an alternative way
# training set
= mean( (dat_tr$medv - m2$fitted.value)^2 )
mse2_tr mse2_tr
[1] 30.16128
# testing set
= predict(m2, newdata = dat_te)
y_pre = mean((dat_te$medv - y_pre)^2)
mse2_te mse2_te
[1] 31.1774
= lm(medv~poly(lstat,7), data = dat_tr)
m7 # training set
= mean((m7$residuals)^2)
mse7_tr mse7_tr
[1] 26.82469
# testing set
= predict(m7, newdata = dat_te)
y_pre = mean((dat_te$medv - y_pre)^2)
mse7_te mse7_te
[1] 26.77753
= lm(medv~poly(lstat,20), data = dat_tr)
m20# training set
= mean((m20$residuals)^2)
mse20_tr mse20_tr
[1] 25.60726
# testing set
= predict(m20, newdata = dat_te)
y_pre = mean((dat_te$medv - y_pre)^2)
mse20_te mse20_te
[1] 56949982980
Task 3.3:
Compare the results from sub tasks 3.1 to 3.2 and draw a conclusion.
Solutions: From the results we can see that, 7th order polynomial regression has better predicting performance than simple linear regression model. However, the 20th order polynomial has the minimum MSE in training test, however very big MSE in testing set, therefore it definitely over fits the data set.