Lab 1: Practice on PCA Method

In this lab, we play with the HWD dataset. The dataset is stored in a R workspace file and you can download it here. You have two main tasks. First, write a program to reconstruct images using PCs and corresponding weights. Second, train a classifier using the extracted features.

Note: Performing PCA on a given dataset is not very difficult, and R provides many functions to implement the PCA algorithm. However, the main purpose of this exercise is to develop a deeper understanding of PCA and to consider how statistical PCA can be interpreted as a ‘machine’ in machine learning. Therefore, I recommend using the method from Lecture 1 (3.2 A simple example) to implement the image reconstruction process in R.

Task 1: Image reconstruction

In the next series of tasks, we implement image reconstruction with HWD dataset.

Task 1.1: Choose your favorite number. Use the function ‘image’ to display the first 24 cases. Below is an example of the usage method for the function image(). In the following R chunk, the first two lines are about color settings. You can also try your favorite color palette. In the third line, X represents all images of your favorite number, which refers to the part of the Dat matrix in R space excluding the first column. The last line generates the plot of the 8th image in X.

colors = c('white','black')
cus_col <- colorRampPalette(colors=colors)
temp = matrix(X[8,256:1],16,16,byrow=T)[,16:1]
image(t(temp), col=cus_col(256), frame = T, axes = F)

Task 1.2: Do PCA on your data set. Display the first 4 and the last 4 principal vectors (eigenvectors) as \(16 \times 16\) images.

Task 1.3: Calculate all the PCs for all the images of your favorite number.

Task 1.4: Image reconstruction. Reconstruct an image in your data set by its first 30, 60, 100, 150, and 200 principal components separately. Put the original image and five approximated images in one plot. For each approximated image, calculate and report the mean square errors.

Task 2: Train a classifier with extracted features

Build a classifier based on the first two principal components (PCs) to classify the images of digits 5 and 6. The following procedure is suggested to solve this task.

Create a new working dataset that contains all the images of 5 and 6.
Split the data set into training set (\(80\%\)) and testing set (\(20\%\)) by random sampling.
Do PCA on the training set. Use the eigenvectors of the sample covariance matrix of the training set to calculate the PCs for both the training set and testing set.
Use PCs of the training set to build the classifier
Apply your classifier to the testing set PCs and calculate the accuracy.

Can you achieve this level of accuracy using only two variables from the original dataset to build the classifier? Think about it.

Lecture 2 Homepage