Lab 1: Exercises of R programming

In this lab, we practice on R programming. Warm-Up, Strength Training, and Extreme Cardio exercises of varying intensities will be provided. You can choose your starting level based on your own situation. As mentioned earlier, the best way to learn any language is through practice. So, let’s get our brains and fingers moving!

Note: Programming training is inseparable from mathematics and various algorithms. Here, we don’t require you to complete all tasks; you can decide based on your own situation. There is no time limit, you can come back to the challenge whenever you have time. The more you solve, the more you practice, and the more you master. In addition, you will also learn more knowledge in math and statistics BTW.

1. Warm-Up

Task 1: Embark!

Install R-studio and R-base. Launch Rstudio, install package tidyverse. It integrates multiple packages and has revolutionary significance in data processing.

Import a data set, FCourse.txt. You can download it here. This dataset includes students’ preference levels for different subjects. Do you have any comments? Delete the last 10 rows of the data set and save it as a txt file in your disk. The function you need is write.table.

Task 2: A small test of skill

Task 2.1: Write a program to calculate the sum of integer from 1 to 100. I’m not sure if you’ve heard the story about the great mathematician Gauss, who was the first pupil to finish this calculation and the first to go home for dinner in his class room. I’m sure you must be even faster than him!

Task 2.2: After this, modify your code such that you can calculate the factorial of 100 (the product of integers from 1 to 100) by the program.

Task 3: Some Basic Plots

R is excellent at graphics, especially taking power of the ggplot2 package into account. We don’t have time to study this package, but will do some simple exercises.

Task 3.1: One can visualize a math function by the following code

x = seq(-pi,pi,0.01)
plot(x, sin(x), type = "l")
abline(h = 0)
abline(v = 0)

Now, it is your turn. Visualize the density function of the normal distribution with mean 5 and sd 2.

Task 3.2: Learn and practice the following basic plotting functions, hist, boxplot, and pie.

Task 4: Data frame

Using the following code to generate the example data for this task. BTW, you may also have a closer look at the functions that you have not seen before or you are not familiar with.

n = 100 # sample size
treatment = rbinom(n, 1, 0.5)
block = sample(c(1,2), n, replace = T)
sex = sample(c("F", "M"), n, replace = T)
age = round( runif(n, 18, 40) )
outcome = round( rnorm(n, 30, 10) )
Dat = data.frame(treatment, block, sex, age, outcome)

Task 4.1: Sort the data set by variable age in ascending order; Filter out all the rows of female observations; Find out the values of variables age and outcome for all the rows that belongs to treatment 1 and block 2.

Task 4.2: Randomly draw a sample with 80 observations from the data set and set them as sub-dataset 1 and the rest of them as sub-dataset 2.

2. Strength Training

Task 5: About Numbers

In this task, we solve some problems about numbers

Task 5.1: Finding prime numbers Write a program to find all prime numbers up to 100. A prime number is a number that has only two factors, that is, 1 and the number itself.

Task 5.2: Write a function that can convert a binary number to a decimal integer.

Task 5.3: Write a function that can convert a decimal integer to a binary number.

Task 6: Law of Large Numbers

Law of large numbers (LLN) is one of the foundations of probability theory. It states that as the number of trials or observations increases, the average of the results approaches the expected value. Simply put, suppose you have a fair coin. Each flip has an equal chance—50/50—of landing heads or tails. If you flip the coin repeatedly and track the cumulative proportion of heads, this proportion will get closer and closer to 0.5 as the number of flips increases.

In this task, we will “prove” the LLN by doing a small simulation. Let the computer mimic flipping a fair coin (generate a random number from Bernoulli distribution with \(p=0.5\)). Draw a graph to show that as the number of flips increases, the cumulative proportion of getting heads or tails converges to 0.5.

Task 7: Central Limit Theorem

The Central Limit Theorem (CLT) is a key principle in statistics that states: for a sufficiently large sample size, the distribution of the sample mean approaches a normal (bell-shaped) distribution, regardless of the shape of the original population distribution. This means that if you take repeated random samples from any population with a finite mean and variance, the means of those samples will tend to follow a normal distribution as the sample size grows.

In this task, we will “demonstrate” the CLT through a simulation. We’ll repeatedly draw random samples, each with a sample size of 50, from a uniform distribution between 0 and 1. For each sample, we’ll calculate and record the average value. After repeating this process 1,000 times, plot a histogram to visualize the distribution of these 1,000 sample averages.

Task 8: Box-Muller’s Algorithm

We roughly mentioned about Pseudo-random numbers in the lecture. Pseudo-random numbers are a sequence of numbers which are generated by some algorithm from an initial number, and they can mimic the behaviour of a random sample of uniform random variables. Once the pseudo uniform random number is ready, different algorithms can be applied to them to generate random numbers from certain distributions. Here, you implement Box and Muller’s algorithm to generate random numbers from an arbitrary Normal distribution. Write the implementation of this algorithm in a function, such that you can apply this function to simulate Normal random sample, just like function `rnorm``

Box-Muller’s Algorithm

Step 1: Randomly generate \(u\) and \(v\) from \(U(0,1)\), uniform distribution between 0 and 1
Step 2: Set \(x = \sqrt{ -2 \log (u) } \cos (2\pi v)\) and \(y = \sqrt{ -2 \log (u) } \sin (2\pi v)\)

Based on this procedure, the resulting values of \(x\) and \(y\) are independent normal distributed with mean 0 and variance 1.

ToDo4Sia: write a note to explain BM algorithm.

Task 9: Newton Raphson Algorithm

Do you know Newton Raphson’s optimization algorithm? The main idea of this algorithm is to find successively better approximations to the roots of a real-valued function. More specifically, assuming \(f(x)\) is differentiable and starting from an initial guess \(x_0\), the root of \(f(x)=0\) can be iteratively approximated as

\[ x_{n+1}=x_{n}-\frac{f(x_{n})}{f'(x_{n})} \] Now, you are required to apply this algorithm to find an approximated root of \(x^3-x-1=0\).

Task 10: Bootstrap Algorithm

The bootstrap algorithm is a statistical technique used to estimate the distribution of a sample statistic by resampling the observed data with replacement. In essence, it involves repeatedly drawing samples (typically of the same size as the original sample) from the dataset, calculating the desired statistic (e.g., mean, median, or standard deviation) for each resample, and then aggregating these results. This method allows for estimating the variability or confidence intervals of statistics without requiring complex mathematical formulas, making it especially useful when traditional parametric assumptions (like normality) are not met. The well known machine learning algorithm random forest was just developed based on this algorithm.

Prepare the data set: Simulate \(x_i,i = 1,2,...,30\) from uniform distribution \(U(0,5)\) by function ’runif’ and \(\epsilon_i,i=1,2,...,30\) from the standard Gaussian distribution. Calculate \(y_i =0.5+1.5x_i+\epsilon_i\).

Next, we will apply bootstrap algorithm to estimate the confidence interval (CI) of the regression coefficient, and compare it with the CI calculated by formula.

Task 10.1: Calculate the CI by formula. Employ function ’lm’ to estimate the regression model and use the output to calculate the confidence interval of the slop term of the regression model.

Task 10.2: Calculate the CI of the estimation of slop term by Bootstrap Algorithm.
First, we generate a bootstrap sample from the data. The bootstrap sample is resampled from the simulated data with a replacement. Second, estimate the regression model with the bootstrap sample and record the estimation of the slope term. Repeat the two steps 1000 times. The bootstrap confidence interval is the upper and lower quantile values of all the estimations of the slope term. Compare the results with 1). This procedure is summarized in the following pseudo algorithm

B = 1000
for(i in 1:B){
  #Step 1: draw a bootstap sample from the data set
  #Step 2: Apply `lm` to estimate the model with the bootstrap sample
  #Step 3: Save the estimation of slop term
}
# Calculate the quantile values of 1000 estimation of slop term.

3. Extreme Cardio

Task 11: Proceptron Algorithm

We have discussed this algorithm in lecture 1. Now, let’s implement it in R.

Task 11.1: Use the following code to generate the data

set.seed(201606)
N = 20
x1 = runif(N,-1,1); x2 <- runif(N,-1,1); X <- cbind(x1,x2)
y = ifelse(x2>x1,-1,1); id <- 1:N
t = seq(-1.5,1.5,0.1)
dat = cbind(y, x1, x2)

Here, you can ignore the bias (constant) term in the classifier, just like what we discussed in lecture 1. Write your function and use it to find the sword of judgment!

Task 11.2: Use the following code to get the data

# Here, we will train a Proceptron algorithm to a subset of iris data
X = iris[1:100,1:2]
y = as.numeric(iris[1:100,5])
dat = cbind(y, X)

Write your function and use it to find the sword of judgment!

Task 12: Decipher Problem

Do you know substitution cipher? A substitution cipher is a method of encryption where each letter in a text is replaced by another letter according to a specific system. For example, we use H to present D, \(H \to D\), and similarly \(A \to T\), and \(U \to A\), then the substitution cipher of word DATA is HUAU. Next, I ask you a secrete question in a text and encrypt it using a substitution cipher. My cipher is simple; it only includes the 26 lowercase letters and space. The task is to write a program that implement the algorithm illustrated in the notes to crack my cipher and answer my secrete question.

My ciphertext: download it here: download it here
The transition probability matrix: download it here

Congratulations! You have become so powerful!

Lecture 2 Homepage