= 0
res for(i in 1:100){
= res + i
res
}cat("The sum of integers from 1 to 100 is:", res, "\n")
The sum of integers from 1 to 100 is: 5050
Install R-studio and R-base. Launch Rstudio, install package tidyverse
. It integrates multiple packages and has revolutionary significance in data processing.
Import a data set, FCourse.txt
. You can download it here. This data set includes students’ preference levels for different subjects. Do you have any comments? Delete the last 10 rows of the data set and save it as a txt
file in your disk. The function you need is write.table
.
Task 2.1: Write a program to calculate the sum of integer from 1 to 100. I’m not sure if you’ve heard the story about the great mathematician Gauss, who was the first pupil to finish this calculation and the first to go home for dinner in his class room. I’m sure you must be even faster than him!
You can apply for loop
The sum of integers from 1 to 100 is: 5050
or a simpler solution is
The sum of integers from 1 to 100 is: 5050
Task 2.2: After this, modify your code such that you can calculate the factorial of 100 (the product of integers from 1 to 100) by the program.
For loop solution:
R is excellent at graphics, especially taking power of the ggplot2
package into account. We don’t have time to study this package, but will do some simple exercises.
Task 3.1: One can visualize a math function by the following code
Now, it is your turn. Visualize the density function of the normal distribution with mean 5 and sd 2.
x = seq(-1, 11, 0.01)
plot(x, dnorm(x, 5, 2), type = "l", ylab = "f(x)", main = "Density function of normal distribution")
Task 3.2: Learn and practice the following basic plotting functions, hist
, boxplot
, and pie
.
Using the following code to generate the example data for this task. BTW, you may also have a closer look at the functions that you have not seen before or you are not familiar with.
Task 4.1: Sort the data set by variable age
in ascending order; Filter out all the rows of female observations; Find out the values of variables age
and outcome
for all the rows that belongs to treatment 1 and block 2.
treatment block sex age outcome
16 0 2 M 18 20
33 0 1 F 18 34
95 0 1 F 18 33
34 1 1 M 19 22
60 1 1 M 19 34
52 1 2 M 20 30
44 0 1 M 21 20
57 1 2 F 21 14
77 1 1 M 21 32
42 0 2 F 22 38
treatment block sex age outcome
3 0 2 M 35 30
4 0 1 M 25 27
6 1 1 M 34 25
7 1 2 M 29 18
8 0 2 M 29 48
10 1 2 M 27 42
age outcome
7 29 18
10 27 42
11 30 37
15 39 39
26 29 35
28 29 39
39 29 21
46 39 35
52 20 30
53 36 18
Task 4.2: Randomly draw a sample with 80 observations from the data set and set them as sub-dataset 1 and the rest of them as sub-dataset 2.
In this task, we solve some problems about numbers
Task 5.1: Finding prime numbers Write a program to find all prime numbers up to 100. A prime number is a number that has only two factors, that is, 1 and the number itself.
# Function to check if a number is prime
is.prime = function(num) {
if (num <= 1) {
return(FALSE) # Numbers less than or equal to 1 are not prime
}
for (i in 2:sqrt(num)) {
if (num %% i == 0) {
return(FALSE) # If the number is divisible by any number between 2 and sqrt(num), it's not prime
}
}
return(TRUE) # If no divisors were found, it's prime
}
# Find and print all prime numbers up to 100
prime_numbers = c()
for (i in 2:100) {
if (is.prime(i)) {
prime_numbers = c(prime_numbers, i)
}
}
cat("Prime numbers up to 100:", prime_numbers, "\n")
Prime numbers up to 100: 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
Task 5.2: Write a function that can convert a binary number to a decimal integer.
# Function to convert binary to decimal
b2d = function(binary) {
# Easy to work with a character string
binary = as.character(binary)
# Reverse the binary string to handle the least significant bit first
binary = rev(strsplit(binary, NULL)[[1]])
# Initialize the decimal value
decimal = 0
# Loop over each bit in the binary string
for (i in 1:length(binary)) {
if (binary[i] == "1") {
# Add the value corresponding to the bit's position
decimal = decimal + 2^(i - 1)
}
}
return(decimal)
}
Task 5.3: Write a function that can convert a decimal integer to a binary number.
# Function to convert decimal to binary
d2b = function(decimal) {
# Initialize an empty string to store the binary representation
binary = ""
# Edge case for 0
if (decimal == 0) {
return("0")
}
# Loop to divide the decimal number by 2 and store the remainder
while (decimal > 0) {
remainder = decimal %% 2 # Find remainder (0 or 1)
binary = paste0(remainder, binary) # Prepend the remainder to binary string
decimal = decimal %/% 2 # Perform integer division by 2
}
return(binary)
}
Law of large numbers (LLN) is one of the foundations of probability theory. It states that as the number of trials or observations increases, the average of the results approaches the expected value. Simply put, suppose you have a fair coin. Each flip has an equal chance—50/50—of landing heads or tails. If you flip the coin repeatedly and track the cumulative proportion of heads, this proportion will get closer and closer to 0.5 as the number of flips increases.
In this task, we will “prove” the LLN by doing a small simulation. Let the computer mimic flipping a fair coin (generate a random number from Bernoulli distribution with \(p=0.5\)). Draw a graph to show that as the number of flips increases, the cumulative proportion of getting heads or tails converges to 0.5.
n = 1000
sum = 0
res = numeric(n)
for(i in 1:n){
sum = sum + rbinom(1,1,0.5)
res[i] = sum/i
}
# or
x = rbinom(n, 1, 0.5)
for(i in 1:n){
res[i] = sum(x[1:i])/i
}
# vis the results
plot(res, type = "l")
abline(h=0.5, col = "red")
Or, you can do it without using for loop.
The Central Limit Theorem (CLT) is a key principle in statistics that states: for a sufficiently large sample size, the distribution of the sample mean approaches a normal (bell-shaped) distribution, regardless of the shape of the original population distribution. This means that if you take repeated random samples from any population with a finite mean and variance, the means of those samples will tend to follow a normal distribution as the sample size grows.
In this task, we will “demonstrate” the CLT through a simulation. We’ll repeatedly draw random samples, each with a sample size of 50, from a uniform distribution between 0 and 1. For each sample, we’ll calculate and record the average value. After repeating this process 1,000 times, plot a histogram to visualize the distribution of these 1,000 sample averages.
# Set parameters
n_samples = 1000 # Number of repeated samples
sample_size = 50 # Size of each sample
# Initialize a vector to store the sample means
sample_means = numeric(n_samples)
# Repeat the sampling process
for (i in 1:n_samples) {
# Draw a random sample from a uniform distribution between 0 and 1
sample = runif(sample_size, min = 0, max = 1)
# Calculate the mean of the sample
sample_means[i] = mean(sample)
}
# Plot a histogram of the sample means
hist(sample_means,
breaks = 30, # Number of bins for the histogram
main = "Distribution of Sample Means (CLT Demonstration)",
xlab = "Sample Mean",
col = "lightblue",
border = "black")
We roughly mentioned about Pseudo-random numbers in the lecture. Pseudo-random numbers are a sequence of numbers which are generated by some algorithm from an initial number, and they can mimic the behaviour of a random sample of uniform random variables. Once the pseudo uniform random number is ready, different algorithms can be applied to them to generate random numbers from certain distributions. Here, you implement Box and Muller’s algorithm to generate random numbers from an arbitrary Normal distribution. Write the implementation of this algorithm in a function, such that you can apply this function to simulate Normal random sample, just like function `rnorm``
Box-Muller’s Algorithm
Based on this procedure, the resulting values of \(x\) and \(y\) are independent normal distributed with mean 0 and variance 1.
# write a function to generate standard Gaussian by Box-Muller
bm = function(n = 100){
n = n/2
u1 = runif(n)
u2 = runif(n)
x1 = sqrt(-2*log(u1))*cos(2*pi*u2)
#x2 = sqrt(-2*log(u1))*sin(2*pi*u2)
return(x1)
}
# apply function bm to generate sample from arbitary normal
myNormal = function(n = 100, mu=0, sigma=1){
res = mu + sigma*bm(n)
return(res)
}
hist(myNormal(10000, 1, 2),
breaks = 30,
freq = FALSE, # display proportion instead of frequence
main = "Distribution of Sample Means (CLT Demonstration)",
xlab = "Normal distributed sample",
col = "lightblue",
border = "black")
x = seq(-5, 7, 0.01)
points(x, dnorm(x, 1, 2), type = "l", col = "red")
Do you know Newton Raphson’s optimization algorithm? The main idea of this algorithm is to find successively better approximations to the roots of a real-valued function. More specifically, assuming \(f(x)\) is differentiable and starting from an initial guess \(x_0\), the root of \(f(x)=0\) can be iteratively approximated as
\[ x_{n+1}=x_{n}-\frac{f(x_{n})}{f'(x_{n})} \] Now, you are required to apply this algorithm to find an approximated root of \(x^3-x-1=0\).
f = function(x){x^3 - x -1}
ff = function(x){3*x^2-1} # derivative
x_0 = 0
err = 10
tau = 0.000001 #tolerance of approximation error. It helps to decide if we stop the loop
while(err > tau){
x_1 = x_0 - f(x_0)/ff(x_0)
err = abs(x_1 - x_0)
# err = abs(f(x_1)) # another way for a stop
x_0 = x_1
}
x_1 # the root
[1] 1.324718
[1] 2.747136e-12
The bootstrap algorithm is a statistical technique used to estimate the distribution of a sample statistic by resampling the observed data with replacement. In essence, it involves repeatedly drawing samples (typically of the same size as the original sample) from the dataset, calculating the desired statistic (e.g., mean, median, or standard deviation) for each resample, and then aggregating these results. This method allows for estimating the variability or confidence intervals of statistics without requiring complex mathematical formulas, making it especially useful when traditional parametric assumptions (like normality) are not met. The well known machine learning algorithm random forest was just developed based on this algorithm.
Prepare the data set: Simulate \(x_i,i = 1,2,...,30\) from uniform distribution \(U(0,5)\) by function ’runif’ and \(\epsilon_i,i=1,2,...,30\) from the standard Gaussian distribution. Calculate \(y_i =0.5+1.5x_i+\epsilon_i\).
Next, we will apply bootstrap algorithm to estimate the confidence interval (CI) of the regression coefficient, and compare it with the CI calculated by formula.
Task 10.1: Calculate the CI by formula. Employ function ’lm’ to estimate the regression model and use the output to calculate the confidence interval of the slop term of the regression model.
Task 10.2: Calculate the CI of the estimation of slop term by Bootstrap Algorithm.
First, we generate a bootstrap sample from the data. The bootstrap sample is resampled from the simulated data with a replacement. Second, estimate the regression model with the bootstrap sample and record the estimation of the slope term. Repeat the two steps 1000 times. The bootstrap confidence interval is the upper and lower quantile values of all the estimations of the slope term. Compare the results with 1). This procedure is summarized in the following pseudo algorithm
B = 1000
for(i in 1:B){
#Step 1: draw a bootstap sample from the data set
#Step 2: Apply `lm` to estimate the model with the bootstrap sample
#Step 3: Save the estimation of slop term
}
# Calculate the quantile values of 1000 estimation of slop term.
Solutions: Data preparation:
N = 30 # sample size
x = runif(N, 0, 5) # regressor from uniform distribution
e = rnorm(N) # noise is from normal
y = 0.5 + 1.5*x + e
plot(x,y)
Confidence intervals by regression model outputs.
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.7912 -0.6825 0.1825 0.5853 1.3844
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6628 0.3011 2.201 0.0362 *
x 1.4598 0.1020 14.313 2.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8455 on 28 degrees of freedom
Multiple R-squared: 0.8798, Adjusted R-squared: 0.8755
F-statistic: 204.9 on 1 and 28 DF, p-value: 2.093e-14
[1] 2.080144
[1] 1.437656
Confidence intervals by Bootstrap algorithm.
# Bootstrap CI
B = 1000 # the number of bootstrap replicates
beta_hat = m$coefficients
slop_est = numeric(B) # store all 100 estimation of slop term
for(i in 1:B){
id = sample(1:N, N, replace = T) # including cases in bootstrap sample
bt.y = y[id]
bt.x = x[id]
slop_est[i] = lm(bt.y~bt.x)$coefficients[2]
}
# CI by Bootstrap
quantile(slop_est, 0.975)
97.5%
1.684921
2.5%
1.250433
We have discussed this algorithm in lecture 1. Now, let’s implement it in R.
Task 11.1: Use the following code to generate the data
set.seed(201606)
N = 20
x1 = runif(N,-1,1); x2 = runif(N,-1,1); X = cbind(x1,x2)
y = ifelse(x2>x1,-1,1); id = 1:N
t = seq(-1.5,1.5,0.1)
dat = cbind(y, x1, x2)
Here, you can ignore the bias (constant) term in the classifier, just like what we discussed in lecture 1. Write your function and use it to find the sword of judgment!
Task 11.2: Use the following code to get the data
# Here, we will train a Proceptron algorithm to a subset of iris data
X = iris[1:100,1:2]
y = as.numeric(iris[1:100,5])
dat = cbind(y, X)
Write your function and use it to find the sword of judgment!
Do you know substitution cipher? A substitution cipher is a method of encryption where each letter in a text is replaced by another letter according to a specific system. For example, we use H to present D, \(H \to D\), and similarly \(A \to T\), and \(U \to A\), then the substitution cipher of word DATA
is HUAU
. Next, I ask you a secrete question in a text and encrypt it using a substitution cipher. My cipher is simple; it only includes the 26 lowercase letters and space. The task is to write a program that implement the algorithm illustrated in the notes to crack my cipher and answer my secrete question.
Congratulations! You have become so powerful!