Problem 1: Gibbs Sampling (Bivariate Normal):

Consider using Gibbs Sampling to generating random values from the random vector(X,Y) having a bivariate normal distribution with mean \(\mu\) and covariance matrix \(\Sigma\)

a) Write a general R program with the following input:

mu = c(1,2)
sig = matrix(c(1,.5,.5,.4),2,2)
x0 = 7 
n = 5000
gibbs.binorm <- function(mu = mu, sig = sig, x0 = x0, n = n){
pairs = matrix(0,2,n)

r = sig[1,2]/sqrt(sig[2,2]*sig[1,1])
c.y =  r*sqrt(sig[2,2]/sig[1,1])
c.x = r*sqrt(sig[1,1]/sig[2,2])

y0 = rnorm(n = 1, mean = mu[2] + c.y*(x0 - mu[1]), sd = sqrt((1-r^2)*sig[2,2]))

pairs[,1] = c(x0,y0)
for(i in 1:(n-1)){
pair.temp = c(0,pairs[2,i])

pair.temp[1] = rnorm(n =1, mean = mu[1] + c.x*(pair.temp[2] - mu[2]), sd = sqrt((1-r^2)*sig[1,1]))
pair.temp[2] = rnorm(n = 1, mean = mu[2] + c.y*(pair.temp[1] - mu[1]), sd = sqrt((1-r^2)*sig[2,2]))

pairs[,i+1] = pair.temp
}
#plot(pairs[1,1000:n],pairs[2,1000:n], xlab ="x", ylab = "y")
pairs_df = data.frame(x =pairs[1,1000:n], y = pairs[2,1000:n])
}
binorm_res = gibbs.binorm(mu,sig,x0, n)

b) Burn 1000 elements of the chain that you obtained, and produce a bivariate plot of x values versus y values. Explain, whehter your plot is what you expect.

Yes, it could seen on the plot that y has a mean of 2 and most of the data is about 2 standard deviations from its mean, sd = 1. Likewise, the plot indicates that x has a mean of 1, and most of its data is within 2 standard deviations, sd = .632.

Problem 2:

a) Summarize and plot the data as appropriate.

BCdata = read.table("/Users/josegonzalez/Documents/R/breastcancer.dat", header = TRUE)

treat = data.frame(BCdata[which(BCdata$treatment ==1),])
control = data.frame(BCdata[which(BCdata$treatment==0),])

treat.uncen = treat[which(treat$censored == 0),]
control.uncen = control[which(control$censored ==0),]

prop = data.frame( "proportion" = apply(BCdata, 2, mean)[2:3]) # proportion treated and censored

m = c(mean(treat[,1]), mean(control[,1])) # means of time before recurrence including censored 
sd = c(sd(treat[,1]),sd(control[,1])) # sd's of time before recurrence including censored 
proportion
treatment 0.5145631
censored 0.7572816
Mean Standard Deviation
Treatment 28.79245 14.87688
Control 24.66000 15.13559

b)

\[ f\left(\theta | \tau\right) = \theta^{\left(\sum \delta_{i}^{C} + \sum \delta_{i}^{H} +a\right)} exp^{\left(-\theta (\sum x_{i}^{C} + \tau \sum x_{i}^{H} + c + d\tau\right)} \]

This implies: \(\theta | \tau \sim\) \(gamma(\sum \delta_{i}^{C} + \sum \delta_{i}^{H} +a +1, \sum x_{i}^{C} + \tau \sum x_{i}^{H} + c + d\tau)\)

\[ f\left(\tau | \theta\right) = \tau^{\left(\sum \delta_{i}^{H} +b\right)} exp^{\left(-\tau ( \theta \sum x_{i}^{H} + d\tau\right)} \]

This implies: \(\tau | \theta \sim\) \(gamma(\sum \delta_{i}^{H} +b + 1, \theta \sum x_{i}^{H} + d\tau)\)

c)

a = 3; b = 1; c = 60; d = 120;
del_c = -sum(control[,3] -1) # number of uncensored times-control
del_h = -sum(treat[,3] -1) # number of uncensored times-treatment 
x_c = sum(control[,1]) # total recurrence times control
x_h = sum(treat[,1])  # total recurrence times treatment 
n = 10000

gibbs.bc = function(n,a,b,c,d){
theta.g = numeric(n)
tau.g = numeric(n)
burn = 1000

# initial values theta0, tau0
theta.g[1] =1/ mean(control[,1])
tau.g[1] = rgamma(n = 1,shape = del_h + b +1, rate = theta.g[1]*x_h + d*theta.g[1])

# Gibbs-Sampler
for(i in 2:n){
theta.g[i] = rgamma(n = 1, shape = del_c + del_h + a + 1, rate = x_c + tau.g[(i-1)]*x_h + c +d*tau.g[(i-1)])
tau.g[i] = tau.g[1] = rgamma(n = 1,shape = del_h + b +1, rate = theta.g[i]*(x_h + d))
}
data.frame(theta.g = theta.g[burn:n], tau.g = tau.g[burn:n])
}

bc.g = gibbs.bc(n,a,b,c,d)

d)

sum_stats <- function(x){
  c(mean = mean(x), sd = sd(x), quantile(x,.025), quantile(x,.975))
}

sum_df = data.frame(apply(bc.g,MARGIN = 2, FUN = sum_stats))
theta.g tau.g
mean 0.0093867 1.1980951
sd 0.0027343 0.4863336
2.5% 0.0048340 0.5302956
97.5% 0.0154444 2.4255043

e)

f)

Our mean \(\tau\), was about 1.21, which in context of our original assumptions, would imply an increase rate in time until recurrence, given that treatment ~ exponential(\(\tau \theta\)) and control ~ exponential(\(\theta\)). The mean time until recurrence should decrease by about 21%, which should be troubling. However, the tables above show that the confidence interval includes 1, which indicates that our result is not statistically significant, since, \(\tau\) = 1 would imply that both distributions are the same.

g)

bc.g.half = gibbs.bc(n,a/2,b/2,c/2,d/2) # half hyper-par
bc.g.doub = gibbs.bc(n,2*a,2*b,2*c,2*d)# double hyper - par

orig = data.frame(apply(bc.g,MARGIN = 2, FUN = sum_stats))
half = data.frame(apply(bc.g.half,MARGIN = 2, FUN = sum_stats))
doub = data.frame(apply(bc.g.doub,MARGIN = 2, FUN = sum_stats))
Original Hyperparameters
theta.g tau.g
mean 0.0093867 1.1980951
sd 0.0027343 0.4863336
2.5% 0.0048340 0.5302956
97.5% 0.0154444 2.4255043
Halved Hyperparameters
theta.g tau.g
mean 0.0086937 1.3155498
sd 0.0025939 0.5667751
2.5% 0.0043394 0.5631139
97.5% 0.0144276 2.6708340
Doubled Hyperparameters
theta.g tau.g
mean 0.0102764 1.0727894
sd 0.0027931 0.4109598
2.5% 0.0056248 0.4891444
97.5% 0.0165017 2.0774076

Based on the tables above, it is clear that an adjustment to the hyperparameters changes the range of the confidence interval. However, with respect to \(\tau\), confidence interval contains 1 for every version of the hyperparameters, which implies that \(\tau\) is not statistically significant and reasonalbe hyperparameters do not effect this significance. The coefficients however, do change drastically. As the hyperparameters range from halving to doubling, the coefficients for \(\tau\) range from 1.048 to 1.307, indicicating that the results are sensitive to changes in the hyperparameters.