Consider using Gibbs Sampling to generating random values from the random vector(X,Y) having a bivariate normal distribution with mean \(\mu\) and covariance matrix \(\Sigma\)
mu = c(1,2)
sig = matrix(c(1,.5,.5,.4),2,2)
x0 = 7
n = 5000
gibbs.binorm <- function(mu = mu, sig = sig, x0 = x0, n = n){
pairs = matrix(0,2,n)
r = sig[1,2]/sqrt(sig[2,2]*sig[1,1])
c.y = r*sqrt(sig[2,2]/sig[1,1])
c.x = r*sqrt(sig[1,1]/sig[2,2])
y0 = rnorm(n = 1, mean = mu[2] + c.y*(x0 - mu[1]), sd = sqrt((1-r^2)*sig[2,2]))
pairs[,1] = c(x0,y0)
for(i in 1:(n-1)){
pair.temp = c(0,pairs[2,i])
pair.temp[1] = rnorm(n =1, mean = mu[1] + c.x*(pair.temp[2] - mu[2]), sd = sqrt((1-r^2)*sig[1,1]))
pair.temp[2] = rnorm(n = 1, mean = mu[2] + c.y*(pair.temp[1] - mu[1]), sd = sqrt((1-r^2)*sig[2,2]))
pairs[,i+1] = pair.temp
}
#plot(pairs[1,1000:n],pairs[2,1000:n], xlab ="x", ylab = "y")
pairs_df = data.frame(x =pairs[1,1000:n], y = pairs[2,1000:n])
}
binorm_res = gibbs.binorm(mu,sig,x0, n)
Yes, it could seen on the plot that y has a mean of 2 and most of the data is about 2 standard deviations from its mean, sd = 1. Likewise, the plot indicates that x has a mean of 1, and most of its data is within 2 standard deviations, sd = .632.
BCdata = read.table("/Users/josegonzalez/Documents/R/breastcancer.dat", header = TRUE)
treat = data.frame(BCdata[which(BCdata$treatment ==1),])
control = data.frame(BCdata[which(BCdata$treatment==0),])
treat.uncen = treat[which(treat$censored == 0),]
control.uncen = control[which(control$censored ==0),]
prop = data.frame( "proportion" = apply(BCdata, 2, mean)[2:3]) # proportion treated and censored
m = c(mean(treat[,1]), mean(control[,1])) # means of time before recurrence including censored
sd = c(sd(treat[,1]),sd(control[,1])) # sd's of time before recurrence including censored
proportion | |
---|---|
treatment | 0.5145631 |
censored | 0.7572816 |
Mean | Standard Deviation | |
---|---|---|
Treatment | 28.79245 | 14.87688 |
Control | 24.66000 | 15.13559 |
\[ f\left(\theta | \tau\right) = \theta^{\left(\sum \delta_{i}^{C} + \sum \delta_{i}^{H} +a\right)} exp^{\left(-\theta (\sum x_{i}^{C} + \tau \sum x_{i}^{H} + c + d\tau\right)} \]
This implies: \(\theta | \tau \sim\) \(gamma(\sum \delta_{i}^{C} + \sum \delta_{i}^{H} +a +1, \sum x_{i}^{C} + \tau \sum x_{i}^{H} + c + d\tau)\)
\[ f\left(\tau | \theta\right) = \tau^{\left(\sum \delta_{i}^{H} +b\right)} exp^{\left(-\tau ( \theta \sum x_{i}^{H} + d\tau\right)} \]
This implies: \(\tau | \theta \sim\) \(gamma(\sum \delta_{i}^{H} +b + 1, \theta \sum x_{i}^{H} + d\tau)\)
a = 3; b = 1; c = 60; d = 120;
del_c = -sum(control[,3] -1) # number of uncensored times-control
del_h = -sum(treat[,3] -1) # number of uncensored times-treatment
x_c = sum(control[,1]) # total recurrence times control
x_h = sum(treat[,1]) # total recurrence times treatment
n = 10000
gibbs.bc = function(n,a,b,c,d){
theta.g = numeric(n)
tau.g = numeric(n)
burn = 1000
# initial values theta0, tau0
theta.g[1] =1/ mean(control[,1])
tau.g[1] = rgamma(n = 1,shape = del_h + b +1, rate = theta.g[1]*x_h + d*theta.g[1])
# Gibbs-Sampler
for(i in 2:n){
theta.g[i] = rgamma(n = 1, shape = del_c + del_h + a + 1, rate = x_c + tau.g[(i-1)]*x_h + c +d*tau.g[(i-1)])
tau.g[i] = tau.g[1] = rgamma(n = 1,shape = del_h + b +1, rate = theta.g[i]*(x_h + d))
}
data.frame(theta.g = theta.g[burn:n], tau.g = tau.g[burn:n])
}
bc.g = gibbs.bc(n,a,b,c,d)
sum_stats <- function(x){
c(mean = mean(x), sd = sd(x), quantile(x,.025), quantile(x,.975))
}
sum_df = data.frame(apply(bc.g,MARGIN = 2, FUN = sum_stats))
theta.g | tau.g | |
---|---|---|
mean | 0.0093867 | 1.1980951 |
sd | 0.0027343 | 0.4863336 |
2.5% | 0.0048340 | 0.5302956 |
97.5% | 0.0154444 | 2.4255043 |
Our mean \(\tau\), was about 1.21, which in context of our original assumptions, would imply an increase rate in time until recurrence, given that treatment ~ exponential(\(\tau \theta\)) and control ~ exponential(\(\theta\)). The mean time until recurrence should decrease by about 21%, which should be troubling. However, the tables above show that the confidence interval includes 1, which indicates that our result is not statistically significant, since, \(\tau\) = 1 would imply that both distributions are the same.
bc.g.half = gibbs.bc(n,a/2,b/2,c/2,d/2) # half hyper-par
bc.g.doub = gibbs.bc(n,2*a,2*b,2*c,2*d)# double hyper - par
orig = data.frame(apply(bc.g,MARGIN = 2, FUN = sum_stats))
half = data.frame(apply(bc.g.half,MARGIN = 2, FUN = sum_stats))
doub = data.frame(apply(bc.g.doub,MARGIN = 2, FUN = sum_stats))
theta.g | tau.g | |
---|---|---|
mean | 0.0093867 | 1.1980951 |
sd | 0.0027343 | 0.4863336 |
2.5% | 0.0048340 | 0.5302956 |
97.5% | 0.0154444 | 2.4255043 |
theta.g | tau.g | |
---|---|---|
mean | 0.0086937 | 1.3155498 |
sd | 0.0025939 | 0.5667751 |
2.5% | 0.0043394 | 0.5631139 |
97.5% | 0.0144276 | 2.6708340 |
theta.g | tau.g | |
---|---|---|
mean | 0.0102764 | 1.0727894 |
sd | 0.0027931 | 0.4109598 |
2.5% | 0.0056248 | 0.4891444 |
97.5% | 0.0165017 | 2.0774076 |
Based on the tables above, it is clear that an adjustment to the hyperparameters changes the range of the confidence interval. However, with respect to \(\tau\), confidence interval contains 1 for every version of the hyperparameters, which implies that \(\tau\) is not statistically significant and reasonalbe hyperparameters do not effect this significance. The coefficients however, do change drastically. As the hyperparameters range from halving to doubling, the coefficients for \(\tau\) range from 1.048 to 1.307, indicicating that the results are sensitive to changes in the hyperparameters.