Problem 1

A. The EM algorithm for this problem was derived in class. Write an R function that applies the EM algorithm obtained, implementing the following specific instructions: Stop your algorithm when either of the following criteria is satisfied:

  • The number of iterations reaches 200
  • The modified relative error is less than tolerr. The tolerance values should be an input to your program

Run your program using the starting value \(\theta^{(0)} = 0.02\) and \(tolerr = 1e-6\). Your printed output should nicely include the following quantities at iteration n = 1,2,…:

  • Iteration number n (2 digits, no decimals)
  • Value of \(\theta^{(n)}\)
  • Value of the modified relative error (use 1 decimal with exponent notation, e.g. 2.0e-07)

The Code Below uses the EM-algorithm derived in class.

em.gen <- function(theta0,tolerr, rate = FALSE, IT = TRUE){
  y <- c(1997,907, 904,32)
  d_sum <- sum(y[2:4])
  maxit = 200
  values <<- matrix(0, ncol = 3, nrow = maxit, 
                   dimnames = list(rep("", maxit), c("IT", "Theta(n)", "MRE")))
  condition = FALSE
  it = 0
  while(it < 200 & condition == FALSE){
  it = it + 1
  x2 = 1997*((theta0/4)/(.5+(theta0/4))) # E-step
  theta1 = (x2 + y[4])/(x2 + d_sum)      # M-step
  mre = abs(theta1 - theta0)/(max(1,abs(theta1)))
  values[it,] <<-c(format(it, digits = 0), formatC(theta1,digits = 12, format = "f"),
                  format(x = mre, scientific = TRUE, digits = 2)) # storing/format
  
  if(mre < tolerr){
    condition = TRUE
  }
  theta0 = theta1 # Iterative Step
  }
  if( IT == TRUE){
  print(values[1:it,], quote = FALSE) # print steps/values
  }
  
  # Print Rate of Convergence:
  if(rate == TRUE){
  print(paste('B = 1,',norm(as.numeric(values[it,2])-0.03567466, type = "2")/(
    norm(as.numeric(values[it-1,2])-0.03567466, type ="2"))))
  print(paste('B = 2,',norm(as.numeric(values[it,2])-0.03567466, type = "2")/(
    norm(as.numeric(values[it-1,2])-0.03567466, type ="2")^(2))))
  print(paste('B = 3/2,',norm(as.numeric(values[it,2])-0.03567466, type = "2")/(
    norm(as.numeric(values[it-1,2])-0.03567466, type ="2")^(3/2))))
  }
}
em.gen(.02,10^(-6))
##  IT Theta(n)       MRE    
##  1  0.027793132773 7.8e-03
##  2  0.031742941132 3.9e-03
##  3  0.033721123764 2e-03  
##  4  0.034705946241 9.8e-04
##  5  0.035194771667 4.9e-04
##  6  0.035437045211 2.4e-04
##  7  0.035557033560 1.2e-04
##  8  0.035616437341 5.9e-05
##  9  0.035645841640 2.9e-05
##  10 0.035660395186 1.5e-05
##  11 0.035667598091 7.2e-06
##  12 0.035671162906 3.6e-06
##  13 0.035672927162 1.8e-06
##  14 0.035673800302 8.7e-07

The exact solution to this problem is \(\theta^{*} = (-1657 + \sqrt(3728689))/7680\) = \(0.03567466\). Numerically determine whether the EM algorithm is linearly, superlinearly, or quadratically convergent for this problem.

Note for the formula below: - If \(0<C<1\) and B = 1, the algorithm converges linearly.

  • If \(0<C<1\) and B = 2, the algorithm converges quadratically.

  • If \(C = 0\) and 1< B <2, the algorithm converges super-linearly.

\[ \lim_{n\to\infty} \frac{\left|\left| e^{n+1}\right|\right|}{\left|| e^{n}\right||} = C \]

where,

\[ e^{n} = x_n - x^{*} \]

For our calculations, we’ll use the true value of theta, the value at the last iterations for \(e^{(n+1)}\) and the prior theta for calculating \(e^{(n)}\). At the point of convergence, the speed of convergence is

em.gen(.02, 10^(-6), rate = TRUE, IT = FALSE)
## [1] "B = 1, 0.496121391611294"
## [1] "B = 2, 286305.69713457"
## [1] "B = 3/2, 376.885103033597"

If we increase of iterations by increasing the accuracy of MRE we get,

em.gen(.02, 10^(-10), rate = TRUE, IT = FALSE)
## [1] "B = 1, 0.978679503991574"
## [1] "B = 2, 224364856.624001"
## [1] "B = 3/2, 14818.2754257679"

We can see here that the true value of B, is most likely 1, which implies the algorithm converges linearly.

Problem 2

####A. Plot a density histogram of the data given, superimposed by a Kernal smoother.

data = read.csv("http://mathfaculty.fullerton.edu/mori/Math534/Homework/Chapter%204/ExJ42.txt", header = TRUE)
attach(data)

data_df = data.frame(data)

B. Suppose that \(\mu_1, \mu_2, \mu_3\), and \(\sigma^{2}\) are unknown. Derive the Em algorithm for estimating the mixture parameters \(\alpha\), and \(\beta\) and the parameters

\(\mu_1, \mu_2, \mu_3\), and \(\sigma^{2}\).

  1. Provide initial values for\(\alpha, \beta, \mu_1, \mu_2, \mu_3\), and \(\sigma^{2}\)

  2. E Step. Calculate our E^{*}(Z_{ij})

\[ E^{*}(Z_{ij}) = \frac{\pi_{j}f(y_{i}|\mu_j,\Sigma)}{\sum_{k = 1}^{K}\pi_{k}f(y_{i}|\mu_k,\Sigma)} \]

  1. M step. Update parameter values using the MLE’s calculated:
    • Note: N = 300, K = 3

\[ \hat{\mu_k} = \frac{\sum_{n = 1}^{N} E^{*}(Z_{ik})y_i}{\sum_{n = 1}^{N} E^{*}(Z_{ik})} \]

\[ \hat{\sigma} = \frac{\sum_{n = 1}^{N}\sum_{k = 1}^{K} \left[E^{*}(Z_{ik})(y_i-\mu_k)^2 \right] }{\sum_{n = 1}^{N} \sum_{k = 1}^{K}\left[ E^{*}(Z_{ik})\right]} \]

\[ \hat{\alpha} = \frac{\sum_{n = 1}^{N} E^{*}(Z_{i1})}{N} \]

\[ \hat{\beta} = \frac{\sum_{n = 1}^{N} E^{*}(Z_{i2})}{N} \]

  1. Evalulate the MRE to check if parameters met convergence criteria. If the convergence criteria was not met, then use the updated values and return to E step.

\[ \text{Relative error} \approx \left|\left|\frac{ \theta^{(n+1)} - \theta^{(n)}}{\text{max}(1,\text{abs}(\theta^{(n+1)}))}\right|\right| \]

C. Code the EM algorithm in part (B) as an R function with input variables:

  • Data: y = (y_{1},…,y_{n}), an n x 1 vector of data
  • Parameter initial values: \(\theta = (\alpha, \beta, \mu_1, \mu_2, \mu_3,\sigma^{2})\)
theta0 <- matrix(c(1/3,1/3,-2,2,6,1),6,1) # Theta's approx from histogram
y = data$y
em.gauss <- function(y, theta0){
it = 0
condition = FALSE
res <- list()
while(it < 200 & condition == FALSE){
  it = 1 + it
  p = c(theta0[1], theta0[2],(1-theta0[1]-theta0[2]))
  mu1 = theta0[3]; mu2=theta0[4]; mu3=theta0[5];s2=theta0[6]
  
  f1 = dnorm(y,mean=mu1,sd=sqrt(s2))
  f2 = dnorm(y,mean=mu2,sd=sqrt(s2))
  f3 = dnorm(y,mean=mu3,sd=sqrt(s2))
  N1 = f1*p[1]
  N2 = f2*p[2]
  N3 = f3*p[3]
  D = N1+N2+N3
  Post1 = N1/D # Posterior probability of belonging to group 1
  Post2 = N2/D # Posterior probability of belonging to group 2
  Post3 = N3/D # Posterior probability of belonging to group 3
  Post <<- cbind(Post1,Post2,Post3)
  theta1 <- numeric(6)
  theta1[1:2]=(apply(Post,2,sum)/length(data$y))[1:2]
  
  theta1[3] <- sum(Post1*y)/sum(Post1)
  theta1[4] <- sum(Post2*y)/sum(Post2)
  theta1[5] <- sum(Post3*y)/sum(Post3)
  theta1[6] <- sum(Post1*(y-mu1)*t(y-mu1) + Post2*(y-mu2)*t(y-mu2) + 
                     Post3*(y-mu3)*t(y-mu3))/sum(Post1 + Post2 + Post3)
  mre = norm((theta1-theta0)/pmax(1,abs(theta1)), type = "2")
  if(mre < 10^(-6)){
  condition = TRUE
    }
  values <- cbind(it, t(theta1),mre)
  dimnames(values) <- list("",c("IT", "Alpha","Beta", "Mu1","Mu2","Mu3","Sigma^2", "MRE"))
  res[[it]] <- values
  theta0 = theta1
  }
res
}

a <- em.gauss(y,theta0)
IT Alpha Beta Mu1 Mu2 Mu3 Sigma^2 MRE
1 0.1696706 0.4120828 -1.231056 2.142406 5.173012 1.427437 0.7367365
2 0.1796142 0.3690914 -1.073349 2.089254 5.006117 1.263665 0.2051593
3 0.1902183 0.3326898 -1.011493 2.004341 4.953239 1.141629 0.1360620
58 0.1808103 0.2954493 -1.099085 1.680808 4.849162 1.005054 0.0000015
59 0.1808101 0.2954493 -1.099086 1.680807 4.849161 1.005054 0.0000012
60 0.1808100 0.2954494 -1.099087 1.680806 4.849161 1.005054 0.0000010

D. Again graph a density histogram of the data, and superimpose the histogram by the three component normal mixture density with parameters obtained as in part h.

x1fit<-seq(-4,4,length=40)
y1fit<-a[[60]][2]*dnorm(x1fit,mean=a[[60]][4],sd=sqrt(a[[60]][7]))

x2fit<-seq(-2,6,length=40)
y2fit<-a[[60]][3]*dnorm(x2fit,mean=a[[60]][5],sd=sqrt(a[[60]][7]))

x3fit<-seq(0,8,length=40)
y3fit<-(1-sum(a[[60]][2:3]))*dnorm(x3fit,mean=a[[60]][6],sd=sqrt(a[[60]][7]))

E. For each case use its corresponding posterior probability to determine to which group it is classified. You will compute probabilites of belonging to group 1, 2, or 3 for each case and then assign the case to the group with highest posterior probability. Then show your result graphically as follows: Draw a graph, with the x-axis indicating case number, and the y-axis including the group numbers 1,2, and 3 respectively on the graph. Do you see a pattern, and if so, what does the pattern indicate about the data?

Note: Post was stored into the global envoirnment.

class = numeric(length(data$y))
for(i in 1:length(data$y)){
  class[i] = which(Post[i,] == max(Post[i,]))
}
compare <- cbind(seq(1:300),class)
red <- compare[which(compare[,2]==1)]
green <- compare[which(compare[,2]==2)]
blue <- compare[which(compare[,2]==3)]

Notice that the second distribution, the one with mean = Mu2, has the most missplaced data points because of its overlay with the two distribution beside it. For the first and third distributions, the missplaced points are correctly missplaced on the borders that overlay with the second distribution. The pattern here is that, if the means are close together and have large proportion, such as 2nd and 3rd distributions, your more likely to have missplaced data points.