Problem 1

Obtain formulas for the elements of the gradient and the Hessian of l(M,S). You will be using these formulas later in your code. Your formulas should not involve the trace function you need to compute the traces analytically.

The log-likelihood function is,

\[ l(\mathbf{\mu, \Sigma} | x_1, x_2, ..., x_n) = -\frac{1}{2}\{nplog(2\pi) + nlog|\mathbf{\Sigma}| + trace(\Sigma^{-1}C)\} \] where \[ C = \sum_{i=1}^{n} \space [(x_i - \mu)(x_i -\mu)^{T}] \]

Gradient

Calculations for \(\mu\)

First differential with respect to \(\mu\)

\[ dl(d\mu) = \sum_{i=1}^{n} (x_i - \mu)^{T}\Sigma^{-1}d\mu \]

Let \(A = \sum_{i=1}^{n} (x_i - \mu)^{T}\Sigma^{-1}\). Then the formula for \(\frac{\partial l}{\partial \mu_i}\) is,

\[ \frac{\partial l}{\partial \mu_i} = A_i \]

Calculations for \(\Sigma\)

First differential with respect to \(\Sigma\)

\[ dl(d\Sigma) = -\frac{1}{2}\{trace[\Sigma^{-1}(n\Sigma - C)\Sigma^{-1}d\Sigma]\} \]

Let \(B = \Sigma^{-1}(n\Sigma - C)\Sigma^{-1}\). Then,

\[ dl(d\Sigma) = -\frac{1}{2}\{trace[Bd\Sigma]\} \]

The formula for \(\frac{\partial l}{\partial \sigma_{ii}}\) is,

\[ \frac{\partial l}{\partial \sigma_{ii}} = -\frac{1}{2}B_{ii} \]

If \(i \neq j\) then,

\[ \frac{\partial l}{\partial \sigma_{ij}} = -\frac{1}{2}(B_{ij}+B_{ji}) \]

There are p parameters in \(\mu\) and p(p+1)/2 parameters in \(\Sigma\), since \(\sigma_{ij} = \sigma_{ji}\). Therefore, the gradient of \(l(\mu, \Sigma)\) will be a (p + p(p+1)/2) column vector, where the first p elements pertain to the partial derivatives with respect to \(\mu_i\) for \(i = 1, ..., p\). The following p(p+1)/2 elements are the partial derivatives with respect to \(\sigma_{ij}\) and are ordered row-wise.

$ l(, ) = \begin{bmatrix}[c] A_1 \ A_2 \ \ A_p \ -B_{11} \ -(B_{12}+B_{21}) \ -B_{22} \ -(B_{13}+B_{31}) \ \ B_{pp} \end{bmatrix} $

Hessian and Fisher Information Matrix

Calculations for \(\mu\)

The second differential with respect to \(\mu\) is,

\[ ddl(d\mu,d\mu) = -nd\mu^{T}\Sigma^{-1}d\mu \]

Therefore, the formula for \({\frac{\partial^{2} l}{\partial \mu_i \partial \mu_j}}\) is,

\[ \frac{\partial^{2} l}{\partial \mu_i \partial \mu_j} = -n[\Sigma^{-1}]_{ij} \]

where \([\Sigma^{-1}]_{ij}\) is the \(ij^{th}\) element of the p x p matrix, \(\Sigma^{-1}\).

The formula for the information matrix \(-E[\frac{\partial^{2} l}{\partial \mu_i \partial \mu_j}]\) is,

\[ -E\left[\frac{\partial^{2} l}{\partial \mu_i \partial \mu_j}\right] = n[\Sigma^{-1}]_{ij} \]

Calculations for \(\Sigma\)

The second differential with respect to \(\Sigma\) is,

\[ ddl(d\Sigma, d\Sigma) = -\frac{1}{2}\{trace[(-n\Sigma^{-1} + 2\Sigma^{-1}C\Sigma^{-1})(d\Sigma \Sigma^{-1}d\Sigma)\} \] Let \(W = (-n\Sigma^{-1} + 2\Sigma^{-1}C\Sigma^{-1})\). Then , \[ ddl(d\Sigma, d\Sigma) = -\frac{1}{2}\{trace[Wd\Sigma \Sigma^{-1}d\Sigma\} \] The formulas for the second partial derivatives with respect to \(\sigma_{ij}, \sigma_{kl}\) are partioned into four cases: \(i=j\) and \(k=l\), \(i=j\) and \(k \neq l\), \(i \neq j\) and \(k=l\), and \(i \neq j\) and \(k \neq l\).

Formula for Case 1: \(i=j\) and \(k=l\), \[ \frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}} = -\frac{1}{2}W_{li}[\Sigma^{-1}]_{jk} \] Formula for Case 2: \(i=j\) and \(k \neq l\), \[ \frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}} = -\frac{1}{2}(W_{li}[\Sigma^{-1}]_{jk}+W_{ki}[\Sigma^{-1}]_{jl}) \] Formula for Case 3: \(i \neq j\) and \(k=l\), \[ \frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}} = -\frac{1}{2}(W_{li}[\Sigma^{-1}]_{jk}+W_{lj}[\Sigma^{-1}]_{ik}) \]

Formula for Case 4: \(i \neq j\) and \(k \neq l\). \[ \frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}} = -\frac{1}{2}(W_{li}[\Sigma^{-1}]_{jk}+W_{lj}[\Sigma^{-1}]_{ik}+W_{ki}[\Sigma^{-1}]_{jl}+W_{kj}[\Sigma^{-1}]_{il}) \] The negative expected value of the second differential with \(\Sigma\), \[ -E[ddl(d\Sigma,d\Sigma)] = \frac{n}{2}\{trace(\Sigma^{-1}d\Sigma\Sigma^{-1}d\Sigma)\} \] Therefore, the formulas for the information matrix \(-E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right]\) are also partioned into four cases stated below:

Information Matrix Formula for Case 1: \(i=j\) and \(k=l\), \[ -E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right] = \frac{n}{2}[\Sigma^{-1}]_{li}[\Sigma^{-1}]_{jk} \] Information Matrix Formula for Case 2: \(i=j\) and \(k \neq l\), \[ -E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right] = \frac{n}{2}([\Sigma^{-1}]_{li}[\Sigma^{-1}]_{jk}+[\Sigma^{-1}]_{ki}[\Sigma^{-1}]_{jl}) \] Information Matrix Formula for Case 3: \(i \neq j\) and \(k=l\), \[ -E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right] = \frac{n}{2}([\Sigma^{-1}]_{li}[\Sigma^{-1}]_{jk}+[\Sigma^{-1}]_{lj}[\Sigma^{-1}]_{ik}) \] Information Matrix Formula for Case 4: \(i \neq j\) and \(k \neq l\). \[ -E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right] = \frac{n}{2}([\Sigma^{-1}]_{li}[\Sigma^{-1}]_{jk}+[\Sigma^{-1}]_{lj}[\Sigma^{-1}]_{ik}+[\Sigma^{-1}]_{ki}[\Sigma^{-1}]_{jl}+[\Sigma^{-1}]_{kj}[\Sigma^{-1}]_{il}) \]

Calculations for \(\mu , \Sigma\)

The second differntial of \(dl(d\mu)\) with respect to \(\Sigma\) is, \[ ddl(d\mu, d\Sigma) = -\sum_{i=1}^{n} (x_i -\mu)^{T}(\Sigma^{-1}d\Sigma\Sigma^{-1}d\mu) \] Let \(V = -\sum_{i=1}^{n} (x_i -\mu)^{T}\Sigma^{-1}\), which is a 1 x p matrix. Then, \[ ddl(d\mu, d\Sigma) = Vd\Sigma\Sigma^{-1}d\mu \] Therefore, the mixed second partials of the likelihood are partioned in to two cases: \(i = j\) and \(i \neq j\),

Case 1: \(i = j\) \[ \frac{\partial^{2} l}{\partial \mu_k, \partial \Sigma_{ij}} = V_i[\Sigma^{-1}]_{jk} \] Case 2: \(i \neq j\) \[ \frac{\partial^{2} l}{\partial \mu_k, \partial \Sigma_{ij}} = V_i[\Sigma^{-1}]_{jk} + V_j[\Sigma^{-1}]_{ik} \]

Hessian

Given that there are p+p(p+1)/2 parameters, the hessian will be a p+p(p+1)/2 x p+p(p+1)/2 matrix, with the first p elements being \(\mu_1,...,\mu_p\) and the remaining p(p+1)/2 parameters being row-wise ordered elements \(\sigma_{11}\),

Generating Data

sqrtm <- function (A) {
  # Obtain matrix square root of a matrix A
  a = eigen(A)
  sqm = a$vectors %*% diag(sqrt(a$values)) %*% t(a$vectors)
  sqm = (sqm+t(sqm))/2
}
gen <- function(n,p,mu,sig,seed = 534){
  #---- Generate data from a p-variate normal with mean mu and covariance sigma
  # mu should be a p by 1 vector
  # sigma should be a positive definite p by matrix
  # Seed can be optionally set for the random number generator
  set.seed(seed)
  # generate data from normal mu sigma
  x = matrix(rnorm(n*p),n,p)
  datan = x %*% sqrtm(sig) + matrix(mu,n,p, byrow = TRUE)
  datan
}


sigma <- matrix(c(1,.7,.7,.7,1,.7,.7,.7,1), ncol = 3, nrow = 3 ,byrow = F)
data <- gen(n = 200, p = 3, mu = c(-1,1,2), sig = sigma, seed = 22013)
Sample of Generated Data
1 2 3
-0.1065871 1.4852948 2.6244473
-1.0153403 2.4358252 3.1963763
-1.1833315 0.6469248 2.3291214
-1.7931938 0.2476836 0.6460097
-1.6639518 0.4307266 0.3277569
0.2748926 2.1806538 3.6140945
-1.9336678 1.7301779 2.8661486
0.5458651 3.4439274 4.5585361
-1.5433673 0.1832414 1.8865077
-0.6046615 2.5108771 2.4969884
0.2591646 0.9780684 1.7008835
0.2962418 3.2566728 3.3285557
-1.0144054 2.1978399 2.2711787
-1.8423674 0.8853358 1.9718061
-0.5343611 1.1103784 1.5536223
-0.8006289 1.4041572 1.9017553
-1.6916421 0.9668761 2.1755740
0.0917476 2.1106963 2.0677713
-1.6369505 -0.1249094 1.7895591
0.5598737 1.5261198 2.3860763