Obtain formulas for the elements of the gradient and the Hessian of l(M,S). You will be using these formulas later in your code. Your formulas should not involve the trace function you need to compute the traces analytically.
The log-likelihood function is,
\[ l(\mathbf{\mu, \Sigma} | x_1, x_2, ..., x_n) = -\frac{1}{2}\{nplog(2\pi) + nlog|\mathbf{\Sigma}| + trace(\Sigma^{-1}C)\} \] where \[ C = \sum_{i=1}^{n} \space [(x_i - \mu)(x_i -\mu)^{T}] \]
First differential with respect to \(\mu\)
\[ dl(d\mu) = \sum_{i=1}^{n} (x_i - \mu)^{T}\Sigma^{-1}d\mu \]
Let \(A = \sum_{i=1}^{n} (x_i - \mu)^{T}\Sigma^{-1}\). Then the formula for \(\frac{\partial l}{\partial \mu_i}\) is,
\[ \frac{\partial l}{\partial \mu_i} = A_i \]
First differential with respect to \(\Sigma\)
\[ dl(d\Sigma) = -\frac{1}{2}\{trace[\Sigma^{-1}(n\Sigma - C)\Sigma^{-1}d\Sigma]\} \]
Let \(B = \Sigma^{-1}(n\Sigma - C)\Sigma^{-1}\). Then,
\[ dl(d\Sigma) = -\frac{1}{2}\{trace[Bd\Sigma]\} \]
The formula for \(\frac{\partial l}{\partial \sigma_{ii}}\) is,
\[ \frac{\partial l}{\partial \sigma_{ii}} = -\frac{1}{2}B_{ii} \]
If \(i \neq j\) then,
\[ \frac{\partial l}{\partial \sigma_{ij}} = -\frac{1}{2}(B_{ij}+B_{ji}) \]
There are p parameters in \(\mu\) and p(p+1)/2 parameters in \(\Sigma\), since \(\sigma_{ij} = \sigma_{ji}\). Therefore, the gradient of \(l(\mu, \Sigma)\) will be a (p + p(p+1)/2) column vector, where the first p elements pertain to the partial derivatives with respect to \(\mu_i\) for \(i = 1, ..., p\). The following p(p+1)/2 elements are the partial derivatives with respect to \(\sigma_{ij}\) and are ordered row-wise.
$ l(, ) = \begin{bmatrix}[c] A_1 \ A_2 \ \ A_p \ -B_{11} \ -(B_{12}+B_{21}) \ -B_{22} \ -(B_{13}+B_{31}) \ \ B_{pp} \end{bmatrix} $
The second differential with respect to \(\mu\) is,
\[ ddl(d\mu,d\mu) = -nd\mu^{T}\Sigma^{-1}d\mu \]
Therefore, the formula for \({\frac{\partial^{2} l}{\partial \mu_i \partial \mu_j}}\) is,
\[ \frac{\partial^{2} l}{\partial \mu_i \partial \mu_j} = -n[\Sigma^{-1}]_{ij} \]
where \([\Sigma^{-1}]_{ij}\) is the \(ij^{th}\) element of the p x p matrix, \(\Sigma^{-1}\).
The formula for the information matrix \(-E[\frac{\partial^{2} l}{\partial \mu_i \partial \mu_j}]\) is,
\[ -E\left[\frac{\partial^{2} l}{\partial \mu_i \partial \mu_j}\right] = n[\Sigma^{-1}]_{ij} \]
The second differential with respect to \(\Sigma\) is,
\[ ddl(d\Sigma, d\Sigma) = -\frac{1}{2}\{trace[(-n\Sigma^{-1} + 2\Sigma^{-1}C\Sigma^{-1})(d\Sigma \Sigma^{-1}d\Sigma)\} \] Let \(W = (-n\Sigma^{-1} + 2\Sigma^{-1}C\Sigma^{-1})\). Then , \[ ddl(d\Sigma, d\Sigma) = -\frac{1}{2}\{trace[Wd\Sigma \Sigma^{-1}d\Sigma\} \] The formulas for the second partial derivatives with respect to \(\sigma_{ij}, \sigma_{kl}\) are partioned into four cases: \(i=j\) and \(k=l\), \(i=j\) and \(k \neq l\), \(i \neq j\) and \(k=l\), and \(i \neq j\) and \(k \neq l\).
Formula for Case 1: \(i=j\) and \(k=l\), \[ \frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}} = -\frac{1}{2}W_{li}[\Sigma^{-1}]_{jk} \] Formula for Case 2: \(i=j\) and \(k \neq l\), \[ \frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}} = -\frac{1}{2}(W_{li}[\Sigma^{-1}]_{jk}+W_{ki}[\Sigma^{-1}]_{jl}) \] Formula for Case 3: \(i \neq j\) and \(k=l\), \[ \frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}} = -\frac{1}{2}(W_{li}[\Sigma^{-1}]_{jk}+W_{lj}[\Sigma^{-1}]_{ik}) \]
Formula for Case 4: \(i \neq j\) and \(k \neq l\). \[ \frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}} = -\frac{1}{2}(W_{li}[\Sigma^{-1}]_{jk}+W_{lj}[\Sigma^{-1}]_{ik}+W_{ki}[\Sigma^{-1}]_{jl}+W_{kj}[\Sigma^{-1}]_{il}) \] The negative expected value of the second differential with \(\Sigma\), \[ -E[ddl(d\Sigma,d\Sigma)] = \frac{n}{2}\{trace(\Sigma^{-1}d\Sigma\Sigma^{-1}d\Sigma)\} \] Therefore, the formulas for the information matrix \(-E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right]\) are also partioned into four cases stated below:
Information Matrix Formula for Case 1: \(i=j\) and \(k=l\), \[ -E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right] = \frac{n}{2}[\Sigma^{-1}]_{li}[\Sigma^{-1}]_{jk} \] Information Matrix Formula for Case 2: \(i=j\) and \(k \neq l\), \[ -E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right] = \frac{n}{2}([\Sigma^{-1}]_{li}[\Sigma^{-1}]_{jk}+[\Sigma^{-1}]_{ki}[\Sigma^{-1}]_{jl}) \] Information Matrix Formula for Case 3: \(i \neq j\) and \(k=l\), \[ -E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right] = \frac{n}{2}([\Sigma^{-1}]_{li}[\Sigma^{-1}]_{jk}+[\Sigma^{-1}]_{lj}[\Sigma^{-1}]_{ik}) \] Information Matrix Formula for Case 4: \(i \neq j\) and \(k \neq l\). \[ -E\left[\frac{\partial^{2}l}{\partial\sigma_{ij}\sigma_{kl}}\right] = \frac{n}{2}([\Sigma^{-1}]_{li}[\Sigma^{-1}]_{jk}+[\Sigma^{-1}]_{lj}[\Sigma^{-1}]_{ik}+[\Sigma^{-1}]_{ki}[\Sigma^{-1}]_{jl}+[\Sigma^{-1}]_{kj}[\Sigma^{-1}]_{il}) \]
The second differntial of \(dl(d\mu)\) with respect to \(\Sigma\) is, \[ ddl(d\mu, d\Sigma) = -\sum_{i=1}^{n} (x_i -\mu)^{T}(\Sigma^{-1}d\Sigma\Sigma^{-1}d\mu) \] Let \(V = -\sum_{i=1}^{n} (x_i -\mu)^{T}\Sigma^{-1}\), which is a 1 x p matrix. Then, \[ ddl(d\mu, d\Sigma) = Vd\Sigma\Sigma^{-1}d\mu \] Therefore, the mixed second partials of the likelihood are partioned in to two cases: \(i = j\) and \(i \neq j\),
Case 1: \(i = j\) \[ \frac{\partial^{2} l}{\partial \mu_k, \partial \Sigma_{ij}} = V_i[\Sigma^{-1}]_{jk} \] Case 2: \(i \neq j\) \[ \frac{\partial^{2} l}{\partial \mu_k, \partial \Sigma_{ij}} = V_i[\Sigma^{-1}]_{jk} + V_j[\Sigma^{-1}]_{ik} \]
Given that there are p+p(p+1)/2 parameters, the hessian will be a p+p(p+1)/2 x p+p(p+1)/2 matrix, with the first p elements being \(\mu_1,...,\mu_p\) and the remaining p(p+1)/2 parameters being row-wise ordered elements \(\sigma_{11}\),
sqrtm <- function (A) {
# Obtain matrix square root of a matrix A
a = eigen(A)
sqm = a$vectors %*% diag(sqrt(a$values)) %*% t(a$vectors)
sqm = (sqm+t(sqm))/2
}
gen <- function(n,p,mu,sig,seed = 534){
#---- Generate data from a p-variate normal with mean mu and covariance sigma
# mu should be a p by 1 vector
# sigma should be a positive definite p by matrix
# Seed can be optionally set for the random number generator
set.seed(seed)
# generate data from normal mu sigma
x = matrix(rnorm(n*p),n,p)
datan = x %*% sqrtm(sig) + matrix(mu,n,p, byrow = TRUE)
datan
}
sigma <- matrix(c(1,.7,.7,.7,1,.7,.7,.7,1), ncol = 3, nrow = 3 ,byrow = F)
data <- gen(n = 200, p = 3, mu = c(-1,1,2), sig = sigma, seed = 22013)
1 | 2 | 3 |
---|---|---|
-0.1065871 | 1.4852948 | 2.6244473 |
-1.0153403 | 2.4358252 | 3.1963763 |
-1.1833315 | 0.6469248 | 2.3291214 |
-1.7931938 | 0.2476836 | 0.6460097 |
-1.6639518 | 0.4307266 | 0.3277569 |
0.2748926 | 2.1806538 | 3.6140945 |
-1.9336678 | 1.7301779 | 2.8661486 |
0.5458651 | 3.4439274 | 4.5585361 |
-1.5433673 | 0.1832414 | 1.8865077 |
-0.6046615 | 2.5108771 | 2.4969884 |
0.2591646 | 0.9780684 | 1.7008835 |
0.2962418 | 3.2566728 | 3.3285557 |
-1.0144054 | 2.1978399 | 2.2711787 |
-1.8423674 | 0.8853358 | 1.9718061 |
-0.5343611 | 1.1103784 | 1.5536223 |
-0.8006289 | 1.4041572 | 1.9017553 |
-1.6916421 | 0.9668761 | 2.1755740 |
0.0917476 | 2.1106963 | 2.0677713 |
-1.6369505 | -0.1249094 | 1.7895591 |
0.5598737 | 1.5261198 | 2.3860763 |