Problem: Maximum Likelihood Estimation of the Covariance Matrix for MVN
ex1.13 050329...
Original Problem Statement (Exercise 1.13)
$X_{(1)}, X_{(2)}, \dots, X_{(n)}$ are samples from a multivariate normal distribution $N_p(\mu, \Sigma)$. The likelihood function is as follows. Using differentiation, find the maximum likelihood estimates (MLE) for the parameters $\mu$ and $\Sigma$.
Given Likelihood Function:
The problem provides the likelihood function $L(\mu, \Sigma)$ as:
$$ L(\mu, \Sigma) = (2\pi)^{-\frac{np}{2}} |\Sigma|^{-\frac{n}{2}} \mathrm{etr}\left[-\frac{1}{2}\Sigma^{-1} \sum_{j=1}^n (X_{(j)} - \mu)(X_{(j)} - \mu)'\right] $$
where $\mathrm{etr}(A) = \exp(\mathrm{tr}(A))$. Let's use the notation $X_j$ instead of $X_{(j)}$ for simplicity. The likelihood function is:
$$ L(\mu, \Sigma) = (2\pi)^{-\frac{np}{2}} |\Sigma|^{-\frac{n}{2}} \exp\left( -\frac{1}{2} \mathrm{tr}\left( \Sigma^{-1} \sum_{j=1}^n (X_j - \mu)(X_j - \mu)' \right) \right) $$
Objective
Maximize the log-likelihood function with respect to the covariance matrix $\Sigma$.
Step 1: Write down the Log-Likelihood Function
Assume we have $n$ independent and identically distributed (i.i.d.) samples $X_1, \dots, X_n$ from a $p$-dimensional multivariate normal distribution $N_p(\mu, \Sigma)$.
The log-likelihood function is:
$$ l(\mu, \Sigma) = -\frac{np}{2}\ln(2\pi) - \frac{n}{2}\ln|\Sigma| - \frac{1}{2}\sum_{j=1}^n (X_j - \mu)' \Sigma^-1 (X_j - \mu) $$
where:
- $n$ is the number of samples
- $p$ is the dimension of the data
- $|\Sigma|$ is the determinant of the covariance matrix $\Sigma$
- $\Sigma^{-1}$ is the inverse of $\Sigma$ (the precision matrix)
Step 2: Simplify the Summation Term using Trace
Notice that each term in the summation $(X_j - \mu)' \Sigma^-1 (X_j - \mu)$ is a scalar. For a scalar $a$, we have $a = \operatorname*{tr}(a)$. Using this and the cyclic property of the trace ($\operatorname*{tr}(ABC) = \operatorname*{tr}(CAB)$):
$$ \begin{aligned} \sum_{j=1}^n (X_j - \mu)' \Sigma^-1 (X_j - \mu) &= \sum_{j=1}^n \operatorname*{tr}\left( (X_j - \mu)' \Sigma^-1 (X_j - \mu) \right) \\ &= \sum_{j=1}^n \operatorname*{tr}\left( \Sigma^-1 (X_j - \mu)(X_j - \mu)' \right) \\ &= \operatorname*{tr}\left( \Sigma^-1 \sum_{j=1}^n (X_j - \mu)(X_j - \mu)' \right) \end{aligned} $$
Define the Scatter Matrix (around $\mu$) as $S_\mu = \sum_{j=1}^n (X_j - \mu)(X_j - \mu)'$.
The log-likelihood function becomes:
$$ l({\bar{X}}, \Sigma) = -\frac{np}{2}\ln(2\pi) - \frac{n}{2}\ln|\Sigma| - \frac{1}{2}\mathrm{tr}(\Sigma^-1 S) $$
Step 3: Substitute the Sample Mean
The MLE for $\mu$ is the sample mean $\hat\mu = \bar{X} = \frac{1}{n}\sum_{j=1}^n X_j$. Substituting $\bar{X}$ for $\mu$ (or considering the profile likelihood for $\Sigma$), we define the sample scatter matrix:
$$ S = \sum_{j=1}^n (X_j - {\bar{X}})(X_j - {\bar{X}})' $$
The log-likelihood function (now viewed primarily as a function of $\Sigma$, given $\bar{X}$) is:
$$ l(\mu, \Sigma) = -\frac{np}{2}\ln(2\pi) - \frac{n}{2}\ln|\Sigma| - \frac{1}{2}\operatorname*{tr}(\Sigma^-1 S_\mu) $$
Step 4: Variable Substitution using the Precision Matrix $W$
To simplify differentiation, let the precision matrix be $W = \Sigma^{-1}$. This implies $\Sigma = W^{-1}$. We substitute terms involving $\Sigma$:
- $$ \Sigma^{-1} = W $$
- $$ \ln|\Sigma| = \ln|W^{-1}| = \ln(|W|^{-1}) = -\ln|W| $$
Substituting these into $l(\bar{X}, \Sigma)$, we obtain a function of $W$, let's call it $g(W)$:
$$ \begin{aligned} g(W) &= l(\bar{X}, W^{-1}) \\ &= -\frac{np}{2}\ln(2\pi) - \frac{n}{2}(-\ln|W|) - \frac{1}{2}\operatorname*{tr}(W S) \\ &= -\frac{np}{2}\ln(2\pi) + \frac{n}{2}\ln|W| - \frac{1}{2}\operatorname*{tr}(W S) \end{aligned} $$
Step 5: Differentiate with respect to $W$
Our goal is to maximize $g(W)$. We compute the derivative of $g(W)$ with respect to the matrix $W$ (using Definition 1.4.2) and set it to the zero matrix.
$$ \begin{aligned} \frac{\partial g(W)}{\partial W} &= \frac{\partial}{\partial W} \left(-\frac{np}{2}\ln(2\pi) + \frac{n}{2}\ln|W| - \frac{1}{2}\mathrm{tr}(W S)\right) \\ &= 0 + \frac{n}{2} \frac{\partial (\ln|W|)}{\partial W} - \frac{1}{2} \frac{\partial (\mathrm{tr}(WS))}{\partial W} \end{aligned} $$
We need two key matrix derivatives. Assuming $W$ is symmetric:
- Derivative of $\ln|W|$: Based on standard matrix calculus results (analogous to Property 1.4.16(2) and 1.4.9):
$$ \frac{\partial \ln|W|}{\partial W} = W^{-1} $$ - Derivative of $\operatorname*{tr}(WS)$: Based on standard matrix calculus results (analogous to Property 1.4.6):
$$ \frac{\partial (\mathrm{tr}(WS))}{\partial W} = S $$
Substituting these results back:
$$ \frac{\partial g(W)}{\partial W} = \frac{n}{2} W^{-1} - \frac{1}{2} S $$$$ \frac{n}{2} {\hat W}^{-1} - \frac{1}{2} S = 0 $$
where $\hat W$ denotes the matrix $W$ that satisfies this equation.
$$ \frac{n}{2} {\hat W}^{-1} = \frac{1}{2} S $$
Multiply both sides by $2$:
$$ n {\hat W}^{-1} = S $$
Solve for $\hat W^{-1}$:
$$ {\hat W}^{-1} = \frac{1}{n} S $$
Step 7: Convert Back to $\Sigma$
Recall that $W = \Sigma^{-1}$, so $W^-1 = (\Sigma^{-1})^{-1} = \Sigma$.
The resulting estimate $\hat W^{-1}$ must be symmetric since $S$ is symmetric.
Therefore, the Maximum Likelihood Estimate (MLE) $\hat\Sigma_{ML}$ is the $\hat W^{-1}$ we just found:
$$ {\hat\Sigma}_{ML} = {\hat W}^{-1} = \frac{1}{n} S $$
Substituting the definition of $S$ :
$$ {\hat\Sigma}{ML} = \frac{1}{n} \sum{j=1}^n (X_j - {\bar{X}})(X_j - {\bar{X}})' $$
This is the MLE for the covariance matrix $\Sigma$ of a multivariate normal distribution.