Covariance MatrixΒΆ

Given two scalar RVs \(X\) and \(Y\), then the covariance \(\colorbox{fact}{$\operatorname{cov} \left( {X,Y} \right) = \mathbb{E}\left[ {\left( {X - \mathbb{E}X} \right)\left( {Y - \mathbb{E}Y} \right)} \right]$}\). Given a RV vector \(X = \left( {\begin{array}{*{20}{c}}{{X_1}} \\ \vdots \\ {{X_M}}\end{array}} \right)\), define \(\mathbb{E}X = \left( {\begin{array}{*{20}{c}}{\mathbb{E}{X_1}} \\\vdots \\{\mathbb{E}{X_M}}\end{array}} \right)\) (and similarly the expectation of a RV matrix is to take expectation on each entry of that matrix), then the covariance matrix is defined as the following,

\[\begin{split}\begin{align} \Sigma \left( X \right) & = \left( {\begin{array}{*{20}{c}} {\operatorname{cov} \left( {{X_1},{X_1}} \right)}& \cdots &{\operatorname{cov} \left( {{X_1},{X_M}} \right)} \\ \vdots & \ddots & \vdots \\ {\operatorname{cov} \left( {{X_M},{X_1}} \right)}& \cdots &{\operatorname{cov} \left( {{X_M},{X_M}} \right)} \end{array}} \right) \hfill \\ & = \left( {\begin{array}{*{20}{c}} {\mathbb{E}\left[ {\left( {{X_1} - \mathbb{E}{X_1}} \right)\left( {{X_1} - \mathbb{E}{X_1}} \right)} \right]}& \cdots &{\mathbb{E}\left[ {\left( {{X_1} - \mathbb{E}{X_1}} \right)\left( {{X_M} - \mathbb{E}{X_M}} \right)} \right]} \\ \vdots & \ddots & \vdots \\ {\mathbb{E}\left[ {\left( {{X_M} - \mathbb{E}{X_M}} \right)\left( {{X_1} - \mathbb{E}{X_1}} \right)} \right]}& \cdots &{\mathbb{E}\left[ {\left( {{X_M} - \mathbb{E}{X_M}} \right)\left( {{X_M} - \mathbb{E}{X_M}} \right)} \right]} \end{array}} \right) \hfill \\ &=\colorbox{fact}{$\mathbb{E}\left[ {\left( {X - \mathbb{E}X} \right){{\left( {X - \mathbb{E}X} \right)}^{\text{T}}}} \right]$} \hfill \\ \end{align}\end{split}\]

where the diagonal elements are variances that can be denoted by \(\operatorname{var}\left( X_{i} \right) := \operatorname{cov}\left( X_{i},X_{i} \right),i = 1,\ldots,M\) . We note there is difference that for two scalar RVs \(X,Y\), \(\operatorname{cov}⁑(X,Y)\) is a value, but \(\operatorname{\Sigma}\begin{pmatrix} X \\ Y \\ \end{pmatrix}\) is a \(2Γ—2\) matrix.

On the other hand, given two RVs \(X,Y\) and draw samples \({\mathbf{x}} = \left( {{x_1}, \ldots ,{x_N}} \right)\sim X\) and \({\mathbf{y}} = \left( {{y_1}, \ldots ,{y_N}} \right)\sim Y\), then we define the sample covariance of them as \(\colorbox{fact}{$\operatorname{cov} \left( {{\mathbf{x}},{\mathbf{y}}} \right) = \frac{1}{{N - 1}}{\left( {{\mathbf{x}} - {\mathbf{\bar x}}} \right)^{\text{T}}}\left( {{\mathbf{y}} - {\mathbf{\bar y}}} \right)$}\). Given a RV vector \(X = \left( {\begin{array}{*{20}{c}}{{X_1}} \\\vdots \\{{X_M}}\end{array}} \right)\), we can draw samples \({\mathbf{x}}_1^{\text{T}} = \left( {{x_{1,1}}, \ldots ,{x_{1,N}}} \right)\sim {X_1}, \ldots , {\mathbf{x}}_M^{\text{T}} = \left( {{x_{M,1}}, \ldots ,{x_{M,M}}} \right)\sim {X_M}\), and form a sample matrix \({\mathbf{X}} = \left( {\begin{array}{*{20}{c}}{{\mathbf{x}}_1^{\text{T}}} \\\vdots \\{{\mathbf{x}}_M^{\text{T}}}\end{array}} \right)\). In the machine learning context, the rows of \(𝐗\) are also referred to as feature vectors Feature vectors are column vectors even though they are rows in the data matrix. because it treats the RVs \(X_1,…,X_M\) as representing \(M\) random features of a data point; and the columns of \(𝐗\) are called data entries, because they are actually observed data. Then the sample covariance matrix is defined w.r.t. the feature vectors as

\[\begin{split}\begin{aligned} {\Sigma }\left( {\mathbf{X}} \right) &= \left( {\begin{array}{*{20}{c}} {\operatorname{cov} \left( {{{\mathbf{x}}_1},{{\mathbf{x}}_1}} \right)}& \cdots &{\operatorname{cov} \left( {{{\mathbf{x}}_1},{{\mathbf{x}}_M}} \right)} \\ \vdots & \ddots & \vdots \\ {\operatorname{cov} \left( {{{\mathbf{x}}_M},{{\mathbf{x}}_1}} \right)}& \cdots &{\operatorname{cov} \left( {{{\mathbf{x}}_M},{{\mathbf{x}}_M}} \right)} \end{array}} \right) \hfill \\ &= \frac{1}{{N - 1}}\left( {\begin{array}{*{20}{c}} {{{\left( {{{\mathbf{x}}_1} - \overline {{{\mathbf{x}}_1}} } \right)}^{\text{T}}}\left( {{{\mathbf{x}}_1} - \overline {{{\mathbf{x}}_1}} } \right)}& \cdots &{{{\left( {{{\mathbf{x}}_1} - \overline {{{\mathbf{x}}_1}} } \right)}^{\text{T}}}\left( {{{\mathbf{x}}_M} - \overline {{{\mathbf{x}}_M}} } \right)} \\ \vdots & \ddots & \vdots \\ {{{\left( {{{\mathbf{x}}_M} - \overline {{{\mathbf{x}}_M}} } \right)}^{\text{T}}}\left( {{{\mathbf{x}}_1} - \overline {{{\mathbf{x}}_1}} } \right)}& \cdots &{{{\left( {{{\mathbf{x}}_M} - \overline {{{\mathbf{x}}_M}} } \right)}^{\text{T}}}\left( {{{\mathbf{x}}_M} - \overline {{{\mathbf{x}}_M}} } \right)} \end{array}} \right) \hfill \\ &= \frac{1}{{N - 1}}\left( {{\mathbf{X}} - {\mathbf{\bar X}}} \right){\left( {{\mathbf{X}} - {\mathbf{\bar X}}} \right)^{\text{T}}} \hfill \\ \end{aligned}\end{split}\]

where \({\overline {{{\mathbf{x}}_i}} ^{\text{T}}} = \frac{1}{N}\mathop \sum \limits_{j = 1}^N {x_{i,j}}{1^{\text{T}}} = \left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^N {x_{i,j}}, \ldots ,\frac{1}{N}\mathop \sum \limits_{j = 1}^N {x_{i,j}}} \right)\) (the same mean value repeating itself for \(N\) times) and \({\mathbf{\bar X}} = \left( {\begin{array}{*{20}{c}}{\overline {{{\mathbf{x}}_1}} } \\\vdots \\{\overline {{{\mathbf{x}}_M}} }\end{array}} \right)\), and the diagonal elements are sample variances that can be denoted as \(\operatorname{var}\left( \mathbf{x}_{i} \right) := \operatorname{cov}\left( \mathbf{x}_{i},\mathbf{x}_{i} \right), i=1,...,M\). The sum of variances, or the trace of the covariance matrix, is called the total variance of \(X\). In addition, note \(\operatorname{cov} \left( {{\mathbf{x}},{\mathbf{y}}} \right)\) is a value, while \(Ξ£(x,y)\) is a \(2Γ—2\) matrix.

Pearson’s correlation is the normalized version of covariance, defined as \(\operatorname{corr}\left( X,Y \right) = \frac{\operatorname{cov}\left( X,Y \right)}{\sqrt{\operatorname{var}\left( X \right)}\sqrt{\operatorname{var}\left( Y \right)}}\) for two scalar RVs \(X,Y\), and \(\operatorname{corr}\left( \mathbf{x},\mathbf{y} \right) = \frac{\operatorname{cov}\left( \mathbf{x},\mathbf{y} \right)}{\sqrt{\operatorname{var}\left( \mathbf{x} \right)}\sqrt{\operatorname{var}\left( \mathbf{y} \right)}}\) for two samples \(\mathbf{x},\mathbf{y}\). A correlation matrix for a RV vector \(X = \begin{pmatrix} X_{1} \\ \vdots \\ X_{M} \\ \end{pmatrix}\) or a sample matrix \(\mathbf{X} = \begin{pmatrix} \mathbf{x}_{1}^{\rm{T}} \\ \vdots \\ \mathbf{x}_{M}^{\rm{T}} \\ \end{pmatrix}\) are just replacing \(\operatorname{cov}\left( \cdot , \cdot \right)\) elements in (1?€?1) and (1?€?2) by corresponding \(\operatorname{corr}\left( \cdot , \cdot \right)\). Since \(\operatorname{corr}\left( X,X \right) = 1\) and \(\operatorname{corr}\left( \mathbf{x},\mathbf{x} \right) = 1\), then the diagonal elements of a correlation matrix will all be 1. Let \(\Sigma_{\text{corr}}\) denote a correlation matrix, we have

\[\begin{split}\operatorname{\Sigma_{\text{corr}}} \left( X \right) = \begin{pmatrix} 1 & \cdots & \frac{\operatorname{cov}\left( X_{1},X_{M} \right)}{\sqrt{\operatorname{var}\left( X_{1} \right)}\sqrt{\operatorname{var}\left( X_{M} \right)}} \\ \vdots & \ddots & \vdots \\ \frac{\operatorname{cov}\left( X_{M},X_{1} \right)}{\sqrt{\operatorname{var}\left( X_{M} \right)}\sqrt{\operatorname{var}\left( X_{1} \right)}} & \cdots & 1 \\ \end{pmatrix} = \begin{pmatrix} 1 & \cdots & \operatorname{E}{\lbrack\frac{\left( X_{1}\mathbb{- E}X_{1} \right)}{\sqrt{\operatorname{var}\left( X_{1} \right)}}\frac{\left( X_{M}\mathbb{- E}X_{M} \right)}{\sqrt{\operatorname{var}\left( X_{M} \right)}}\rbrack} \\ \vdots & \ddots & \vdots \\ \operatorname{E}{\lbrack\frac{\left( X_{M}\mathbb{- E}X_{M} \right)}{\sqrt{\operatorname{var}\left( X_{M} \right)}}\frac{\left( X_{1}\mathbb{- E}X_{1} \right)}{\sqrt{\operatorname{var}\left( X_{1} \right)}}\rbrack} & \cdots & 1 \\ \end{pmatrix}\end{split}\]
\[\begin{split}\Sigma_{\text{corr}}\left( \mathbf{X} \right) = \begin{pmatrix} 1 & \cdots & \frac{\operatorname{cov}\left( \mathbf{x}_{1},\mathbf{x}_{M} \right)}{\sqrt{\operatorname{var}\left( \mathbf{x}_{1} \right)}\sqrt{\operatorname{var}\left( \mathbf{x}_{M} \right)}} \\ \vdots & \ddots & \vdots \\ \frac{\operatorname{cov}\left( \mathbf{x}_{M},\mathbf{x}_{1} \right)}{\sqrt{\operatorname{var}\left( \mathbf{x}_{1} \right)}\sqrt{\operatorname{var}\left( \mathbf{x}_{M} \right)}} & \cdots & 1 \\ \end{pmatrix} = \frac{1}{N - 1}\begin{pmatrix} 1 & \cdots & \frac{\left( \mathbf{x}_{1} - \overline{\mathbf{x}_{1}} \right)}{\sqrt{\operatorname{var}\left( \mathbf{x}_{1} \right)}}^{\rm{T}}\frac{\left( \mathbf{x}_{M} - \overline{\mathbf{x}_{M}} \right)}{\sqrt{\operatorname{var}\left( \mathbf{x}_{M} \right)}} \\ \vdots & \ddots & \vdots \\ \frac{\left( \mathbf{x}_{M} - \overline{\mathbf{x}_{M}} \right)}{\sqrt{\operatorname{var}\left( \mathbf{x}_{M} \right)}}^{\rm{T}}\frac{\left( \mathbf{x}_{1} - \overline{\mathbf{x}_{1}} \right)}{\sqrt{\operatorname{var}\left( \mathbf{x}_{1} \right)}} & \cdots & 1 \\ \end{pmatrix}\end{split}\]

Given a RV \(X\), we can define \(\widetilde{X} = \frac{\left( X - \mathbb{E}X \right)}{\sqrt{\operatorname{var}\left( X \right)}}\) as the standardized RV (or normalized RV) of \(X\) of zero expectation and unit variance; given a sample \(\mathbf{x}\), we can define \(\widetilde{\mathbf{x}} = \frac{\left( \mathbf{x} - \overline{\mathbf{x}} \right)}{\sqrt{\operatorname{var}\left( \mathbf{x} \right)}}\) as the standardized sample (or normalized sample) of \(\mathbf{x}\) with zero mean and unit variance. We thus can define standardize RV vector \(\widetilde{X} = \begin{pmatrix} {\widetilde{X}}_{1} \\ \vdots \\ {\widetilde{X}}_{M} \\ \end{pmatrix}\) and standardized sample matrix \(\widetilde{\mathbf{X}} = \begin{pmatrix} {\widetilde{\mathbf{x}}}_{1}^{\rm{T}} \\ \vdots \\ {\widetilde{\mathbf{x}}}_{M}^{\rm{T}} \\ \end{pmatrix}\), and therefore \(\operatorname{}\left( X \right) = \operatorname{E}{\lbrack{\widetilde{X}\widetilde{X}}^{\rm{T}}\rbrack},\Sigma_{\text{corr}}\left( \mathbf{X} \right) = \frac{1}{N - 1}\widetilde{\mathbf{X}}{\widetilde{\mathbf{X}}}^{\rm{T}}\), where we see correlation matrix is just the concept of covariance matrix applied on standardize (normalized) RV or sample.

Cross-covariance matrix and cross-correlation matrix is a generalized concept of covariance matrix and correlation matrix that consider two RV vectors \(X,Y\) or two sample matrices \(\mathbf{X},\mathbf{Y}\). For example, suppose \(X,Y\) are of lengths \(M_{1},M_{2}\), following Eq.1.5, the cross-covariance matrix for \(X,Y\) is defined as a \(M_{1} \times M_{2}\) matrix (note need not be a square matrix),

\[\begin{split}\begin{aligned} \operatorname{\Sigma}\left( X,Y \right) &= \begin{pmatrix} \operatorname{cov}{(X_{1},Y_{1})} & \cdots & \operatorname{cov}{(X_{1},Y_{M_{2}})} \\ \vdots & \ddots & \vdots \\ \operatorname{cov}{(X_{M_{1}},Y_{1})} & \cdots & \operatorname{cov}{(X_{M_{1}},Y_{M_{2}})} \\ \end{pmatrix} \\ &= \begin{pmatrix} \operatorname{E}{\lbrack\left( X_{1}\mathbb{- E}X_{1} \right)\left( Y_{1}\mathbb{- E}Y_{1} \right)\rbrack} & \cdots & \operatorname{E}{\lbrack\left( X_{1}\mathbb{- E}X_{1} \right)\left( Y_{M_{2}}\mathbb{- E}Y_{M_{2}} \right)\rbrack} \\ \vdots & \ddots & \vdots \\ \operatorname{E}{\lbrack\left( X_{M_{1}}\mathbb{- E}X_{M_{1}} \right)\left( Y_{1}\mathbb{- E}Y_{1} \right)\rbrack} & \cdots & \operatorname{E}{\lbrack\left( X_{M_{1}}\mathbb{- E}X_{M_{1}} \right)\left( Y_{M_{2}}\mathbb{- E}Y_{M_{2}} \right)\rbrack} \\ \end{pmatrix} \\ &= \colorbox{fact}{$\operatorname{E}{\lbrack{\left( X - \mathbb{E}X \right)\left( Y - \mathbb{E}Y \right)}^{\rm{T}}\rbrack}$} \end{aligned}\end{split}\]

And all others can be defined in the same way, summarized as \(\colorbox{fact}{$\Sigma\left( \mathbf{X,Y} \right) = \frac{1}{N - 1}\left( \mathbf{X} - \overline{\mathbf{X}} \right)\left( \mathbf{Y} - \overline{\mathbf{Y}} \right)^{\rm{T}}$}\), \(\colorbox{fact}{$\operatorname{}\left( X,Y \right) = \operatorname{E}{\lbrack{\widetilde{X}\widetilde{Y}}^{\rm{T}}\rbrack}$}\), and \(\colorbox{fact}{$\Sigma_{\text{corr}}\left( \mathbf{X,Y} \right) = \frac{1}{N - 1}\widetilde{\mathbf{X}}{\widetilde{\mathbf{Y}}}^{\rm{T}}$}\)

Caution

In machine learning problems, we are often given a data matrix \({\mathbf{X}} = \left( {{{\mathbf{x}}_1}, \ldots ,{{\mathbf{x}}_N}} \right)\) with the columns \({{\mathbf{x}}_1}, \ldots ,{{\mathbf{x}}_N}\) as data entries. The bold small-letter symbol β€œ\(\mathbf{x}\)” very often represents a data entry in the content of machine learning, but in statistics it often instead represents a feature vector (as in above discussion), and sometimes this causes confusion. Therefore, we note it is necessary to understand what β€œ\(\mathbf{x}\)” represents in the context.

The other possible confusion is about the β€œsamples”. It is possible both the data entries and feature vectors are referred to as samples in different contexts. We again note sample covariance is always w.r.t. the feature vectors, not data entries, because it studies how the quantities of different features vary with each other. Therefore, the β€œsamples” in Eq.1.6 refers to feature vectors.

  • Property 1-4. Classic representation of covariance by expectation (or mean).ΒΆ

    Using the fact that \(\colorbox{fact}{$\operatorname{cov}\left( X,Y \right)\mathbb{= E}XY - \mathbb{E}X\mathbb{E}Y$}\) for scalar RVs \(X,Y\), the covariance matrix has another form
    \[\begin{split}\operatorname{\Sigma}\left( X \right) = \begin{pmatrix} \mathbb{E}X_{1}^{2} - \mathbb{E}^{2}X_{1} & \cdots & \mathbb{E}{X_{1}X_{M}}\mathbb{- E}X_{1}\mathbb{E}X_{M} \\ \vdots & \ddots & \vdots \\ \mathbb{E}{X_{M}X_{1}}\mathbb{- E}X_{M}\mathbb{E}X_{1} & \cdots & \mathbb{E}X_{M}^{2} - \mathbb{E}^{2}X_{M} \\ \end{pmatrix} = \colorbox{rlt}{$\mathbb{E}\mathrm{\lbrack}XX^{\mathrm{T}}\mathrm{\rbrack} - \mathbb{E}X\mathbb{E}^{\mathrm{T}}X$}\end{split}\]

    For two samples \(\mathbf{x}\sim X,\mathbf{y}\sim Y\) where \(\mathbf{x =}\left( x_{1}\mathbf{,\ldots,}x_{N} \right)\mathbf{,}\mathbf{y = (}y_{1}\mathbf{,\ldots,}y_{N}\mathbf{)}\), we have

    \[\left( \mathbf{x} - \overline{\mathbf{x}} \right)^{\mathrm{T}}\left( \mathbf{y} - \overline{\mathbf{y}} \right) = \mathbf{x}^{\mathrm{T}}\mathbf{y} - \mathbf{x}^{\mathrm{T}}\overline{\mathbf{y}} - {\overline{\mathbf{x}}}^{\mathrm{T}}\mathbf{y} + {\overline{\mathbf{x}}}^{\mathrm{T}}\overline{\mathbf{y}}\]

    Let \(\overline{x} = \frac{\sum_{i = 1}^{N}{\mathbf{x(}i\mathbf{)}}}{N}\) and \(\overline{y} = \frac{\sum_{i = 1}^{N}{\mathbf{y(}i\mathbf{)}}}{N}\), then

    \[{\mathbf{x}^{\mathrm{T}}\overline{\mathbf{y}} = \sum_{i = 1}^{N}{\overline{y}\mathbf{x(}i\mathbf{)}} = \overline{y}\sum_{i = 1}^{N}{\mathbf{x(}i\mathbf{)}} = N\overline{x}\overline{y}}\]
    \[{{\overline{\mathbf{x}}}^{\mathrm{T}}\mathbf{y} = \sum_{i = 1}^{N}{\overline{x}\mathbf{y(}i\mathbf{)}} = \overline{x}\sum_{i = 1}^{N}{\mathbf{y(}i\mathbf{)}} = N\overline{x}\overline{y}}\]
    \[{{\overline{\mathbf{x}}}^{\mathrm{T}}\overline{\mathbf{y}} = \sum_{i = 1}^{N}{\overline{x}\overline{y}} = N\overline{x}\overline{y}}\]

    Thus

    \[\left( \mathbf{x} - \overline{\mathbf{x}} \right)^{\mathrm{T}}\left( \mathbf{y} - \overline{\mathbf{y}} \right) = \mathbf{x}^{\mathrm{T}}\mathbf{x +}N\overline{x}\overline{y} - 2N\overline{x}\overline{y} = \mathbf{x}^{\mathrm{T}}\mathbf{x -}N\overline{x}\overline{y} = \mathbf{x}^{\mathrm{T}}\mathbf{y -}{\overline{\mathbf{x}}}^{\mathrm{T}}\overline{\mathbf{y}}\]

    which implies

    \[\begin{split}\Sigma(\mathbf{X}) = \frac{1}{N - 1}\begin{pmatrix} \mathbf{x}_{1}^{\mathrm{T}}\mathbf{x}_{1}{\bf -}{\overline{\mathbf{x}_{1}}}^{\mathrm{T}}\overline{\mathbf{x}_{1}} & \cdots & \mathbf{x}_{1}^{\mathrm{T}}\mathbf{x}_{M}{\bf -}{\overline{\mathbf{x}_{1}}}^{\mathrm{T}}\overline{\mathbf{x}_{M}} \\ \vdots & \ddots & \vdots \\ \mathbf{x}_{M}^{\mathrm{T}}\mathbf{x}_{1}{\bf -}{\overline{\mathbf{x}_{M}}}^{\mathrm{T}}\overline{\mathbf{x}_{1}} & \cdots & \mathbf{x}_{M}^{\mathrm{T}}\mathbf{x}_{M}{\bf -}{\overline{\mathbf{x}_{M}}}^{\mathrm{T}}\overline{\mathbf{x}_{M}} \\ \end{pmatrix} = \colorbox{rlt}{$\frac{1}{N - 1}\left( \mathbf{X}\mathbf{X}^{\mathrm{T}}{\bf -}\overline{\mathbf{X}}{\overline{\mathbf{X}}}^{\mathrm{T}} \right)$}\end{split}\]

    We can verify above inference directly works for cross-covariance, and therefore

    \[\colorbox{result}{$\operatorname{\Sigma}\left( X,Y \right) = \mathbb{E}\mathrm{\lbrack}XY^{\mathrm{T}}\mathrm{\rbrack} - \mathbb{E}X\mathbb{E}^{\rm{T}}Y,\Sigma\left( \mathbf{X,Y} \right) = \frac{1}{N - 1}\left( \mathbf{X}\mathbf{Y}^{\rm{T}}\mathbf{-}\overline{\mathbf{X}}{\overline{\mathbf{Y}}}^{\rm{T}} \right)$}\]
  • Property 1-5. Invariance to centralization.ΒΆ

    For any random vector \(X\), we have \(\colorbox{result}{$\Sigma\left( X - \mathbb{E}X \right) = \Sigma(X)$}\), since \(\mathbb{E}\left\lbrack X - \mathbb{E}X \right\rbrack = \mathbf{0}\) and
    \[\Sigma\left( X - \mathbb{E}X \right) = \mathbb{E}{\lbrack{\left( X - \mathbb{E}X - \mathbb{E}\left\lbrack X - \mathbb{E}X \right\rbrack \right)\left( X - \mathbb{E}X - \mathbb{E}\left\lbrack X - \mathbb{E}X \right\rbrack \right)}^{\mathrm{T}}\rbrack} = \mathbb{E}{\lbrack{\left( X - \mathbb{E}X \right)\left( X - \mathbb{E}X \right)}^{\mathrm{T}}\rbrack} = \Sigma(X)\]

    Similarly \(\colorbox{result}{$\Sigma\left( \mathbf{X} - \overline{\mathbf{X}} \right) = \Sigma\left( \mathbf{X} \right)$}\), because \(\mathbf{x} - \overline{\mathbf{x}} - \overline{\mathbf{x} - \overline{\mathbf{x}}} = \mathbf{x} - \overline{\mathbf{x}}\) for any sample \(\mathbf{x}\), and the result following by applying this on Eq.1.6. For cross-covariance matrix, we have \(\colorbox{result}{$\Sigma\left( X - \mathbb{E}X,Y - \mathbb{E}Y \right) = \Sigma\left( X,Y \right)$}\) and \(\colorbox{result}{$\Sigma\left( \mathbf{X} - \overline{\mathbf{X}},\mathbf{Y} - \overline{\mathbf{Y}} \right) = \Sigma\left( \mathbf{X,}\mathbf{Y} \right)$}\) for exactly the same reason.

  • Theorem 1-5. Matrix arithmetics of covariance matrix.ΒΆ

    Given \(X = \left( X_{1},\ldots,X_{n} \right)^{\mathrm{T}}\), \(\operatorname{var}\left( \mathbf{Ξ±}^{\rm{T}}X \right) = \operatorname{var}\left( X^{\rm{T}}\mathbf{Ξ±} \right) = \mathbf{Ξ±}^{\rm{T}}\operatorname{\Sigma}\left( X \right)\mathbf{Ξ±}\). Note \(Ξ±^{\rm{T}}\Sigma{X}\) is a scalar random variable, and
    \[\mathbb{E}\left( \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X \right)\mathbb{= E}\left( X^{\mathrm{T}}\mathrm{\mathbf{Ξ±}} \right) = \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}\mathbb{E}X = \left( \mathbb{E}^{\mathrm{T}}X \right)\mathrm{\mathbf{Ξ±}}\]

    Also \(\mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X\) is a scalar RV, and so \(\left( \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X \right)^{2}={\left( \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X \right)\left( \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X \right)}^{\mathrm{T}} = \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}XX^{\mathrm{T}}\mathrm{\mathbf{Ξ±}}\). Recall \(\operatorname{var} \left( X \right)=\mathbb{E}X^{2} - \mathbb{E}^{2}X\), then using Property 1-4, we have

    \[\operatorname{var} \left( \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X \right) = \mathbb{E}\mathrm{\lbrack}\mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X\left( \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X \right)^{\mathrm{T}}\mathrm{\rbrack} - \mathbb{E}\left\lbrack \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X \right\rbrack\mathbb{E}^{\mathrm{T}}\left\lbrack \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X \right\rbrack = \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}\mathbb{E}\mathrm{\lbrack}XX^{\mathrm{T}}\mathrm{\rbrack}\mathrm{\mathbf{Ξ±}} - \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}\mathbb{E}X\mathbb{E}^{\mathrm{T}}X\mathrm{\mathbf{Ξ±}}=\mathrm{\mathbf{Ξ±}}^{\mathrm{T}}\mathbf{(}\mathbb{E}\mathrm{\lbrack}XX^{\mathrm{T}}\mathrm{\rbrack} - \mathbb{E}X\mathbb{E}^{\mathrm{T}}X\mathrm{)}\mathrm{\mathbf{Ξ±}}=\mathrm{\mathbf{Ξ±}}^{\mathrm{T}}\operatorname{\Sigma}\left( X \right)\mathrm{\mathbf{Ξ±}}\]

    Similarly, using \(\operatorname{cov} \left( X,Y \right)\mathbb{= E}XY - \mathbb{E}X\mathbb{E}Y\), we have \(\colorbox{theorem}{$\operatorname{cov} \left( \mathrm{\mathbf{Ξ±}}^{\mathrm{T}}X,\mathbf{Ξ²}^\mathrm{T}Y \right) = \mathrm{\mathbf{Ξ±}}^\mathrm{T}\operatorname{\Sigma}\left( X \right)\mathbf{Ξ²}$}\). Further, if we let \(\mathbf{A}\mathbf{= (}\mathbf{a}_{1}\mathbf{,\ldots,}\mathbf{a}_{n}\mathbf{)}\), then \(\colorbox{theorem}{$\Sigma\left( \mathbf{A}^\mathrm{T}X \right) = \mathbf{A}^\mathrm{T}\operatorname{\Sigma}\left( X \right)\mathbf{A}$}\), since \(\Sigma\left( \mathrm{\mathbf{Ξ±}}_{i}X\mathbf{,}\mathrm{\mathbf{Ξ±}}_{j}X \right) = \mathrm{\mathbf{Ξ±}}_{i}^\mathrm{T}\operatorname{\Sigma}\left( X \right)\mathrm{\mathbf{Ξ±}}_{j}\); for the same reason, we have \(\Sigma\left( \mathbf{A}^{\rm{T}}X,\mathbf{B}^{\rm{T}}Y \right) = \mathbf{A}^{\rm{T}}\operatorname{\Sigma}\left( X \right)\mathbf{B}\) for cross-covariance.

    On the other hand, given \(\mathbf{X}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\mathbf{)}\), we have that \(\colorbox{theorem}{$\operatorname{var}\left( \mathbf{Ξ±}^{\rm{T}}\mathbf{X} \right) = \operatorname{var}\left( \mathbf{X}\mathbf{Ξ±} \right) = \mathbf{Ξ±}^{\rm{T}}\operatorname{\Sigma}\left( \mathbf{X} \right)\mathbf{Ξ±}$}\). First check

    \[\begin{split}\left\{ \begin{matrix} \overline{\mathbf{\text{XΞ±}}} = \overline{\sum_{i = 1}^{n}{\mathrm{\mathbf{Ξ±}}\left( i \right)\mathbf{x}_{i}}} = \frac{1}{m}\sum_{j = 1}^{m}{\sum_{i = 1}^{n}{\mathrm{\mathbf{Ξ±}}\left( i \right)\mathbf{x}_{i}(j)}} \\ \overline{\mathbf{X}}\mathrm{\mathbf{Ξ±}}=\sum_{i = 1}^{n}{\mathrm{\mathbf{Ξ±}}\left( i \right)\overline{\mathbf{x}_{i}}} = \sum_{i = 1}^{n}\left( \mathrm{\mathbf{Ξ±}}\left( i \right) \times \frac{1}{m}\sum_{j = 1}^{m}{\mathbf{x}_{i}\left( j \right)} \right) = \frac{1}{m}\sum_{i = 1}^{n}\left( \sum_{j = 1}^{m}{\mathrm{\mathbf{Ξ±}}\left( i \right)\mathbf{x}_{i}\left( j \right)} \right) \\ \end{matrix} \Rightarrow \colorbox{rlt}{$\overline{\mathbf{\text{XΞ±}}} = \overline{\mathbf{X}}\mathrm{\mathbf{Ξ±}}$} \right.\end{split}\]

    Then we have

    \[\operatorname{var} \left( \mathbf{\text{XΞ±}} \right) = \frac{1}{n + 1}\left( \left( \mathbf{\text{XΞ±}} \right)\mathrm{T}\left( \mathbf{\text{XΞ±}} \right)\mathbf{-}{\overline{\mathbf{\text{XΞ±}}}}^\mathrm{T}\overline{\mathbf{\text{XΞ±}}} \right) = \frac{1}{n + 1}\left( \mathrm{\mathbf{Ξ±}}^\mathrm{T}\mathbf{X}^\mathrm{T}\mathbf{XΞ± -}\mathrm{\mathbf{Ξ±}}^\mathrm{T}{\overline{\mathbf{X}}}^\mathrm{T}\overline{\mathbf{X}}\mathrm{\mathbf{Ξ±}} \right) = \mathrm{\mathbf{Ξ±}}^\mathrm{T}\Sigma\left( \mathbf{X} \right)\mathrm{\mathbf{Ξ±}}\]

    By similar calculation, \(\operatorname{cov} \left( \mathbf{XΞ±,XΞ²} \right) = \mathrm{\mathbf{Ξ±}}^\mathrm{T}\Sigma\left( \mathbf{X} \right)\mathbf{Ξ²}\). Let \(\mathbf{Y} = \mathbf{\text{XA}}\) for any matrix \(\mathbf{A} = (\mathbf{a}_{1}\mathbf{,\ldots,}\mathbf{a}_{n}\mathbf{)}\), then \(\colorbox{theorem}{$\Sigma(\mathbf{\text{A}}^{\rm{T}}\mathbf{\mathrm{X}}) = \mathbf{A}^\mathrm{T}\Sigma(\mathbf{X})\mathbf{A}$}\), since \(\ \Sigma\left( \mathbf{a}_{i}^{\rm{T}}\mathbf{X}\mathbf{,}\mathbf{a}_{j}^{\rm{T}}\mathbf{X} \right) = \mathbf{a}_{i}^{\rm{T}}\operatorname{\Sigma}\left( \mathbf{X} \right)\mathbf{a}_{j}\); of course we also have \(\Sigma(\mathbf{A}^{\text{T}}\mathbf{X,}\mathbf{B}^{\rm{T}}\mathbf{Y}) = \mathbf{A}^{\rm{T}}\Sigma(\mathbf{X,Y})\mathbf{B}\) for cross-covariance.

    For the same reason, a square cross-covariance matrix is positive-semidefinite.

  • Theorem 1-6. Positive definiteness of covariance matrix.ΒΆ

    Covariance matrix is clearly symmetric, and moreover they are semi-positive definite, since for any constant vector \(\mathbf{Ξ±}\), using Theorem 1-5, we have
    \[\begin{split}\begin{aligned} {{\mathbf{Ξ± }}^{\text{T}}}\left( {\Sigma \left( X \right)} \right){\mathbf{Ξ± }} &= \Sigma \left( {{{\mathbf{Ξ± }}^{\text{T}}}X} \right) \hfill \\ &= \mathbb{E}\left[ {\left( {{{\mathbf{Ξ± }}^{\text{T}}}X - \mathbb{E}{{\mathbf{Ξ± }}^{\text{T}}}X} \right){{\left( {{{\mathbf{Ξ± }}^{\text{T}}}X - \mathbb{E}{{\mathbf{Ξ± }}^{\text{T}}}X} \right)}^{\text{T}}}} \right] = \mathbb{E}\left[ {\left( {{{\mathbf{Ξ± }}^{\text{T}}}X - {{\mathbf{Ξ± }}^{\text{T}}}\mathbb{E}X} \right){{\left( {{{\mathbf{Ξ± }}^{\text{T}}}X - {{\mathbf{Ξ± }}^{\text{T}}}\mathbb{E}X} \right)}^{\text{T}}}} \right] \hfill \\ &= \mathbb{E}\left[ {{{\mathbf{Ξ± }}^{\text{T}}}\left( {X - \mathbb{E}X} \right){{\left( {X - \mathbb{E}X} \right)}^{\text{T}}}{\mathbf{Ξ± }}} \right] = \mathbb{E}\left[ {{{\left( {{{\left( {X - \mathbb{E}X} \right)}^{\text{T}}}{\mathbf{Ξ± }}} \right)}^{\text{T}}}\left( {{{\left( {X - \mathbb{E}X} \right)}^{\text{T}}}{\mathbf{Ξ± }}} \right)} \right] \geqslant 0 \hfill \\ \end{aligned}\end{split}\]

    For sample covariance matrix, check that (omiting the coefficient)

    \[\mathbf{Ξ±}^{\rm{T}}\left( \Sigma\left( \mathbf{X} \right) \right)\mathbf{Ξ±} =\Sigma\left( \mathbf{\text{XΞ±}} \right) \propto \left( \mathbf{\text{XΞ±}} - \overline{\mathbf{\text{XΞ±}}} \right)^{\rm{T}}\left( \mathbf{\text{XΞ±}} - \overline{\mathbf{\text{XΞ±}}} \right) \geq 0\]
  • Property 1-6. Sample covariance represneted by rank-1 sum.ΒΆ

    Recall \(\mathbf{X} = \begin{pmatrix} \mathbf{x}_{1}^{\rm{T}} \\ \vdots \\ \mathbf{x}_{M}^{\rm{T}} \\ \end{pmatrix}\) are feature vectors, and the covariance matrix in Eq.1.6 is defined w.r.t. the feature vectors. Suppose \(\mathbf{X} = \left( \mathbf{𝓍}_{1},\ldots,\mathbf{𝓍}_{N} \right)\) where \(\mathbf{𝓍}_{1},\ldots,\mathbf{𝓍}_{N}\) are columns of \(\mathbf{X}\) as data entries, and similarly \(\mathbf{Y} = \left( \mathbf{π“Ž}_{1},\ldots,\mathbf{π“Ž}_{N} \right)\) and let \(\overline{\mathbf{𝓍}} = \frac{1}{N}\sum_{i = 1}^{N}\mathbf{𝓍}_{i}\) and \(\overline{\mathbf{π“Ž}} = \frac{1}{N}\sum_{i = 1}^{N}\mathbf{π“Ž}_{j}\) be the mean vector of all data entries. Then we can show \(\Sigma\left( \mathbf{X} \right)\) or \(\Sigma\left( \mathbf{X,Y} \right)\) can also be represented by sum of rank-1 addends dependent on the data entries (rather than the feature vectors) as
    \[\Sigma\left( \mathbf{X,Y} \right) = \frac{1}{N - 1}\sum_{k = 1}^{N}{\left( \mathbf{𝓍}_{k} - \overline{\mathbf{𝓍}} \right)\left( \mathbf{π“Ž}_{k} - \overline{\mathbf{π“Ž}} \right)^{\rm{T}}}\]

    Let \(\mathbf{\Sigma} := \Sigma\left( \mathbf{X},\mathbf{Y} \right)\) for convenience. By Eq.1.6, we have (omitting the coefficient)

    \[\mathbf{\Sigma}\left( i,j \right) \propto \left( \mathbf{x}_{i} - \overline{\mathbf{x}_{i}} \right)^{\rm{T}}\left( \mathbf{y}_{j} - \overline{\mathbf{y}_{j}} \right)\]

    Note \(\mathbf{𝓍}_{j}\left( i \right) = x_{i,j} = \mathbf{x}_{i}\left( j \right)\) and \(\overline{\mathbf{𝓍}}\left( i \right)\mathbf{=}\overline{x_{i}} = \frac{1}{N}\sum_{k = 1}^{N}x_{i,k}\), \(\overline{\mathbf{π“Ž}}\left( j \right)\mathbf{=}\overline{y_{j}} = \frac{1}{N}\sum_{j = 1}^{N}y_{j,k}\), we have

    \[\begin{split}\begin{aligned} & \mathbf{\Sigma}\left( i,j \right) \propto \left( \mathbf{x}_{i} - \overline{\mathbf{x}_{i}} \right)^{\rm{T}}\left( \mathbf{y}_{j} - \overline{\mathbf{y}_{j}} \right) = \sum_{k = 1}^{N}{\left( \mathbf{x}_{i}\left( k \right) - \overline{x_{i}} \right)\left( \mathbf{y}_{j}\left( k \right) - \overline{y_{j}} \right)} \\ &= \sum_{k = 1}^{N}{\left( \mathbf{𝓍}_{k}\left( i \right) - \overline{x_{i}} \right)\left( \mathbf{π“Ž}_{k}\left( j \right) - \overline{y_{j}} \right)} = \sum_{k = 1}^{N}{\left( \mathbf{𝓍}_{k}\left( i \right) - \overline{\mathbf{𝓍}}\left( i \right) \right)\left( \mathbf{π“Ž}_{k}\left( j \right) - \overline{\mathbf{π“Ž}}\left( j \right) \right)} \\ &= \sum_{k = 1}^{N}{\left( \left( \mathbf{𝓍}_{k} - \overline{\mathbf{𝓍}} \right)\left( \mathbf{π“Ž}_{k} - \overline{\mathbf{π“Ž}} \right)^{\rm{T}} \right)\left( i,j \right)} \end{aligned}\end{split}\]

    The above identity immediately implies Eq.1.7.

    Corollary 1-4. \(\colorbox{result}{$\mathbf{X}\mathbf{X}^{\rm{T}} = \left( N - 1 \right)\Sigma\left( \mathbf{X} \right) + N\overline{\mathbf{𝓍}}{\overline{\mathbf{𝓍}}}^{\rm{T}}$}\). Check that \(\mathbf{\text{XX}}^{\rm{T}} = \begin{pmatrix} \mathbf{x}_{1}^{\rm{T}}\mathbf{}_{1} & \cdots & \mathbf{x}_{1}^{\rm{T}}\mathbf{x}_{M} \\ \vdots & \ddots & \vdots \\ \mathbf{x}_{M}^{\rm{T}}\mathbf{x}_{1} & \cdots & \mathbf{x}_{M}^{ T}\mathbf{x}_{M} \\ \end{pmatrix}\), and by the same inference as Eq.1.8 we have \(\mathbf{\text{XX}}^{\rm{T}} = \sum_{k = 1}^{N}{\mathbf{𝓍}_{k}{\mathbf{𝓍}_{k}}^{\rm{T}}}\) This is the rank-1 decomposition of any symmetric matrix in linear algebra., and then use Eq.1.7, we have

    \[\begin{split}\begin{aligned} \mathbf{\text{XX}}^{\rm{T}} &= \sum_{k = 1}^{N}{\mathbf{𝓍}_{k}{\mathbf{𝓍}_{k}}^{\rm{T}}} \\ &= \sum_{k = 1}^{N}{\left( \mathbf{𝓍}_{k} - \overline{\mathbf{𝓍}} \right)({\mathbf{𝓍}_{k} - \overline{\mathbf{𝓍}})}^{\rm{T}}} + \sum_{k = 1}^{N}\left( \overline{\mathbf{𝓍}}\mathbf{𝓍}_{k}^{\rm{T}} + \mathbf{𝓍}_{k}{\overline{\mathbf{𝓍}}}^{\rm{T}} - \overline{\mathbf{𝓍}}{\overline{\mathbf{𝓍}}}^{\rm{T}} \right) \\ &= \left( N - 1 \right)\Sigma\left( \mathbf{X} \right)\mathbf{+}\overline{\mathbf{𝓍}}\left( \sum_{k = 1}^{N}\mathbf{𝓍}_{k}^{\rm{T}} \right) + \left( \sum_{k = 1}^{N}\mathbf{𝓍}_{k} \right){\overline{\mathbf{𝓍}}}^{\rm{T}} - N\overline{\mathbf{𝓍}}{\overline{\mathbf{𝓍}}}^{\rm{T}} \\ &= \left( N - 1 \right)\Sigma\left( \mathbf{X} \right)\mathbf{+}N\overline{\mathbf{𝓍}}{\overline{\mathbf{𝓍}}}^{\rm{T}} + N\overline{\mathbf{𝓍}}{\overline{\mathbf{𝓍}}}^{\rm{T}} - N\overline{\mathbf{𝓍}}{\overline{\mathbf{𝓍}}}^{\rm{T}} \\ &= \left( N - 1 \right)\Sigma\left( \mathbf{X} \right)\mathbf{+}N\overline{\mathbf{𝓍}} {\overline{\mathbf{𝓍}}}^{\rm{T}} \end{aligned}\end{split}\]
  • Theorem 1-7. Block decomposition of covariance matrix.ΒΆ

    Again consider \(\mathbf{X} = \left( \mathbf{𝓍}_{1},\ldots,\mathbf{𝓍}_{N} \right)\) where \(\mathbf{𝓍}_{1},\ldots,\mathbf{𝓍}_{N}\) are data entries, and \(\overline{\mathbf{𝓍}} = \frac{1}{N}\sum_{i = 1}^{N}\mathbf{𝓍}_{i}\). Suppose \(\mathbf{𝓍}_{1},\ldots,\mathbf{𝓍}_{N}\) are categorized into \(K\) non-overlapping groups \(G_{1},\ldots,G_{k}\). Let \(N_{1},\ldots,N_{k}\) be the size of the groups, and \({\overline{\mathbf{𝓍}}}^{k} = \frac{1}{N_{k}}\sum_{\mathbf{𝓍} \in G_{k}}^{}\mathbf{𝓍}\), then
    \[\colorbox{theorem}{$\left( N - 1 \right)\Sigma\left( \mathbf{X} \right) = \color{conn1}{\sum_{k = 1}^{K}{\left( N_{k} - 1 \right)\Sigma\left( \mathbf{X}_{k} \right)}} + \color{conn2}{\sum_{k = 1}^{K}{N_{k}\left( {\overline{\mathbf{𝓍}}}^{k} - \overline{\mathbf{𝓍}} \right)\left( {\overline{\mathbf{𝓍}}}^{k} - \overline{\mathbf{𝓍}} \right)^{\rm{T}}}}$}\]

    This is because

    \[\begin{split}\begin{aligned} \left( N - 1 \right)\Sigma\left( \mathbf{X} \right) &= \sum_{j = 1}^{N}{\left( \mathbf{𝓍}_{j} - \overline{\mathbf{𝓍}} \right)\left( \mathbf{𝓍}_{j} - \overline{\mathbf{𝓍}} \right)^{\rm{T}}} \\ &= \sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( \mathbf{𝓍} - \overline{\mathbf{𝓍}} \right)\left( \mathbf{𝓍} - \overline{\mathbf{𝓍}} \right)^{\rm{T}}}} \\ &= \sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k}\mathbf{+}{\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k}\mathbf{+}{\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)^{\rm{T}}}} \\ &= \sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k} \right)\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k} \right)^{\rm{T}}}} + \sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)^{\rm{T}}}} + \sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k} \right)\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)^{\rm{T}}}} + \sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k} \right)^{\rm{T}}}} \end{aligned}\end{split}\]

    where we have

    \[\sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k} \right)\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k} \right)^{\rm{T}}}} = \color{conn1}{\sum_{k = 1}^{K}{\left( N_{k} - 1 \right)\Sigma\left( \mathbf{X}_{k} \right)}}\]
    \[\sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)^{\rm{T}}}} = \color{conn2}{\sum_{k = 1}^{K}{N_{k}\left( {\overline{\mathbf{𝓍}}}^{k} - \overline{\mathbf{𝓍}} \right)\left( {\overline{\mathbf{𝓍}}}^{k} - \overline{\mathbf{𝓍}} \right)^{\rm{T}}}}\]

    The other two summations are zero matrices. For example,

    \[\sum_{k = 1}^{K}{\sum_{\mathbf{𝓍} \in G_{k}}^{}{\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k} \right)\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)^{\rm{T}}}} = \sum_{k = 1}^{K}\left( \left( \sum_{\mathbf{𝓍} \in G_{k}}^{}\left( \mathbf{𝓍} - {\overline{\mathbf{𝓍}}}^{k} \right) \right)\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)^{\rm{T}} \right) = \sum_{k = 1}^{K}{\left( N_{k}{\overline{\mathbf{𝓍}}}^{k}\mathbf{-}N_{k}{\overline{\mathbf{𝓍}}}^{k} \right)\left( {\overline{\mathbf{𝓍}}}^{k}\mathbf{-}\overline{\mathbf{𝓍}} \right)^{\rm{T}}} = \mathbf{O}\]

    Now the identity of Eq.1.9 is obvious.