Multivariate Gaussian Distribution

An \(m\)-dimensional Multivariate Gaussian distribution is defined by an \(m\)-dimensional mean \(\mathbf{μ}\) and a \(m \times m\) non-singular covariance \(\mathbf{\Sigma}\), denoted by \(\operatorname{Gaussian}\left( \mathbf{μ},\mathbf{\Sigma} \right)\), whose density function can be written as

\[p\left( \mathbf{x} \right) = \frac{1}{\left( 2\pi \right)^{\frac{m}{2}}}\frac{1}{\left| \mathbf{\Sigma} \right|^{\frac{1}{2}}}\exp\left\{ - \frac{1}{2}\left( \mathbf{x} - \mathbf{μ} \right)^{\rm{T}}\mathbf{\Sigma}^{- 1}\left( \mathbf{x} - \mathbf{μ} \right) \right\} = \frac{1}{\left( 2\pi \right)^{\frac{m}{2}}}\frac{1}{\left| \mathbf{\Sigma} \right|^{\frac{1}{2}}}\kappa\left( \mathbf{x},\mathbf{μ} \right) = \frac{1}{\left( 2\pi \right)^{\frac{m}{2}}}\frac{1}{\left| \mathbf{\Sigma} \right|^{\frac{1}{2}}}\exp\left\{ - \frac{1}{2}𝒹_{M}\left( \mathbf{x};\mathbf{μ,}\mathbf{\Sigma} \right) \right\}\]

where \(\left| \mathbf{\Sigma} \right|\) is the determinant of \(\mathbf{\Sigma}\), \(\kappa\) is the Gaussian kernel, and \(𝒹_{M}\left( \mathbf{x};\mathbf{μ,\Sigma} \right) = \left( \mathbf{x} - \mathbf{μ} \right)^{\rm{T}}\mathbf{\Sigma}^{- 1}\left( \mathbf{x} - \mathbf{μ} \right)\) is called the Mahalanobis distance. \(\kappa\) and \(𝒹_{M}\) measure the similarity/distance between an observation \(\mathbf{x}\) and a distribution with mean \(\mathbf{μ}\) and covariance \(\mathbf{\Sigma}\) (not necessarily Gaussian). Note \(𝒹_{M}\) is reduced to Euclidean distance from observation \(\mathbf{x}\) to mean \(\mathbf{μ}\) when the covariance is identity. Recall the basic fact that co-variance matrix is symmetric and positive semidefinite, and in the case of Gaussian it must be positive definite This is because \(|Σ|≠0\); otherwise the Gaussian density of Eq.1.10 would be invalid. Therefore, all eigenvalues of Σ must be positive, because the determinant equals to the product of all eigenvalues., thus \(𝒹_{M}\left( \mathbf{x};\mathbf{μ,\Sigma} \right) \geq 0\) and \(𝒹_{M}\left( \mathbf{x};\mathbf{μ,\Sigma} \right) = 0\) iff \(\mathbf{x} = \mathbf{μ}\). The inverse of covariance matrix \(\mathbf{\Sigma}^{- 1}\) is also named the precision matrix, denoted by :math:`mathbf{Ⲗ}:=mathbf{Sigma}^{- 1}`, and the density can be written in terms of precision matrix as

\[p\left( \mathbf{x} \right) = \frac{\left| \mathbf{Ⲗ} \right|^{\frac{1}{2}}}{\left( 2\pi \right)^{\frac{m}{2}}}\exp\left\{ - \frac{1}{2}\left( \mathbf{x} - \mathbf{μ} \right)^{\rm{T}}\mathbf{Ⲗ}\left( \mathbf{x} - \mathbf{μ} \right) \right\}\]

The limitation of Gaussian is its quadratic number of parameters which could be a problem for high-dimensional computation, and its very limited unimodal shape which could not represent complicated real-world distributions.

  • We state useful gradient results from matrix derivatives, which are frequently used in finding analytical solution for optimization problems. They are soon applied to prove the maximum likelihood of multivariate Gaussian.

    Fact 1-5. Affine transformation \(\mathbf{a}^{\rm{T}}\mathbf{x} + b\mathbf{:}\mathbb{R}^{n}\mathbb{\rightarrow R}\) where \(\mathbf{a} \in \mathbb{R}^{n},b\mathbb{\in R}\) has gradient \(\colorbox{fact}{$\nabla\left( \mathbf{a}^{\rm{T}}\mathbf{x} + b \right) = \nabla\left( \mathbf{x}^{\rm{T}}\mathbf{a} + b \right) = \mathbf{a}$}\).

    Fact 1-6. Determinant \(\left| \mathbf{X} \right|:\mathbb{R}^{n \times n}\mathbb{\rightarrow R}\) is a scalar-valued matrix function, and we have \(\colorbox{fact}{$\nabla\log\left| \mathbf{X} \right| = \mathbf{X}^{-T}$}\).

    Fact 1-7. The gradient of \(f\left( \mathbf{X} \right)=\mathbf{u}^{\rm{T}}\mathbf{\text{Xv}}:\mathbb{R}^{m \times n}\mathbb{\rightarrow R}\), where \(\mathbf{u} \in \mathbb{R}^{m},\mathbf{v} \in \mathbb{R}^{n}\) are constant vectors, is \(\colorbox{fact}{$\nabla\mathbf{u}^{\rm{T}}\mathbf{\text{Xv}} \equiv \mathbf{u}\mathbf{v}^{\rm{T}}$}\).

    Fact 1-8. The gradient of \(f\left( \mathbf{x} \right) = \mathbf{x}^{\rm{T}}\mathbf{\text{Ux}}:\mathbf{x} \in \mathbb{R}^{n}\mathbb{\rightarrow R}\), where \(\mathbf{U} \in \mathbb{R}^{n \times n}\) is a constant matrix, is \(\colorbox{fact}{$\nabla\mathbf{x}^{\rm{T}}\mathbf{\text{Ux}} = (\mathbf{U} + \mathbf{U}^{\rm{T}})\mathbf{x}$}\).

    Theorem 1-8. Gaussian maximum likelihood estimators.

    Given a data matrix \(\mathbf{X}=\left( \mathbf{x}_{1},\ldots,\mathbf{x}_{N} \right)\), we can view its columns \(\mathbf{x}_{1},\ldots,\mathbf{x}_{N}\) as being drawn i.i.d. drawn from a multivariate Gaussian distribution, then the log-likelihood is
    \[L \propto \sum_{i = 1}^{N}{- \frac{1}{2}\log\left| \mathbf{\Sigma} \right| - \frac{1}{2}\left( \mathbf{x}_{i} - \mathbf{μ} \right)^{\rm{T}}\mathbf{\Sigma}^{- 1}\left( \mathbf{x}_{i} - \mathbf{μ} \right)} = \frac{N}{2}\log\left| \mathbf{\Sigma}^{- 1} \right| - \frac{1}{2}\sum_{i = 1}^{N}{\left( \mathbf{x}_{i} - \mathbf{μ} \right)^{\rm{T}}\mathbf{\Sigma}^{- 1}\left( \mathbf{x}_{i} - \mathbf{μ} \right)}\]

    Using Fact 1-6 and Fact 1-7, we have

    \[\frac{\partial L}{\partial\mathbf{\Sigma}^{- 1}} = \frac{N}{2}\mathbf{\Sigma}^{\rm{T}} - \frac{1}{2}\sum_{i = 1}^{N}{\left( \mathbf{x}_{i} - \mathbf{μ} \right)\left( \mathbf{x}_{i} - \mathbf{μ} \right)^{\rm{T}}} = 0 \Rightarrow \mathbf{\Sigma}=\frac{1}{N}\sum_{i = 1}^{N}{\left( \mathbf{x}_{i} - \mathbf{μ} \right)\left( \mathbf{x}_{i} - \mathbf{μ} \right)^{\rm{T}}}\]

    Then for \(\mathbf{μ}\), using Fact 1-8,

    \[\begin{split}\begin{aligned} & L \propto \sum_{i = 1}^{N}{\mathbf{x}_{i}^{\rm{T}}\mathbf{\Sigma}^{- 1}\mathbf{μ} - \frac{1}{2}\mathbf{μ}^{\rm{T}}\mathbf{\Sigma}^{- 1}\mathbf{μ}} \\ & \Rightarrow \frac{\partial L}{\partial\mathbf{μ}} = \sum_{i = 1}^{N}{\mathbf{x}_{i}^{\rm{T}}\mathbf{\Sigma}^{- 1} - \mathbf{\Sigma}^{- 1}\mathbf{μ}} = \mathbf{0} \\ & \Rightarrow \mathbf{\Sigma}^{- 1}\left( \sum_{i = 1}^{N}\mathbf{x}_{i} \right) = N\mathbf{\Sigma}^{- 1}\mathbf{μ} \Rightarrow \colorbox{thm}{$\mathbf{μ}_{\text{ML}}=\frac{\sum_{i = 1}^{N}\mathbf{x}_{i}}{N} = \overline{\mathbf{x}}$} \end{aligned}\end{split}\]

    Plug back to \(\mathbf{\Sigma}^{- 1}\) in (1?€?7) and we have

    \[\mathbf{\Sigma}_{\text{ML}}=\frac{1}{N}\sum_{i = 1}^{N}{\left( \mathbf{x}_{i} - \overline{\mathbf{x}} \right)\left( \mathbf{x}_{i} - \overline{\mathbf{x}} \right)^{\rm{T}}}\]

    By Property 1-6, \(\mathbf{\Sigma}_{\text{ML}}\) equals the biased sample covariance, or \(\frac{N}{N - 1}\mathbf{\Sigma}_{\text{ML}}\) equals the sample covariance matrix

  • Treating samples \(\mathbf{x,y}\) drawn from \(\operatorname{Gaussian}\left( \mathbf{μ},\mathbf{\Sigma} \right)\) as two RV vector where \(\mathbb{E}\mathbf{x} = \mathbb{E}\mathbf{y} = \mathbf{μ}\), recall we have \(\mathbf{\Sigma =}\mathbb{E}\left\lbrack \mathbf{x}\mathbf{y}^{\rm{T}} \right\rbrack-\mathbf{μ}\mathbf{μ}^{\rm{T}}\) or \(\mathbb{E}\left\lbrack \mathbf{x}\mathbf{y}^{\rm{T}} \right\rbrack=\mathbf{μ}\mathbf{μ}^{\rm{T}}\mathbf{+ \Sigma}\) by Property 1-4. When \(\mathbf{x},\mathbf{y}\) are independent, we have \(\mathbb{E}\left\lbrack \mathbf{x}\mathbf{y}^{\rm{T}} \right\rbrack=\mathbf{μ}\mathbf{μ}^{\rm{T}}\).

    Theorem 1-9. Bias of Gaussian MLE.

    Now given i.i.d. RVs \(\mathbf{x}_{1},\ldots,\mathbf{x}_{N}\) with mean \(\mathbf{μ}\) and covariance \(\mathbf{\Sigma}\), note
    \[\begin{split}\mathbb{E}\left\lbrack \mathbf{x}_{i}\mathbf{x}_{j}^{\rm{T}} \right\rbrack = \left\{ \begin{matrix} \mathbf{μ}\mathbf{μ}^{\rm{T}} & i \neq j \\ \mathbf{μ}\mathbf{μ}^{\rm{T}}\mathbf{+}\mathbf{\Sigma} & i = j \\ \end{matrix} \right.\end{split}\]

    Then we have

    \[\mathbb{E}\left\lbrack \mathbf{x}_{i}{\overline{\mathbf{x}}}^{\rm{T}} \right\rbrack = \frac{1}{N}\sum_{i = 1}^{N}{\mathbb{E}\left\lbrack \mathbf{x}_{i}\mathbf{x}_{j}^{\rm{T}} \right\rbrack} = \frac{1}{N}\left( N\mathbf{μ}\mathbf{μ}^{\rm{T}} + \mathbf{\Sigma} \right) = \mathbf{μ}\mathbf{μ}^{\rm{T}} + \frac{\mathbf{\Sigma}}{N}\mathbb{= E}\left\lbrack \overline{\mathbf{x}}\mathbf{x}_{i}^{\rm{T}} \right\rbrack\]
    \[\mathbb{E}\left\lbrack \overline{\mathbf{x}}{\overline{\mathbf{x}}}^{\rm{T}} \right\rbrack = \frac{1}{N^{2}}\sum_{i = 1}^{N}{\sum_{j = 1}^{N}{\mathbb{E}\left\lbrack \mathbf{x}_{i}\mathbf{x}_{j}^{\rm{T}} \right\rbrack}} = \frac{1}{N^{2}}\left(N^{2}\mathbf{μ}\mathbf{μ}^{\rm{T}} + N\mathbf{\Sigma} \right) = \mathbf{μ}\mathbf{μ}^{\rm{T}} + \frac{\mathbf{\Sigma}}{N}\]

    For Gaussian i.i.d. RVs \(\mathbf{x}_{1},\ldots,\mathbf{x}_{N}\) as in

    \[\begin{split}\begin{aligned}\mathbf{E}\left\lbrack \mathbf{\Sigma}_{\text{ML}} \right\rbrack &=\frac{1}{N}\sum_{i = 1}^{N}{\mathbb{E}\left( \mathbf{x}_{i} - \overline{\mathbf{x}} \right)\left( \mathbf{x}_{i} - \overline{\mathbf{x}} \right)^{\rm{T}}} = \frac{1}{N}\sum_{i = 1}^{N}{\mathbb{E}\left\lbrack \mathbf{x}_{i}\mathbf{x}_{i}^{\rm{T}} - \mathbf{x}_{i}{\overline{\mathbf{x}}}^{\rm{T}} - \overline{\mathbf{x}}\mathbf{x}_{i}^{\rm{T}} + \overline{\mathbf{x}}{\overline{\mathbf{x}}}^{\rm{T}} \right\rbrack} \\ &= \frac{1}{N}\sum_{i = 1}^{N}\left( \cancel{\mathbf{μ}\mathbf{μ}^{\rm{T}}}\mathbf{+ \Sigma} - 2\left( \cancel{\mathbf{μ}\mathbf{μ}^{\rm{T}}} + \frac{\mathbf{\Sigma}}{N} \right) + \cancel{\mathbf{μ}\mathbf{μ}^{\rm{T}}} + \frac{\mathbf{\Sigma}}{N} \right) = \frac{N - 1}{N}\mathbf{\Sigma}\end{aligned}\end{split}\]

    Thus, \(\mathbf{\Sigma}_{\text{ML}} = \frac{N - 1}{N}\mathbf{\Sigma}\) is a biased estimator. \(\mathbf{μ}_{\text{ML}}\) is trivially an unbiased estimator, since \(\mathbb{E}\left\lbrack \mathbf{μ}_{\text{ML}} \right\rbrack=\mathbb{E}\left\lbrack \overline{\mathbf{x}} \right\rbrack = \mathbf{μ}\).

  • Fact 1-9. Every matrix \(\mathbf{A}\) s.t. \(\operatorname{rank}\mathbf{A} = k\) can be written as the sum of \(k\) rank-1 matrices, i.e. \(\colorbox{fact}{$\mathbf{A} = \sum_{i = 1}^{k}{σ_{i}^{2}\mathbf{u}_{i}\mathbf{v}_{i}^{\rm{T}}}$}\), where \(\mathbf{u}_{i}\) and \(\mathbf{v}_{i}\) are columns from two matrices \(\mathbf{U},\mathbf{V}\) that come from the reduced SVD \(\mathbf{A} = \mathbf{\text{U}\Lambda}\mathbf{V}^{\rm{T}}\); in particular, if \(\mathbf{A}\) is symmetric and positive semidefinite, then the singular values are eigenvalues of \(\mathbf{A}\), and the reduced SVD becomes reduced eigen-decomposition \(\mathbf{A} = \mathbf{\text{Q}\Lambda}\mathbf{Q}^{\rm{T}}\) where \(\mathbf{Q}\) consists of \(k\) orthonormal eigenvectors \(\mathbf{q}_{1},\ldots,\mathbf{q}_{k}\), and thus \(\colorbox{fact}{$\mathbf{A =}\sum_{i = 1}^{k}{𝜆_{i}\mathbf{q}_{i}\mathbf{q}_{i}^{\rm{T}}}$}\); moreover, \(\colorbox{fact}{$\mathbf{A}^{- 1}=\sum_{i = 1}^{k}{𝜆_{i}^{- 1}\mathbf{q}_{i}\mathbf{q}_{i}^{\rm{T}}}$}\) because \(\mathbf{A}^{- 1}\) and \(\mathbf{A}\) share the same eigenspace for each eigenvalue.

    Property 1-7. Shape of contours of Gaussian density.

    Recall the Mahalanobis distance \(𝒹_{M}\left( \mathbf{x};\mathbf{μ,\Sigma} \right) = \left( \mathbf{x} - \mathbf{μ} \right)^{\rm{T}}\mathbf{\Sigma}^{- 1}\left( \mathbf{x} - \mathbf{μ} \right)\). Let \(𝜆_{1},\ldots,𝜆_{m}\) be the eigenvalues of \(\mathbf{\Sigma}\), then \(\mathbf{\Sigma}^{- 1} = \sum_{i = 1}^{k}{𝜆_{i}^{- 1}\mathbf{q}_{i}\mathbf{q}_{i}^{\rm{T}}}\) as a result of Fact 1-9, and
    \[𝒹_{M}\left( \mathbf{x};\mathbf{μ,\Sigma} \right) = \sum_{i = 1}^{k}{𝜆_{i}^{- 1}\left( \mathbf{x} - \mathbf{μ} \right)^{\rm{T}}\mathbf{q}_{i}\mathbf{q}_{i}^{\rm{T}}}\left( \mathbf{x} - \mathbf{μ} \right) = \sum_{i = 1}^{k}{\left( \frac{1}{\sqrt{𝜆_{i}}}\mathbf{q}_{i}^{\rm{T}}\left( \mathbf{x} - \mathbf{μ} \right) \right)^{\rm{T}}\left( \frac{1}{\sqrt{𝜆_{i}}}\mathbf{q}_{i}^{\rm{T}}\left( \mathbf{x} - \mathbf{μ} \right) \right)}\]

    Therefore, \(𝒹_{M}\left( \mathbf{x};\mathbf{μ,\Sigma} \right) = \sum_{i = 1}^{k}z_{i}^{2} = \mathbf{z}^{\rm{T}}\mathbf{z}\) where \(z_{i} = \frac{1}{\sqrt{𝜆_{i}}}\mathbf{q}_{i}^{\rm{T}}\left( \mathbf{x} - \mathbf{μ} \right)\) and \(\mathbf{z} = \mathbf{\Lambda}^{- \frac{1}{2}}\mathbf{Q}\left( \mathbf{x} - \mathbf{μ} \right) = \left( z_{1},\ldots,z_{m} \right)^{\rm{T}}\), and \(\mathbf{x} = \mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\rm{T}}\mathbf{z}\mathbf{+}\mathbf{μ}\).

    Recall an orthonormal matrix represents a rotation (oriented rotation, to be exact), then \(\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\rm{T}}\left( \cdot \right)\mathbf{+}\mathbf{μ}\) geometrically transforms the standard frame on a unit circle to a frame on an ellipsoid centered at \(\mathbf{μ}\), and \(\mathbf{z}\) is the coordinate of \(\mathbf{x}\) w.r.t. to the ellipsoid frame. Conversely, given a contour level \(p\left( \mathbf{x} \right) = p_{0}\), we have \(\mathbf{z}^{\rm{T}}\mathbf{z =}c \Rightarrow \left\| \mathbf{z} \right\|_{2} = \sqrt{c}\) for some constant \(c\) (i.e. the contour is a circle w.r.t. the ellipsoid frame), and then any \(\mathbf{x}\) s.t. \(\left\| \mathbf{z} \right\|_{2} = \sqrt{c}\) will be on an ellipsoid centered at \(\mathbf{μ}\). Therefore, the contours of multivariate Gaussian density are ellipsoids.

  • Lemma 1-3. If a density function \(p\left( \mathbf{x} \right) \propto \exp\left\{ - \frac{1}{2}\mathbf{x}^{\rm{T}}\mathbf{Ax +}\mathbf{x}^{\rm{T}}\mathbf{\text{Ay}} \right\}\) where \(\mathbf{A}\) is symmetric positive semidefinite, then \(p\) must be Gaussian with precision \(\mathbf{Ⲗ}=\mathbf{A}\) and mean \(\mathbf{μ} = \mathbf{y}\). This is simply we can rearrange it as

    \[p\left( \mathbf{x} \right) \propto \exp\left\{ - \frac{1}{2}\left( \mathbf{x - y} \right)^{\rm{T}}\mathbf{A}\left( \mathbf{x}-\mathbf{y} \right)-\mathbf{y}^{\rm{T}}\mathbf{\text{Ay}} \right\} \propto \exp\left\{ - \frac{1}{2}\left( \mathbf{x - y} \right)^{\rm{T}}\mathbf{A}\left( \mathbf{x}-\mathbf{y} \right) \right\}\]

    There is no need to worry about normalization, since we have assumed \(p\) is a density, where its normalization is guaranteed.

    Fact 1-10. If a block matrix \(\begin{pmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \\ \end{pmatrix}\) is inversible, then if \(\mathbf{A}\) is non-singular, we have the following with all inverses valid

    \[\begin{split}\colorbox{fact}{$\begin{pmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \\ \end{pmatrix}^{- 1} = \begin{pmatrix} \mathbf{A}^{- 1}\left( \mathbf{I}\mathbf{+}\mathbf{\text{BMC}}\mathbf{A}^{- 1} \right) & - \mathbf{A}^{- 1}\mathbf{\text{BM}} \\ - \mathbf{\text{MC}}\mathbf{A}^{- 1} & \mathbf{M} \\ \end{pmatrix},\mathbf{M} = \left( \mathbf{D} - \mathbf{C}\mathbf{A}^{- 1}\mathbf{B} \right)^{- 1}$}\end{split}\]

    and if \(\mathbf{D}\) is non-singular, we have the following with all inverses valid,

    \[\begin{split}\colorbox{fact}{$\begin{pmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \\ \end{pmatrix}^{- 1} = \begin{pmatrix} \mathbf{M} & - \mathbf{\text{MB}}\mathbf{D}^{- 1} \\ - \mathbf{D}^{- 1}\mathbf{\text{CM}} & \mathbf{D}^{- 1}\left( \mathbf{I} + \mathbf{\text{CMB}}\mathbf{D}^{- 1} \right) \\ \end{pmatrix},\mathbf{M} = \left( \mathbf{A} - \mathbf{B}\mathbf{D}^{- 1}\mathbf{C} \right)^{- 1}$}\end{split}\]

    Theorem 1-10. Conditional density of multivariate Guassian.

    Given \(\mathbf{x}\sim\operatorname{Gaussian}\left( \mathbf{μ},\mathbf{\Sigma} \right)\), WLOG, partition \(\mathbf{x} = \begin{pmatrix} \mathbf{x}_{a} \\ \mathbf{x}_{b} \\ \end{pmatrix}\), where \(\mathbf{x}_{a} \in \mathbb{R}^{d}\) is the unknown, and \(\mathbf{x}_{b} \in \mathbb{R}^{m - d}\) is the condition, and we want to find the conditional density \(p\left( \mathbf{x}_{a}\mathbf{|}\mathbf{x}_{b} \right)\). Partition the mean and covariance accordingly as \(\mathbf{μ} = \begin{pmatrix} \mathbf{μ}_{a} \\ \mathbf{μ}_{b} \\ \end{pmatrix}\), \(\mathbf{\Sigma} = \begin{pmatrix} \mathbf{\Sigma}_{{aa}} & \mathbf{\Sigma}_{{ab}} \\ \mathbf{\Sigma}_{{ba}} & \mathbf{\Sigma}_{{bb}} \\ \end{pmatrix}\) and \(\mathbf{Ⲗ} = \begin{pmatrix} \mathbf{Ⲗ}_{{aa}} & \mathbf{Ⲗ}_{{ab}} \\ \mathbf{Ⲗ}_{{ba}} & \mathbf{Ⲗ}_{{bb}} \\ \end{pmatrix}\). Assume \(\mathbf{\Sigma}_{{aa}}\) is non-singular, then we have
    \[\begin{split}\begin{aligned} - \frac{1}{2}\left( \mathbf{x} - \mathbf{μ} \right)^{\rm{T}}\mathbf{Ⲗ}\left( \mathbf{x} - \mathbf{μ} \right) &= \begin{pmatrix} \mathbf{x}_{a}-\mathbf{μ}_{a} \\ \mathbf{x}_{b}-\mathbf{μ}_{b} \\ \end{pmatrix}^{\rm{T}}\begin{pmatrix} \mathbf{Ⲗ}_{{aa}} & \mathbf{Ⲗ}_{{ab}} \\ \mathbf{Ⲗ}_{{ba}} & \mathbf{Ⲗ}_{{bb}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{x}_{a}-\mathbf{μ}_{a} \\ \mathbf{x}_{b}-\mathbf{μ}_{b} \\ \end{pmatrix} \\ &= - \frac{1}{2}\left( \mathbf{x}_{a}-\mathbf{μ}_{a} \right)^{\rm{T}}\mathbf{Ⲗ}_{{aa}}\left( \mathbf{x}_{a}-\mathbf{μ}_{a} \right) - \left( \mathbf{x}_{a}-\mathbf{μ}_{a} \right)^{\rm{T}}\mathbf{Ⲗ}_{{ab}}\left( \mathbf{x}_{b}-\mathbf{μ}_{b} \right)-\frac{1}{2}\left( \mathbf{x}_{b}-\mathbf{μ}_{b} \right)^{\rm{T}}\mathbf{Ⲗ}_{{bb}}\left( \mathbf{x}_{b}-\mathbf{μ}_{b} \right) \\ &= - \frac{1}{2}\mathbf{x}_{a}^{\rm{T}}\mathbf{Ⲗ}_{{aa}}\mathbf{x}_{a} + \mathbf{x}_{a}^{\rm{T}}\mathbf{Ⲗ}_{{aa}}\mathbf{μ}_{a}-\mathbf{x}_{a}^{\rm{T}}\mathbf{Ⲗ}_{{ab}}\left( \mathbf{x}_{b}-\mathbf{μ}_{b} \right)\mathbf{+}\text{constant} \\ &= - \frac{1}{2}\mathbf{x}_{a}^{\rm{T}}\mathbf{Ⲗ}_{{aa}}\mathbf{x}_{a} + \mathbf{x}_{a}^{\rm{T}}\mathbf{Ⲗ}_{{aa}}\left( \mathbf{μ}_{a}-\mathbf{Ⲗ}_{{aa}}^{- 1}\mathbf{Ⲗ}_{{ab}}\left( \mathbf{x}_{b}-\mathbf{μ}_{b} \right) \right)\mathbf{+}\text{constant} \end{aligned}\end{split}\]

    By Lemma 1-3, it follows that if \(\mathbf{\Sigma}_{{aa}}\) is non-singular, then \(p\left( \mathbf{x}_{a}\mathbf{|}\mathbf{x}_{b} \right)\) is Gaussian. Denote \(\mathbf{x}_{a}\mathbf{|}\mathbf{x}_{b}\mathbf{\sim}\operatorname{Gaussian}\left( \mathbf{μ}_{a|b}\mathbf{,}\mathbf{\Sigma}_{a|b} \right)\), we have

    \[\colorbox{theorem}{$\mathbf{μ}_{a|b}=\mathbf{μ}_{a}-\mathbf{Ⲗ}_{{aa}}^{- 1}\mathbf{Ⲗ}_{{ab}}\left( \mathbf{x}_{b}-\mathbf{μ}_{b} \right)\mathbf{,}\mathbf{\Sigma}_{a|b}=\mathbf{Ⲗ}_{{aa}}^{- 1}$}\]

    If in addition \(\mathbf{\Sigma}_{{bb}}\) is non-singular, using Fact 1-10 and \(\begin{pmatrix} \mathbf{\Sigma}_{{aa}} & \mathbf{\Sigma}_{{ab}} \\ \mathbf{\Sigma}_{{ba}} & \mathbf{\Sigma}_{{bb}} \\ \end{pmatrix}^{- 1} = \begin{pmatrix} \mathbf{Ⲗ}_{{aa}} & \mathbf{Ⲗ}_{{ab}} \\ \mathbf{Ⲗ}_{{ba}} & \mathbf{Ⲗ}_{{bb}} \\ \end{pmatrix}\), we have

    \[\begin{split}\left\{ {\begin{array}{*{20}{l}} {{Ⲗ _{aa}} = {{\left( {{{\mathbf{\Sigma }}_{aa}} - {{\mathbf{\Sigma }}_{ab}}{\mathbf{\Sigma }}_{bb}^{ - 1}{{\mathbf{\Sigma }}_{ba}}} \right)}^{ - 1}}} \\ {{Ⲗ _{ab}} = - {Ⲗ _{aa}}{{\mathbf{\Sigma }}_{ab}}{\mathbf{\Sigma }}_{bb}^{ - 1}} \end{array}} \right. \Rightarrow \left\{ {\begin{array}{*{20}{l}} {Ⲗ _{aa}^{ - 1}{Ⲗ _{ab}} = - {{\mathbf{\Sigma }}_{ab}}{\mathbf{\Sigma }}_{bb}^{ - 1}} \\ {Ⲗ _{aa}^{ - 1} = {{\mathbf{\Sigma }}_{aa}} - {{\mathbf{\Sigma }}_{ab}}{\mathbf{\Sigma }}_{bb}^{ - 1}{{\mathbf{\Sigma }}_{ba}}} \end{array}} \right.\end{split}\]

    Plug into Eq.1.13 we have

    \[\colorbox{theorem}{${{\mathbf{μ }}_{a|b}} = {{\mathbf{μ }}_a} + {{\mathbf{\Sigma }}_{ab}}{\mathbf{\Sigma }}_{bb}^{ - 1}\left( {{{\mathbf{x}}_b} - {{\mathbf{μ }}_b}} \right),{{\mathbf{\Sigma }}_{a|b}} = {{\mathbf{\Sigma }}_{aa}} - {{\mathbf{\Sigma }}_{ab}}{\mathbf{\Sigma }}_{bb}^{ - 1}{{\mathbf{\Sigma }}_{ba}}$}\]

    Similarly, if \(\mathbf{\Sigma}_{{bb}}\) is non-singular, then \(p\left( \mathbf{x}_{b}|\mathbf{x}_{a} \right)\) is Gaussian. Following Eq.1.12 we have

    \[\begin{split}\begin{aligned} - \frac{1}{2}\left( \mathbf{x} - \mathbf{μ} \right)^{\rm{T}}\mathbf{Ⲗ}\left( \mathbf{x} - \mathbf{μ} \right) &= - \frac{1}{2}\mathbf{x}_{b}^{\rm{T}}\mathbf{Ⲗ}_{{bb}}\mathbf{x}_{b} + \mathbf{x}_{b}^{\rm{T}}\mathbf{Ⲗ}_{{bb}}\mathbf{μ}_{b}-\left( \mathbf{x}_{a}-\mathbf{μ}_{a} \right)^{\rm{T}}\mathbf{Ⲗ}_{{ab}}\mathbf{x}_{b}\mathbf{+}\text{constant} \\ &= - \frac{1}{2}\mathbf{x}_{b}^{\rm{T}}\mathbf{Ⲗ}_{{bb}}\mathbf{x}_{b} + \mathbf{x}_{b}^{\rm{T}}\mathbf{Ⲗ}_{{bb}}\mathbf{μ}_{b}-\mathbf{x}_{b}^{\rm{T}}\mathbf{Ⲗ}_{{ba}}\left( \mathbf{x}_{a}-\mathbf{μ}_{a} \right)\mathbf{+}\text{constant} \\ &= - \frac{1}{2}\mathbf{x}_{b}^{\rm{T}}\mathbf{Ⲗ}_{{bb}}\mathbf{x}_{b} + \mathbf{x}_{b}^{\rm{T}}\mathbf{Ⲗ}_{{bb}}\left( \mathbf{μ}_{b}-\mathbf{Ⲗ}_{{bb}}^{- 1}\mathbf{Ⲗ}_{{ba}}\left( \mathbf{x}_{a}-\mathbf{μ}_{a} \right) \right)\mathbf{+}\text{constant}\end{aligned}\end{split}\]

    If in addition \(\mathbf{\Sigma}_{{aa}}\) is non-singular, using Fact 1-10 we have

    \[\mathbf{Ⲗ}_{{bb}}^{- 1}\mathbf{Ⲗ}_{{ba}}\mathbf{= -}\mathbf{\Sigma}_{{ba}}\mathbf{\Sigma}_{{aa}}^{- 1}\mathbf{,}\mathbf{Ⲗ}_{{bb}}^{- 1}=\mathbf{\Sigma}_{{bb}}-\mathbf{\Sigma}_{{ba}}\mathbf{\Sigma}_{{aa}}^{- 1}\mathbf{\Sigma}_{{ab}}\]

    Finally,

    \[\colorbox{theorem}{$\mathbf{μ}_{b|a}=\mathbf{μ}_{b}-\mathbf{Ⲗ}_{{bb}}^{- 1}\mathbf{Ⲗ}_{{ba}}\left( \mathbf{x}_{a}-\mathbf{μ}_{a} \right)=\mathbf{μ}_{b}\mathbf{+}\mathbf{\Sigma}_{{ba}}\mathbf{\Sigma}_{{aa}}^{- 1}\left( \mathbf{x}_{a}-\mathbf{μ}_{a} \right)$}\]
    \[\colorbox{theorem}{$\mathbf{\Sigma}_{b|a}=\mathbf{Ⲗ}_{{bb}}^{- 1}=\mathbf{\Sigma}_{{bb}}-\mathbf{\Sigma}_{{ba}}\mathbf{\Sigma}_{{aa}}^{- 1}\mathbf{\Sigma}_{{ab}}$}\]

    We note 1) the conditional mean and variance can be simpler in terms of precision, as seen in Eq.1.13 and Eq.1.14; 2) the conditional mean \(\mathbf{μ}_{a|b}\) is independent of \(\mathbf{x}_{a}\), \(\mathbf{μ}_{b|a}\) is independent of \(\mathbf{x}_{b}\), and both \(\mathbf{\Sigma}_{a|b}\mathbf{,}\mathbf{\Sigma}_{b|a}\) are independent of \(\mathbf{x}\), which is expected because the conditional mean and covariance should be determined by the distribution itself and the conditions, but should not be related to the “unknown” RVs.