Home

Expected Value, Variance, and Covariance

In the previous section, we defined random variables and their basic properties such as their Probability Density Functions (PDFs) and Probability Mass Functions (PMFs). These functions completely describe the random variable in question, and we often want to measure them in science. This is quite difficult to do accurately because a PDF is a function with a separate value for each point in the domain, all of which have to be measured. In such cases, it is often more useful to measure simplifying properties of the PDF instead. These are called the “moments” of the distribution. The \(n^\mathrm{th}\) moment of the PDF \(P(x)\) is defined as

\[\int dx\, x^n P(x).\]
(1)
for some integer \(n\).

Below, we will focus on a few important moments and explain why they are useful. We will consider continuous random variables which have PDFs, though many of the results also apply to discrete random variables.

The expected value of a random variable \(X\) is defined as its first-order moment:

\[\langle X\rangle = \int dx\, x P(x).\]
(2)
It is effectively the average of all values \(X\) could have, weighted by their probability.

The introduction of this post implied that moments are easier to measure than the PDF directly. To see this, let’s design an experiment that attempts to measure \(\langle X \rangle\). We’ll make \(n\) independent measurements of \(X\), which we call \(X_1, X_2,\dots X_n\). A good estimator for \(\langle X\rangle\) could be to take the average of these measurements:

\[M = \frac{1}{n}\sum_{i=1}^n X_i.\]
(3)

The PDF of \(M\) is a multivariate PDF, but because \(X_i\) are all independent, it simplifies to \(P(x_1,\dots,x_n) = P(x_1)\cdots P(x_n)\). Thus, the expected value of \(M\) is

\[\langle M \rangle = \int dx_1\cdots dx_n\, \brackets{\frac{1}{n} \sum_{i=1}^n x_i} P(x_1,\dots,x_n)\]
(4)
\[ = \frac{1}{n} \sum_{i=1}^n \int dx_i\, x_i P(x_i)\]
(5)
\[ = \frac{1}{n} \sum_{i=1}^n \langle X_i \rangle\]
(6)
\[ = \langle X \rangle.\]
(7)

We have shown that the expected value of this average is the expected value of the random variable itself, so that our estimator \(M\) is an accurate measurement of \(\langle X \rangle\). But to show that it’s precise, we must introduce the concept of variance.

The variance of \(X\) is defined as

\[\mathrm{Var}\ X = \int dx\, (x - \langle X \rangle)^2 P(x) = \langle (X - \langle X \rangle)^2 \rangle.\]
(8)
This is a second-order moment of \(X\). Its value is positive, so we can also define the standard deviation \(\sigma_X\) to be such that \(\sigma_X^2 = \mathrm{Var}\ X\). Manipulating the above integral shows that an alternative form of variance is

\[\mathrm{Var}\ X = \langle X^2 \rangle - \langle X \rangle^2\]
(9)

which is often useful.

Variance is an extremely important concept because it characterizes the width of a probability distribution. A PDF with variance of zero must be a delta function — zero probability everywhere except at one value which is the mean. On the other hand, a PDF with high variance could take on a great many widely separated values.

Returning to the experiment we introduced in the previous section, \(M\) is only a precise estimator for \(\langle X \rangle\) if it has low variance. Let’s check this by computing \(\langle M^2 \rangle\) and applying the second form of variance above.

\[\langle M^2 \rangle = \int dx_1\cdots dx_n\, \brackets{\frac{1}{n} \sum_{i=1}^n x_i}^2 P(x_1)\cdots P(x_n)\]
(10)
\[= \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \int dx_i dx_j\, x_i x_j P(x_i)P(x_j)\]
(11)
\[= \frac{1}{n^2} \sum_{i=1}^n \sum_{j\neq i} \langle X \rangle^2 + \frac{1}{n^2} \sum_{i=1}^n \langle X^2 \rangle\]
(12)
\[= \frac{n-1}{n}\langle X \rangle^2 + \frac{1}{n} \langle X^2 \rangle\]
(13)
\[= \langle X \rangle^2 + \frac{1}{n} \mathrm{Var}\ X\]
(14)
where the second line expanded the square, the third line separated the components where \(i\neq j\) and \(i= j\), the fourth counted the number of terms in the sum, and the fifth substituted in Eq. 9.

Finally, the variance of \(M\) must be

\[\mathrm{Var}\ M = \frac{1}{n} \mathrm{Var}\ X\]
(15)
or equivalently
\[\sigma_M = \frac{\sigma_X}{\sqrt{n}}.\]
(16)
This shows that \(M\) is a good estimator for the expected value of \(X\), because as \(n\) increases \(M\) has low variance, becoming more and more precise.

This experiment illustrated a common process in statistics. Whenever one wants to measure a property of a random variable, one creates an estimator, shows that it is unbiased (i.e. its mean is the thing attempted to be measured) and that it is minimum-variance (i.e. its variance is as low as possible in the limit of large data).

An exercise for the reader is to show that the minimum variance estimator of the variance of a distribution is

\[V = \frac{1}{n - 1} \sum_{i=1}^n (X_i - M)^2.\]
(17)
and, like \(M\), exhibits \(1/\sqrt{n}\) standard deviation.

For multivariate probability distributions, it’s possible to define a different type of second moment, called covariance

\[\mathrm{Cov} [X, Y] = \big\langle (X - \langle X \rangle)(Y- \langle Y\rangle)\big\rangle.\]
(18)
For a set of \(n\) random variables \(X_1,\dots,X_n\) one can define a covariance matrix \(\Sigma\) whose entries are
\[\Sigma_{ij} = \mathrm{Cov} [X_i, X_j].\]
(19)
Notice that \(\mathrm{Cov} [X, Y] = \mathrm{Cov} [Y, X]\) and \(\mathrm{Cov} [X, X] = \mathrm{Var}\ X\) by definition. Equivalently, we could say that the covariance matrix \(\Sigma\) is symmetric, and its diagonal contains the variances of \(X_i\).

An interesting fact about covariance is that if \(X\) and \(Y\) are independent, then \(\mathrm{Cov} [X, Y] = 0\). You can prove this from the first equation of this section. Just as variance encapsulated the width of a probability distribution, covariance therefore encapsulates how dependent two variables are on each other. If the covariance matrix contains off-diagonal elements, its random variables are dependent.

The covariance matrix is a very useful mathematical object, which we will unpack later.

One can continue to form third and fourth order moments, which can likewise be estimated from data. These are generally called skewness (third order) and kurtosis (fourth order), and qualitatively they measure asymmetry in the PDF and the fraction of probability contained in the tails of the distribution. Their estimators possess approximately \(1/\sqrt{n}\) standard deviation just as the mean estimator did. We will not discuss these much.

Consequences

In the above sections, we showed that the average \(M\) of many independent random variables \(X_1,\dots,X_n\) drawn from the same distribution has mean \(\langle X \rangle\) and standard deviation \(\sigma_X / \sqrt{n}\). But we stopped short of computing the full PDF \(P_M(x)\).

You might expect that \(P_M(m)\) depends on the details of \(P_X(x)\). But for large \(n\), the central limit theorem tells us this is not the case. This theorem is arguably one of the most important and most used theorems in statistics.

The central limit theorem (CLT) states that, under some light constraints, \(M\) is Gaussian-distributed in the limit of large \(n\) regardless of \(P_X(x)\). That is,

\[P_M(m) \rightarrow \sqrt{\frac{n}{2\pi \sigma_X^2}}\exp\parens{-\frac{(m-\langle X \rangle)^2}{2 \sigma_X^2/n}}\]
(20)
The applications of the CLT are practically boundless. If you don’t know the PDF of \(X\), which is a common problem, an average over enough data points drawn from \(X\) will be Gaussian-distributed. A histogram, for example, will have Gaussian-distributed error bars on each bin. Likewise a function of the average will have easily predictable error bars, insensitive to potential tails in the distribution of \(P_X(x)\).

What follows is one of many possible proofs, based on Fourier series. For the sake of simplicity, we work with the quantity \(S = nM\), whose PDF is

\[P_S(s) = \int dx_1\cdots dx_n\, P_X(x_1)\cdots P_X(x_n) \delta\parens{s - \sum_{i=1}^n x_1}.\]
(21)

This is the convolution of \(P_X\) with itself \(n\) times. The “convolution theorem” dictates that the Fourier transform of the real-space product of functions \(f(x)g(x)\) is equal to the convolution of their Fourier transforms, \(\widetilde f(k)\) and \(\widetilde g(k)\). The inverse of this statement demonstrates that the convolution in Eq. 21 satisfies

\[\widetilde P_S(k) = \widetilde P_X(k)^n\]
(22)
or
\[\ln \widetilde P_S(k) = n \ln \widetilde P_X(k).\]
(23)
We are interested in the behavior of \(\ln \widetilde P_S(k)\) near its maximum, which contributes most to \(P_S(s)\). Due to the large-\(n\) multiplication on the right hand side, \(\widetilde P_S(k)\) is exponentially suppressed except when \(k\) is very close to the maximizing value \(k_0\). We may therefore approximate \(\ln \widetilde P_X(k)\) by its lowest order Taylor series near its maximum
\[\ln \widetilde P_S(k) = na_1 - \frac{n}{2}(k-k_0)^2 a_2\]
(24)
where \(a_1\) and \(a_2>0\) are Taylor series coefficients. Removing the logarithm, \(P_S(k)\) is a Gaussian
\[\widetilde P_S(k) \propto \exp \parens{-a_2\frac{n(k-k_0)^2}{2}}\]
(25)
and the Fourier transform of a Gaussian is also a Gaussian
\[P_S(x) \propto \exp \parens{-\frac{(x-\mu)^2}{2\sigma’^2}} \implies P_M(x) \propto \exp \parens{-\frac{(x-\mu)^2}{2\sigma^2}}\]
(26)
for some parameters \(\sigma\) and \(\mu\). We showed in the sections on expected value and variance that \(P_S(x)\) has expected value \(\mu = \langle X \rangle\) and standard deviation \(\sigma = \sigma_X/\sqrt{n}\). Asserting these findings reproduces the central limit theorem Eq. 20.

The last property we wish to note about variances is how to “propagate” variances through functions. Specifically, if we know \(\sigma_X\) and if \(Y = f(X)\), then what is \(\sigma_Y\)?

While there is a general solution, a common approximation is made. Suppose that \(P_X(x)\) is narrowly peaked around its mean, so that \(f(x)\) is slowly varying over the region where \(P_X(x)\) is large. Then we may approximate \(f(x)\) as linear in that region:

\[Y = f’(\langle X \rangle)(X - \langle X \rangle) + f(\langle X \rangle).\]
(27)
The definition of variance now simply states
\[\sigma_Y = |f’(\langle X \rangle)|\sigma_X.\]
(28)

The next section represents our first application of probability theory to statistics; we will use the tools we’ve learned to create a theory fitting models to data.