Home
Expected Value, Variance, and Covariance
In the previous section, we defined random variables and their basic properties such as their Probability Density Functions (PDFs) and Probability Mass Functions (PMFs). These functions completely describe the random variable in question, and we often want to measure them in science. This is quite difficult to do accurately because a PDF is a function with a separate value for each point in the domain, all of which have to be measured. In such cases, it is often more useful to measure simplifying properties of the PDF instead. These are called the “moments” of the distribution. The \(n^\mathrm{th}\) moment of the PDF \(P(x)\) is defined as
Below, we will focus on a few important moments and explain why they are useful. We will consider continuous random variables which have PDFs, though many of the results also apply to discrete random variables.
The expected value of a random variable \(X\) is defined as its first-order moment:
The introduction of this post implied that moments are easier to measure than the PDF directly. To see this, let’s design an experiment that attempts to measure \(\langle X \rangle\). We’ll make \(n\) independent measurements of \(X\), which we call \(X_1, X_2,\dots X_n\). A good estimator for \(\langle X\rangle\) could be to take the average of these measurements:
The PDF of \(M\) is a multivariate PDF, but because \(X_i\) are all independent, it simplifies to \(P(x_1,\dots,x_n) = P(x_1)\cdots P(x_n)\). Thus, the expected value of \(M\) is
We have shown that the expected value of this average is the expected value of the random variable itself, so that our estimator \(M\) is an accurate measurement of \(\langle X \rangle\). But to show that it’s precise, we must introduce the concept of variance.
The variance of \(X\) is defined as
which is often useful.
Variance is an extremely important concept because it characterizes the width of a probability distribution. A PDF with variance of zero must be a delta function — zero probability everywhere except at one value which is the mean. On the other hand, a PDF with high variance could take on a great many widely separated values.
Returning to the experiment we introduced in the previous section, \(M\) is only a precise estimator for \(\langle X \rangle\) if it has low variance. Let’s check this by computing \(\langle M^2 \rangle\) and applying the second form of variance above.
Finally, the variance of \(M\) must be
This experiment illustrated a common process in statistics. Whenever one wants to measure a property of a random variable, one creates an estimator, shows that it is unbiased (i.e. its mean is the thing attempted to be measured) and that it is minimum-variance (i.e. its variance is as low as possible in the limit of large data).
An exercise for the reader is to show that the minimum variance estimator of the variance of a distribution is
For multivariate probability distributions, it’s possible to define a different type of second moment, called covariance
An interesting fact about covariance is that if \(X\) and \(Y\) are independent, then \(\mathrm{Cov} [X, Y] = 0\). You can prove this from the first equation of this section. Just as variance encapsulated the width of a probability distribution, covariance therefore encapsulates how dependent two variables are on each other. If the covariance matrix contains off-diagonal elements, its random variables are dependent.
The covariance matrix is a very useful mathematical object, which we will unpack later.
One can continue to form third and fourth order moments, which can likewise be estimated from data. These are generally called skewness (third order) and kurtosis (fourth order), and qualitatively they measure asymmetry in the PDF and the fraction of probability contained in the tails of the distribution. Their estimators possess approximately \(1/\sqrt{n}\) standard deviation just as the mean estimator did. We will not discuss these much.
Consequences
In the above sections, we showed that the average \(M\) of many independent random variables \(X_1,\dots,X_n\) drawn from the same distribution has mean \(\langle X \rangle\) and standard deviation \(\sigma_X / \sqrt{n}\). But we stopped short of computing the full PDF \(P_M(x)\).
You might expect that \(P_M(m)\) depends on the details of \(P_X(x)\). But for large \(n\), the central limit theorem tells us this is not the case. This theorem is arguably one of the most important and most used theorems in statistics.
The central limit theorem (CLT) states that, under some light constraints, \(M\) is Gaussian-distributed in the limit of large \(n\) regardless of \(P_X(x)\). That is,
What follows is one of many possible proofs, based on Fourier series. For the sake of simplicity, we work with the quantity \(S = nM\), whose PDF is
This is the convolution of \(P_X\) with itself \(n\) times. The “convolution theorem” dictates that the Fourier transform of the real-space product of functions \(f(x)g(x)\) is equal to the convolution of their Fourier transforms, \(\widetilde f(k)\) and \(\widetilde g(k)\). The inverse of this statement demonstrates that the convolution in Eq. 21 satisfies
The last property we wish to note about variances is how to “propagate” variances through functions. Specifically, if we know \(\sigma_X\) and if \(Y = f(X)\), then what is \(\sigma_Y\)?
While there is a general solution, a common approximation is made. Suppose that \(P_X(x)\) is narrowly peaked around its mean, so that \(f(x)\) is slowly varying over the region where \(P_X(x)\) is large. Then we may approximate \(f(x)\) as linear in that region:
The next section represents our first application of probability theory to statistics; we will use the tools we’ve learned to create a theory fitting models to data.