Home
Parameter Fitting and Uncertainties
In the last section, we reviewed a method for computing the posterior distribution \(P(\bm \theta | D)\) for the parameters \(\bm \theta\) of a model. Calculating this posterior fully solves the statistical problem, but the posterior is a clunky object — a function defined for all values of \(\bm \theta\). It is helpful to simplify it into two numbers per parameter: the best fit parameters and the uncertainties. For example, with best fit parameter \(\theta = 1\) and uncertainty \(0.5\), the result is \(1 \pm 0.5\).
One common definition of the best fit parameters and uncertainties is that they are the mean and standard deviation of \(P(\bm \theta | D)\) respectively. In cases with multiple parameters, the covariance matrix is the most general measure of uncertainties. We will discuss the meaning of covariances in the next section.
This definition works well for symmetric posteriors, but asymmetry can make it less useful. Consider a posterior for the neutrino mass, for example. The posterior should be very close to zero because the neutrino is light, but never negative because the neutrino must have positive mass. It may therefore by asymmetric, with a large tail towards higher masses. The previous definition would have assigned symmetric uncertainties on the neutrino mass equal to the standard deviation of the posterior. Another potential definition is that best fit parameters are the mode of \(\bm \theta\) and uncertainty on some parameter \(\theta_i\) is the range of the most probable values of \(\theta_i\) that encloses 68% of the posterior. This is the 68% credible interval. In the neutrino mass case, this interval would be asymmetric with a higher upper uncertainty. For example, if the mode occurred at \(\theta = 0.1\) and the range of values from 0.05 to 0.3 contained 68% of the posterior, then the uncertainties would be reported as \(\theta = 0.1^{+0.2}_{-0.05}\).
You might wonder why it’s OK to have two definitions of uncertainties. An important point, however, is that these definitions coincide when the posterior is Gaussian. So the usual approach is to use the first definition, which is numerically simpler, when the posterior is roughly Gaussian. In contexts where the posterior is highly non-Gaussian, the second definition is used.
A fit result for multiple parameters gives a covariance matrix \(\Sigma\), but what does covariance really mean? These can be understood in terms of corner plots.
Consider a fit for two parameters. The (symmetric) covariance matrix has three unique entries
We can plot the posterior distribution two dimensionally as a function of \(\theta_1\) and \(\theta_2\). For a Gaussian posterior, contours of this distribution will be ellipses. There’s a correspondence between the plot and the entries of \(\Sigma\): the width of the ellipse in the \(\theta_1\) direction is \(\sigma_1\) and likewise for the \(\theta_2\) direction, while \(b\) controls the orientation of the ellipse. In particular, the angle of the ellipse with respect to the \(\theta_1\) and \(\theta_2\) axes is controlled by the correlation, defined as
Intuitively, a non-zero correlation or covariance tells us that two parameters are dependent in ways that their uncertainties \(\sigma\) do not show. If one parameter statistically fluctuates in one direction, the other is likely to do so as well.
TODO: Example corner plot
The graphical depiction of the previous section raises an important question. When two parameters are uncorrelated, their posterior looks like an ellipse aligned with the parameter axes. When they are correlated, the posterior rotates, but is still an ellipse. That seems to imply that there are different parameters for which the posterior is still uncorrelated, and these parameters should be the axes of the ellipse.
This is a correct and generalizable statement. Mathematically, one could explain it in the following way. The covariance matrix \(\Sigma\) is symmetric, so it can be diagonalized \(\Sigma = U^T \Lambda U\) where \(\Lambda\) is a diagonal matrix and \(U^T\) is an orthonormal matrix. If one defines new parameters \(\bm \eta = U \bm \theta\), one can show that the covariance matrix for the \(\bm \eta\) parameters is just \(\Lambda\). Since \(\Lambda\) is diagonal, the different entries of \(\eta\) are uncorrelated.
This is a very interesting result; it shows that covariance and correlation are telling us just about the parameters that we chose to use. They can be removed if we choose different parameters, which is sometimes a very useful exercise. A further note is that if the posterior is Gaussian, the \(\bm \eta\) parameters are actually completely independent.
Even though transforming to \(\eta\) removes correlations, the \(\eta\) parameters are not magically better measured. Removing correlations is merely a computational simplification. One can see this by considering the total uncertainty of all the parameters \(\sum_i \sigma_i^2 = \mathrm{tr}\ \Sigma\). It’s a fact of linear algebra that orthonormal transformations like we just did do not affect \(\mathrm{tr}\ \Sigma\). So all we’ve done is moved around uncertainty, not removed it.
As we’ve discussed, covariance highlights the dependence of the parameters on each other. But often only the standard deviations \(\sigma\) are reported on parameters, not covariance. You might wonder whether reporting just variances is enough.
Large covariances between two parameters should not be ignored when you’re interested in both parameters, but it gets more complicated when one of the parameters is boring. For example, let’s say we are fitting for the flux of a star observed with a telescope. The photons coming from the star will represent both those created by the star itself, plus those created by a diffuse background pervading the image. The usual process is to fit two parameters: the true flux and the background flux. If we do the fit, we’ll get uncertainties \(\sigma_F\) and \(\sigma_B\). We’ll also get a large covariance because if the true background is higher than we measure, then the true star flux must be lower to explain the fixed amount of light coming from the star.
We must fit for \(B\) and \(F\) to get an accurate estimate of \(F\), yielding a two-dimensional covariance matrix. Otherwise, we are not interested in \(B\). One would say \(B\) is a nuisance parameter. The question is, is it OK to report only the uncertainty on \(F\), and not the uncertainty on \(B\) or its covariance with \(F\)? The answer is yes, for the following reason.
Since we don’t care about the value of \(B\), we want to report the marginal posterior \(P(F|D)\). We have the full posterior \(P((F,B)|D)\). The marginal posterior is defined as
So, you should report covariances between interesting parameters, but not for nuisance parameters.
In the next section, we’ll apply what we’ve learned so far in a simple example: Fitting a line to a set of data points.