Home

Finishing Parameter Fitting: Uncertainty on the Prediction

We concluded the previous post by stating that \(\sigma_a\) and \(\sigma_b\) don’t fully express the uncertainty of the fit, because of the nonzero covariance. But this merely raises the new question of how does one fully express uncertainty? Listing an entire covariance matrix is certainly an option, but it’s clunky if the number of parameters is large, and it’s not very intuitive.

A graphical option is to translate the uncertainty on fit parameters into uncertainty on the best fit line. These can be represented as “error bands” on a plot, which enclose all the lines which could have been drawn consistent with the posterior distribution. An example is shown in the interactive applet below.

Applet: A linear fit \(y=ax+b\) to simulated data with constant uncertainty \(\sigma\). Use the sliders to adjust the true values of \(a\), \(b\), \(\sigma\), and the number of data points \(N\). The dashed line represents the true model, while the solid line is the best-fit model extracted using the methods of the previous post. The yellow band shows the uncertainty on the fit derived in this post; It should contain the true model 68% of the time. In the bottom left is a tracker containing the percentage of times that the yellow band has enclosed the true line. The Y or N in parentheses indicates the status of the current band. If you keep clicking randomize, this number should approach 68. The text in the upper left shows the best fit parameters and uncertainty derived in the previous section.

How do we compute which lines are consistent with the posterior? In the previous post, we found that the posterior is a bivariate-Gaussian with mean given by the best fit parameters \(\hat a\) and \(\hat b\) and covariance matrix

\[\Sigma = \frac{\sigma^2}{N \mathrm{Var}\ x}\begin{pmatrix}1 & -\langle x \rangle \\ -\langle x \rangle & \langle x^2 \rangle\end{pmatrix}.\]
(1)
(I am using hats to refer to the values we determine from our fit, to distinguish them from the true values \(a\) and \(b\).) We could define “consistent with the posterior” to mean within one sigma deviation of the best fit parameters. That is,
\[\begin{pmatrix} a - \hat a & b - \hat b \end{pmatrix} \Sigma^{-1} \begin{pmatrix} a - \hat a \\ b - \hat b \end{pmatrix} \leq 2.3.\]
(2)
This equation seems unintiuitive, so we will discuss it a little further. Let \(\bm \theta\) be a two dimensional vector containing \(a\) and \(b\). Then we can rewrite the left hand side more simply
\[(\bm \theta - \hat {\bm \theta})^T \Sigma^{-1} (\bm \theta - \hat {\bm \theta}) \leq 2.3.\]
(3)
Note that this is actually in the exponent of the Gaussian expression for the posterior distribution:
\[P(\bm \theta) \propto \exp\parens{-\frac{1}{2}(\bm \theta - \hat {\bm \theta})^T \Sigma^{-1} (\bm \theta - \hat {\bm \theta})}.\]
(4)
Thus, the set of parameters where equality is satisfied in Eq. 2 is a “level curve” of the posterior—it gives constant \(P(\bm \theta)\). That constant value is \(e^{-2.3/2}\) of the maximum. Further calculation shows that this area encloses 68% of the posterior. Remember that in the post on parameter fitting, we mentioned that our parameter uncertainties enclose 68% of the posterior? Eq. 2 is simply inheriting that same definition to the case of finding uncertainties on the best fit line.1

The next task is to convert this region of consistent best fit parameters to a band of consistent lines by finding what are the maximum and minimum values of \(y\) that can be achieved by these parameters. This can be done with the method of Lagrange multipliers. That is, we set

\[\lambda = (\bm \theta - \hat {\bm \theta})^T \Sigma^{-1} (\bm \theta - \hat {\bm \theta}),\]
(5)
and we want to extremize \(y = ax + b\) with respect to \(a\) and \(b\) while satisfying \(\lambda = 2.3\). We do this by taking the gradient of \(y\) and setting it proportional to the gradient of \(\lambda\):
\[\begin{pmatrix} x \\ 1\end{pmatrix} = C \Sigma^{-1} (\bm \theta - \hat {\bm \theta})\]
(6)
where \(C\) is a constant. This and \(\lambda = 2.3\) constitute a system of equations for \(a\), \(b\), and \(C\). Solving it,
\[y = \hat y \pm \sqrt{2.3\begin{pmatrix}x & 1 \end{pmatrix} \Sigma\begin{pmatrix} x \\ 1\end{pmatrix}}\]
(7)
where \(\hat y = \hat a x + \hat b\).

This equation is saying that \(y\)-values consistent with the posterior are the best fit line \(\hat y\) plus the square root of a hyperbola which grows for large \(|x|\). As we would expect, the uncertainty on the best fit line increases far from the data. One can confirm this intuition by calculating that the band is narrowest at \(x = \langle x \rangle\).

Far from the data, the \(x^2\) term dominates so that \(y\) looks line a line. The slope of this line is \(\hat a \pm \sqrt{2.3\Sigma_{11}} = \hat a \pm \sqrt{2.3}\sigma / (\sigma_x\sqrt{N}).\) This is proportional to our conclusion in the previous section that the uncertainty on the slope is \(\sigma_a = \sigma/ (\sigma_x\sqrt{N})\).2 Likewise, the value of the line at the \(y\)-intercept is given by \(b = \hat b \pm \sqrt{2.3\Sigma_{22}} = \hat b \pm \sqrt{2.3}\sigma \sqrt{\langle x^2 \rangle} / (\sigma_x\sqrt{N})\), again proportional to our uncertainty for \(b\).

This exercise of finding uncertainties on the best fit line therefore reproduces the information we found in the previous post, but it does so in a more intuitive and graphical way which also takes into account the covariance between the best fit parameters. An applet combining all the techniques we’ve applied so far to linear fits is provided at the beginning of the post. By clicking the randomize button and moving the sliders, you can confirm that our best fit solution is valid.

In these posts on linear fitting, we assumed a Gaussian likelihood with a linear model and equal uncertainty \(\sigma\) across the data. If any of these assumptions are violated, our results do not hold and an analytical solution may not even be possible. In particular, most scientific models are nonlinear, and even those that are likely have uneven error bars. But even in cases where these assumptions are violated, the fundamental Bayesian principles still hold. In the next post, we discuss the general case of a Gaussian likelihood and make no assumptions on the model or the uncertainties.

1 The number 2.3 comes from numerically solving for the value such that the ellipse contains 68% of the posterior. Let this value be \(\lambda\). The fraction enclosed by \(\lambda\) works out to be \(\frac{\Gamma(d/2) - \Gamma(d/2,\lambda/2)}{\Gamma(d/2) \lambda^{d/2}}\) where \(d\) is the dimensionality of \(\bm \theta\) and \(\Gamma(x,y)\) is the upper incomplete Gamma function. Setting \(\lambda=1\) for \(d=1\) gives 68%. For \(d=2\), the value of \(\lambda\) that also gives 68% is 2.30 (the number we use here), and for increasing \(n\) we get 3.53, 4.72, 5.89, etc.
2 If the factor of \(\sqrt{2.3}\) difference between the trend of \(y\) and the uncertainty on \(a\) is confusing, consider this. The uncertainty on \(a\) was defined to enclose 68% of \(P(a|D)\). The uncertainty on \(y\) was defined to enclose 68% of \(P(a,b|D)\). Since a Gaussian in the two-dimensional space of \(a\) and \(b\) has more tail than in the one-dimensional space of \(a\), the interval enclosed for \(y\) was bigger. We chose to keep the uncertainties enclosing 68% of the PDF, allowing \(y\) to vary more than one standard deviation from \(\hat y\). If we had chosen the other route (which is sometimes done), we would reproduce the same uncertainties on \(a\) that the previous post gave, but our line would enclose only 39% of the posterior.