Home

Fitting a line

We defined a new formalism in the introduction to Bayesian statistics and derived formulas for best fit parameters and uncertainty in the previous post. In this section, we’ll apply this formalisms for the first time.

We’ll consider the simple case of a Gaussian likelihood with a linear model. The linear model is a rather unusual case, but a Gaussian likelihood is very commonly used.

Let’s name the data \(\{y_i\}\), where \(i\) indexes over the data points. A model will produce predictions \(y^*_i\) for the data—we won’t assume a linear model yet. Explicitly, a Gaussian likelihood is

\[\mathcal{L} = P(y_i | y^*_i) = \brackets{\prod_i\frac{1}{\sqrt{2\pi \sigma_i^2}}} \exp\parens{-\sum_i \frac{(y_i-y_i^*)^2}{2 \sigma_i^2}}.\]
(1)
We have assumed that each data point \(y_i\) is independent with some standard deviation \(\sigma_i\). Sometimes, the assumption of independent data points is removed, so that the likelihood is
\[\mathcal{L} = \brackets{\prod_i\frac{1}{\sqrt{2\pi \sigma_i^2}}} \exp\parens{-\frac{1}{2}\sum_i \sum_j(y_i-y_i^*)\Sigma_{ij}(y_j-y_j^*)}\]
(2)
where \(\Sigma_{ij}\) is a covariance matrix. But for this post we will stick to the first form.

The \(\sigma_i\) standard deviations depend on the context of the fit. In the case of fitting data to histograms, the central limit theorem suggests \(\sigma_i = \sqrt{y_i}\) (see the post on variance and the CLT).

In practice, the log likelihood is more useful. In particular we can define the quantity

\[\chi^2 = - 2 \ln \frac{\mathcal{L}}{\mathcal{L}_0} = \sum_i \frac{(y_i - y^*_i)^2}{\sigma_i^2}\]
(3)
where \(\mathcal{L}_0\) is the largest possible value of \(\mathcal{L}\). In the case where \(\sigma_i = \sqrt{y_i}\),
\[\chi^2 = \sum_i \frac{(y_i - y^*_i)^2}{y_i}.\]
(4)
Some people refer to only the latter as \(\chi^2\), though many physicists use \(\chi^2\) to refer to both.

This quantity is very important, and has many special properties. For now, all we need is its connection to the log-likelihood. Specifically, it is \(-2\) times the log-likelihood \(\ln \mathcal{L}\) minus some additional value \(\ln \mathcal{L}_0\) which is a normalization constant and roughly independent of the model. Best fit parameters and uncertainties do not require likelihood to be normalized, so the \(\ln \mathcal{L}_0\) factor will not matter. It only serves to simplify the formula for \(\chi^2\).

Now that we have defined our Gaussian likelihood, all we need is a prior to generate posteriors from the data. Let’s assume uniform priors \(\pi(\bm \theta) = \mathrm{const}\). As with \(\mathcal{L}_0\), we can discard with normalizing \(\pi\), so that \(\ln \pi(\bm \theta) =0\) is a perfectly reasonable definition of the prior. In this special case, the posterior is simply equal to the likelihood.

We can now inherit our definitions of best fit parameters and uncertainties from the previous post. Specifically, the best fit parameters are the mode of the posterior \(p\) (likelihood \(\mathcal{L}\)) and can be found by setting

\[\frac{dp}{d\theta} = \frac{d\mathcal{L}}{d\theta} = \frac{d\ln\mathcal{L}}{d\theta} = \frac{d\chi^2}{d\theta} = 0\]
(5)
where all of the equalities are true for the best fit values.

The uncertainties can also be simplified because we know that the posterior, like the likelihood, must be Gaussian. The covariance matrix of a Gaussian is connected to its second derivative evaluated at the best fit parameters by

\[\frac{d^2\ln \mathcal{L}}{d\theta_i d\theta_j} = -(\Sigma^{-1})_{ij}.\]
(6)
You can work this out by simply differentiating a Gaussian posterior with covariance matrix \(\Sigma\) and mode equal to the best fit value. Using the \(\chi^2\) value instead,
\[\frac{d^2(\chi^2)}{d\theta_i d\theta_j} = 2(\Sigma^{-1})_{ij}.\]
(7)

This trick of relating parameter uncertainties to the behavior of the likelihood at the mode is called “Laplace’s approximation.” It’s an approximation because it uses the assumption of Gaussian posteriors to relate the behavior at the mode to the broader standard deviation of the whole distribution.

The simple case we’ll consider is fitting a line \(y^*_i=ax_i + b\) to a set of \(N\) data points. We’ll assume \(\sigma_i = \sigma\) is constant across the data and is known. We will be able to solve for the best fit value of \(a\) and \(b\) exactly and their covariance matrix. Let’s start by writing the \(\chi^2\)

\[\chi^2 = \sum_i \frac{(y_i - a x_i - b)^2}{\sigma^2}.\]
(8)
This \(\chi^2\) value can be expanded to separate the parameters
\[\chi^2 = \frac{N}{\sigma^2} \brackets{\langle y^2 \rangle + a^2 \langle x^2 \rangle - 2a \langle xy \rangle + b^2 - 2b \langle y \rangle + 2ab \langle x \rangle}.\]
(9)
The angle brackets represent averages over the collected data.

The best fit parameters are given by the minimum of this \(\chi^2\) as we determined in the previous section, which we obtain by setting the derivative of \(\chi^2\) with respect to the parameters equal to zero. This gives the system of equations

\[\begin{pmatrix} \langle xy \rangle \\ \langle y \rangle \end{pmatrix} = \begin{pmatrix} \langle x^2 \rangle & \langle x \rangle \\ \langle x \rangle & 1 \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix}.\]
(10)
Fortunately, this matrix equation can be analytically solved for \(a\) and \(b\):
\[\begin{pmatrix} a \\ b \end{pmatrix} = \frac{1}{\langle x^2 \rangle - \langle x \rangle^2}\begin{pmatrix} 1 & -\langle x \rangle \\ -\langle x \rangle & \langle x^2 \rangle \end{pmatrix} \begin{pmatrix} \langle xy \rangle \\ \langle y \rangle \end{pmatrix}\]
(11)
or
\[a = \frac{\mathrm{Cov}(x,y)}{\mathrm{Var}\ x}, \qquad b = \frac{\langle x^2 \rangle\langle y \rangle - \langle x \rangle \langle xy \rangle}{\mathrm{Var}\ x},\]
(12)
where we have used the definitions of variance and covariance.

To compute uncertainties on the best fit values of \(a\) and \(b\), we need only differentiate \(\chi^2\) as we learned earlier in this post. The second derivative matrix, or Hessian matrix, is

\[\frac{\partial}{\partial \theta_i} \frac{\partial}{\partial \theta_j}\chi^2 = \frac{2N}{\sigma^2}\begin{pmatrix}\langle x^2 \rangle & \langle x \rangle \\ \langle x \rangle & 1\end{pmatrix}.\]
(13)
Inverting this matrix and multiplying by two gives the covariance between the best fit parameters
\[\Sigma = \frac{\sigma^2}{N \mathrm{Var}\ x}\begin{pmatrix}1 & -\langle x \rangle \\ -\langle x \rangle & \langle x^2 \rangle\end{pmatrix}.\]
(14)
If we were reporting standard errors on \(a\) and \(b\), we would say \(\sigma_a = \sigma / (\sigma_X\sqrt{N})\) and \(\sigma_b = \sigma \sqrt{\langle x^2 \rangle} / (\sigma_X\sqrt{N})\). The trend that uncertainties decrease with \(\sqrt{N}\) occurs for both parameters, and we have also seen it in a simple example on population means and in the central limit theorem. It is a very common trend.

Another note is that, while the best fit parameters were independent of \(\sigma\), the parameter uncertainties \(\sigma_a\) and \(\sigma_b\) were not. This is an important point; it is often the case that the models that predict \(\sigma\) are not very reliable. This calculation suggests that incorrect error models affect the parameter uncertainties, but not so much the best fit values.

Finally, we should note the covariance between \(a\) and \(b\). In other words, simply reporting \(\sigma_a\) and \(\sigma_b\) does not fully express all the uncertainty in our fit. If \(\langle x \rangle \neq 0\), then \(a\) and \(b\) are also covariant. This is a sensible result. If the data is far from the \(y\)-axis (i.e. \(\langle x \rangle \gg 0\)), then the \(y\)-intercept \(b\) must be estimated by extrapolating the best fit line down to the \(y\)-axis. A different value of \(a\) will induce a different value of \(b\). Thus we should expect covariance.

We could finish the discussion here, since we have determined the best fit parameters and uncertainties which are generally the goal when doing a fit. However, we have not actually determined whether our fit is a good one. The most general way to do this is to plot the data and the best fit line and see if they match. It is helpful when doing this to express the fit uncertainties \(\sigma_a\) and \(\sigma_b\) as uncertainties on the best fit line, so that we can visually determine whether the points lie within the uncertainties or not. This is described in the next post.

It is sometimes possible to determine whether the model is a good fit from the \(\chi^2\) alone, without a figure. This is much less general technique, because it relies on properties which are true only for the \(\chi^2\) and only in certain limits, such as a large number of data points. These are discussed in the post on \(\chi^2\) values, and more extensively in the posts on model comparison.