Home

Gaussian Likelihoods

This post deals parameter fitting under a powerful yet often accurate approximation: a Gaussian likelihood. To be specific, for a data set \(\{x_i\}\) containing \(N\) elements, we assume the likelihood to be

\[\mathcal{L} \propto \exp \parens{-\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N(x_i - \mu_i) (\Sigma^{-1})_{ij}(x_j - \mu_j)}\]
(1)
where \(\mu_i\) represents the model prediction for data point \(i\) and \(\Sigma\) is the covariance matrix of the data, which is also predicted by the model.

In practice, we don’t even need the likelihood to be Gaussian everywhere. If it is roughly Gaussian near its maximum, the below methods will apply even if it is non-Gaussian in the tails.

It is convenient to take the logarithm of the likelihood, which is

\[\ln \mathcal{L} = -\frac{1}{2}(\bm x - \bm \mu) (\Sigma^{-1})(\bm x - \mu)\]
(2)
where we have switched to matrix notation and dropped the normalization term. The posterior is simply this likelihood times the prior; therefore, a uniform prior will give a Gaussian posterior. Even non-uniform, sufficiently wide priors will still give approximately Gaussian posteriors. Gaussianity is only violated if the priors are so narrow that their complex curvature disturbs the Gaussian likelihood. In such a case (which we will not further consider), the fit results will be dominated by the prior rather than the data.

Since we’ve proved that the posterior is approximately Gaussian, we may use the same estimates for the parameter uncertainties that we used in the case of a linear fit. Specifically, the covariance matrix of the posterior has components

\[(\Sigma^{-1})_{ij} = -\left.\frac{\partial}{\partial \theta_i}\frac{\partial}{\partial \theta_j}\ln P(\bm \theta|D)\right |_{\hat{\bm \theta}}\]
(3)
where \(P(\bm \theta|D)\) is the posterior and \(\hat{\bm \theta}\) is the best fit value of \(\hat{\bm \theta}\) (the maximum of the posterior). You can verify this by plugging a multidimensional Gaussian into the posterior and evaluating the second derivative.

The problem of fitting is now reduced to computing this posterior maximum and the second derivatives. But as mentioned in the introduction, the model \(\bm \mu\) is potentially very complicated, and these might not be computable analytically. We will use numerical techniques to compute them instead.

Let’s assume we’ve already found \(\hat{\bm \theta}\) and now wish to evaluate \(\Sigma\)—the parameter uncertainties. The second derivative matrix (or Hessian) it is equal to can be computed numerically via the finite difference method

\[{\Sigma^{-1}}_{ij} = \frac{P(\hat{\bm \theta} + \epsilon\bm{\hat i} + \epsilon\bm{\hat j}|D) + P(\hat{\bm \theta} - \epsilon\bm{\hat i} - \epsilon\bm{\hat j}|D) - P(\hat{\bm \theta} + \epsilon\bm{\hat i} - \epsilon\bm{\hat j}|D) - P(\hat{\bm \theta} - \epsilon\bm{\hat i} + \epsilon\bm{\hat j}|D)}{4\epsilon^2}\]
(4)
where \(\epsilon\) is a small number.

The idea behind this formula is that the posterior \(P(\hat{\bm \theta} + \epsilon\cdots|D)\) can be Taylor-expanded near \(\hat{\bm \theta}\) giving several terms:

By design, the constant and linear terms cancel in this formula. If \(\epsilon\) is sufficiently small, then the higher order terms are small and the result simply gives second derivative, as desired.

However, if \(\epsilon\) is too small, one runs into problems numerically evaluating the formulas involved. The delicate balance between small and large is difficult to strike; no value of \(\epsilon\) works in all circumstances. One can do it by hand, or use one of many software packages designed to try a few values of \(\epsilon\) and choose the best. The second derivative is then evaluated for every combination of \(i\) and \(j\), and the resulting Hessian matrix is inverted to get \(\Sigma\).

Altogether, computing \(\Sigma\) is not very difficult, since the window of acceptable values of \(\epsilon\) is generally large. The greater difficulty is finding the best fit values \(\hat{\bm \theta}\) at which to evaluate it.

Numerically maximizing a function (or minimizing \(-1\) times the function) is a common task and many algorithms exist to do it. In my experience, it is not necessary to understand the algorithms in detail because they have been implemented in a myriad programming languages already. However, it is important to understand the strengths and weaknesses of each, so that you can choose which one is likely to work.

I’ll present three algorithms, useful in different circumstances. Briefly, they are

Now for a brief summary of each, in increasing order of complexity. To minimize a function \(f(\bm x)\), all of these methods work by guessing some point \(\bm x_0\) believed to be close to the minimum of a function. They then update the point to \(\bm x_1\), \(\bm x_2\) etc. until the minimum is reached.

Gradient descent

Gradient descent updates according to

\[\bm x_{i+1} = -\gamma \nabla f(\bm x_i)\]
(5)
where \(\gamma\) is a number chosen by hand. This update step moves downwards in \(f\) simply by following the gradient, which can be computed numerically or analytically. If the gradient is known analytically, the simplicity of this update formula makes it usable when \(\bm x\) is very high-dimensional, unlike the other methods. Gradient descent is inefficient if \(\gamma\) is too small, because the function will not move fast enough. It is also inefficient if \(\gamma\) is too big because the algorithm might skip over the minimum. \(\gamma\) must therefore be tuned, or an efficient value estimated using one of many methods. A final problem however is that the gradient may lead the algorithm on a curvy, inefficient path.

BFGS

BFGS uses additional information about the function (specifically the Hessian) to go more directly towards the minimum. It can be understood as an application of Newton’s method for finding the root of a function. Newton’s method finds the root of a scalar function \(g(x)\) by drawing a line tangent to \(g(x)\) at \(x=x_i\), and setting \(x_{i+1}\) equal to the x-intercept of that line.

\[x_{i+1} = x_i - \frac{g(x_i)}{g’(x_i)}.\]
(6)
This is related to minimization because the minimum of \(f(\bm x)\) occurs at \(\nabla f(\bm x) = 0\). For minimizing a one-dimensional function, the last term of the update step should therefore be replaced with \(\frac{f’(x_i)}{f’’(x_i)}\). For a multidimensional function, the correct approach is
\[x_{i+1} = x_i - H(\bm x_i)^{-1} \nabla f(x_i).\]
(7)
where \(H(\bm x_i)\) is the Hessian matrix of \(f(\bm x_i)\).

What I have described so far is Newton’s method for minimization. The BFGS method uses an additional trick to estimate the inverse of the Hessian so that you don’t have to numerically evaluate it. The process is a little involved, but in essence it works by keeping track of changes in the numerically-computed gradients \(\nabla f(x_0), \nabla f(x_1), \dots\) as the algorithm progresses. These changes describe the Hessian. If \(f(\bm x)\) is expensive to compute, this automatic calculation of the Hessian can save a lot of time. Most implementations output this estimated Hessian at the end of the algorithm, so you can convert it into parameter uncertainties without requiring Eq. 4.

Nelder–Mead

A potential problem with BFGS is that, even though the Hessian estimate allows it to traverse to the minimum of \(f\) more directly than gradient descent, functions where the Hessian itself varies greatly makes the method inefficient because the algorithm is unable to keep track of these variations. Nelder-Mead is a more stable, albeit slower method, which doesn’t rely on computing derivatives at all. It starts by evaluating \(f(\bm x)\) at an ensemble of \(n\) points around \(x_0\) and evolves the ensemble based on the values of \(f\). If \(\bm x\) is two-dimensional, \(n\) is chosen to be three so that the ensemble represents a triangle. For three-dimensional \(\bm x\), \(n=4\) represents a tetrahedron. In general, \(n=d+1\) where \(d\) is the dimension of \(\bm x\) so that the ensemble represents the simplest polytope one can create in that space.

Every step, the worst point in the ensemble is replaced with a new point, chosen based on the other points in the ensemble. The new point can have the effect of reflecting, expanding, or contracting the ensemble. The algorithm terminates when all the points in the ensemble are at approximately the same value. The path taken by this method is often not terribly efficient, but it stands out in its ability to navigate over rough terrain compared to the other methods.

Gaussian likelihoods (and therefore Gaussian posteriors) are so common yet so powerful that I recommend doing a fit using the methods outline here in most circumstances. However, it’s common in literature to do away with the assumption of Gaussianity and solve for the entire posterior distribution directly. The technique for doing this fully general Bayesian fitting is a Markov-Chain Monte Carlo method, discussed in the next post. It allows one to calculate the covariance matrix \(\Sigma\) directly from the posterior rather than the Hessian evaluated at the maximum. It also can be used to compute the higher-order moments of the posterior, which reveal potential correlations that are not present in the covariance matrix. Its disadvantage is that it is orders-of-magnitude slower than the minimization method presented here. If you find yourself limited by computing time, consider assuming Gaussian likelihoods.