Home
Gaussian Likelihoods
This post deals parameter fitting under a powerful yet often accurate approximation: a Gaussian likelihood. To be specific, for a data set \(\{x_i\}\) containing \(N\) elements, we assume the likelihood to be
In practice, we don’t even need the likelihood to be Gaussian everywhere. If it is roughly Gaussian near its maximum, the below methods will apply even if it is non-Gaussian in the tails.
It is convenient to take the logarithm of the likelihood, which is
Since we’ve proved that the posterior is approximately Gaussian, we may use the same estimates for the parameter uncertainties that we used in the case of a linear fit. Specifically, the covariance matrix of the posterior has components
The problem of fitting is now reduced to computing this posterior maximum and the second derivatives. But as mentioned in the introduction, the model \(\bm \mu\) is potentially very complicated, and these might not be computable analytically. We will use numerical techniques to compute them instead.
Let’s assume we’ve already found \(\hat{\bm \theta}\) and now wish to evaluate \(\Sigma\)—the parameter uncertainties. The second derivative matrix (or Hessian) it is equal to can be computed numerically via the finite difference method
The idea behind this formula is that the posterior \(P(\hat{\bm \theta} + \epsilon\cdots|D)\) can be Taylor-expanded near \(\hat{\bm \theta}\) giving several terms:
- a constant term
- a term proportional to \(\epsilon\) times a first derivative of the posterior
- a term proportional to \(\epsilon^2\) times a second derivative
- higher order terms.
However, if \(\epsilon\) is too small, one runs into problems numerically evaluating the formulas involved. The delicate balance between small and large is difficult to strike; no value of \(\epsilon\) works in all circumstances. One can do it by hand, or use one of many software packages designed to try a few values of \(\epsilon\) and choose the best. The second derivative is then evaluated for every combination of \(i\) and \(j\), and the resulting Hessian matrix is inverted to get \(\Sigma\).
Altogether, computing \(\Sigma\) is not very difficult, since the window of acceptable values of \(\epsilon\) is generally large. The greater difficulty is finding the best fit values \(\hat{\bm \theta}\) at which to evaluate it.
Numerically maximizing a function (or minimizing \(-1\) times the function) is a common task and many algorithms exist to do it. In my experience, it is not necessary to understand the algorithms in detail because they have been implemented in a myriad programming languages already. However, it is important to understand the strengths and weaknesses of each, so that you can choose which one is likely to work.
I’ll present three algorithms, useful in different circumstances. Briefly, they are
- Broyden–Fletcher–Goldfarb–Shanno (BFGS): Useful when you have a well-behaved posterior (e.g. roughly Gaussian everywhere). This is the fastest method; if in doubt, you may want to try it first. It also provides a good estimate of the Hessian matrix automatically.
- Nelder–Mead: Useful when your posterior is poorly behaved, e.g. containing strong correlations between the parameters or lumps that confuse the minimizer. This method is not as fast as BFGS.
- Gradient Descent: Useful when you have many parameters (think hundreds) and most other minimization methods are impossible. Gradient descent is most useful if you analytically know the gradient of your posterior.
Now for a brief summary of each, in increasing order of complexity. To minimize a function \(f(\bm x)\), all of these methods work by guessing some point \(\bm x_0\) believed to be close to the minimum of a function. They then update the point to \(\bm x_1\), \(\bm x_2\) etc. until the minimum is reached.
Gradient descent
Gradient descent updates according to
BFGS
BFGS uses additional information about the function (specifically the Hessian) to go more directly towards the minimum. It can be understood as an application of Newton’s method for finding the root of a function. Newton’s method finds the root of a scalar function \(g(x)\) by drawing a line tangent to \(g(x)\) at \(x=x_i\), and setting \(x_{i+1}\) equal to the x-intercept of that line.
What I have described so far is Newton’s method for minimization. The BFGS method uses an additional trick to estimate the inverse of the Hessian so that you don’t have to numerically evaluate it. The process is a little involved, but in essence it works by keeping track of changes in the numerically-computed gradients \(\nabla f(x_0), \nabla f(x_1), \dots\) as the algorithm progresses. These changes describe the Hessian. If \(f(\bm x)\) is expensive to compute, this automatic calculation of the Hessian can save a lot of time. Most implementations output this estimated Hessian at the end of the algorithm, so you can convert it into parameter uncertainties without requiring Eq. 4.
Nelder–Mead
A potential problem with BFGS is that, even though the Hessian estimate allows it to traverse to the minimum of \(f\) more directly than gradient descent, functions where the Hessian itself varies greatly makes the method inefficient because the algorithm is unable to keep track of these variations. Nelder-Mead is a more stable, albeit slower method, which doesn’t rely on computing derivatives at all. It starts by evaluating \(f(\bm x)\) at an ensemble of \(n\) points around \(x_0\) and evolves the ensemble based on the values of \(f\). If \(\bm x\) is two-dimensional, \(n\) is chosen to be three so that the ensemble represents a triangle. For three-dimensional \(\bm x\), \(n=4\) represents a tetrahedron. In general, \(n=d+1\) where \(d\) is the dimension of \(\bm x\) so that the ensemble represents the simplest polytope one can create in that space.
Every step, the worst point in the ensemble is replaced with a new point, chosen based on the other points in the ensemble. The new point can have the effect of reflecting, expanding, or contracting the ensemble. The algorithm terminates when all the points in the ensemble are at approximately the same value. The path taken by this method is often not terribly efficient, but it stands out in its ability to navigate over rough terrain compared to the other methods.
Gaussian likelihoods (and therefore Gaussian posteriors) are so common yet so powerful that I recommend doing a fit using the methods outline here in most circumstances. However, it’s common in literature to do away with the assumption of Gaussianity and solve for the entire posterior distribution directly. The technique for doing this fully general Bayesian fitting is a Markov-Chain Monte Carlo method, discussed in the next post. It allows one to calculate the covariance matrix \(\Sigma\) directly from the posterior rather than the Hessian evaluated at the maximum. It also can be used to compute the higher-order moments of the posterior, which reveal potential correlations that are not present in the covariance matrix. Its disadvantage is that it is orders-of-magnitude slower than the minimization method presented here. If you find yourself limited by computing time, consider assuming Gaussian likelihoods.