Home
General Bayesian Fitting: Markov Chain Monte Carlo methods
We have gone through several blog posts about fitting now, from general theory to an example for linear functions to establishing uncertainties on the best fit and a generalization to Gaussian likelihoods. All of these posts invoked assumptions to keep the calculations tractable, but these assumptions can be (and often are) removed with the help of new algorithms, such as the Markov Chain Monte Carlo (MCMC) methods we’ll introduce here.
To understand MCMCs, we’ll have to approach parameter fitting in a completely new way, which deserves some introduction. Let’s start by stating the easy part. There’s nothing fundamentally hard about computing the posterior \(P(\bm \theta | D) = \mathcal{L}(\bm \theta) \pi(\bm \theta)\). We are simply multiplying the known functions: the likelihood and the prior. But practical issues arrive in two forms. Likelihoods can be slow to compute, and one often wishes to know \(P(\bm \theta | D)\) at many values of \(\bm \theta\), so that one can be sure one has found the minimum and has fully understood the shape of the distribution. The trick is to use an algorithm which spends its time calculating \(P(\bm \theta|D)\) only at the interesting points, i.e. near the maximum.
This efficient computation of \(P(\bm \theta |D )\) is one goal of an MCMC, which we will introduce in this post. But there is another goal too, which is to generate samples from \(P(\bm \theta|D)\). We haven’t thought much about sampling distributions yet, since we’ve so far only fitted for the functional forms of our posteriors. But if the posterior is some awful many-parameter function, a functional form might not be useful or even desireable. Ultimately, we want to know things such as the mean, median, modes, and credible intervals of a distribution. To compute these from functional forms, you need to do integrals or invert functions in general, both of which are hard problems. But the tasks are easy if you have samples. The distribution mean is just the mean of all the samples; the median is the sample median, and the credible intervals have to do with the samples’ percentiles. If we really need the form of the posterior, you can always histogram the samples and the curve represents \(P(\bm \theta | D)\). This act of understanding something by probabilistically generating samples from it is called a Monte Carlo method (the second MCMC).
Since we need both to navigate our way to the maximum of a posterior and to generate samples from it, it makes sense to solve both these problems with one algorithm. Our algorithm can generate samples at the same time as it traverses the posterior. This is rather like the minimization algorithms we introduced in the previous post, except now we’re proposing to record the points visited by the algorithm as samples. The first few samples will be dependent on where the algorithm starts, but after the algorithm approaches the minimum they will start to reflect the true distribution of the posterior near the maximum. This trick of recording the path
of an algorithm through parameter space is the other MC: the list of samples generated is called a Markov Chain.
The challenge is to make sure that the algorithm really does visit points in proportion to the probability there, without favoring some over others. This is not true of any of the minimization algorithms we’ve discussed so far, but now we’ll discuss a simple new one which guarantees unbiased samples.
The Metropolis-Hastings algorithm is one of the simplest ways to generate correct MCMC samples from a distribution. It works like this.
Consider a single Markov chain, whose most recent point is \(x_n\). You wish to generate the next point \(\bm \theta_{n+1}\) which is distributed according to the posterior \(P(\bm \theta)\). The algorithm proceeds by first choosing a random trial point \(\bm \phi\) and then generating a uniformly random number \(r\) from 0 to 1. If \(r < P(\bm \phi)/P(\bm \theta_n)\), then the trial point is accepted and \(\bm \theta_{n+1}\) is set to \(\bm \phi\). Otherwise, reject the point and set \(\bm \theta_{n+1}=\bm \theta_n\).
If \(\bm \phi\) is more probable than \(P_n\) (\(P(\bm \phi) > P(\bm \theta_n)\)), this algorithm guarantees that \(\bm \phi\) is always accepted. But the power of the Metropolis-Hastings algorithm lies in the fact that it also has a chance to accept less probable values. This is fundamentally different behavior from a minimizer, which has no reason to ever progress backwards in this way. The Metropolis-Hastings algorithm is trying to do something more: identify the shape of all of \(P(\bm \theta)\), and accepting a few low probability values of \(\bm \phi\) ensures that the tails of the \(P(\bm \phi)\) distribution are accurately modeled.
To prove that the Metropolis-Hastings algorithm is correct, imagine that we’re running lots of Markov chains at the same time, which have reached steady state. The algorithm has converged (reached steady state) when the number of points entering some region \(d\bm \theta\) is equal to the number of points exiting \(d\bm \theta\). The number of points exiting a region is the number of points in that region \(N(\bm \theta)\) times the probability to exit, and the Metropolis-Hastings algorithm dictates that the probability to exit to point \(\bm \phi\) is \(\min(P(\bm \phi)/P(\bm \theta),1)\) (the minimum appears because the probability to exit cannot be more than 1, even if \(P(\bm \phi) > P(\bm \theta)\)). Suppose \(P(\bm \phi) < P(\bm \theta)\). The steady-state condition is
What value of \(N\) satisfies this condition? Simply \(N(\bm \theta) \propto P(\bm \theta)\). The chains will be distributed according to the target probability distribution! In fact, \(N(\bm \theta) \propto P(\bm \theta)\) is the only end-state that satisfies this condition for all pairs of points \(\bm \phi\) and \(\bm \theta\).
This conclusion proves that, when the Metropolis-Hastings algorithm has converged, its Markov chain contains correct samples. However, we haven’t proved that the algorithm converges at a reasonable rate. Much of the technology of complicated fitters goes into hastening the process of convergence, mostly by choosing clever trial values \(\bm \phi\). We’ll discuss some of these in the next section.
We are about to look at two different methods for choosing a trial point \(\bm \phi\) that is likely to be accepted. But first, it’s important to point out that we are allowed to do this. Messing with the trial point selection method is OK because it does not affect the steady state condition Eq. 1, which is what gives the Metropolis-Hastings algorithm its validity. The only problem would occur if we make it impossible to pick \(\bm \phi\) from certain points. Then we’re guaranteeing that \(N(\bm \phi) = 0\) there, which is a serious error if \(P(\bm \phi)\) is big. The trick in general is to try to choose most \(\bm \phi\) from points where \(P(\bm \phi)\) is big, with only a few trial points taken from low values of \(P(\bm \phi)\).
Method 1: Ensemble Samplers
The first method we’ll discuss is called an Affine Invariant Ensemble Sampler. It uses the current positions of the “ensemble” of all the Markov chains you’re running to inform the choice of \(\bm \phi\) by picking points near to or inside the ensemble. The “affine invariance” refers to the particular method of choosing \(\bm \phi\) from the locations of all the chains \(\bm \theta_i\), which works even in cases where \(P(\bm \theta)\) is highly correlated or much thinner in one axis than another. For most Gaussian-like posteriors, this MCMC method converges quickly. However, for multi-modal or otherwise non-Gaussian posteriors the “affine-invariant” selection method is not flexible enough to be efficient, and convergence is slower.
Method 2: Nested Sampling
A common alternative to ensemble samplers is nested sampling. Nested samplers are really a set of many algorithms which are all based on one core principle: updating the chains using the Metropolis-Hastings algorithm in order of least to greatest likelihood. This simple procedure guarantees that as the algorithm proceeds and the chains proceed to higher likelihood values, the whole ensemble encloses on the maximum. Furthermore, since the chains are guaranteed to occupy high likelihood regions, trial points \(\phi\) drawn from regions near other chains are more likely to be accepted. This principle makes nested samplers more reliable than ensemble samplers in some situations with non-Gaussian posteriors. A final benefit of nested sampling is that it provides a simple way of estimating the statistical evidence, which we will come to in a later post.
MCMC methods provide a robust way of fitting models to data with as few assumptions on the posterior as possible. But it is important to note that they are also the least efficient. Analytically solving for the posterior is much faster if possible. When not, assuming the posterior is Gaussian reduces the problem simply to finding the maximum of the posterior, which can be done quickly by one of several numerical methods. Generating samples to solve for the whole posterior, as an MCMC does, is still useful in many cases. But if you have to perform many fits, or a fit over many parameters, you should seriously consider other methods before committing to a days-long computation that will take several attempts to get right.