Home

Bayesian Statistics

This post makes heavy use of Bayes’ theorem, which was proved in the first post.

When I say statistics, I mean the task of quantitatively turning data into constraints on the physical parameters. To do this, one needs a model with parameters denoted by the vector \(\bm \theta\). The idea is that the model should predict the data \(D\) given \(\bm \theta\)—or more specifically, it should predict the probability of observing \(D\)

\[\mathcal{L} = P(D | \bm \theta).\]

(1)

This probability distribution is called the likelihood, and it is one of the most important quantities in statistics.

The likelihood could take many forms because it is model dependent. But as for ways to generate constraints from a given likelihood, there are two main ones. One is the “frequentist” approach in common use in particle physics, and one is the “bayesian” approach in common use in astrophysics. Here we discuss the Bayesian approach.

Everyone will agree that one can extract best-fit parameters from a data set with some uncertainty. A Bayesian analysis models these by proposing that \(\bm \theta\) really is described by a probability distribution \(P(\bm \theta | D)\), where we condition on \(D\) because we are interested in the probability of the parameters after one has observed the given data set. This distribution is called the posterior. A Bayesian analysis then usually reports the best-fit and uncertainties of \(\bm \theta\) as the mean of \(P(\bm \theta | D)\), while the tails describe the uncertainty.

The posterior \(P(\bm \theta|D)\) is related to the likelihood \(P(D|\bm \theta)\) by Bayes’ rule:

\[P(\bm \theta | D) = \mathcal{L} \frac{P(\bm \theta)}{P(D)}.\]

(2)

We have introduced two new PDFs \(P(\bm \theta)\) and \(P(D)\). Respectively, these are the original PDF of the parameters before the data were observed, and the PDF of the data itself. In particular \(P(\bm \theta)\), sometimes written \(\pi(\bm \theta)\) is called the prior, and \(P(D)\) is called the evidence.

In this and the next few posts, we will consider the typical situation where you have observed a data set and would like to know the full posterior distribution \(P(\bm \theta | D)\) as a function of \(\bm \theta\). In this case, we therefore don’t have to the evidence \(P(D)\) because it only scales the normalization of the posterior, which we can manually normalize instead. We therefore simplify to

\[P(\bm \theta | D) \propto \mathcal{L} P(\bm \theta).\]

(3)

Probably the most difficult thing to understand about Bayesian statistics is what a prior \(P(\theta)\) actually is. Mathematically, it’s the probability the parameters had before the data is observed as I said. But scientifically, how is one supposed to know what the parameters are at all before seeing any data? The answer is, unforunately, that one must choose the priors by hand.

There are a few cases where the choice is simple. If you’re trying to measure the electron neutrino mass \(m_\nu\), you know that the mass is certainly positive. The prior could therefore be \(P(m_\nu) = 0\) for all \(m_\nu < 0\). In some cases you might have further information, such as when the neutrino mass has already been weakly constrained by other experiments. Then you might want to set \(P(m_\nu)\) equal to that previously found constraint.

In unclear situations, the most standard practice is the intuitive one: to set \(P(\theta)\) to be as wide as possible. This indicates our lack of knowledge about the system before it is observed. Many would go farther and state that best practice is to use not just wide priors, but wide uniform priors. In my opinion, this is an overstatement because uniformity is parameterization-dependent notion. In another blog post, I give some pointers about how to choose priors.

Many scientists dislike the notion of priors because they can be chosen arbitrarily and can affect your results for the posterior. Frequentist statistics is an appealing alternative. In frequentist statistics, one deals directly with the likelihood \(\mathcal{L} = P(D|\bm \theta)\) and doesn’t use Bayes’ theorem. The reported best-fit parameters are usually the parameters \(\bm \theta\) which maximize \(\mathcal{L}\). Intuitively, these represent the parameters that make the observed data most probable. One should be careful not to describe them as the most probable parameters, though, because that would be the maximum of the posterior \(P(\bm \theta|D)\), which we need a prior to compute.

This illustrates the problem with frequentist statistics. Often, the likelihood is just not the quantity one really wants. Frequentists have therefore invented many tools that dress the likelihood into a function more suitable to making statistical claims. Butthe lack of a general theory based on Bayes’ rule makes these methods hard to learn in my opinion. I will mostly stick to Bayesian statistics for that reason.

Eq. 3 is probably the most important equation in Bayesian statistics so I will repeat it again, this time in log space.

\[\ln P(\bm \theta | D) = \ln \mathcal{L} + \ln P(\bm \theta) + C\]

(4)

where \(C\) is a constant that can be worked out by normalizing \(P(\bm \theta | D)\). Intuitively, the posterior (the function we want) is proportional to the likelihood (the function we have) times the prior (a wide distribution used to put in parameter constraints). The multiplication is usually very easy; the hard part due to the fact that \(\mathcal{L}\) is often expensive to compute. Calculating what \(\ln P(\bm \theta | D)\) is for every value of \(\bm \theta\) can be slow, especially if \(\bm \theta\) is high-dimensional. Some advice on how to proceed is given in the next post.

Home

Bayesian Statistics

Formulation of Bayesian Statistics

A word about priors

Comparison to Frequentist statistics

Conclusion