Home

Introduction to Probability

Manipulating probabilistic formulas is very important in statistics, and it’s useful to outline a couple definitions and theorems that will be of future use. We will use the notion of “sets” to define the basic elements of probability below. It may seem unnecessarily tedious, but I find that this concrete way of thinking about probability leads to less confusion than other methods. In particular, it will give natural explanations for two confusing fundamental formulas in probability: Change of variables and Bayes’ theorem

Definitions

Probability deals with the mathematics of random variables — variables whose value could be one of multiple options. A classic example of a random variable is whether a flipped coin will come up heads (\(H\)) or tails (\(T\)). Another example is the number of clicks a Geiger counter makes in a given second.

Random variables are associated with domains, which are the set of possible outcomes. The domains of the above examples are \(\{H, T\}\) and the non-negative integers respectively.

These examples are both discrete random variables, because their domains are discrete sets — i.e. their elements are separated from each other. The domain of a continuous random variable, on the other hand, belongs to a continuous set. Examples include the mass of a galaxy or the velocity of a particle, whose domains are the positive real numbers and the space of three-dimensional vectors .

A random variable is often written with a capital letter such as \(R\), whose domain is \(\mathrm{Dom}\ R\)

An event refers to a particular subset of the domain. Events are often denoted with an equation. For example, consider a random variable \(C\) that represents a random primary color (red, yellow, blue). The event \(C=\mathrm{red}\) is the set containing only \(\{\mathrm{red}\}\), and the event \(Y\neq \mathrm{blue}\) is a set with two entries: \(\{\mathrm{red}, \mathrm{yellow}\}\).

One can also make an event out of two random variables with a little more work. If we make a new random variable \(D\) which is also a color, the domain of \(C\) and \(D\) together could consist of ordered pairs \((c,d)\) where \(c\) and \(d\) can each be any of red, yellow, or blue. This domain contains nine items.

Applying this definition, the event \(C=D\) is \(\{(\mathrm{red},\mathrm{red}), (\mathrm{yellow},\mathrm{yellow}), (\mathrm{blue},\mathrm{blue})\}\).

As a final example, events can also be represented as an inequality. Consider a Geiger counter which observes \(N\) radioactive particles in a given second. The event that three or more particles are detected is written \(N \geq 3\) and refers to the set \(\{3, 4, \dots\}\).

For every discrete random variable \(N\), we can attach a probability to each element \(n\) of the domain, which is written \(p(n)\). This is referred to as the probability mass function (PMF). It’s required that \(\sum_n p(n) = 1\).

With that done, the PMF of an event is the sum of \(p(x)\) for all items of the event:

\[p(E) = \sum_{e\in E} p(e).\]
(1)
You may have heard that the probability of \(A\) or \(B\) is the probability of \(A\) plus the probability of \(B\). Eq. 1 is the origin of this rule.

A probability distribution function (PDF) is the related concept for the case of continuous random variables. For a random variable \(X\), we define the PDF \(P(x)\) to be a function of \(x\) with the requirement that \(\int dx\, P(x)=1\). This function has the interpretation that a little segment \(dx\) of \(\mathrm{Dom}\ X\) has probability

\[dp = P(x)dx.\]
(2)
Because of this definition, some people write \(P(x)\) as \(\frac{dp}{dx}\) instead. The probability of an event \(E\) is defined very similarly to that of PMFs:
\[p(E) = \int_E dx\, P(x).\]
(3)

In the section on events we introduced a situation where one has two random variables \(C\) and \(D\). It is also possible to write PDFs and PMFs for such a case; these are functions of two dimensions \(p(c,d)\) and are called multivariate probability functions.

If \(C\) and \(D\) are independent, meaning their values are unrelated to each other, then \(p(c,d) = p(c)p(d)\). This can be shown by the following argument: if \(C\) is independent of \(D\), then one gains no information about \(p(c)\) by measuring \(D\). So \(p(c,d) \propto p(c)\), where the proportionality constant exists only to normalize the probability distribution. The same argument asserts that \(p(c,d) \propto p(d)\) as well, and therefore \(p(c,d) = p(c)p(d)\). A common interpretation of this rule is that the probability \(A\) and \(B\) is the probability of \(A\) times the probability of \(B\) (for independent random variables).

In some cases, one wishes to determine the PMF of \(C\) given only \(p(c,d)\). This is can be done by summing, or marginalizing over all possible values of \(D\). The marginal PMF for \(C\) is therefore

\[p(c) = \sum_{d} p(c,d).\]
(4)
If \(C\) and \(D\) were continuous random variables, one should replace the sum with an integral. This formula is true even in the case of dependent random variables.

A conditional \(X|E\) restricts a random variable \(X\) to the subset of its domain that intersects with \(E\). For example, consider \((C, D)\) which is pair of random variable representing color as described above. \(\mathrm{Dom}(C,D)\) has nine elements, but a conditional \((C,D)|D=\mathrm{red}\) is a new random variable whose domain is shrunk to the three elements where \(D\) is red: \(\{(\mathrm{red},\mathrm{red}), (\mathrm{yellow},\mathrm{red}), (\mathrm{blue},\mathrm{red})\}\).

Notation details: when reading conditionals, the vertical bar is pronounced “given.” Events for conditionals have special notation for the sake of brevity too. Take the random variable \(X \equiv (C,D)|D=\mathrm{red}\) used above. The event that \(X=(\mathrm{blue}, \mathrm{red})\) is written \(C=\mathrm{blue}|D=\mathrm{red}\) rather than \((\mathrm{blue}, \mathrm{red}) = (C,D)|D=\mathrm{red}\).

The PMF/PDF of the conditional \(X|E\) is proportional to the PMF/PDF of the original random variable \(X\), but it is rescaled to continue to satisfy the normalization condition with its smaller domain.

Consequences

Armed with a clear understanding of probability functions, we can now understand some important facts about probability that would otherwise not be obvious

Suppose we have a continuous random variable \(X\) with PDF \(P(x)\), but we’d like to convert to a new random variable \(Y = f(X)\) for some function \(X\). What is the PDF of \(Y\)? It’s tempting to guess that \(P(y)=P(x)\), where \(y=f(x)\), but this is incorrect. You can understand the correct rule by using the alternate notation for PDFs: \(P(y) = dp/dy\). The chain rule tells immediately that

\[P(y) = \frac{dp}{dy} = \frac{dp}{dx}\frac{dx}{dy} = \frac{P(x)}{f’(x)}.\]
(5)
It is often helpful to use this derivative notation for the PDF to avoid confusion. However, we have glossed over the slight complication that \(P(y)\) cannot be negative. To enforce this, the true rule is
\[P(y) = \frac{P(x)}{|f’(x)|}.\]
(6)
If you don’t like this hand-wavy explanation, you can prove this fact from Eq. 2, without relying on writing \(P(y)\) as a derivative. But in my opinion, that version is not very illuminating.

Let’s consider two random variables \(X\) and \(Y\), which represent flips of two different unfair coins. A flip can be either \(H\) or \(T\). We define two different conditional random variables \(C_1 = (X,Y)|X=T\) and \(C_2 = (X,Y)|Y=T\). Now consider the probability of specific events for these random variables respectively: \(p(Y=T|X=T)\) and \(p(X=T|Y=T)\). One might naively guess that these probabilities are equal. Indeed, the events \(Y=T|X=T\) and \(X=T|Y=T\) do refer to the same set: \(\{(T, T)\}\). But the probabilities are not equal because the PMFs of \(C_1\) and \(C_2\) are different.

How the two probabilities are related? Remember that the PMF of \(C_1\) is reweighted so that its probability remains normalized. Specifically, \(p_{C_1}(x,y) = p_{(X,Y)}(x,y) / p(X=T)\) normalizes that probability. The first probability, \(p(Y=T|X=T) = p_{C_1}(T,T)\), is therefore equal to \(p_{(X,Y)}(T,T) / p(X=T)\). Likewise, \(p(X=T|Y=T) = p_{(X,Y)}(T,T) / p(Y=T)\).

Note that \(p_{(X,Y)}(T,T)\) is present in both formulas. Eliminating this variable, we find

\[p(Y=T|X=T) = p(X=T|Y=T)\frac{p(Y=T)}{p(X=T)}.\]
(7)

In more general language, for any two events \(A\) and \(B\),

\[p(A|B) = p(B|A)\frac{p(A)}{p(B)}.\]
(8)
This fact is called Bayes’ theorem and the fraction in the previous equation is sometimes called Bayes’ factor.

An example

Bayes’ theorem has lots of applications. A commonly given one is for doctor tests. Suppose you test positive for a disease (event \(T=+\)), and you want to know the probability of a false positive— that you don’t have the disease (event \(D=-\)). In other words, you’re interested in \(P(D=-|T=+)\). The doctor may tell you that the test has a false positive rate of only 1%, meaning that the test returns a positive result on one out of 100 negative patients. However, you recognize that 1% is not the answer to your question because this false positive rate is \(P(T=+|D=-)\), which differs from \(P(D=-|T=+)\) by the Bayes factor.

So you want to calculate the Bayes factor \(P(D=-)/P(T=+)\). Convince yourself that we can expand the denominator as \(P(T=+) = P(T=+|D=+)P(D=+) + P(T=+|D=-)P(D=-)\). We know that \(P(T=+|D=-)=1\)% perhaps the doctor tells us that the true positive rate \(P(T=+|D=+)\) is \(100\)%. We still don’t know our goal of \(P(D=-|T=+)\) because we must determine \(P(D=-)\).

Suppose this is a very rare disease; \(P(D=+) = 10^{-6}\) and \(P(D=-) = 1-10^{-6}\). Then the probability we want—that we actually don’t have the disease despite the test— is

\[P(D=-|T=+)=1\% \frac{1-10^{-6}}{10^{-6} + 1\% (1-10^{-6})} = 99.99\%. \]
(9)
In other words, it’s very likely that you do not have the disease! Why is this answer so different from the 1% “false positive rate”? Intuitively, the disease is so rare that it’s drowned out by the false positive rate. This is something we would not have noticed mathematically without Bayes’ theorem.

Now that we’ve defined the basics of random variables, we discuss their properties such as expected value, variance, and covariance in the next section.