“When you have eliminated the impossible, all that remains, no matter how improbable, must be the truth.”

– Sherlock Holmes (Arthur Conan Doyle)

For a long time Bayesian inference was something I understood without really understanding it. I only really *got it* it after reading Chapter 2 of John K. Kruschke’s textbook *Doing Bayesian Data Analysis*, where he describes Bayesian Inference as *Reallocation of Credibility Across Possibilities*

I now understand Bayesian Inference to be essentially Sherlock Holmes’ pithy statement about eliminating the impossible quoted above, taken to its mathematical conclusion. This article is my own attempt to elucidate this idea. If this essay doesn’t do the trick, you might try Bayesian Reasoning for Intelligent People by Simon DeDeo or Kruschke’s Bayesian data analysis for newcomers.

## Prior Beliefs

A Bayesian reasoner starts out with **prior** beliefs, which are a set of mutually exclusive and exhaustive possibilities for the situation at hand.

For example, suppose that Sherlock Holmes is investigating the Case of the Disappearing Duchess, and believes that there is:

- a 50% chance that the Duke has kidnapped the Duchess and is holding her alive in captivity
- a 25% that chance that the Duke has murdered her
- and a 25% chance that she has been murdered by the Count

The total of the probabilities is 100%. Holmes’ prior beliefs can be summarized in the table below.

**Prior Beliefs**

Culprit | Status | Probability |
---|---|---|

Duke | Alive | 50% |

Duke | Dead | 25% |

Count | Dead | 25% |

TOTAL | 100% |

## Revising Beliefs

Bayesian reasoners then revise their beliefs when they acquire new information/evidence according to very simple rules:

- reject any possibilities that are incompatible with the evidence
- reallocate probability to the remaining possibilities so that they sum to 100%

For example, if the Duchess’s body is found buried under the Atrium, Holmes must eliminate the possibility that she is being held captive and alive by her jealous husband, the Duke. He must then reallocate probability among the remaining possibilities.

So all the remaining probability is allocated to the remaining two scenarios in which the Duchess is dead.

These new beliefs are Holmes **posterior** beliefs, as summarized in the table below.

**Posterior Beliefs**

Culprit | Status | Probability |
---|---|---|

Duke | Alive | 0% (eliminated) |

Duke | Dead | 50% |

Count | Dead | 50% |

TOTAL | 100% |

## Reallocation of Probability Mass

Another way of looking at this is that the 50% “probability mass” previously allocated to the first possibility is reallocated to the remaining possibilities. You can visualize this if we plot the prior and posterior probabilities as bar charts. The total volume of the bars in each of the two charts below is the same: 100%. The 50% probability mass that was previously allocated to the first possibility in the prior is reallocated to the remaining two possibilities in the posterior.

## Sequential Updating

Suppose Holmes subsequently finds evidence that exonerates the Count. To update Holmes’ beliefs again, we repeat the process. The posterior after the last piece of evidence becomes the new prior. We then eliminate possibility #3 (she was murdered by the Count). This time, since only one possibility remains, all probability mass is reallocated to this possibility.

## All That Remains

So what happens now that we’ve reached a point where there are no more possibilities to eliminate? At this point, no more inferences can be made. There is nothing more to learn – at least with respect to the Case of the Disappearing Duchess. Holmes’ has eliminated the impossible and the remaining possibility *must* be the truth.

It’s not always possible to eliminate all uncertainty such that only one possibility remains. But Bayesian inference can be thought of as the process of reducing uncertainty: eliminating the impossible, and increasing the probability of “all that remains” so that it sums to 100%.

When you have eliminated the impossible, the probability of all that remains, no matter how improbable, must sum to 100%

– Sherlock Thomas Bayes Holmes (Jonathan Warden)

## Updating Beliefs based on Evidence

What makes Bayesian inference so powerful is that learning about one thing can shift beliefs in other things, sometimes in non-intuitive ways.

For example, learning that the Duchess is dead **decreased** the probability that the Duke did it (from 75% to 50%), and **increased** the probability that the Count did it (from 25% to 50%).

How can this be? This is demonstrated visually in the four charts below. The first row of charts we have already seen: they show Holmes’ priors on the left, and his posteriors after learning that the Duchess is dead on the right.

The second row of charts show the same probabilities, but this time the charts show the *total* for each possible *culprit*. The Duke is the culprit in two different scenarios in the priors, so the total prior probability for the Duke is 50% + 25% = 75%. The total prior probability for the Count is 25%.

After eliminating the Alive+Dike scenario, the remaining probability mass for the Duke and the Count are both 25% – but these are then scaled up so to 50% each so their total sums to 100%, as shown in the bottom-right chart. The net result is a decreased total probability for the Duke and increased total probability for the Count.

## Beliefs as Joint Probability Distributions

The key to the power of Bayesian inference is that it tells us exactly how a rational being should update their belief in one thing (the Duke or Count did it), after learning another thing (the Countess is dead), given their prior beliefs.

Inferring one thing from another thing is only possible here because Holmes’ prior beliefs are beliefs in **combinations** of propositions, not just individual propositions. Holmes’ prior beliefs are not simply that *there is a 75% chance that the Duke did it* or *there is a 50% chance that the Duchess is dead*. If his beliefs were so simple, learning that the Duchess was murdered would not tell Holmes anything about whether it was the Duke or the Count that did it.

Rather his beliefs are about **combinations** of propositions (e.g. *the Countess is Dead and the Duke did*). His beliefs form a **joint probability distribution** that encodes the knowledge about the relationships between propositions that enables Holmes to make inferences about the culprit upon learning of the Duchess’s death.

I think that understanding prior beliefs a joint probability distribution is key to understanding Bayesian Inference.

## Conditional Probability

Before discovering the Duchess’s body, we can calculate what Holmes’ beliefs **would** be if he learned that the Duchess was definitely alive or dead. The probability that the Duke/Count is the culprit **given** the countess is Alive/Dead is called a **conditional** probability.

Conditional probabilities are written in the form $P(Hypothesis \vert Evidence)$. $Evidence$ is whatever new information has been learned (e.g. the *Duchess is Dead*), and $Hypothesis$ is any other proposition of interest (e.g. the *Duke Count did it*).

The conditional probability of some Hypothesis given some piece of Evidence can be calculated using the following formula:

$$ \begin{aligned} P(Hypothesis \vert Evidence) &= \frac{P(Hypothesis, Evidence)}{P(Evidence)} \end{aligned} $$

Where $P(Hypothesis, Evidence)$ is the **total prior probability** of all possibilities where both the evidence and hypothesis is true, and $P(Evidence)$ is the total probability of all possibilities where the evidence is true.

For example, referring back to Holmes’ prior probability table, you can see that $P(Duke, Dead) = 25\%$, and $P(Dead) = 25\% + 25\% = 50\%$. So:

$$ \begin{aligned} P(Duke|Dead) &= \frac{P(Duke, Dead)}{P(Dead)}\cr &= \frac{25\%}{50\%} = 50\% \end{aligned} $$

## Posterior Belief Formula

It is a common convention to represent **prior** beliefs (before learning some piece of new information) as $P$, and **posterior** beliefs (after learning new information) as $P’$. $P’$ represents a whole new probability distribution, generated from $P$ by eliminating all possibilities incompatible with the evidence and scaling the remaining probabilities so they sum to 1.

For example, let’s say Holmes’ beliefs after finding the Duchess’s body is $P’$.

Now we don’t actually have to calculate all of $P’$ if all we want to know is $P’(Duke)$. Instead, we can use the conditional probability formula above to calculate $P(Duke|Dead)$. That is, the posterior belief, $P’(Duke)$, is equal to the prior *conditional* belief *given* the Duchess is dead, $P(Duke|Dead)*.

$$ P’(Duke) = P(Duke|Dead) $$

Or more generally

$$ \begin{aligned} P’(Hypothesis)\cr &= P(Hypothesis|Evidence)\cr\cr &= \frac{P(Hypothesis, Evidence)}{P(Evidence)}\cr \end{aligned} $$

This three-part formula is useful one to memorize. Note that the left-hand side is a *posterior* probability. The middle formula is the notation for the *conditional* probability of the hypothesis given the evidence. And the right-hand side lets us calculate the posterior probability of any hypothesis given any piece of evidence in terms of the prior probability distribution.

## Summary

So far, we have engaged in Bayesian inference without using the famous Bayes’ Theorem. Bayes rule is not actually necessary for Bayesian inference, and conflating the use of Bayes’ rule with Bayesian inference can interfere with an understanding of the more fundamental principle of Bayesian inference as reallocation of probabilities.

So here’s a summary of the principle of Bayesian inference:

- Start with prior beliefs as a joint probability distribution
- Eliminate possibilities inconsistent with new evidence
- Reallocate probability to remaining possibilities such that they sum to 100%
- Update beliefs sequentially by eliminating possibilities as new evidence is learned
- Make inferences by simply calculating the total posterior probability of the hypothesis given the evidence using the conditional probability formula

*Originally posted on jonathanwarden.com.*