What is Thompson Sampling?
Thompson sampling is a heuristic learning algorithm that chooses an action which maximizes the expected reward for a randomly assigned belief. The problem this sampling addresses is also called the exploration-exploitation dilemma.
In these decision problems, actions must be performed sequentially and in a way to balance for two competing goals:
- Consuming resources and time to exploit what is already known to maximize immediate performance.
- Investing resources and time to accumulate new information that might improve future performance.
Algebraically, this is expressed as playing the action:

To maximize the probability of achieving the maximum reward from either course of action, like so:

When is Thompson Sampling used?
This sampling approach is an efficient solution for a wide range of problems with many variables, where the action taken on one affects the outcomes of other potential actions. Some of the most common situations include:
- Bernoulli bandit problems
- Shortest path problems
- Dynamic pricing
- Recommendation engine
- Active learning with neural networks
- Reinforcement learning in Markov decision processes.