Thompson Sampling

What is Thompson Sampling?

Thompson Sampling, also known as probability matching, is a heuristic algorithm used for balancing exploration and exploitation in online decision problems, particularly in the context of multi-armed bandit problems. It is named after William R. Thompson, who introduced the method in 1933. The algorithm is a Bayesian approach to selecting actions and has gained popularity in the field of reinforcement learning and machine learning for its efficiency and effectiveness.

Understanding the Multi-Armed Bandit Problem

To appreciate the value of Thompson Sampling, it's essential to understand the multi-armed bandit problem. Imagine a gambler facing several slot machines (bandits), each with a different, unknown probability of payout. The gambler's goal is to maximize their winnings over a series of spins. The challenge lies in deciding which machine to play at each turn—a dilemma between exploiting a known rewarding machine or exploring others for potentially higher payouts.

How Thompson Sampling Works

Thompson Sampling addresses this challenge by considering the probability distribution of the expected rewards of each action. It assumes a prior distribution over the reward probabilities of each machine. As the gambler plays the machines and observes the outcomes (rewards or no rewards), the algorithm updates the posterior distribution of each machine's reward probability using Bayes' theorem.

When deciding which machine to play next, Thompson Sampling randomly selects an action according to the probability that the action is the best one, given the current data. This means that the more likely a machine is to have a higher payout rate, the more often it will be selected. However, less frequently played machines still have a chance of being chosen, allowing for exploration.

Advantages of Thompson Sampling

Thompson Sampling has several advantages that contribute to its popularity:

  • Balance of Exploration and Exploitation: It naturally balances the exploration of less-known machines with the exploitation of machines that are known to have higher payouts.
  • Bayesian Approach: It incorporates prior knowledge and updates beliefs in a principled way, making it robust to uncertainties in the reward distributions.
  • Performance: It has been shown to perform well compared to other strategies, especially when the number of trials is limited.
  • Adaptability: It can easily adapt to changing environments where the reward probabilities of the machines change over time.
  • Simplicity: Despite its solid theoretical foundations, the algorithm is straightforward to implement.

Applications of Thompson Sampling

Thompson Sampling has been successfully applied in various domains, including:

  • Online Advertising: To select which ads to display to users to maximize click-through rates.
  • Recommendation Systems: To choose which content to recommend to users while continuously learning about their preferences.
  • Clinical Trials: To determine the most effective treatment options while minimizing patient risk.
  • Dynamic Pricing: To adjust prices in response to consumer behavior and maximize revenue.

Thompson Sampling vs. Other Algorithms

Thompson Sampling is often compared to other algorithms like ε-greedy or Upper Confidence Bound (UCB). While ε-greedy randomly explores with a fixed probability ε and exploits the rest of the time, UCB explores based on the uncertainty or variance in the estimate of each machine's reward. Thompson Sampling, however, selects actions based on the posterior probability of being optimal, which can lead to more nuanced exploration-exploitation trade-offs.


Thompson Sampling is a powerful and efficient algorithm for addressing the exploration-exploitation dilemma in online decision-making problems. Its ability to incorporate uncertainty and adapt to new information makes it a valuable tool in the arsenal of machine learning practitioners. As online platforms and services continue to grow, the importance of algorithms like Thompson Sampling that can make real-time decisions under uncertainty will only increase.


Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285-294.

Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson Sampling. In Advances in neural information processing systems (pp. 2249-2257).

Agrawal, S., & Goyal, N. (2012). Analysis of Thompson Sampling for the multi-armed bandit problem. In Conference on Learning Theory (pp. 39.1-39.26).

Please sign up or login with your details

Forgot password? Click here to reset