The Multi-Armed Bandit (MAB) problem models the exploration and exploitation tradeoff in sequential decision processes and is typically described as a game between the agent and the environment with arms. The game proceeds in time steps. In each time step , the agent plays an arm based on the observation of the previous time steps, and then observes a reward that is independently generated from a 1-subGaussian distribution with mean value , where are unknown. The goal of the agent is to maximize the cumulative reward over time steps. The performance of a strategy for MAB is measured by the expected cumulative difference over time steps between playing the best arm and playing the arm according to the strategy, which is also called the regret of a bandit strategy. Formally, the regret is defined as follows
For a fixed time horizon , the problem-independent lower bound (Auer et al., 2002b) states that any strategy has at least a regret in the order of 111 notation hides constant factors., which is called the minimax-optimal regret or worse case optimal regret. On the other hand, for a fixed model (i.e., are fixed), Lai and Robbins (1985); Katehakis and Robbins (1995) proved the asymptotically lower bound that any strategy must have at least regret when the horizon approaches infinity, where is a constant depending on the model. A strategy with a regret upper-bounded by is called asymptotically optimal.
In this paper we aim at achieving the asymptotic optimality and minimax optimality for the earliest bandit strategy, Thompson Sampling (TS) (Thompson, 1933). It has been observed in practice that Thompson Sampling can achieve a better performance than many upper confidence bound (UCB)-based algorithms (Chapelle and Li, 2011; Wang and Chen, 2018). In addition, TS is natural, simple and easy to implement. Despite the aforementioned advantages, the theoretical analysis of TS has not been established until the past decade. In particular, Agrawal and Goyal (2012) and Kaufmann et al. (2012) proved the first regret bound of TS and showed that it is asymptotically optimal. Later, Agrawal and Goyal (2017)
showed that TS using Beta distribution as the prior achievesproblem-independent regret bound while maintaining the asymptotic optimality as well. Moreover, Agrawal and Goyal (2017) also proved that TS with Gaussian prior can achieve an improved regret bound . Meanwhile, Agrawal and Goyal (2017) proved that the vanilla TS strategy with Gaussian prior has a problem-independent bound at least in the order of .
It remains an open problem (Li and Chapelle, 2012) that whether Thompson Sampling type algorithms can achieve the minimax optimal regret bound for MAB problems.
Main Contributions. In this paper, we solve this open problem by proposing a new Thompson Sampling algorithm called Minimax Optimal Thompson Sampling (MOTS), which clips the sampling results for each arm based on the history of pulls for the arm. We prove that our proposed MOTS algorithm achieves the asymptotic optimal and minimax optimal regret simultaneously. This is the first TS type algorithm that achieves the minimax optimal regret bound . Our result also conveys the important message that the lower bound for vanilla TS strategy with Gaussian priors in Agrawal and Goyal (2017) may not hold in more general cases. Our experimental results also demonstrate the superiority of MOTS over the state-of-the-art bandit algorithms such as UCB (Auer et al., 2002a), MOSS (Audibert and Bubeck, 2009) and TS.
Notations.is said to follow 1-subGaussian distribution, if it holds that for all . We reserve the notation to represent universal positive constants that are independent of problem parameters. The specific value of can be different line by line. We use for total number of time steps, for number of arms and for set . Without loss of generality, we assume throughout this paper. We use to denote the gap between arm and arm , i.e., , . We denote as the number of times that arm has been played at time step and as the average reward for pulling arm up to time , where is the reward received by the algorithm at time .
2 Minimax Optimal Thompson Sampling Algorithm
In this section, we propose a Minimax Optimal Thompson Sampling (MOTS) algorithm, whose details are displayed in Algorithm 1.
Specifically, MOTS maintains a distribution for each arm at time step during execution, where
is initialized as the standard Gaussian distribution. At the-th iteration of Algorithm 1, it samples instances independently from distribution for all . Then the agent plays the arm and receives a reward . The average reward and the number of pulls for each arm are updated accordingly.
The main difference between MOTS and vanilla Thompson Sampling in Agrawal and Goyal (2017) is the choice of distribution . In Agrawal and Goyal (2017), is chosen as the Gaussian distribution . In contrast, we define as a clipped Gaussian distribution , where is an arbitrary constant. We describe the detailed procedure of sampling from of MOTS as follows.
Sampling from a clipped Gaussian distribution: At time step , for all arm , we denote the following range
where is defined as . For arm , we first sample an instance from Gaussian distribution . If , then return as a sample from ; otherwise return as a sample from .
We would like to point out that the right endpoint in (2) resembles the upper confidence bound in MOSS (Audibert and Bubeck, 2009). Apart from the difference that MOTS is TS-type and MOSS is UCB-type, we claim that they are also very different from a theoretical perspective. Under the definition of in (2), we will prove in the next section that MOTS is both asymptotically optimal and minimax optimal. However, MOSS is only minimax optimal (Audibert and Bubeck, 2009). The improvement of MOSS to achieve asymptotic optimality is only recently developed in the KL-UCB algorithm (Ménard and Garivier, 2017) and the AdaUCB algorithm (Lattimore, 2018), which can be seen as variants of MOSS. Both KL-UCB and AdaUCB need to reduce the constant factor 4 in the right endpoint of defined in (2) to 2, which essentially decreases the exploration rate. Moreover, KL-UCB utilizes a more complicated upper confidence bound with an additional term and AdaUCB only works for Gaussian reward distributions.
In contrast, it is easy to verify that for MOTS the constant 4 in (2 ) can be replaced by any constant larger than 4 while maintaining the asymptotic optimality and minimax optimality. Therefore, MOTS is more robust in the choice of hyperparameter. It will be more suitable to design better algorithms based on MOTS, e.g., achieving instance-dependent optimality (see
) can be replaced by any constant larger than 4 while maintaining the asymptotic optimality and minimax optimality. Therefore, MOTS is more robust in the choice of hyperparameter. It will be more suitable to design better algorithms based on MOTS, e.g., achieving instance-dependent optimality (seeLattimore (2018) for detail) while keeping the asymptotic optimality.
3 Main Theory
In this section, we present our main theory of MOTS.
Theorem 1 (Minimax Optimality).
For any fixed , there exists a universal constant such that the regret of Algorithm 1 with 1-subGaussian rewards satisfies
The second term in the right hand side of (3) is due to the fact that we need to pull each arm at least once if . Follow the convention in the literature (Audibert and Bubeck, 2009; Agrawal and Goyal, 2017), we only need to consider the case when is dominated by .
Compared with the results in Agrawal and Goyal (2017), the regret bound of MOTS improves that of TS by a factor of and improves that of TS with Gaussian priors by a factor of . This is the first time that a Thompson Sampling type algorithm achieves the minimax optimal regret for multi-armed bandit problems (Auer et al., 2002a), which also answers the open problem in Li and Chapelle (2012) where it is conjectured that Thompson sampling’s regret actually matches the lower bound and is indeed optimal.
Theorem 2 (Asymptotic Optimality).
For any fixed , the regret of Algorithm 1 with 1-subGaussian rewards satisfies
Theorem 2 indicates that the asymptotic regret rate of MOTS matches the asymptotic optimal rate up to a multiplicative factor , where is arbitrarily fixed. This is the same as that of vanilla TS in Agrawal and Goyal (2017), where the authors proved an asymptotic regret rate that matches the asymptotic optimal rate by a multiplicative factor , where is a fixed constant.
So far, we have assumed the reward follows an unknown subGaussian distribution. In the next theorem, we present an variant of MOTS that achieves the minimax optimality and asymptotic optimality for Gaussian reward distributions.
3.1 Proof of the Minimax Optimality
The following lemma will be frequently used throughout our analysis, which characterises the concentration property of subGaussian random variables.
Lemma 1 (Lemma 9.3 in Lattimore and Szepesvári (2020)).
Let be independent and -subGaussian with zero mean. Denote . Then for any ,
where is a universal constant.
Let be the average reward of arm when it has been played times. Define
The regret of Algorithm 1 can be decomposed as follows.
The first term in (8) can be bounded as:
where is a universal constant and the inequality comes from Lemma 1 since
Now we focus on . Note that the update rules of Algorithm 1 ensure whenever . Hence, we can define as the prior distribution of arm when it has been played times and obtain the following Lemma.
Lemma 2 (Theorem 36.2 in Lattimore and Szepesvári (2020)).
Let be an arbitrary constant. Then the expected number of times that Algorithm 1 plays arm is bounded by
where , is the CDF of , and .
Let be sampled from the clipped distribution . Recall the clipped sampling procedure in Section 2. We can first sample from distribution . If , we return ; otherwise, we return . Combining with (14), we know that .
Bounding term : Note that
We define the following notation.
which immediately implies
The following lemma characterizes the bound for .
Let be a constant and be 1-subGaussian random variables with zero means. Denote . Then for any ,
where the first inequality is due to the fact that . It is easy to verify that is monotonically decreasing for and any . Since , we have . Plugging this fact into (3.1), we have that , where is a universal constant.
Bounding term : We first prove the following lemma.
There exists a universal constant such that:
Proof of Lemma 4.
We decompose the proof of Lemma 4 into the proof of the following two statements: (i) there exists a universal constant such that
and (ii) for , it holds that
For , and . For , let and be the random variable denoting the number of consecutive independent trials until a sample of becomes greater than . Note that , where is sampled from . Hence we have
Consider an integer . Let , where and will be determined late. Let random variable be the maximum of independent samples from . Define to be the filtration consisting the history of plays of Algorithm 1 up to the -th pull of arm . Then it holds
For a random variable , it holds by Formula 7.1.13 from Abramowitz and Stegun (1965) that
Therefore, it holds that
where the last inequality is due to , and . Let , then
It is easy to verify that for , . Hence, if , we have . Thus, we have
For any , it holds that
For any , this gives rise to
where is a universal constant. Let . We further obtain
Since is fixed, then there exists a universal constant such that
Then, we have
Applying Lemma 8, we have
Substituting the above inequality into (33) yields
3.2 Proof of the Asymptotic Optimality
We first prove the following technical lemma.
For any that satisfies , it holds that
Proof of Lemma 5.
For sufficiently small such that , which also implies . Applying Lemma 8, we have . Furthermore,
where the last inequality is due to the fact for all . Define . For and sampled from , if , then using Gaussian tail bound in Lemma 7, we obtain
Let be the event that holds for all . We further obtain
Now we prove the asymptotic optimality of MOTS.
Proof of Theorem 2.
Let be the following event
For any arm , we have