The multi-armed bandit problem has been extensively studied in computer science, operations research and economics since the seminal work of Robbins (1952)
. It is a model designed for sequential decision-making in which a player chooses at each time step amongst a finite set of available arms and receives a reward for the chosen decision. The player’s objective is to minimize the difference, called regret, between the rewards she receives and the rewards accumulated by the best arm. The rewards of each arm is drawn from a probability distribution in the stochastic multi-armed bandit problem; but in adversarial multi-armed bandit models, there is typically no assumption imposed on the sequence of rewards received by the player.
In recent work, Lykouris et al. (2018) introduce a model in which an adversary could corrupt the stochastic reward generated by an arm pull. They provide an algorithm and show that the regret of this “middle ground” scenario degrades smoothly with the amount of corruption injected by the adversary. Gupta et al. (2019) present an alternative algorithm which gives a significant improvement.
With real-world applications such as fake reviews and effects of employing celebrity brand ambassadors in mind (Kapoor et al., 2019), we complement the literature by incorporating the notion of corruption into the stochastic linear optimization problem, and hence answering an open question suggested in Gupta et al. (2019), in the framework of Dani et al. (2008). In our finite-horizon model, the player chooses at each time step
a vector (i.e., an arm) in a fixed decision set. To consider the problem dependent bound, we assume that is a -dimensional polytope as in Abbasi-Yadkori et al. (2011). The regret of our algorithm is , where corresponds to the distance between the highest and lowest expected rewards, the amount of corruption and the level of confidence. In contrast to the stochastic model with corruption, our regret suffers an extra multiplicative loss of , which is caused by the separation of exploration and exploitation.
1.1 Related works
The finite-arm version of the stochastic linear optimization problem is introduced in Auer (2002). When the number of arms becomes infinity, the CONFIDENCEBALL algorithm (Dani et al., 2008) obtains the worst case regret bound of . Li et al. (2019) improve this result by replacing by a dependence. For the problem dependent bound, Abbasi-Yadkori et al. (2011) show that the regret of their OFUL algorithm is , and our algorithm achieves at least the same asymptotic performance when there exists an amount of corruption. Similar to the result of Lykouris et al. (2018), both the CONFIDENCEBALL algorithm and the OFUL algorithm suffer linear regret even when the amount of corruption appears to be small.
There also have been works that strive to achieve good regret guarantees in both stochastic multi-armed bandit models and their adversarial counterparts, commonly known as “the best of both worlds” (e.g., Bubeck and Slivkins (2012) and Zimmert and Seldin (2018)). In those algorithms the regret does not degrade smoothly as the amount of adversarial corruption increases. Kapoor et al. (2019) consider the corruption setting in the linear contextual bandit problem under a strong assumption that at each time step the adversary corrupts the data with a constant probability.
Our algorithm builds on Gupta et al. (2019)
. To eliminate the effect from corruption, we borrow the idea of dividing the time horizon into epochs which increase exponentially in length and use only the estimation from the previous epoch to conduct exploitation in the current round. This approach weakens the dependence of current estimate on the levels of earlier corruption, so the negative impact from the adversary fades away over time. The main challenge of our paper is that we cannot simply adopt the widely used ordinary least square estimator since the correlation between different time steps of estimation impedes the application of concentration inequalities. We thus conduct exploration on each coordinate independently.
Let be a -polytope. At each time step , the algorithm chooses an action . Let be an unknown hidden vector and
a sequence of sub-Gaussian random noise with mean 0 and variance proxy 1. For a given time step,, and a chosen action, , we define the reward as , where the first term is the inner product of and . We assume without loss of generality that and for all .
At each time step , there is an adaptive adversary who may corrupt the observed reward by choosing a corruption function . The algorithm chooses first , then observes the corrupted reward , and finally receives the actual reward . We denote by the total corruption generated by the adversary. The value of is unknown to the algorithm, which is, in turn, evaluated by pseudo-regret:
where is an action that maximizes the expected reward. In this paper, we assume that is unique.111This assumption is without loss of generality because it is of probability 1 that the best action is unique when the action set is perturbed with a random noise. Let be the set of extreme points of and . The extreme point that generates the second highest reward is denoted ; i.e., . Thus the corresponding expected reward gap between and is given by
We now introduce the so-called Löwner-John ellipsoid (see Grtschel et al. (1988) for a detailed discussion), which plays a key role in the construction of our algorithm.
Theorem 2.1 (Löwner-John’s Ellipsoid Theorem).
For any bounded convex body , there exists an ellipsoid satisfying
A discussion of finding efficiently the Löwner-John ellipsoid is deferred in Section 6. Let be a Löwner-John ellipsoid guaranteed by Theorem 2.1. Let be the center and the -th principal axis, , of . Without loss of generality, we assume that is the origin; otherwise we could shift the origin toward such that the new decision set . Then the reward for each action is shifted by the same constant, and therefore the problem remains unchanged. In what follows, we dub the exploration set. It is worth noting that corresponds to an orthogonal basis for . From Theorem 2.1, we obtain the following result.
For each , we have , where .
3 The SBE algorithm
In this section, we introduce our Support Basis Exploration (SBE) algorithm for the stochastic linear optimization problem with adversarial corruption (see Algorithm 1).
The algorithm runs in epochs which increase exponentially in length. Each epoch has a length greater than , and therefore the total number of epochs is bounded above by . The choice of current action depends only on information received from the last epoch, so the level of earlier corruption will have a decreasing effect on later epochs. Different from other algorithms for stochastic linear optimization models, we separate exploration and exploitation so that we can decrease the correlation between vector pulls in each epoch and thus minimize the influence of adversarial corruption on the estimate. This approach will inevitably increase the regret by a multiplicative factor.
Given the exploration set defined in Section 2, we can represent each vector in the decision set according to the elements of . By Corollary 2.2, the coefficient on each coordinate, in this new representation, is bounded by . It follows that the maximal projection on the basis vector is simply . In other words, contains the maximum information up to a constant in its own direction. Since basis vectors and are orthogonal to each other, there is no information loss using the exploration set in the algorithm. Thus, we obtain a better concentration in each round of estimation. Note that our algorithm can take any basis as input that has similar performance as in Corollary 2.2, and in Section 6, we provide an efficient algorithm that finds such a set with a multiplicative loss in regret. The construction of other parameters in the algorithm is explained in the next section.
4 Parameter estimation
We now know that the hidden vector, , can be represented according to the exploration set ; that is, . For any , let be an indicator defined on the event if the basis vector is chosen in time step . Let be the expected number of time steps used to explore each basis vector . But since is sampled uniformly, it follows that is independent of . Then, the “average reward" for exploring in epoch is222 This is not the actual average reward as is not the realized number of time steps used to explore .
Note that is independent of the noise, , as well as the amount of corruption, , taking expectation over the randomness of independent variables and on both sides yields
where . At the end of each epoch , we have as the estimate of and as the estimate of . Before giving an uniform bound for the error in expected reward , we provide first an upper bound for the error of in each dimension .
4.1 Error of estimated reward
With probability at least , the estimate is such that
for all and for all epoch .
Since the indicator and the noise are independent random variables, by a form of the Chernoff-Hoeffding bound in Hoeffding (1963), we have for any deviation and any
For any , let for all . Denote by the filtration generated by random variables and , and define . Since is independent of the corruption level conditional on , yields a martingale with respect to the filtration . The variance of conditional on can be bounded as
The first inequality holds because , and the second inequality holds because . Using a Freedman-type concentration inequality for martingales (Beygelzimer et al., 2011), we have for any ,
Note that . Combining it with Equation (2), for any , we have
For any , substituting , we can get
Similarly, consider the sequence . Then, for any , we have
Let and . Then
and . It follows that
where the first equality holds because by the definition of and , , and . By applying the union bound for all and epoch , we obtain the desired result. ∎
With probability at least , we have
for all epochs and all .
For simplicity, we denote
and let be the event that . Note that event happens with probability at least .
4.2 Bound analysis for estimated gap
Let us now turn to provide the upper and lower bounds for the estimated gap . Let be one of the actions that maximizes the expected reward given the estimate . We also define , and let the second best action given be . Since may not be unique, the expected reward for and may coincide. Then the estimated gap in epoch corresponds to .
Lemma 4.3 (Upper Bound for ).
Suppose that event happens, then for all epochs
First note that whenever ; otherwise we have a unique expected reward-maximizing action for estimate . By the uniqueness of , we have , which implies that
because . For the case , we have . It follows that , and therefore
The last inequality follows from Lemma 4.2 because when the event occurs, both inequalities and are satisfied.
Now for the case , it is straightforward to see from the fact . This implies that the expected reward of given the estimate is at least as large as that of ; i.e., . Therefore
Combining all cases yields
Lemma 4.4 (Lower Bound for ).
Suppose that event happens, then for all epochs
We consider first the case that the best action for is unique.
If , we know that , and thus
as the term is always negative.
If , then . It follows that
where the last inequality holds because . When the best action given is unique, we have
For the case that the best action is not unique, let be the best action given . Then , giving that
By applying the upper bound for in Lemma 4.3, we thus get
5 Regret estimation
With probability at least , the regret is bounded by
Let and be the pseudo regret for exploitation and exploration in epoch respectively. By Lemma 4.2, the event occurs with probability . We propose first the pseudo regret bound for exploitation given the occurrence of .
Exploitation: The pseudo regret for exploitation in epoch is .
Let be the pseudo regret for the action . Given that the event happens, we have
because . Define . Then we can get
If , then the total regret for exploitation is ; otherwise, we have . Now we consider two different cases.
For the case , we have . Combining it with Inequality (5), we have . So, the pseudo regret is
For the case , by Inequality (5), we have . It follows that
Thus, for each epoch ,
Summing over all epochs yields
where the third inequality holds because by the construction of our algorithm.
Exploration: Now we turn to the exploration part and propose a bound for the pseudo regret in each epoch . Note that the expected number of time steps in which exploration is conducted is , and the pseudo regret for each of such time step is bounded above by 1.
When , since , we have
When , we again consider two cases. For the case , since and because , we have , and