Reinforcement Learning (RL) has been applied to many challenging sequential decision making problems, such as game playing and operation management (Peng et al., 2017; Silver et al., 2017; Cai et al., 2017). The most popular model used in RL is the Markov Decision Process (MDP), where an agent interacts with the environment, and utility is determined by the current state and the policy used. We focus on the episodic setting without discount, in which the agent goes through episodes sequentially and takes an action at each round. Once the agent reaches the horizon on rounds in an episode, it begins a new episode. The agent follows a policy, whose performance is measured by the expected cumulative reward, and the corresponding regret is defined as the gap between the expected cumulative reward received by the current policy and the optimal policy. Recent theoretical progress finds the minimax rates on regret in episodic MDPs up to log-factors (Azar et al., 2017; Zanette and Brunskill, 2019; Simchowitz and Jamieson, 2019).
However, an assumption crucial to previous analyses is stationarity of the MDP, i.e., the dynamics and reward of the underlying MDP are assumed to be fixed across episodes. But in most realistic settings the agent competes with an adversary who can adapt her strategy based on observations and side information. As a result, the MDP varies across episodes, and techniques based on stationarity fail to find a reliable policy. In fact, the concept of learning the best policy itself becomes questionable. To see why, consider the special MDP with only a single state, namely, the adversarial bandit problem (Bubeck et al., 2012)
. Here, when rewards are designed in an adversarial manner, any deterministic policy fails to achieve vanishing average regret. We must instead use a mixed strategy. By the same token, competing with the optimal policy without any constraints is impossible, and estimating the statistical properties of the system dynamics becomes meaningless.
This background motivates the following two fundamental questions:
Can we develop efficient policy-learning algorithms for MDP subject to adversarial manipulation?
Can we design algorithms for adversarial MDPs that are as efficient as algorithms for stationary MDPs?
We answer both these questions in this paper by building on the following useful analogy: just like a stationary MDP is a natural extension of the stochastic bandit problem, a non-stationary MDP is a natural extension of the adversarial bandit problem. We outline our contributions more precisely below.
We develop a provably efficient algorithm called ARL for adversarial MDPs. ARL achieves a regret of (Theorem 1); the dependence on , , and is optimal, and on , it is the best among existing model-free methods. It is worth noting that this rate is slightly better than the best known rate even for stationary MDPs (Jin et al., 2018)
. Moreover, to our knowledge it is the first log-free upper bound in the literature on non-asymptotic analysis of episodic MDP.
We introduce the use of a particular reduction to bandits (Lemma 1), which is novel in the RL literature on sample complexity. This reduction leads to significantly shorter and simpler proofs than existing analyses that are based on results from the bandit literature.
2 Related Work
In this section, we review existing work related to the adversarial MDP problem. We first recall the multi-armed bandit problem and its solution, which will be used later in deriving the ARL algorithm. Then, we review existing sample efficient algorithms for both stationary and non-stationary MDPs. Finally, we describe some connections to game theory. See Section3 for details on the notation used in summarizing the related work below.
2.1 Multi-armed Bandit
The multi-armed bandit problem is a special case of an MDP with a single state, i.e., the number of states (Bubeck et al., 2012). Thus, decision making for each step is the same and the agent makes decisions in total. Equivalently, in round the agent chooses an action from a set of actions , and receives a reward given by the value . For simplicity, we assume the reward is bounded, but extensions for light-tailed and heavy-tailed are both available in the literature (Bubeck et al., 2013a).
If the distribution is identical for all , we call it a stochastic bandit problem. The main challenge in this problem is to explore the action space efficiently while exploiting the best action to achieve low regret. The standard solution technique is the upper confidence bound (UCB) method. For each action
, a UCB with high probability is maintained, and in each round the action with the highest UCB is chosen. This basic version of UCB is usually referred to as UCB1, and it achievesregret in the worst case. Variants of UCB1 like MOSS (Audibert and Bubeck, 2009) achieve regret, which matches the lower bound.
If the distribution varies with , we call it an adversarial bandit problem. The standard algorithm for this setting is the Exponential-weight algorithm for Exploration and Exploitation (Exp3, Auer et al. (1995)
). Exp3 first computes unbiased estimates of the cumulative rewards via importance sampling; it then maps these into probabilities using exponential weights. More precisely, Exp3 maintains a probability distributionfor sampling an action in round . If action is chosen, the estimate of reward for every arm is
and we update the probability distribution via the formula
where is the normalization constant and is a parameter.
Setting , it is shown in (Stoltz, 2005) that the regret can be upper bounded by . An information theoretic lower bound of can be constructed (Bubeck et al., 2013b), which implies Exp3 is rate optimal up to a log-factor. To remove this log-factor, the Implicitly Normalized Forcaster (INF) algorithm is proposed in (Audibert and Bubeck, 2009). INF uses the same unbiased estimate as Exp3, but computes the normalization constant for the cumulative reward directly by , while updating the probability distribution by
where is a function defined implicitly (thus the name INF). It is shown in (Audibert and Bubeck, 2009) that the regret is and thus rate optimal. We will use INF as a subroutine in our ARL algorithm.
2.2 Provably efficient methods for Stationary MDP
For stationary MDPs, the agent is also challenged by the trade-off between exploration and exploitation. Some algorithms assume the availability of a simulator, which enables the agent to query arbitrary state-action pairs (Koenig and Simmons, 1993). RL with a simulator is much easier since the challenge of exploration disappears; we focus therefore on the harder case without assuming access to a simulator.
Since a stationary MDP is a natural extension to stochastic bandits, UCB is used to explore the environment in almost all RL algorithms that have polynomial sample complexity. In (Jaksch et al., 2010), UCBs are constructed for system dynamics and rewards directly to achieve regret, while Agrawal and Jia (2017)
combine UCB with Thompson sampling to achieveregret. Dann and Brunskill (2015) use a Berstein-type concentration instead of Hoeffding-type to achieve regret.
A lower bound of is established in (Osband and Van Roy, 2016) and achieved by (Azar et al., 2017) up to log-factors. The improvement comes from a construction of UCBs for Q-values instead of transition probabilities. Jin et al. (2018) apply the above techniques to model-free methods to achieve regret. Ortner (2018) establish a similar minimax regret bound for the infinite horizon case under a mixing assumption.
The UCB-based approximate dynamic programming approach used in (Azar et al., 2017) is extended by later papers. Zanette and Brunskill (2019) achieve tighter problem-dependent regret bounds without domain knowledge and Simchowitz and Jamieson (2019) achieve a minimax instance-dependent regret with the same minimax optimal regret (up to log-factors) in the worst case. Both of them take advantage of a lower confidence bound developed in Dann et al. (2018).
The rates noted above might differ slightly from the ones derived in the original papers with respect to . The reason is that some of them allow and to depend on , while others do not. To compare the rates consistently we use the more general setting where and may depend on . See (Jin et al., 2018) for more details on transforming the rates from one setting into the other.
Finally, note that although we measure performance using regret, one can also use other criteria such as PAC sample complexity (Kakade et al., 2003). It is known that these two kinds of results can be translated into each other. So we only focus on regret analysis. See (Dann et al., 2017, 2018) for other evaluation criteria for RL algorithms and their interrelation. Still, notice that the PAC framework is only meaningful in the stationary case. Therefore, the regret framework is more general and suits our study of adversarial MDPs better.
2.3 Non-stationary MDP
Non-stationary MDPs are motivated by robust RL, with or without presence of an explicit adversary. The environment might be non-stationary in each episode, or we might be using a simulated environment that differs from reality. The desire for better generalization under such challenging circumstances motivates adversarial training in RL (Pinto et al., 2017). Historically, Nilim and El Ghaoui (2005) and Xu and Mannor (2007) study MDPs with fixed rewards but changing dynamics, while Even-Dar et al. (2005) and Dick et al. (2014) study MDPs with fixed dynamics but changing rewards.
In non-stationary MDPs, if we still define the regret to the gap of expected cumulative reward between the current policy and the optimal one, then there is no hope of achieving a sublinear dependence on . Indeed, just consider the adversary who assigns a reward of to the least probable action and to the others. Thus, we have to either make further assumptions on the underlying MDP, or set up constraints on the oracle against which we compete.
Two popular assumptions are switching and drifting. In switching there are certain turning points where the MDP changes suddenly but remains the same within each period in between. A common strategy is to then detect the turning points actively and apply algorithms for stationary MDP in between (Da Silva et al., 2006; Abdallah and Kaisers, 2016; Padakandla et al., 2019). This line of work focuses on emipirical results because the switching scenario is computationally hard in principle: worst cases can be constructed where the regret grows exponentially as the number of turning points increases. In the drifting scenario, the MDP is changing either gradually or abruptly but the total variation in reward and transition probability is bounded. In this case, the minimax regret is , which is achieved in (Gajane et al., 2019).
Without these assumptions, we have to constrain the oracle to behave the same in each episode. This constraint matches the standard setting in online learning and adversarial bandits, and is also used in our work. Abbasi et al. (2013) also use this setting but suffer from the following points: (i) They make a mixing assumption on the adversarial MDP, which does not hold in many interesting cases like the lower bound constructed in (Jin et al., 2018). Indeed this assumption is very strong and essentially reduces the MDP problem to a standard online learning problem. (ii) They assume full information is available, which is tantamount to using a simulator in RL language. (iii) They work with all the possible policies, which is computationally overwhelming because the number of possible policies grows exponentially as and increase. We overcome all these limitations by introducing an advantage decomposition combined with adversarial bandits.
2.4 Game Theory
Our reduction from adversarial MDP to adversarial bandits is partially inspired by the counterfactual regret minimization technique in game theory (Zinkevich et al., 2008). In extensive form games, the regret can be defined similarly but optimizing the regret directly is computationally overwhelming. The counterfactual regret minimization method upper bounds the overall regret by the sum of regret in each information set and solves the subproblems separately.
To control the regret in each information set, Blackwell’s algorithm (Blackwell et al., 1956) is the most popular option. It is usually assumed that a simulator is available in extensive form games so that reward of each action can be observed. The corresponding regret is defined by . Blackwell algorithm updates the probability distribution by
Blackwell’s algorithm also yields regret and is sometimes called a regret matching algorithm due to the form of expression for updating the probability distribution. However, since we do not assume the availability of a simulator in our RL problem, we can not Blackwell’s algorithm to solve the reduced bandit problem.
In this section, we first define the problem setting for stationary MDP, and subsequently introduce the (non-stationary) adversarial MDP problem.
3.1 The Stationary MDP Problem
A stationary MDP is a popular model to describe the interaction between an agent and the environment. A stationary episodic MDP is characterized by the tuple , where
Set contains possible states of the agent, whose cardinality is denoted by .
Set contains possible actions of the agent, whose cardinality is denoted by .
Horizon is the number of steps in one episode.
System dynamics characterizes the transition probability in .
Reward describes the utility of the agent.
In a stationary episodic MDP problem, the agent goes through episodes. Let be the total number of steps. In each episode the agent makes a sequence of decisions before reaching the horizon; then the agent starts a new episode. Without loss of generality, we assume that each episode begins with the same state .
At the -th step in episode , the agent is in state and chooses the action . Then he transits to a new state according to probability distribution and receives a reward according to supported on .
A policy characterizes the decision strategy of the agent. Mathematically, it maps a pair to an action . Since a policy can be learned through the interaction process, a series of policies might be used. For each pair , the policy is used at most once in each episode, so we can write the policy used in episode as . We use to denote the collection of policies .
The value function characterizes the expected cumulative reward for state during the whole episode , assuming the agent is using policy ; formally, it is defined as
The Q-value function describes the expected cumulative reward for the state-action pair via
The goal for a stationary MDP problem is to minimize the regret defined by
3.2 The Adversarial MDP Problem
The adversarial MDP is a natural extension of the stationary one by allowing the MDP to change over time. Mathematically, we have different MDPs , where . In each episode, the agent interacts with the environment as usual, but the transition dynamics and the reward can depend on , and be possibly designed based on the policy used. We only assume the reward is bounded a.s., so that
and that is a valid probability distribution
We also need to define what it means to learn a reliable policy in adversarial MDP. As mentioned above, we have to use a random policy instead of deterministic one to achieve sublinear regret. A policy now maps a step-state pair to a distribution over the action space . We can also define the value and Q-value functions with minor modifications
For any policy , we also define
The goal for an adversarial MDP problem is to minimize the regret
The concept of optimal policy must also be modified. Instead of defining it for a single episode, we consider the total reward instead. As a result, the regret (6) can be rewritten as
Formulation (7) is what we will subsequently use. Notice that if we assume , i.e., the MDP is actually stationary, everything defined above reduces to its counterpart for stationary MDP. Therefore, the solution developed for adversarial MDP automatically yields a solution for stationary MDP with the same theoretical guarantees; the converse is clearly false.
4 Main results
In this section we propose the ARL algorithm and prove an upper bound on its regret. To motivate our algorithm, we first reduce the adversarial MDP problem into a series of adversarial bandit problems. Then, we solve these bandit problems separately and the corresponding RL algorithm is called ARL. Finally we prove the upper regret bound based on the reduction.
4.1 Reduction to Adversarial Bandits
Now we begin our reduction from adversarial MDP to adversarial bandit with the following plan of attack. We first introduce the notion of advantage, which can be considered as the single-state regret, and its decomposition. Then, we construct an adversarial bandit for each state separately. By definition, the advantage will be dominated by regret in each adversarial bandit, and thus solving these bandits yields a solution to the original MDP.
The central concept of advantage is defined as follows:
which characterizes the regret of using policy instead of in each state. The following lemma, usually referred as advantage decomposition of policy values in the RL literature (Sutton and Barto, 2018), shows that if the advantage can be controlled, then so can the overall regret.
For any two policies and , the difference of value functions can be written as
where is the distribution of the state in step beginning from under policy and MDP .
To this end, we run an adversarial bandit algorithm for each pair separately. The bandit method receives a reward and updates its parameters whenever is visited. Denote the episode that is visited the -th time (under the policy ) by and the total times is visited by . The reward is now defined using the cumulative reward received until the end of that episode . The expected value of the reward for this adversarial bandit is exactly by definition. The regret is (where denotes for simplicity):
By definition of the regret of an adversarial bandit problem, the regret of the adversarial bandit constructed above yields an upper bound on the cumulative advantage (10) of the corresponding time-state pair . Therefore, once we can solve the adversarial bandit problem with low regret, the said upper bound automatically yields an upper bound on the regret of the ARL algorithm.
4.2 The ARL Algorithm
We are now ready to present our Adversarial Reinforcement Learning (ARL) algorithm based on the reduction described above.
The ARL is determined once we choose a bandit algorithm for each pair . Here we choose INF (see Section 2.1) to take advantage of its corresponding regret bound. For more on INF and adversarial bandit algorithms, please see the related work section.
With these choices, we are ready to state an upper regret bound for the ARL algorithm.
Assuming standard notation MDP (see Section 3.1), the policy defined by ARL satisfies the bound
We first rewrite the cumulative regret using the advantage decomposition of Lemma 1 in terms of cumulative advantage. That is,
Now we proceed to bound the individual expectations above to obtain
Comparing with the lower bound. In Jin et al. (2018), a lower bound for stationary setting is given by , which automatically implies a lower bound for the adversarial setting. Comparing with our result, we can see the upper bound (11) is tight with respect to , and yet loses a factor (but, to our knowledge, it is the first log-free upper bound in literature).
Comparing with upper bounds in stationary MDP. Since stationary MDP is a special case of adversarial MDP, ARL can also be used in the stationary setting. In (Azar et al., 2017), a matching upper bound (up to log-factor) is given for the stationary setting using model-based method and Jin et al. (2018) establishes the model-free upper bound . Therefore, despite non-stationarity we achieve the best dependence on using model-free method.
Using reduction to bandits for stationary MDP. A natural idea is to use the same bandit reduction technique for stationary MDPs. However, the resulting bandit will still be an adversarial bandit instead of stochastic one because the policy is updated over time and thus the reward distribution is changing. Indeed, inequality (12) holds because it is using an adversarial bandit instead of a stochastic one.
Improving the dependence on . When proving upper bounds for stationary MDP, a key tool is Hoeffding-type concentration. This tool is optimal for stochastic bandit problems, but no longer optimal for MDPs. Instead, a Bernstein type concentration is used to improve the dependence on . Thus, an optimal bandit algorithm does not necessarily imply an optimal RL algorithm. Therefore, more sophisticated structures within the bandits must be taken into account if one wants to improve dependence on . It is currently an open question whether this improvement is possible, and whether the minimax rate for stationary MDP and adversarial bandit are the same with respect to . Since the concept of horizon does not appear in the bandit setting, the dependence on cannot be explained by purely following the connection between bandits and MDPs.
5 Examples and applications
In this section, we give two examples to illustrate the usefulness of the ARL algorithm.
5.1 Stationary MDP under adversarial attack
The first example we consider is stationary MDP under adverarial attack. Specifically, we consider reward manipulation. Everytime the agent receives a reward , the adversary can change it to . We measure the power of the attack by the total variation . Without loss of generality, we can assume the attack is also bounded, i.e., . The reason is that, since the reward in stationary MDP is known to be supported on , if after manipulation the reward received by the agent is not in this interval, the agent can be aware of the attack and clip the reward received for exploration.
We call the original stationary MDP and the new MDP under attack , which is an adversarial MDP and differs from only in reward. Therefore, we need to specify the regret concept under adversarial attack. For any set of policy , we can either define the regret under the original MDP , or define the regret under the attacked MDP . The following lemma shows these two definitions are close to each other if the power of attack is relatively small.
The difference of the two regrets defined above is bounded by the power of attack, i.e.
We first show the provably efficient algorithms like ones developed in (Dann and Brunskill, 2015; Dann et al., 2017; Azar et al., 2017; Jin et al., 2018; Dann et al., 2018; Zanette and Brunskill, 2019; Simchowitz and Jamieson, 2019) are extremely sensitive to adversarial attack with reward manipulation. Quantitatively, we can design an adversarial attack with power to result in regret (for both definitions above).
If the learning algorithm achieves instance-dependent regret in stationary MDP, we can construct a adversarial attack with power to make the agent following a given policy for times. As a result, if policy is sub-optimal, the regret of the agent is .
The attacking plan is very simple: If the agent is following , the adversarial does not manipulate the reward. Otherwise the adversary set it to . Then in the resulted MDP , the optimal policy will be exactly since the other choices will result in reward. By the regret assumption, the learning algorithm will choose the other actions at most times, therefore the adversary needs to spend at most power.
Now we can easily check the claims in Prop. 1 are valid. The agent will follow for times by the above argument. If policy is sub-optimal, then the agent suffers constant regret each episode and in total in . Since the power is , we can see the regret in is also . ∎
A few comments are in place:
The above construction does not require the adversary to know more about than the agent. In fact, more efficient attacking plans are also avaliable. Jun et al. (2018) develop an online algorithm for the adversary to learn the environment with the agent at the same time for stochastic bandits and Ma et al. (2018) develop an offline algorithm designing adversarial attack by solving an optimization problem for contextual bandits. Both arguments can be extended to MDP setting.
The requirement for the learning algorithm under attack is to achieve minimax instance-depedent rate. As argued in Simchowitz and Jamieson (2019), all of the recently proposed provably efficient methods using optimistic estimation such as (Dann and Brunskill, 2015; Dann et al., 2017; Azar et al., 2017; Jin et al., 2018; Dann et al., 2018; Zanette and Brunskill, 2019; Simchowitz and Jamieson, 2019) achieve this rate.
Although we assume the MDP starts from the same state in each episode, our construction of adversarial attack still holds in general. That is, the sensitiveness of adversarial attack cannot be saved by restart.
In constrast, we can prove ARL is quite robust to reward manipulation. Indeed, since the attacked MDP still satisfies the bounded reward assumption, by Thm. 1, is always . A straightforward application of Lemma 2 results in the following conclusion.
Let be the set of policies learnt using the ARL algorithm. Then
furthermore, for any , if , then the power used by the adversary must be at the same order, i.e.,
Notice the first half of Prop. 2 still holds if the adversary can also manipulate the dynamics. If we measure the perturbation in dynamics using distance, then the second half becomes
Therefore, if the adversary want to make the agent using ARL algorithm choosing a sub-optimal sequence of actions times, she must spend power, too. In words, the adversary always pay the same price. As a result, ARL is robust to reward manipulation.
5.2 Equilibrium in multi-agent RL
Adversarial MDP is also a crucial step towards multi-agent RL, also termed as stochastic game (Shapley, 1953). Instead of interacting with the environment directly, the agents play a matrix game in each round and then each agent receives his reward. If other agents’ decisions are also revealed, some game-theoretic RL algorithms can be developed (Hu and Wellman, 2003; Littman, 1994, 2001). Here we disucss a more general setting where each agent can only see his own decision. Then multi-agent RL problem reduces to our adversarial MDP setting and the ARL algorithm provided a provably efficient way to compute approximate equilibrium.
In game theoretic language, for each episode , agent choose a deterministic policy . Since the number of possible policy grows exponentially with and , it is hard to compute equilibrium for the game direcly. ARL provides us an efficient way to compute approximate equilibrium in polynomial time. Notice in game theory literature, people usually use loss instead of reward to characterize the utility of agents. Here we use reward for convenience and everything can be translated into loss by taking the negative.
We first consider a two-player constant-sum game. By minimax theorem, the worst case guarantee of both agents in one episode can be characterized by the value of the game . A famous result in game theory says sublinear regret will imply convergence to the value of the game (see Thm. 4.9 in (Blum and Monsour, 2007)). Therefore, Thm. 1 implies the following result.
For , if agent is using ARL,
Now we consider general-sum game for arbitrary players. Although the concept of value does not exist anymore, we can still compute an approximate correlated equilibrium. To this end, we need a stronger notation of regret called swap regret (Blum and Mansour, 2007). A modification rule map a deterministic policy to another by consistently swapping one action to another. The swap regret of a policy is the gap between the current cumulative reward and that of the best possible policy under any modification rule, i.e.
We can bound the swap regret similarly by modifying the ARL algorithm a little bit. Essentially, we use the bandit algorithm in (Blum and Mansour, 2007) for each , which achieves swap regret.
ARL using bandit algorithm in Blum and Mansour (2007) satisfies
Now we are ready to use the above swap regret guarantee to derive the correlated equilibrium guarantee by Thm. 4.12 in (Blum and Monsour, 2007).
If all the agents are using ARL with bandit algorithm in (Blum and Mansour, 2007), the empirical distribution of joint policy used is an -correlated equilibrium.
The above propositions provide us with a probably efficient way to compute an equilibrium in multi-agent RL.
In this paper, we study the RL with adversarial MDP. We develop a provably efficient algortihm ARL, whose regret rate is slightly better than the best known rate of model-free method for stationary MDP (Jin et al., 2018). We illustrate the usefulness of ARL algorithm by two applications in provably robust RL algorithms and computing approximate equilibria in multi-agent RL.
- Abbasi et al.  Yasin Abbasi, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvári. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in neural information processing systems, pages 2508–2516, 2013.
Abdallah and Kaisers 
Sherief Abdallah and Michael Kaisers.
Addressing environment non-stationarity by repeating q-learning
The Journal of Machine Learning Research, 17(1):1582–1612, 2016.
- Agrawal and Jia  Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
- Audibert and Bubeck  Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
- Auer et al.  Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE, 1995.
- Azar et al.  Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
Blackwell et al. 
David Blackwell et al.
An analog of the minimax theorem for vector payoffs.Pacific Journal of Mathematics, 6(1):1–8, 1956.
- Blum and Mansour  Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8(Jun):1307–1324, 2007.
- Blum and Monsour  Avrim Blum and Yishay Monsour. Learning, regret minimization, and equilibria. 2007.
- Bubeck et al.  Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Bubeck et al. [2013a] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013a.
- Bubeck et al. [2013b] Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. In Conference on Learning Theory, pages 122–134, 2013b.
- Cai et al.  Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 661–670. ACM, 2017.
- Da Silva et al.  Bruno C Da Silva, Eduardo W Basso, Ana LC Bazzan, and Paulo M Engel. Dealing with non-stationary environments using context detection. In Proceedings of the 23rd international conference on Machine learning, pages 217–224. ACM, 2006.
- Dann and Brunskill  Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
- Dann et al.  Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
- Dann et al.  Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056, 2018.
- Dick et al.  Travis Dick, Andras Gyorgy, and Csaba Szepesvari. Online learning in markov decision processes with changing cost sequences. In International Conference on Machine Learning, pages 512–520, 2014.
- Even-Dar et al.  Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Experts in a markov decision process. In Advances in neural information processing systems, pages 401–408, 2005.
- Gajane et al.  Pratik Gajane, Ronald Ortner, and Peter Auer. Variational regret bounds for reinforcement learning. arXiv preprint arXiv:1905.05857, 2019.
- Hu and Wellman  Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
- Jaksch et al.  Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Jin et al.  Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4868–4878, 2018.
- Jun et al.  Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3640–3649, 2018.
- Kakade et al.  Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
- Koenig and Simmons  Sven Koenig and Reid G Simmons. Complexity analysis of real-time reinforcement learning. In AAAI, pages 99–107, 1993.
- Littman  Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
- Littman  Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001.
- Ma et al.  Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pages 186–204. Springer, 2018.
- Nilim and El Ghaoui  Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
- Ortner  Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration. arXiv preprint arXiv:1808.01813, 2018.
- Osband and Van Roy  Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
- Padakandla et al.  Sindhu Padakandla, Shalabh Bhatnagar, et al. Reinforcement learning in non-stationary environments. arXiv preprint arXiv:1905.03970, 2019.
- Peng et al.  Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017.
- Pinto et al.  Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.
- Shapley  Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
- Silver et al.  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Simchowitz and Jamieson  Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. arXiv preprint arXiv:1905.03814, 2019.
- Stoltz  Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences. PhD thesis, Université Paris Sud-Paris XI, 2005.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Xu and Mannor  Huan Xu and Shie Mannor. The robustness-performance tradeoff in markov decision processes. In Advances in Neural Information Processing Systems, pages 1537–1544, 2007.
- Zanette and Brunskill  Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210, 2019.
- Zinkevich et al.  Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in neural information processing systems, pages 1729–1736, 2008.