1 Introduction
Recent advancements in AI have the potential to change our daily lives by boosting our productivity (e.g., via virtual personal assistants), augmenting our capabilities (e.g., via smart mobility systems), and increasing automation (e.g., via autopilots and assistive robots). These are settings of intelligence augmentation, where societal benefit will come not from complete automation but rather from the interaction between people and machines, in a process of a productive humanmachine collaboration.
We expect that useful collaboration will come about through AI agents that can adapt to the behavior of users. As an example, consider smart elevators that detect users’ intentions to automatically open/close doors and select floors. In an initial period, users constantly change their behavior while they experiment with the elevator and do not follow a particular behavior. Such dynamics are natural in many settings, including when using personal assistive apps, becoming accustomed with new features of an autopilot mode, and using an assistive robot to perform a task. Without accounting for this changing behavior of users, the performance of the AI agent could considerably deteriorate, leading to, for example, hazardous situation in an autopilot mode. Hence, it is important that the AI agent updates its decisionmaking policy accordingly.
We formalize this problem through a twoagent, reinforcement learning (RL) framework. The agents, hereafter referred to as agent
and agent , jointly solve a task in a collaborative setting (i.e., share a common reward function and a transition kernel that is based on their joint actions). Our goal is to develop a learning algorithm for agent that facilitates a successful collaboration even in cases when agent is adapting its own policy. In the above examples, agent could represent the AI agent whereas agent could be a person with timeevolving behavior. We primarily focus on an episodic Markov decision process (MDP) setting, in which the agents repeatedly interact:
[label=()]

agent decides on its policy based on historic information (agent ’s past policies) and the underlying MDP model;

agent commits to its policy for a given episode without knowing the policy of agent ;

agent updates its policy at the end of the episode based on agent ’s observed behavior.
When agent ’s policy is fixed and known, one can find an optimal policy for agent using standard MDP planning techniques. In our setting, however, we do not assume agent ’s behavior to be stationary, nor do we consider any particular model of how agent changes its policy. This differs from similar twoagent (humanAI) collaborative settings (Dimitrakakis et al., 2017; Nikolaidis et al., 2017) that prescribe a particular behavioral model to agent (human agent).
1.1 Overview of our approach
The presence of agent in our framework implies that the reward function and the transition kernel are changing from the perspective of agent . Variants of the setting have also been studied in the learning literature (EvenDar et al., 2005, 2009; Yu & Mannor, 2009b, a; Yu et al., 2009; Abbasi et al., 2013; Wei et al., 2017). However, these approaches do not directly apply because of the following differences: (i) they focus on a particular aspect of nonstationarity (e.g., changing rewards with fixed transitions) (EvenDar et al., 2005, 2009), (ii) require that the changes in the transition model are bounded (Yu & Mannor, 2009a, b), (iii) make restrictions on the policy space (Abbasi et al., 2013), and (iv) consider a competitive or adversarial setting instead of cooperative setting with shared reward (Wei et al., 2017).
In our work, we adopt an assumption that agent does not abruptly change its policy across episodes. We show that this assumption is critical—the problem becomes computationally intractable otherwise. Our approach is inspired by the problem of experts learning in MDPs (EvenDar et al., 2005), in which each state is associated with an experts algorithm that derives the policy for that state using values. However, to compensate for the nonstationarity of transitions and facilitate a faster learning process, we introduce novel forms of recency bias inspired by the ideas of Rakhlin & Sridharan (2013); Syrgkanis et al. (2015).
Contributions.
We design novel algorithms for agent that lead to sublinear regret of for episodes. Here, the parameter defines an upper bound on the magnitude of agent ’s policy change w.r.t. as . We show that it is computationally hard to achieve sublinear regret for the special case of , using a reduction from the learning parities with noise problem (Abbasi et al., 2013; Kanade & Steinke, 2014). Furthermore, we connect the agents’ joint return to the regret of agent by adapting the concept of smoothness
from the gametheory literature
(Roughgarden, 2009; Syrgkanis et al., 2015), and we show that the bound on the regret of agent implies near optimality of the agents’ joint return for MDPs that manifest a smooth game (Roughgarden, 2009; Syrgkanis et al., 2015). To the best of our knowledge, we are the first to provide such guarantees in a collaborative twoagent MDP learning setup.2 The Setting
We study a twoagent learning problem in an MDP. The agents are referenced as agent and agent . We consider an episodic setting with episodes (also called time steps) and each episode lasting rounds. Generic episodes are denoted by and , while a generic round is denoted by . The MDP is defined by:

a finite set of states , with denoting a generic state. We enumerate the states by , …,
, and assume this ordering in our vector notation.

a finite set of actions , with denoting a generic action of agent and denoting a generic action of agent . We enumerate the actions of agent by , …, and agent by , …, , and assume this ordering in our vector notation.

a transition kernel
, which is a tensor with indices defined by the current state, the agents’ actions, and the next state.

a reward function that defines the joint reward for both agents.
We assume that agent knows the MDP model. The agents commit to playing stationary policies and in each episode , and do so without knowing the commitment of the other agent. At the end of the episode , the agents observe each other’s policies (, ) and can use this information to update their future policies. Since the state and action spaces are finite, policies can be represented as matrices and , so that rows and define distributions on actions in a given state. We also define the reward matrix for agent as , whose elements are the expected rewards of agent for different actions and states. By bounded rewards, we have .
2.1 Objective
After each episode , the agents can adapt their policies. Note that agent is not in our control and not assumed to be optimal. Therefore, we take the perspective of agent , and seek to optimize its policy in order to obtain good joint returns. The joint return in episode is:
Here is the state at round . For this comes from the initial state distribution . For later periods this is obtained by following joint actions , , …, from state . Actions are obtained from policies and . The second equation uses vector notation to define the joint return, where is a row vector representing the state distribution at episode and round , while is a rowwise dot product whose result is a column vector with elements. Since this is an episodic framework, we will assume the same starting state distribution, , for all episodes . However can differ across episodes since policies and evolve.
We define the average return over all episodes as . The objective is to output a sequence of agent ’s policies , …, that maximize:
The maximum possible value of over all combinations of agent ’s and agent ’s policies is denoted as Opt. Notice that this value is achievable using MDP planning techniques, provided that we control both agents.
2.2 Policy change magnitude and influences
However, we do not control agent nor do we assume that agent follows a particular behavioral model. Instead, we quantify the allowed behavior via its policy change magnitude, which for agent is defined as:
where is operator (induced) norm.
In the case of agent , we will be focusing on policy change magnitudes that are of the order , where is strictly grater than . For instance, the assumption holds if agent is a learning agent adopting the experts in MDP approach of EvenDar et al. (2005, 2009).
Furthermore, we introduce the notion of the influence of an agent on the transition dynamics. This measures how much an agent can influence the transition dynamics by changing its policy. For agent , the influence is defined as:
where kernel (matrix)
denotes the probability of transitioning from
to when the agents’ policies are and respectively.^{1}^{1}1Our notion of influence is similar to, although not the same as, that of Dimitrakakis et al. (2017).Influence is a measure of how much an agent affects the transition probabilities by changing its policy. We are primarily interested in this notion to show how our approach compares to the existing results from the online learning literature. For , our setting relates to the single agent settings of (EvenDar et al., 2005, 2009; Dick et al., 2014) where rewards are nonstationary but transition probabilities are fixed. In general, the influence takes values in (see Appendix B, Corollary 1). We can analogously define policy change magnitude and influence of agent .
2.3 Mixing time and values
We follow standard assumptions from the literature on online learning in MDPs (e.g., see (EvenDar et al., 2005)), and only consider transition kernels that have welldefined stationary distributions. For the associated transition kernel, we define a stationary state distribution as the one for which:

any initial state distribution converges to under policies and ;

and .
Note that is represented as a row vector with elements. Furthermore, as discussed in (EvenDar et al., 2005), this implies that there exists a mixing time such that for all state distributions and :
Due to the welldefined mixing time, we can define the average reward of agent when following policy in episode as:
where is rowwise dot product whose result is a column vector with elements. The value matrix for agent w.r.t. policy is defined as:
where and are states and actions in round , starting from state with action and then using policy . Moreover, the policywise value (column) vector for w.r.t. policy is defined by:
or in matrix notation . The values satisfy the following Bellman equation:
where
defines the probability distribution over next states given action
of agent and policy of agent (here, is denoted as a row vector with elements). For other useful properties of this MDP framework we refer the reader to Appendix B.3 Smoothness and Noregret Dynamics
The goal is to output a sequence of agent ’s policies , …, so that the joint return is maximized. There are two key challenges: (i) agent policies could be suboptimal (or, even adversarial in the extreme case), and (ii) agent does not know the current policy of agent at the beginning of episode .
Smoothness criterion.
To deal with the first challenge, we consider a structural assumption that enables us to apply a regret analysis when quantifying the quality of a solution w.r.t. the optimum. In particular, we assume that the MDP is ()smooth:
Definition 1.
We say that an MDP is ()smooth if there exists a pair of policies such that for every policy pair :
This bounds the impact of agent ’s policy on the average reward. In particular, there must exist an optimal policy pair such that the negative impact of agent for choosing is controllable by an appropriate choice of policy of agent . This smoothness definition is a variant of the gametheoretic smoothness notion studied in other contexts, e.g., for the priceofanarchy analysis of noncooperative games and learning in repeated games (Roughgarden, 2009; Syrgkanis et al., 2015). We analyze the relationship between the smoothness parameters and the properties of the MDP in our setting in Appendix C.
It is important to note that since we have a finite number of rounds per episode, Opt is not necessarily the same as , and the policies that achieve Opt need not lead to .
Noregret learning.
To address the second challenge, we adopt the online learning framework and seek to minimize regret :
(1) 
A policy sequence , …, is noregret if regret is sublinear in . An algorithm that outputs such sequences is a noregret algorithm—this intuitively means that the agent’s performance is competitive w.r.t. any fixed policy.
Nearoptimality of noregret dynamics.
Because agent could be adapting to the policies of agent , this is an adaptive learning setting, and the notion of regret can become less useful. This is where the smoothness criterion comes in: we will show that it suffices to minimize the regret in order to obtain nearoptimal performance.
Using an analysis similar to (Syrgkanis et al., 2015), we show the nearoptimality of noregret dynamics defined w.r.t. the optimal return Opt, as stated in the following lemma:
Lemma 1.
Return is lower bounded by:
Proof.
See Appendix D for the proof. ∎
Lemma 1 implies that as the number of episodes and the number of rounds go to infinity, return converges to of optimum Opt provided that agent is a noregret learner. In the next section, we design such noregret learning algorithms for agent .
4 Learning Algorithms
We base our approach on the expert learning literature for MDPs, in particular that of EvenDar et al. (2005, 2009). The basic idea is to associate each state with an experts algorithm, and decide on a policy by examining values of stateaction pairs. Thus, the function represents a reward function in the expert terminology.
4.1 Experts with periodic restarts: ExpRestart
In cases when agent has no influence on transitions, the approach of EvenDar et al. (2005, 2009) would yield the noregret guarantees. The main difficulty of the present setting is that agent can influence the transitions via its policy. The hope is that as long as the policy change magnitude of agent is not too large, agent can compensate for the nonstationarity of transitions by using only recent history when updating its policy.
A simple way of implementing this principle is to use a noregret learning algorithm, but periodically restarting it, i.e., by splitting the full time horizon in segments of length and applying the algorithm on each segment separately. Notice that in this case we have welldefined periods , , …, . As a choice of an expert algorithm (the algorithm associated with each state), we use Optimistic Follow the Regularized Leader (OFTRL) (Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015). Our policy updating rule for segment , with starting point , can be described as:
for , and:
denotes a row of matrix (see Section 2.3)^{2}^{2}2Given and , we can calculate from the Bellman equation using standard dynamic programming techniques., is a row vector from probability simplex , denotes the transpose operator, is a 1strongly convex regularizer w.r.t. norm , and is the learning rate. This approach, henceforth referred to as experts with periodic restarts (ExpRestart), suffices to obtain sublinear regret provided that the segment length and learning rate are properly set (see Appendix G).
One of the main drawbacks of experts with periodic restarts is that it potentially results in abrupt changes in the policy of agent , this occurring when switching from one segment to another. In practice, one might want to avoid this, for example, because agent (e.g., representing a person) might negatively respond to such abrupt changes in agent ’s policy. Next, we design a new experts algorithm for our setting that ensures gradual policy changes for agent across episodes, while achieving the same order of regret guarantees (see Section 5.4 and Appendix G).
4.2 Experts with doubly recency bias: ExpDRBias
Utilizing fixed segments, as in the approach of ExpRestart, leads to potentially rapid policy changes after each segment. To avoid this, one could average the policies obtained by segmentation from different possible starting points. In particular, for each episode , we consider a family of segments that this episode belongs to: , , …, , , each segment in this family identifying a possible policy for episode . By averaging these policies, we can treat equally each possible segment of size that contains episode . This approach, henceforth referred to as experts with doubly recency bias (ExpDRBias), can be implemented through the following two ideas.
Recency windowing. The first idea is what we call recency windowing. Simply put, it specifies how far in the history an agent should look when choosing a policy. More precisely, we define a sliding window of size and to decide on policy we only use historical information from periods after . In particular, the updating rule of OFTRL would be modified for as
and:
(2) 
Recency modulation. The second idea is what we call recency modulation. This creates an averaging effect over the policies computed by the experts with periodic restarts approach for different possible starting points of the segmentation. For episode , recency modulation calculates policy updates using recency windowing but considers widows of different sizes. More precisely, we calculate updates with window sizes to , and then average them to obtain the final update. Lemma 3 shows that this updating rule will not lead to abrupt changes in agent ’s policy.
To summarize, agent has the following policy update rule for :
(3) 
where
For , we follow equation update (2). The full description of agent ’s policy update using the approach of ExpDRBias is given in Algorithm 1. As with ExpRestart, ExpDRBias leads to a sublinear regret for a proper choice of and , which in turn results in a nearoptimal behavior, as analyzed in the next section.
5 Theoretical Analysis of ExpDRBias
To bound regret , given by equation (1), it is useful to express difference in terms of values. In particular, one can show that this difference is equal to (see Lemma 15 in Appendix B). By the definitions of and , this implies:
If was not dependent on (e.g., if agent was not changing its policy), bounding would amount to bounding the sum of terms . This could be done with an approach that carefully combines the proof techniques of EvenDar et al. (2005) with the OFTRL properties, in particular, regret bounded by variation in utilities (RVU) (Syrgkanis et al., 2015). However, in our setting is generally changing with .
5.1 Change magnitudes of stationary distributions
To account for this, we need to investigate how quickly distributions change across episodes. Furthermore, to utilize the RVU property, we need to do the same for distributions . The following lemma provides bounds on the respective change magnitudes.
Lemma 2.
The difference between the stationary distributions of two consecutive episodes is upper bounded by:
Furthermore, for any policy :
Proof.
See Appendix E.3. ∎
5.2 Properties based on OFTRL
The bounds on the change magnitudes of distributions and , which will propagate to the final result, depend on agent ’s policy change magnitude . The following lemma provides a bound for that, together with the assumed bound on , is useful in establishing noregret guarantees.
Lemma 3.
For any and , the change magnitude of weights in ExpDRBias is bounded by:
Consequently:
Proof.
See Appendix E.2. ∎
Now, we turn to bounding term . Lemma 4 formalizes the RVU property for ExpDRBias using norm and its dual norm, derived from results in the existing literature (Syrgkanis et al., 2015).^{3}^{3}3An extended version of the lemma, which is needed for the main result, is provided in Appendix E.1. Intuitively, Lemma 4 shows that it is possible to bound by examining the change magnitudes of values.
Lemma 4.
Consider ExpDRBias and let denote column vector of ones with elements. Then, for each episode of ExpDRBias , we have:
where are defined in (3), , and is an arbitrary policy of agent .
Proof.
See Appendix E.1. ∎
5.3 Change magnitudes of values
Finally, we derive bounds on the change magnitudes of values, that we use together with Lemma 4 to prove the main results. We first bound the difference , which helps us in bounding the difference .
Lemma 5.
The difference between values of two consecutive episodes is upper bounded by:
where .
Proof.
See Appendix E.4 for the proof. ∎
Lemma 6.
The difference between values of two consecutive episodes is upper bounded by:
where .
Proof.
See Appendix E.5 for the proof. ∎
For convenience, instead of directly using , we consider a variable such that . The following proposition gives a rather loose (but easy to interpret) bound on that satisfies the inequality.
Proposition 1.
There exists a constant independent of and , such that:
5.4 Regret analysis and main results
We now come to the most important part of our analysis: establishing the regret guarantees for ExpDRBias. Using the results from the previous subsections, we obtain the following regret bound:
Theorem 1.
Let the learning rate of ExpDRBias be equal to and let be such that and . Then, the regret of ExpDRBias is upperbounded by:
Proof.
See Appendix F for the proof. ∎
When agent does not influence the transition kernel through its policy, i.e., when , the regret is for . In this case, we could have also applied the original approach of EvenDar et al. (2005, 2009), but interestingly, it would result in a worse regret bound, i.e., . By leveraging the fact that agent ’s policy is slowly changing, which corresponds to reward functions in the setting of EvenDar et al. (2005, 2009) not being fully adversarial, we are able to improve on the worstcase guarantees. The main reason for such an improvement is our choice of the underlying experts algorithm, i.e., OFTRL, that exploits the apparent predictability of agent ’s behavior. Similar arguments were made for the repeated games settings (Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015), which correspond to our setting when the MDP consists of only one state. Namely, in the single state scenario, agent does not influence transitions, so the resulting regret is , matching the results of (Syrgkanis et al., 2015).
In general, the regret depends on . If with , then equalizes the order of the two regret components in Theorem 1 and leads to the regret of . This brings us to the main result, which provides a lower bound on the return :
Theorem 2.
Assume that for . Let and . Then, the regret of ExpDRBias is upperbounded by:
Furthermore, when the MDP is ()smooth, the return of ExpDRBias is lowerbounded by:
Proof.
Notice that for . By Lemma 3, this implies that there exists a fixed (not dependant on ) such that for large enough . Furthermore, , so there exists a fixed such that for large enough . Hence, we can apply Theorem 1 to obtain an orderwise regret bound: .
Now, consider two cases. First, let . Then, we obtain:
For the other case, i.e., when , we obtain:
Therefore, , which proves the first statement. By combining it with Lemma 1, we obtain the second statement. ∎
The multiplicative factors in the asymptotic bounds mainly depend on mixing time . In particular they are dominated by factor and its powers, as can be seen from Lemma 1, Theorem 1, and Proposition 1. Note that Lemma 3 allows us to upper bound in Theorem 1 with . Furthermore, for large enough . Hence, these results imply dependency of the asymptotic bounds on . This is larger than what one might expect from the prior work (e.g., the bound in (EvenDar et al., 2005, 2009) has dependency). However, note that our setting is different in that the presence of agent has an effect on transitions (from agent ’s perspective), so it is not surprising that the resulting dependency on the mixing time is worse.
6 Hardness Result
Our formal guarantees assume that the policy change magnitude of agent is a decreasing function in the number of episodes given by for . What if we relax this, and allow agent to adapt independently of the number of episodes? We show a hardness result for the setting of , using a reduction from the online agnostic parity learning problem (Abbasi et al., 2013). As argued in (Abbasi et al., 2013), the online to batch reduction implies that the online version of agnostic parity learning is at least as hard as its offline version, for which the best known algorithm has complexity (Kalai et al., 2008). In fact, agnostic parity learning is a (harder) variant of the learning with parity noise problem, widely believed to be computationally intractable (Blum et al., 2003; Pietrzak, 2012), and thus often adopted as a hardness assumption (e.g., (Sharan et al., 2018)).
Theorem 3.
Assume that the policy change magnitude of agent is of the order and that its influence is equal to . If there exists a time algorithm that outputs a policy sequence , …, whose regret is for , then there also exists a time algorithm for online agnostic parity learning whose regret is .
Proof.
See Appendix H. ∎
Our proof relies on the result of Abbasi et al. (2013) (Theorem 5 and its proof), which reduces the online agnostic parity learning problem to the adversarial shortest path problem, which we reduce to our problem. This theorem implies that when , it is unlikely to obtain that is sublinear in given the current computational complexity results.
7 Related Work
Experts learning in MDP. Our framework is closely related to that of EvenDar et al. (2005, 2009), although the presence of agent means that we cannot directly use their algorithmic approach. In fact, learning with an arbitrarily changing transition is believed to be computationally intractable (Abbasi et al., 2013), and computationally efficient learning algorithms experience linear regret (Yu & Mannor, 2009a; Abbasi et al., 2013). This is where we make use of the bound on the magnitude of agent ’s policy change. Contrary to most of the existing work, the changes in reward and transition kernel in our model are nonoblivious and adapting to the learning algorithm of agent . There have been a number of followup work that either extend the mentioned results or improve them for more specialized settings (Dick et al., 2014; Neu et al., 2012, 2010; Dekel & Hazan, 2013). Agarwal et al. (2017) and Singla et al. (2018) study the problem of learning with experts advice where experts are not stationary and are learning agents themselves. However, their focus is on designing a metaalgorithm on how to coordinate with these experts and is technically very different from ours.
Learning in games. To relate the quality of an optimal solution to agent ’s regret, we use techniques similar to those studied in the learning in games literature (Blum et al., 2008; Roughgarden, 2009; Syrgkanis et al., 2015). The fact that agent ’s policy is changing slowly, enables us to utilize noregret algorithms for learning in games with recency bias (e.g., (Daskalakis et al., 2011; Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015)) that lead to better regret bounds than standard noregret learning techniques (Littlestone & Warmuth, 1994; Freund & Schapire, 1997). The recent work by Wei et al. (2017) studies twoplayer learning in zerosum stochastic games. Apart from focusing on zerosum games, Wei et al. (2017) consider different set of assumptions to derive the regret bounds, so their results are not directly comparable to ours. Furthermore, their algorithmic techniques are orthogonal to the line of literature we follow (e.g., (EvenDar et al., 2005, 2009)) and these differences are further elaborated in (Wei et al., 2017).
Human AI collaboration. The helperAI problem (Dimitrakakis et al., 2017) is related to the present work, in that an AI agent is designing its policy by accounting for human imperfections. The authors use a Stackleberg formulation of the problem in a single shot scenario. Their model assumes that the AI agent knows the behavioral model of the human agent, which is a best response to the policy of the AI agent for an incorrect transition kernel. We relax this requirements by studying a repeated humanAI interaction. Nikolaidis et al. (2017) study a repeated humanAI interaction, but their setting is more restrictive than ours as they do not model the changes in the environment. In particular, they have a repeated game setup, where the only aspect that changes over time is the “state” of the human representing what knowledge the human has about the robot’s payoffs. Prior work also considers a learner that is aware of the presence of other actors, e.g., (Foerster et al., 2018; Raileanu et al., 2018). While these multiagent learning approaches account for the evolving behavior of other actors, the underlying assumption is typically that each agent follows a known model.
Steering and teaching. There is also a related literature on “steering” the behavior of other agent. For example, (i) the environment design framework of Zhang et al. (2009), where one agent tries to steer the behavior of another agent by modifying its reward function, (ii) the cooperative inverse reinforcement learning of HadfieldMenell et al. (2016), where the human uses demonstrations to reveal a proper reward function to the AI agent, and (iii) the advicebased interaction model (e.g., (Amir et al., 2016)), where the goal is to communicate advice to a suboptimal agent on how to act in the world. The latter approach is also in close relationship to the machine teaching literature (e.g., (Zhu et al., 2018; Zhu, 2015; Singla et al., 2013; Cakmak et al., 2012)). Our work differs from the above in that we focus on joint decisionmaking, rather than teaching or steering.
8 Conclusion
We presented a twoagent MDP framework in a collaborative setting. We considered the problem of designing a noregret algorithm for one of the agents in the presence of a slowly adapting, second agent. Our algorithm builds from the ideas of experts learning in MDPs and makes use of a novel form of recency bias to achieve strong regret bounds. In particular, we showed that in order for the first agent to facilitate collaboration, it is critical that the second agent’s policy changes are not abrupt.
References
 Abbasi et al. (2013) Abbasi, Y., Bartlett, P. L., Kanade, V., Seldin, Y., and Szepesvári, C. Online learning in markov decision processes with adversarially chosen transition probability distributions. In NIPS, pp. 2508–2516, 2013.
 Agarwal et al. (2017) Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In COLT, pp. 12–38, 2017.
 Amir et al. (2016) Amir, O., Kamar, E., Kolobov, A., and Grosz, B. Interactive teaching strategies for agent training. In IJCAI, pp. 804–811, 2016.
 Blum et al. (2003) Blum, A., Kalai, A., and Wasserman, H. Noisetolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506–519, 2003.
 Blum et al. (2008) Blum, A., Hajiaghayi, M., Ligett, K., and Roth, A. Regret minimization and the price of total anarchy. In STOC, pp. 373–382. ACM, 2008.
 Cakmak et al. (2012) Cakmak, M., Lopes, M., et al. Algorithmic and human teaching of sequential decision tasks. In AAAI, 2012.
 Daskalakis et al. (2011) Daskalakis, C., Deckelbaum, A., and Kim, A. Nearoptimal noregret algorithms for zerosum games. In SODA, pp. 235–254, 2011.
 Dekel & Hazan (2013) Dekel, O. and Hazan, E. Better rates for any adversarial deterministic mdp. In ICML, pp. 675–683, 2013.
 Dick et al. (2014) Dick, T., Gyorgy, A., and Szepesvari, C. Online learning in markov decision processes with changing cost sequences. In ICML, pp. 512–520, 2014.
 Dimitrakakis et al. (2017) Dimitrakakis, C., Parkes, D. C., Radanovic, G., and Tylkin, P. Multiview decision processes: The helperai problem. In NIPS, pp. 5443–5452, 2017.
 EvenDar et al. (2005) EvenDar, E., Kakade, S. M., and Mansour, Y. Experts in a markov decision process. In NIPS, pp. 401–408, 2005.
 EvenDar et al. (2009) EvenDar, E., Kakade, S. M., and Mansour, Y. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
 Foerster et al. (2018) Foerster, J., Chen, R. Y., AlShedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with opponentlearning awareness. In AAMAS, pp. 122–130, 2018.
 Freund & Schapire (1997) Freund, Y. and Schapire, R. E. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
 HadfieldMenell et al. (2016) HadfieldMenell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In NIPS, pp. 3909–3917, 2016.
 Kalai et al. (2008) Kalai, A. T., Mansour, Y., and Verbin, E. On agnostic boosting and parity learning. In STOC, pp. 629–638, 2008.
 Kanade & Steinke (2014) Kanade, V. and Steinke, T. Learning hurdles for sleeping experts. ACM Transactions on Computation Theory (TOCT), 6(3):11, 2014.
 Littlestone & Warmuth (1994) Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
 Neu et al. (2010) Neu, G., Antos, A., György, A., and Szepesvári, C. Online markov decision processes under bandit feedback. In NIPS, pp. 1804–1812, 2010.
 Neu et al. (2012) Neu, G., Gyorgy, A., and Szepesvári, C. The adversarial stochastic shortest path problem with unknown transition probabilities. In AISTATS, pp. 805–813, 2012.
 Nikolaidis et al. (2017) Nikolaidis, S., Nath, S., Procaccia, A. D., and Srinivasa, S. Gametheoretic modeling of human adaptation in humanrobot collaboration. In Proceedings of the International conference on humanrobot interaction, pp. 323–331, 2017.
 Pietrzak (2012) Pietrzak, K. Cryptography from learning parity with noise. In International Conference on Current Trends in Theory and Practice of Computer Science, pp. 99–114, 2012.
 Raileanu et al. (2018) Raileanu, R., Denton, E., Szlam, A., and Fergus, R. Modeling others using oneself in multiagent reinforcement learning. In ICML, pp. 4254–4263, 2018.
 Rakhlin & Sridharan (2013) Rakhlin, S. and Sridharan, K. Optimization, learning, and games with predictable sequences. In NIPS, 2013.
 Roughgarden (2009) Roughgarden, T. Intrinsic robustness of the price of anarchy. In STOC, pp. 513–522. ACM, 2009.
 Sharan et al. (2018) Sharan, V., Kakade, S., Liang, P., and Valiant, G. Prediction with a short memory. In STOC, pp. 1074–1087, 2018.

Singla et al. (2013)
Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A.
On actively teaching the crowd to classify.
In NIPS Workshop on Data Driven Education, 2013.  Singla et al. (2018) Singla, A., Hassani, S. H., and Krause, A. Learning to interact with learning agents. In AAAI, 2018.
 Syrgkanis et al. (2015) Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast convergence of regularized learning in games. In NIPS, pp. 2989–2997, 2015.
 Wei et al. (2017) Wei, C.Y., Hong, Y.T., and Lu, C.J. Online reinforcement learning in stochastic games. In NIPS, pp. 4987–4997, 2017.
 Yu & Mannor (2009a) Yu, J. Y. and Mannor, S. Arbitrarily modulated markov decision processes. In Decision and Control, 2009, pp. 2946–2953, 2009a.
 Yu & Mannor (2009b) Yu, J. Y. and Mannor, S. Online learning in markov decision processes with arbitrarily changing rewards and transitions. In GameNets, pp. 314–322, 2009b.
 Yu et al. (2009) Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
 Zhang et al. (2009) Zhang, H., Parkes, D. C., and Chen, Y. Policy teaching through reward function learning. In EC, pp. 295–304, 2009.
 Zhu (2015) Zhu, X. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, pp. 4083–4087, 2015.
 Zhu et al. (2018) Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. An overview of machine teaching. CoRR, abs/1801.05927, 2018.
Appendix A List of Appendices
In this section we provide a brief description of the the content provided in the appendices of the paper.
Appendix B Important MDP properties
To obtain our formal results, we derive several useful MDP properties.
b.1 Policyreward bounds
The first policy reward bound we show is the upperbound on the vector product between and .
Lemma 7.
The policyreward dot product is bounded by:
Proof.
The definition of gives us:
where we used the triangle inequality and the boundedness of rewards. ∎
Note that the lemma holds for any and (i.e., for any and ). Furthermore, using the following two lemmas, we also bound the difference between two consecutive policyreward dot products. In particular:
Lemma 8.
The following holds:
for all .
Proof.
To obtain the first inequality, note that:
Comments
There are no comments yet.