Learning to Collaborate in Markov Decision Processes

01/23/2019 ∙ by Goran Radanovic, et al. ∙ 0

We consider a two-agent MDP framework where agents repeatedly solve a task in a collaborative setting. We study the problem of designing a learning algorithm for the first agent (A1) that facilitates a successful collaboration even in cases when the second agent (A2) is adapting its policy in an unknown way. The key challenge in our setting is that the presence of the second agent leads to non-stationarity and non-obliviousness of rewards and transitions for the first agent. We design novel online learning algorithms for agent A1 whose regret decays as O(T^1-3/7·α) with T learning episodes provided that the magnitude of agent A2's policy changes between any two consecutive episodes are upper bounded by O(T^-α). Here, the parameter α is assumed to be strictly greater than 0, and we show that this assumption is necessary provided that the learning parity with noise problem is computationally hard. We show that sub-linear regret of agent A1 further implies near-optimality of the agents' joint return for MDPs that manifest the properties of a smooth game.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advancements in AI have the potential to change our daily lives by boosting our productivity (e.g., via virtual personal assistants), augmenting our capabilities (e.g., via smart mobility systems), and increasing automation (e.g., via auto-pilots and assistive robots). These are settings of intelligence augmentation, where societal benefit will come not from complete automation but rather from the interaction between people and machines, in a process of a productive human-machine collaboration.

We expect that useful collaboration will come about through AI agents that can adapt to the behavior of users. As an example, consider smart elevators that detect users’ intentions to automatically open/close doors and select floors. In an initial period, users constantly change their behavior while they experiment with the elevator and do not follow a particular behavior. Such dynamics are natural in many settings, including when using personal assistive apps, becoming accustomed with new features of an auto-pilot mode, and using an assistive robot to perform a task. Without accounting for this changing behavior of users, the performance of the AI agent could considerably deteriorate, leading to, for example, hazardous situation in an auto-pilot mode. Hence, it is important that the AI agent updates its decision-making policy accordingly.

We formalize this problem through a two-agent, reinforcement learning (RL) framework. The agents, hereafter referred to as agent

 and agent , jointly solve a task in a collaborative setting (i.e., share a common reward function and a transition kernel that is based on their joint actions). Our goal is to develop a learning algorithm for agent  that facilitates a successful collaboration even in cases when agent  is adapting its own policy. In the above examples, agent  could represent the AI agent whereas agent  could be a person with time-evolving behavior. We primarily focus on an episodic Markov decision process (MDP) setting, in which the agents repeatedly interact:

  1. [label=()]

  2. agent  decides on its policy based on historic information (agent ’s past policies) and the underlying MDP model;

  3. agent  commits to its policy for a given episode without knowing the policy of agent ;

  4. agent  updates its policy at the end of the episode based on agent ’s observed behavior.

When agent ’s policy is fixed and known, one can find an optimal policy for agent  using standard MDP planning techniques. In our setting, however, we do not assume agent ’s behavior to be stationary, nor do we consider any particular model of how agent  changes its policy. This differs from similar two-agent (human-AI) collaborative settings (Dimitrakakis et al., 2017; Nikolaidis et al., 2017) that prescribe a particular behavioral model to agent  (human agent).

1.1 Overview of our approach

The presence of agent  in our framework implies that the reward function and the transition kernel are changing from the perspective of agent . Variants of the setting have also been studied in the learning literature (Even-Dar et al., 2005, 2009; Yu & Mannor, 2009b, a; Yu et al., 2009; Abbasi et al., 2013; Wei et al., 2017). However, these approaches do not directly apply because of the following differences: (i) they focus on a particular aspect of non-stationarity (e.g., changing rewards with fixed transitions) (Even-Dar et al., 2005, 2009), (ii) require that the changes in the transition model are bounded (Yu & Mannor, 2009a, b), (iii) make restrictions on the policy space (Abbasi et al., 2013), and (iv) consider a competitive or adversarial setting instead of cooperative setting with shared reward (Wei et al., 2017).

In our work, we adopt an assumption that agent  does not abruptly change its policy across episodes. We show that this assumption is critical—the problem becomes computationally intractable otherwise. Our approach is inspired by the problem of experts learning in MDPs (Even-Dar et al., 2005), in which each state is associated with an experts algorithm that derives the policy for that state using -values. However, to compensate for the non-stationarity of transitions and facilitate a faster learning process, we introduce novel forms of recency bias inspired by the ideas of Rakhlin & Sridharan (2013); Syrgkanis et al. (2015).


We design novel algorithms for agent  that lead to sub-linear regret  of for episodes. Here, the parameter defines an upper bound on the magnitude of agent ’s policy change w.r.t. as . We show that it is computationally hard to achieve sub-linear regret for the special case of , using a reduction from the learning parities with noise problem (Abbasi et al., 2013; Kanade & Steinke, 2014). Furthermore, we connect the agents’ joint return to the regret of agent  by adapting the concept of smoothness

from the game-theory literature

(Roughgarden, 2009; Syrgkanis et al., 2015), and we show that the bound on the regret of agent  implies near optimality of the agents’ joint return for MDPs that manifest a smooth game (Roughgarden, 2009; Syrgkanis et al., 2015). To the best of our knowledge, we are the first to provide such guarantees in a collaborative two-agent MDP learning setup.

2 The Setting

We study a two-agent learning problem in an MDP. The agents are referenced as agent  and agent . We consider an episodic setting with episodes (also called time steps) and each episode lasting rounds. Generic episodes are denoted by and , while a generic round is denoted by . The MDP  is defined by:

  • a finite set of states , with denoting a generic state. We enumerate the states by , …,

    , and assume this ordering in our vector notation.

  • a finite set of actions , with denoting a generic action of agent  and denoting a generic action of agent . We enumerate the actions of agent  by , …, and agent  by , …, , and assume this ordering in our vector notation.

  • a transition kernel

    , which is a tensor with indices defined by the current state, the agents’ actions, and the next state.

  • a reward function that defines the joint reward for both agents.

We assume that agent  knows the MDP model. The agents commit to playing stationary policies and in each episode , and do so without knowing the commitment of the other agent. At the end of the episode , the agents observe each other’s policies (, ) and can use this information to update their future policies. Since the state and action spaces are finite, policies can be represented as matrices and , so that rows and define distributions on actions in a given state. We also define the reward matrix for agent  as , whose elements are the expected rewards of agent  for different actions and states. By bounded rewards, we have .

2.1 Objective

After each episode , the agents can adapt their policies. Note that agent  is not in our control and not assumed to be optimal. Therefore, we take the perspective of agent , and seek to optimize its policy in order to obtain good joint returns. The joint return in episode is:

Here is the state at round . For this comes from the initial state distribution . For later periods this is obtained by following joint actions , , …, from state . Actions are obtained from policies and . The second equation uses vector notation to define the joint return, where is a row vector representing the state distribution at episode and round , while is a row-wise dot product whose result is a column vector with elements. Since this is an episodic framework, we will assume the same starting state distribution, , for all episodes . However can differ across episodes since policies and evolve.

We define the average return over all episodes as . The objective is to output a sequence of agent ’s policies , …, that maximize:

The maximum possible value of over all combinations of agent ’s and agent ’s policies is denoted as Opt. Notice that this value is achievable using MDP planning techniques, provided that we control both agents.

2.2 Policy change magnitude and influences

However, we do not control agent  nor do we assume that agent  follows a particular behavioral model. Instead, we quantify the allowed behavior via its policy change magnitude, which for agent  is defined as:

where is operator (induced) norm.

In the case of agent , we will be focusing on policy change magnitudes that are of the order , where is strictly grater than . For instance, the assumption holds if agent  is a learning agent adopting the experts in MDP approach of Even-Dar et al. (2005, 2009).

Furthermore, we introduce the notion of the influence of an agent on the transition dynamics. This measures how much an agent can influence the transition dynamics by changing its policy. For agent , the influence is defined as:

where kernel (matrix)

denotes the probability of transitioning from

to when the agents’ policies are and respectively.111Our notion of influence is similar to, although not the same as, that of Dimitrakakis et al. (2017).

Influence is a measure of how much an agent affects the transition probabilities by changing its policy. We are primarily interested in this notion to show how our approach compares to the existing results from the online learning literature. For , our setting relates to the single agent settings of (Even-Dar et al., 2005, 2009; Dick et al., 2014) where rewards are non-stationary but transition probabilities are fixed. In general, the influence takes values in (see Appendix B, Corollary 1). We can analogously define policy change magnitude and influence of agent .

2.3 Mixing time and -values

We follow standard assumptions from the literature on online learning in MDPs (e.g., see (Even-Dar et al., 2005)), and only consider transition kernels that have well-defined stationary distributions. For the associated transition kernel, we define a stationary state distribution as the one for which:

  1. any initial state distribution converges to under policies and ;

  2. and .

Note that is represented as a row vector with elements. Furthermore, as discussed in (Even-Dar et al., 2005), this implies that there exists a mixing time such that for all state distributions and :

Due to the well-defined mixing time, we can define the average reward of agent  when following policy in episode as:

where is row-wise dot product whose result is a column vector with elements. The -value matrix for agent  w.r.t. policy is defined as:

where and are states and actions in round , starting from state with action and then using policy . Moreover, the policy-wise -value (column) vector for w.r.t. policy is defined by:

or in matrix notation . The -values satisfy the following Bellman equation:


defines the probability distribution over next states given action

of agent  and policy of agent  (here, is denoted as a row vector with elements). For other useful properties of this MDP  framework we refer the reader to Appendix B.

3 Smoothness and No-regret Dynamics

The goal is to output a sequence of agent ’s policies , …, so that the joint return is maximized. There are two key challenges: (i) agent  policies could be sub-optimal (or, even adversarial in the extreme case), and (ii) agent  does not know the current policy of agent  at the beginning of episode .

Smoothness criterion.

To deal with the first challenge, we consider a structural assumption that enables us to apply a regret analysis when quantifying the quality of a solution w.r.t. the optimum. In particular, we assume that the MDP  is ()-smooth:

Definition 1.

We say that an MDP  is ()-smooth if there exists a pair of policies such that for every policy pair :

This bounds the impact of agent ’s policy on the average reward. In particular, there must exist an optimal policy pair such that the negative impact of agent  for choosing is controllable by an appropriate choice of policy of agent . This smoothness definition is a variant of the game-theoretic smoothness notion studied in other contexts, e.g., for the price-of-anarchy analysis of non-cooperative games and learning in repeated games (Roughgarden, 2009; Syrgkanis et al., 2015). We analyze the relationship between the smoothness parameters and the properties of the MDP in our setting in Appendix C.

It is important to note that since we have a finite number of rounds per episode, Opt is not necessarily the same as , and the policies that achieve Opt need not lead to .

No-regret learning.

To address the second challenge, we adopt the online learning framework and seek to minimize regret :


A policy sequence , …, is no-regret if regret is sublinear in . An algorithm that outputs such sequences is a no-regret algorithm—this intuitively means that the agent’s performance is competitive w.r.t. any fixed policy.

Near-optimality of no-regret dynamics.

Because agent  could be adapting to the policies of agent , this is an adaptive learning setting, and the notion of regret can become less useful. This is where the smoothness criterion comes in: we will show that it suffices to minimize the regret in order to obtain near-optimal performance.

Using an analysis similar to (Syrgkanis et al., 2015), we show the near-optimality of no-regret dynamics defined w.r.t. the optimal return Opt, as stated in the following lemma:

Lemma 1.

Return is lower bounded by:


See Appendix D for the proof. ∎

Lemma 1 implies that as the number of episodes and the number of rounds go to infinity, return converges to of optimum Opt provided that agent  is a no-regret learner. In the next section, we design such no-regret learning algorithms for agent .

4 Learning Algorithms

We base our approach on the expert learning literature for MDPs, in particular that of Even-Dar et al. (2005, 2009). The basic idea is to associate each state with an experts algorithm, and decide on a policy by examining -values of state-action pairs. Thus, the function represents a reward function in the expert terminology.

4.1 Experts with periodic restarts: ExpRestart

In cases when agent   has no influence on transitions, the approach of Even-Dar et al. (2005, 2009) would yield the no-regret guarantees. The main difficulty of the present setting is that agent  can influence the transitions via its policy. The hope is that as long as the policy change magnitude of agent  is not too large, agent  can compensate for the non-stationarity of transitions by using only recent history when updating its policy.

A simple way of implementing this principle is to use a no-regret learning algorithm, but periodically restarting it, i.e., by splitting the full time horizon in segments of length and applying the algorithm on each segment separately. Notice that in this case we have well-defined periods , , …, . As a choice of an expert algorithm (the algorithm associated with each state), we use Optimistic Follow the Regularized Leader (OFTRL) (Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015). Our policy updating rule for segment , with starting point , can be described as:

for , and:

denotes a row of matrix (see Section 2.3)222Given and , we can calculate from the Bellman equation using standard dynamic programming techniques., is a row vector from probability simplex , denotes the transpose operator, is a 1-strongly convex regularizer w.r.t. norm , and is the learning rate. This approach, henceforth referred to as experts with periodic restarts (ExpRestart), suffices to obtain sublinear regret provided that the segment length and learning rate are properly set (see Appendix G).

One of the main drawbacks of experts with periodic restarts is that it potentially results in abrupt changes in the policy of agent , this occurring when switching from one segment to another. In practice, one might want to avoid this, for example, because agent  (e.g., representing a person) might negatively respond to such abrupt changes in agent ’s policy. Next, we design a new experts algorithm for our setting that ensures gradual policy changes for agent  across episodes, while achieving the same order of regret guarantees (see Section 5.4 and Appendix G).

4.2 Experts with doubly recency bias: ExpDRBias

Utilizing fixed segments, as in the approach of ExpRestart, leads to potentially rapid policy changes after each segment. To avoid this, one could average the policies obtained by segmentation from different possible starting points. In particular, for each episode , we consider a family of segments that this episode belongs to: , , …, , , each segment in this family identifying a possible policy for episode . By averaging these policies, we can treat equally each possible segment of size that contains episode . This approach, henceforth referred to as experts with doubly recency bias (ExpDRBias), can be implemented through the following two ideas.

Recency windowing. The first idea is what we call recency windowing. Simply put, it specifies how far in the history an agent should look when choosing a policy. More precisely, we define a sliding window of size and to decide on policy we only use historical information from periods after . In particular, the updating rule of OFTRL would be modified for as


Input: History horizon , learning rate
       Initialize: , compute using Eq. (2)
       for episode  do
             , commit to policy
             Obtain the return
             Observe agent ’s policy
             Calculate Q-values
             , compute using Eq. (3)
       end for
Algorithm 1 ExpDRBias

Recency modulation. The second idea is what we call recency modulation. This creates an averaging effect over the policies computed by the experts with periodic restarts approach for different possible starting points of the segmentation. For episode , recency modulation calculates policy updates using recency windowing but considers widows of different sizes. More precisely, we calculate updates with window sizes to , and then average them to obtain the final update. Lemma 3 shows that this updating rule will not lead to abrupt changes in agent ’s policy.

To summarize, agent  has the following policy update rule for :



For , we follow equation update (2). The full description of agent ’s policy update using the approach of ExpDRBias is given in Algorithm 1. As with ExpRestart, ExpDRBias leads to a sub-linear regret for a proper choice of and , which in turn results in a near-optimal behavior, as analyzed in the next section.

5 Theoretical Analysis of ExpDRBias

To bound regret , given by equation (1), it is useful to express difference in terms of -values. In particular, one can show that this difference is equal to (see Lemma 15 in Appendix B). By the definitions of and , this implies:

If was not dependent on (e.g., if agent  was not changing its policy), bounding would amount to bounding the sum of terms . This could be done with an approach that carefully combines the proof techniques of Even-Dar et al. (2005) with the OFTRL properties, in particular, regret bounded by variation in utilities (RVU) (Syrgkanis et al., 2015). However, in our setting is generally changing with .

5.1 Change magnitudes of stationary distributions

To account for this, we need to investigate how quickly distributions change across episodes. Furthermore, to utilize the RVU property, we need to do the same for distributions . The following lemma provides bounds on the respective change magnitudes.

Lemma 2.

The difference between the stationary distributions of two consecutive episodes is upper bounded by:

Furthermore, for any policy :


See Appendix E.3. ∎

5.2 Properties based on OFTRL

The bounds on the change magnitudes of distributions and , which will propagate to the final result, depend on agent ’s policy change magnitude . The following lemma provides a bound for that, together with the assumed bound on , is useful in establishing no-regret guarantees.

Lemma 3.

For any and , the change magnitude of weights in ExpDRBias is bounded by:



See Appendix E.2. ∎

Now, we turn to bounding term . Lemma 4 formalizes the RVU property for ExpDRBias using norm and its dual norm, derived from results in the existing literature (Syrgkanis et al., 2015).333An extended version of the lemma, which is needed for the main result, is provided in Appendix E.1. Intuitively, Lemma 4 shows that it is possible to bound by examining the change magnitudes of -values.

Lemma 4.

Consider ExpDRBias and let denote column vector of ones with elements. Then, for each episode of ExpDRBias , we have:

where are defined in (3), , and is an arbitrary policy of agent .


See Appendix E.1. ∎

5.3 Change magnitudes of -values

Finally, we derive bounds on the change magnitudes of -values, that we use together with Lemma 4 to prove the main results. We first bound the difference , which helps us in bounding the difference .

Lemma 5.

The difference between -values of two consecutive episodes is upper bounded by:

where .


See Appendix E.4 for the proof. ∎

Lemma 6.

The difference between -values of two consecutive episodes is upper bounded by:

where .


See Appendix E.5 for the proof. ∎

For convenience, instead of directly using , we consider a variable such that . The following proposition gives a rather loose (but easy to interpret) bound on that satisfies the inequality.

Proposition 1.

There exists a constant independent of and , such that:


The claim is directly obtained from Lemma 5, Lemma  6, and the fact that and . ∎

5.4 Regret analysis and main results

We now come to the most important part of our analysis: establishing the regret guarantees for ExpDRBias. Using the results from the previous subsections, we obtain the following regret bound:

Theorem 1.

Let the learning rate of ExpDRBias be equal to and let be such that and . Then, the regret of ExpDRBias is upper-bounded by:


See Appendix F for the proof. ∎

When agent  does not influence the transition kernel through its policy, i.e., when , the regret is for . In this case, we could have also applied the original approach of Even-Dar et al. (2005, 2009), but interestingly, it would result in a worse regret bound, i.e., . By leveraging the fact that agent ’s policy is slowly changing, which corresponds to reward functions in the setting of Even-Dar et al. (2005, 2009) not being fully adversarial, we are able to improve on the worst-case guarantees. The main reason for such an improvement is our choice of the underlying experts algorithm, i.e., OFTRL, that exploits the apparent predictability of agent ’s behavior. Similar arguments were made for the repeated games settings (Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015), which correspond to our setting when the MDP consists of only one state. Namely, in the single state scenario, agent  does not influence transitions, so the resulting regret is , matching the results of (Syrgkanis et al., 2015).

In general, the regret depends on . If with , then equalizes the order of the two regret components in Theorem 1 and leads to the regret of . This brings us to the main result, which provides a lower bound on the return :

Theorem 2.

Assume that for . Let and . Then, the regret of ExpDRBias is upper-bounded by:

Furthermore, when the MDP is ()-smooth, the return of ExpDRBias is lower-bounded by:


Notice that for . By Lemma 3, this implies that there exists a fixed (not dependant on ) such that for large enough . Furthermore, , so there exists a fixed such that for large enough . Hence, we can apply Theorem 1 to obtain an order-wise regret bound: .

Now, consider two cases. First, let . Then, we obtain:

For the other case, i.e., when , we obtain:

Therefore, , which proves the first statement. By combining it with Lemma 1, we obtain the second statement. ∎

The multiplicative factors in the asymptotic bounds mainly depend on mixing time . In particular they are dominated by factor and its powers, as can be seen from Lemma 1, Theorem 1, and Proposition 1. Note that Lemma 3 allows us to upper bound in Theorem 1 with . Furthermore, for large enough . Hence, these results imply dependency of the asymptotic bounds on . This is larger than what one might expect from the prior work (e.g., the bound in (Even-Dar et al., 2005, 2009) has dependency). However, note that our setting is different in that the presence of agent  has an effect on transitions (from agent ’s perspective), so it is not surprising that the resulting dependency on the mixing time is worse.

6 Hardness Result

Our formal guarantees assume that the policy change magnitude of agent  is a decreasing function in the number of episodes given by for . What if we relax this, and allow agent  to adapt independently of the number of episodes? We show a hardness result for the setting of , using a reduction from the online agnostic parity learning problem (Abbasi et al., 2013). As argued in (Abbasi et al., 2013), the online to batch reduction implies that the online version of agnostic parity learning is at least as hard as its offline version, for which the best known algorithm has complexity (Kalai et al., 2008). In fact, agnostic parity learning is a (harder) variant of the learning with parity noise problem, widely believed to be computationally intractable  (Blum et al., 2003; Pietrzak, 2012), and thus often adopted as a hardness assumption (e.g., (Sharan et al., 2018)).

Theorem 3.

Assume that the policy change magnitude of agent  is of the order and that its influence is equal to . If there exists a time algorithm that outputs a policy sequence , …, whose regret is for , then there also exists a time algorithm for online agnostic parity learning whose regret is .


See Appendix H. ∎

Our proof relies on the result of Abbasi et al. (2013) (Theorem 5 and its proof), which reduces the online agnostic parity learning problem to the adversarial shortest path problem, which we reduce to our problem. This theorem implies that when , it is unlikely to obtain that is sub-linear in given the current computational complexity results.

7 Related Work

Experts learning in MDP. Our framework is closely related to that of Even-Dar et al. (2005, 2009), although the presence of agent  means that we cannot directly use their algorithmic approach. In fact, learning with an arbitrarily changing transition is believed to be computationally intractable (Abbasi et al., 2013), and computationally efficient learning algorithms experience linear regret (Yu & Mannor, 2009a; Abbasi et al., 2013). This is where we make use of the bound on the magnitude of agent ’s policy change. Contrary to most of the existing work, the changes in reward and transition kernel in our model are non-oblivious and adapting to the learning algorithm of agent . There have been a number of follow-up work that either extend the mentioned results or improve them for more specialized settings (Dick et al., 2014; Neu et al., 2012, 2010; Dekel & Hazan, 2013). Agarwal et al. (2017) and Singla et al. (2018) study the problem of learning with experts advice where experts are not stationary and are learning agents themselves. However, their focus is on designing a meta-algorithm on how to coordinate with these experts and is technically very different from ours.

Learning in games. To relate the quality of an optimal solution to agent ’s regret, we use techniques similar to those studied in the learning in games literature (Blum et al., 2008; Roughgarden, 2009; Syrgkanis et al., 2015). The fact that agent ’s policy is changing slowly, enables us to utilize no-regret algorithms for learning in games with recency bias (e.g., (Daskalakis et al., 2011; Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015)) that lead to better regret bounds than standard no-regret learning techniques (Littlestone & Warmuth, 1994; Freund & Schapire, 1997). The recent work by Wei et al. (2017) studies two-player learning in zero-sum stochastic games. Apart from focusing on zero-sum games, Wei et al. (2017) consider different set of assumptions to derive the regret bounds, so their results are not directly comparable to ours. Furthermore, their algorithmic techniques are orthogonal to the line of literature we follow (e.g., (Even-Dar et al., 2005, 2009)) and these differences are further elaborated in (Wei et al., 2017).

Human AI collaboration. The helper-AI problem (Dimitrakakis et al., 2017) is related to the present work, in that an AI agent is designing its policy by accounting for human imperfections. The authors use a Stackleberg formulation of the problem in a single shot scenario. Their model assumes that the AI agent knows the behavioral model of the human agent, which is a best response to the policy of the AI agent for an incorrect transition kernel. We relax this requirements by studying a repeated human-AI interaction. Nikolaidis et al. (2017) study a repeated human-AI interaction, but their setting is more restrictive than ours as they do not model the changes in the environment. In particular, they have a repeated game setup, where the only aspect that changes over time is the “state” of the human representing what knowledge the human has about the robot’s payoffs. Prior work also considers a learner that is aware of the presence of other actors, e.g., (Foerster et al., 2018; Raileanu et al., 2018). While these multi-agent learning approaches account for the evolving behavior of other actors, the underlying assumption is typically that each agent follows a known model.

Steering and teaching. There is also a related literature on “steering” the behavior of other agent. For example, (i) the environment design framework of Zhang et al. (2009), where one agent tries to steer the behavior of another agent by modifying its reward function, (ii) the cooperative inverse reinforcement learning of Hadfield-Menell et al. (2016), where the human uses demonstrations to reveal a proper reward function to the AI agent, and (iii) the advice-based interaction model (e.g., (Amir et al., 2016)), where the goal is to communicate advice to a sub-optimal agent on how to act in the world. The latter approach is also in close relationship to the machine teaching literature (e.g., (Zhu et al., 2018; Zhu, 2015; Singla et al., 2013; Cakmak et al., 2012)). Our work differs from the above in that we focus on joint decision-making, rather than teaching or steering.

8 Conclusion

We presented a two-agent MDP framework in a collaborative setting. We considered the problem of designing a no-regret algorithm for one of the agents in the presence of a slowly adapting, second agent. Our algorithm builds from the ideas of experts learning in MDPs and makes use of a novel form of recency bias to achieve strong regret bounds. In particular, we showed that in order for the first agent to facilitate collaboration, it is critical that the second agent’s policy changes are not abrupt.


  • Abbasi et al. (2013) Abbasi, Y., Bartlett, P. L., Kanade, V., Seldin, Y., and Szepesvári, C. Online learning in markov decision processes with adversarially chosen transition probability distributions. In NIPS, pp. 2508–2516, 2013.
  • Agarwal et al. (2017) Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In COLT, pp. 12–38, 2017.
  • Amir et al. (2016) Amir, O., Kamar, E., Kolobov, A., and Grosz, B. Interactive teaching strategies for agent training. In IJCAI, pp. 804–811, 2016.
  • Blum et al. (2003) Blum, A., Kalai, A., and Wasserman, H. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506–519, 2003.
  • Blum et al. (2008) Blum, A., Hajiaghayi, M., Ligett, K., and Roth, A. Regret minimization and the price of total anarchy. In STOC, pp. 373–382. ACM, 2008.
  • Cakmak et al. (2012) Cakmak, M., Lopes, M., et al. Algorithmic and human teaching of sequential decision tasks. In AAAI, 2012.
  • Daskalakis et al. (2011) Daskalakis, C., Deckelbaum, A., and Kim, A. Near-optimal no-regret algorithms for zero-sum games. In SODA, pp. 235–254, 2011.
  • Dekel & Hazan (2013) Dekel, O. and Hazan, E. Better rates for any adversarial deterministic mdp. In ICML, pp. 675–683, 2013.
  • Dick et al. (2014) Dick, T., Gyorgy, A., and Szepesvari, C. Online learning in markov decision processes with changing cost sequences. In ICML, pp. 512–520, 2014.
  • Dimitrakakis et al. (2017) Dimitrakakis, C., Parkes, D. C., Radanovic, G., and Tylkin, P. Multi-view decision processes: The helper-ai problem. In NIPS, pp. 5443–5452, 2017.
  • Even-Dar et al. (2005) Even-Dar, E., Kakade, S. M., and Mansour, Y. Experts in a markov decision process. In NIPS, pp. 401–408, 2005.
  • Even-Dar et al. (2009) Even-Dar, E., Kakade, S. M., and Mansour, Y. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
  • Foerster et al. (2018) Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with opponent-learning awareness. In AAMAS, pp. 122–130, 2018.
  • Freund & Schapire (1997) Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • Hadfield-Menell et al. (2016) Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In NIPS, pp. 3909–3917, 2016.
  • Kalai et al. (2008) Kalai, A. T., Mansour, Y., and Verbin, E. On agnostic boosting and parity learning. In STOC, pp. 629–638, 2008.
  • Kanade & Steinke (2014) Kanade, V. and Steinke, T. Learning hurdles for sleeping experts. ACM Transactions on Computation Theory (TOCT), 6(3):11, 2014.
  • Littlestone & Warmuth (1994) Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
  • Neu et al. (2010) Neu, G., Antos, A., György, A., and Szepesvári, C. Online markov decision processes under bandit feedback. In NIPS, pp. 1804–1812, 2010.
  • Neu et al. (2012) Neu, G., Gyorgy, A., and Szepesvári, C. The adversarial stochastic shortest path problem with unknown transition probabilities. In AISTATS, pp. 805–813, 2012.
  • Nikolaidis et al. (2017) Nikolaidis, S., Nath, S., Procaccia, A. D., and Srinivasa, S. Game-theoretic modeling of human adaptation in human-robot collaboration. In Proceedings of the International conference on human-robot interaction, pp. 323–331, 2017.
  • Pietrzak (2012) Pietrzak, K. Cryptography from learning parity with noise. In International Conference on Current Trends in Theory and Practice of Computer Science, pp. 99–114, 2012.
  • Raileanu et al. (2018) Raileanu, R., Denton, E., Szlam, A., and Fergus, R. Modeling others using oneself in multi-agent reinforcement learning. In ICML, pp. 4254–4263, 2018.
  • Rakhlin & Sridharan (2013) Rakhlin, S. and Sridharan, K. Optimization, learning, and games with predictable sequences. In NIPS, 2013.
  • Roughgarden (2009) Roughgarden, T. Intrinsic robustness of the price of anarchy. In STOC, pp. 513–522. ACM, 2009.
  • Sharan et al. (2018) Sharan, V., Kakade, S., Liang, P., and Valiant, G. Prediction with a short memory. In STOC, pp. 1074–1087, 2018.
  • Singla et al. (2013) Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A.

    On actively teaching the crowd to classify.

    In NIPS Workshop on Data Driven Education, 2013.
  • Singla et al. (2018) Singla, A., Hassani, S. H., and Krause, A. Learning to interact with learning agents. In AAAI, 2018.
  • Syrgkanis et al. (2015) Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast convergence of regularized learning in games. In NIPS, pp. 2989–2997, 2015.
  • Wei et al. (2017) Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. Online reinforcement learning in stochastic games. In NIPS, pp. 4987–4997, 2017.
  • Yu & Mannor (2009a) Yu, J. Y. and Mannor, S. Arbitrarily modulated markov decision processes. In Decision and Control, 2009, pp. 2946–2953, 2009a.
  • Yu & Mannor (2009b) Yu, J. Y. and Mannor, S. Online learning in markov decision processes with arbitrarily changing rewards and transitions. In GameNets, pp. 314–322, 2009b.
  • Yu et al. (2009) Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
  • Zhang et al. (2009) Zhang, H., Parkes, D. C., and Chen, Y. Policy teaching through reward function learning. In EC, pp. 295–304, 2009.
  • Zhu (2015) Zhu, X. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, pp. 4083–4087, 2015.
  • Zhu et al. (2018) Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. An overview of machine teaching. CoRR, abs/1801.05927, 2018.

Appendix A List of Appendices

In this section we provide a brief description of the the content provided in the appendices of the paper.

  1. Appendix B contains the statements and the corresponding proofs for the MDP properties that we used in proving the technical results of the paper.

  2. Appendix C provides a relationship between the smoothness parameters introduced in Section 3 and structural properties of our setting.

  3. Appendix D provides the proof of Lemma 1, which connects no-regret learning with the optimization objective (see Section 3).

  4. Appendix E provides the proofs of the lemmas related to our algorithmic approach that are important for proving the main results.

  5. Appendix F provides the proof of Theorem 1, which establishes the regret bound of Algorithm 1 (see Section 5.4).

  6. Appendix G describes the properties of experts with periodic restarts (see Section 4.1).

  7. Appendix H provides the proof of Theorem 3, which establishes the hardness of achieving no-regret if agent ’s policy change is not a decreasing function in the number of episodes (see Section 6).

Appendix B Important MDP properties

To obtain our formal results, we derive several useful MDP properties.

b.1 Policy-reward bounds

The first policy reward bound we show is the upper-bound on the vector product between and .

Lemma 7.

The policy-reward dot product is bounded by:


The definition of gives us:

where we used the triangle inequality and the boundedness of rewards. ∎

Note that the lemma holds for any and (i.e., for any and ). Furthermore, using the following two lemmas, we also bound the difference between two consecutive policy-reward dot products. In particular:

Lemma 8.

The following holds:

for all .


To obtain the first inequality, note that: