A Regularized Opponent Model with Maximum Entropy Objective

05/17/2019 ∙ by Zheng Tian, et al. ∙ UCL 2

In a single-agent setting, reinforcement learning (RL) tasks can be cast into an inference problem by introducing a binary random variable o, which stands for the "optimality". In this paper, we redefine the binary random variable o in multi-agent setting and formalize multi-agent reinforcement learning (MARL) as probabilistic inference. We derive a variational lower bound of the likelihood of achieving the optimality and name it as Regularized Opponent Model with Maximum Entropy Objective (ROMMEO). From ROMMEO, we present a novel perspective on opponent modeling and show how it can improve the performance of training agents theoretically and empirically in cooperative games. To optimize ROMMEO, we first introduce a tabular Q-iteration method ROMMEO-Q with proof of convergence. We extend the exact algorithm to complex environments by proposing an approximate version, ROMMEO-AC. We evaluate these two algorithms on the challenging iterated matrix game and differential game respectively and show that they can outperform strong MARL baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Casting decision making and optimal control as an inference problem have a long history, which dates back to [Kalman1960]

where the Kalman smoothing is used to solve optimal control in linear dynamics with quadratic cost. Bayesian methods can capture the uncertainties regarding the transition probabilities, the rewards functions in the environment or other agents’ policies. This distributional information can be used to formulate a more structured exploration/exploitation strategy than those commonly used in classical RL, e.g.

-greedy. A common approach in many works [Toussaint and Storkey2006, Rawlik et al.2013, Levine and Koltun2013, Abdolmaleki et al.2018] for framing RL as an inference problem is by introducing a binary random variable which represents “optimality”. By this way, RL problems are able to lend itself to powerful inference tools [Levine2018]. However, the Bayesian approach in a multi-agent environment is less well studied.

In many single-agent works, maximizing entropy is part of a training agent’s objective for resolving ambiguities in inverse reinforcement learning [Ziebart et al.], improving the diversity [Florensa et al.2017], robustness [Fox et al.2015] and the compositionality [Haarnoja et al.2018a] of the learned policy. In Bayesian RL, it often presents in the evidence lower bound (ELBO) for the log likelihood of optimality [Haarnoja et al.2017, Schulman et al.2017, Haarnoja et al.2018b], commonly known as maximum entropy objective (MEO), which encourages the optimal policy to maximize the expected return and long term entropy.

In MARL, there is more than one agent interacting with a stationary environment. In contrast with the single agent environment, an agent’s reward not only depends on the current environment state and the agent’s action but also on the actions of others. The existence of other agents increases the uncertainty in the environment. Therefore, the capability of reasoning about other agents’ belief, private information, behavior, strategy, and other characteristics is crucial. A reasoning model can be used in many different ways, but the most common case is where an agent utilize its reasoning model to help its self decision making [Brown1951, Heinrich and Silver2016, He et al.2016, Raileanu et al.2018, Wen et al.2019, Tian et al.2018]. In this work, we use the word “opponent” when referring to another agent in the environment irrespective of the environment’s cooperative or adversarial nature.

In this work, we reformulate the MARL problem into Bayesian inference and derive a multi-agent version of MEO, which we call the regularized opponent model with maximum entropy objective (ROMMEO). Optimizing this objective with respect to one agent’s opponent model gives rise to a new perceptive on opponent modeling. We present two off-policy RL algorithms for optimizing ROMMEO in MARL. ROMMEO-Q is applied in discrete action case with proof of convergence. For the complex and continuous action environment, we propose ROMMEO Actor-Critic (ROMMEO-AC), which approximates the former procedure and extend itself to continuous problems. We evaluate these two approaches on the matrix game and the differential game against strong baselines and show that our methods can outperform all the baselines in terms of the overall performance and speed of convergence.

2 Method

2.1 Stochastic Games

For an -agent stochastic game [Shapley1953], we define a tuple , where denotes the state space, is the distribution of the initial state, is the discount factor for future rewards, and are the action space and the reward function for agent respectively. Agent chooses its action according to the policy parameterized by conditioning on some given state . Let us define the joint policy as the collection of all agents’ policies with representing the joint parameter. It is convenient to interpret the joint policy from the perspective of agent such that , where , , and is a compact representation of the joint policy of all complementary agents of . At each stage of the game, actions are taken simultaneously.

In fully cooperative games, different agents have the same reward function . Therefore, each agent’s objective is to maximize the shared expected return:


with sample from .

2.2 A Variational Lower Bound for Multi-Agent Reinforcement Learning Problems

Figure 1: Graphical Model of a Stochastic Game with Optimality Variables

We embed the control problem into a graphical model by introducing a binary random variable which serves as the indicator for “optimality” for each agent at each time step . Recall that in single agent problem, reward is bounded, but the achievement of the maximum reward given the action is unknown. Therefore, in the single-agent case, indicates the optimality of achieving the bounded maximum reward . It thus can be regarded as a random variable and we have . However, the definition of “optimality” in the multi-agent case is subtlety different from the one in the single-agent situation.

In cooperative multi-agent reinforcement learning (CMARL), to define agent ’s optimality, we first introduce the definition of optimum and optimal policy:

Definition 1.

In cooperative multi-agent reinforcement learning, optimum is a strategy profile such that:


where and Agent ’s optimal policy is .

We also define a best response function as:

Definition 2.

For agent , the best response function to an arbitrary policy profile of its opponents is the policy which produces the highest reward:


where .

Therefore, we can easily see that:

Proposition 1.

For player let be its best response function. Given the optimal policy profile , then is the optimal policies of its opponent if and only if the best response to is agent ’s optimal policy:


In CMARL, a single agent’s“optimality” cannot implies that it obtains the maximum reward because the reward depends on the joint actions of all agents . Therefore, we define only indicates that agent ’s policy at time step is optimal

. For the posterior probability of agent

’s optimality given its action we have:


We assume that given other players actions , the posterior probability of agent ’s optimality is proportional to its exponential reward:


The conditional probabilities of “optimality” in both of CMARL and single-agent case have similar forms. However, it is worth mentioning that the “optimality” in CMARL has a different interpretation to the one in single-agent case.

For cooperative games, if all agents play optimally, then agents can receive the maximum rewards, which is the optimum of the games. Therefore, given the fact that other agents are playing their optimal policies , the probability that agent also plays its optimal policy is the probability of obtaining the maximum reward from agent ’s perspective. Therefore, we define agent ’s objective as:


The graphical model is summarized in Fig. 1. Without loss of generality, we derive the solution for agent but the same solution can be derived for other agents. By using a variational distribution , we derive a lower bound on the likelihood of optimality of agent :


Written out in full, is agent

’s opponent model estimating

optimal policies of its opponents, is the agent ’s conditional policy at optimum () and is the prior of optimal policy of opponents. In our work, we set the prior equal to the observed empirical distribution of opponents’ actions given states. As we are only interested in the case where , we drop them in and thereafter. Eq. 8 is a variational lower bound of and the derivation is deferred to Appendix B.1.

2.3 The Learning of Opponent Model

We can further expand Eq. 8 into Eq. 9. and we find that it resembles the maximum entropy objective in single-agent reinforcement learning [Kappen2005, Todorov2007, Ziebart et al., Haarnoja et al.2017]. We denote agent ’s expectation of reward plus entropy of the conditional policy as agent ’s maximum entropy objective (MEO). In the multi-agent version, however, it is worthy of noting that optimizing the MEO will lead to the optimization of . This can be counter-intuitive at first sight as opponent behaviour models are normally trained with only past state-action data to predict opponents’ actions.

However, recall that is modelling opponents’ optimal policies in our work. By Prop. 1, we know that if and only if the best response function to of agent is ’s optimal policy, the estimated opponent’s policies are their optimal polices:


Then, probabilistically, we have:


From Eq. 11, we can derive that:


See Appendix B.2. ∎

Therefore, agent ’s maximum entropy objective is a lower bound of the probability that its opponent model is optimal policies of its opponents. Given agent ’s policy being fixed, optimizing this lower bound updates agent ’s opponent model in the direction of the higher shared reward and the more stochastic conditional policy , making it closer to the real optimal policies of the opponents.

Without any regularization, at iteration , agent can freely learn a new opponent model which is the closest to the optimal opponent policies from its perspective given . Next, agent can optimize the lower bound with respect to given . Then we have an EM-like iterative training and can show it monotonically increases the probability that the opponent model is optimal policies of the opponents. Then, by acting optimally to the converged opponent model , we can recover agent ’s optimal policy .

However, it is unrealistic to learn such an opponent model. As the opponents have no access to agent ’s conditional policy , the learning of its policy can be different from the one of agent ’s opponent model. Then the actual opponent policies can be very different from agent ’s converged opponent model learned in the above way given agent ’s conditional policy . Therefore, acting optimally to an opponent model far from the real opponents’ policies can lead to poor performance.

The last term in Eq. 9 can prevent agent building an unrealistic opponent model. The Kullback-Leibler (KL) divergence between opponent model and a prior can act as a regularizer of . By setting the prior to the empirical distribution of opponent past behaviour, the KL divergence penalizes heavily if it deviates from the empirical distribution too much. As the objective in Eq. 9 can be seen as a Maximum Entropy objective for one agent’s policy and opponent model with regularization on the opponent model, we call this objective as Regularized Opponent Model with Maximum Entropy Objective (ROMMEO).

3 Multi-Agent Soft Actor Critic

To optimize the ROMMEO in Eq. 9 derived in the previous section, we propose two reinforcement learning algorithms. We first introduce an exact tabular Q-iteration method with proof of convergence. For practical implementation in a complex continuous environment, we then propose the ROMMEO actor critic ROMMEO-AC, which is an approximation to this procedure.

3.1 Regularized Opponent Model with Maximum Entropy Objective Q-Iteration

In this section, we derive a multi-agent version of Soft Q-iteration algorithm proposed in [Haarnoja et al.2017] and we name our algorithm as ROMMEO-Q. The derivation follows from a similar logic to [Haarnoja et al.2017], but the extension of Soft Q-learning to MARL is still nontrivial. From this section, we slightly modify the objective in Eq. 9 by adding a weighting factor for the entropy term and it can be recovered by setting .

We first define multi-agent soft Q-function and V-function respectively. Then we can show that the conditional policy and opponent model defined in Eq. 15 and 16 below are optimal solutions with respect to the objective defined in Eq. 9:

Theorem 1.

We define the soft state-action value function of agent as


and soft state value function as


Then the optimal conditional policy and opponent model for Eq. 8 are




See Appendix C.2. ∎

Following from Theorem 1, we can find the optimal solution of Eq. 9 by learning the soft multi-agent Q-function first and recover the optimal policy and opponent model by Equations 15 and 16. To learn the Q-function, we show that it satisfies a Bellman-like equation, which we name it as multi-agent soft Bellman equation:

Theorem 2.

We define the soft multi-agent Bellman equation for the soft state-action value function of agent as


See Appendix C.3. ∎

With this Bellman equation defined above, we can derive a solution to Eq. 17 with a fixed point iteration, which we call ROMMEO Q-iteration (ROMMEO-Q). Additionally, We can show that it can converge to the optimal and with certain restrictions as stated in [Wen et al.2019]:

Theorem 3.

ROMMEO Q-iteration. In a symmetric game with only one global optimum, i.e. , where is the optimal strategy profile. Let and be bounded and assume

and that exists. Then the fixed-point iteration


converges to and respectively.


See Appendix C.3. ∎

Figure 2: (a): Learning curves of ROMMEO and baselines on ICG over 100 trials of training. (b): Probability of convergence to the global optimum for ROMMEO and baselines on ICG over 100 trails of training. The vertical axis represents the joint probability of taking actions for both agents. (c): Probability of taking estimated by agent ’s opponent model and observed empirical frequency in one trail of training, .

3.2 Regularized Opponent Model with Maximum Entropy Objective Actor Critic

The ROMMEO-Q assumes we have the model of the environment and is impractical to implement in high-dimensional continuous problems. To solve these problems, we propose the ROMMEO actor critic (ROMMEO-AC) which is a model-free method with parameterized policy and Q functions. We use neural networks (NNs) as function approximators for the conditional policy, opponent model and Q-function and learn these functions by stochastic gradient. We parameterize the Q-function, conditional policy and opponent model by

, and respectively.

Without access to the environment model, we replace the Q-iteration with Q-learning. Therefore, we can train to minimize:




where are target functions for providing relatively stable target values. We use denoting the action sampled from agent ’s opponent model and it should be distinguished from which is the real action taken by agent ’s opponent. Eq. 21 can be derived from Eq. 15 and 16.

To recover the optimal conditional policy and opponent model, we follow the method in [Haarnoja et al.2018b] where and are trained to minimize the KL-divergence:


By using the reparameterization trick: and , we can rewrite the objectives above as:


The gradient of Eq. 20, 24 and 25 with respect to the corresponding parameters are listed as below:


We list the pseudo-code of ROMMEO-Q and ROMMEO-AC in Appendix A.

4 Related Works

In early works, the maximum entropy principle has been used in policy search in linear dynamics [Todorov2010, Toussaint2009, Levine and Koltun2013] and path integral control in general dynamics [Kappen2005, Theodorou et al.2010]. Recently, off-policy methods [Haarnoja et al.2017, Schulman et al.2017, Nachum et al.2017] have been proposed to improve the sample efficiency in optimizing MEO. However, in a continuous environment, the complex approximate inference may be needed for sampling actions. To avoid the complex sampling procedure, training a policy in supervised fashion is employed in [Haarnoja et al.2018b]. Our work is closely related to this series of recent works because ROMMEO is an extension of MEO to MARL.

A few related works to ours have been conducted in multi-agent soft Q-learning [Wen et al.2019, Wei et al.2018, Grau-Moya et al.2018], where variants of soft Q-learning are applied for solving different problems in MARL. However, unlike previous works, we do not take the soft Q-learning as given and apply it to MARL problems with modifications. In our work, we first establish a novel objective ROMMEO and ROMMEO-Q is only an off-policy method we derive with complete convergence proof, which can optimize the objective. There are other ways of optimizing ROMMEO, for example, the on-policy gradient-based methods, but they are not included in the paper. ROMMEO-Q resembles soft Q-learning because ROMMEO is an extension to MEO in MARL but not because it is a modification of soft Q-learning applying on MARL problems.

There has been substantial progress in combining RL with probabilistic inference. However, most of the existing works focus on the single-agent case. The literature of Bayesian methods in MARL is limited. Among these are methods performing on cooperative games with prior knowledge on distributions of the game model and the possible strategies of others [Chalkiadakis and Boutilier2003] or policy parameters and possible roles of other agents [Wilson et al.2010]. In our work, we assume very limited prior knowledge of the environment model, optimal policy, opponents or the observations during the play. In addition, our algorithms are fully decentralized at training and execution, which is more challenging than problems from the centralized training [Foerster et al.2017, Lowe et al.2017, Rashid et al.2018].

Figure 3: Experiment on Max of Two Quadratic Game. (a) Reward surface and learning path of agents. Scattered points are actions taken at each step; (b) Learning curve of ROMMEO and baselines. (c) Mean of agents’ policies and opponent models .

5 Experiments

5.1 Iterated Matrix Games

We first present the proof-of-principle result of ROMMEO-Q222The experiment code is available at https://github.com/rommeoijcai2019/rommeo. on iterated matrix games where players need to cooperate to achieve the shared maximum reward. To this end, we study the iterated climbing games (ICG) which is a classic purely cooperative two-player stateless iterated matrix games.

Climbing game (CG) is a fully cooperative game proposed in [Claus and Boutilier1998] whose payoff matrix is summarized as follows: . It is a challenging benchmark because of the difficulty of convergence to its global optimum. There are two Nash equilibrium and but one global optimal . The punishment of miscoordination by choosing a certain action increases in the order of . The safest action is and the miscoordination punishment is the most server for . Therefore it is very difficult for agents to converge to the global optimum in ICG.

We compare our method to a series of strong baselines in MARL, including Joint Action Learner (JAL) [Claus and Boutilier1998], WoLF Policy Hillclimbing (WoLF-PHC) [Bowling and Veloso2001], Frequency Maximum Q (FMQ) [Kapetanakis and Kudenko2002] and Probabilistic Recursive Reasoning (PR2) [Wen et al.2019]. ROMMEO-Q-EMP is an ablation study to evaluate the effectiveness of our proposed opponent model learning process, where we replace our opponent model with empirical frequency. Fig. 1(a) shows the learning curves on ICG for different algorithms. The difference of rewards between ROMMEO-Q and FMQ-c10 may seem small because of the small reward margin between the global optimum and the local one. However, ROMMEO-Q actually outperforms all baselines significantly in terms of converging to the global optimum, which is shown in Fig. 1(b). To further analyze the opponent modeling described in Sec. 2.3, we visualize the probability of agent taking the optimal action estimated by agent ’s opponent model and its opponent true policy in Fig. 1(c). Agent ’s opponent model “thinks ahead” of agent and converges to agent ’s optimal policy before agent itself converges to the optimal policy. This helps agent to respond to its opponent model optimally by choosing action , which in turn leads to the improvement of agent ’s opponent model and policy. Therefore, the game converges to the global optimum. To note, the big drop of for both policies and opponent models at the beginning of the training comes from the severe punishment of miscoordination associated with action .

5.2 Differential Games

We adopt the differential Max of Two Quadratic Game [Wei et al.2018] for continuous case. The agents have continuous action space of . Each agent’s reward depends on the joint action following the equations: where . We compare the algorithm with a series of baselines including PR2 [Wen et al.2019], MASQL  [Wei et al.2018, Grau-Moya et al.2018], MADDPG [Lowe et al.2017] and independent learner via DDPG [Lillicrap et al.2015]. To compare against traditional opponent modeling methods, similar to  [Rabinowitz et al.2018, He et al.2016], we implement an additional baseline of DDPG with an opponent module that is trained online with supervision in order to capture the latest opponent behaviors, called DDPG-OM. We trained all agents for 200 iterations with 25 steps per iteration.

This is a challenging task to most continuous gradient based RL algorithms because gradient update tends to direct the training agent to the sub-optimal point. The reward surface is provided in Fig. 2(a) ; there is a local maximum at and a global maximum at , with a deep valley staying in the middle. If the agents’ policies are initialized to (the red starred point) that lies within the basin of the left local maximum, the gradient-based methods would tend to fail to find the global maximum equilibrium point due to the valley blocking the upper right area.

A learning path of ROMMEO-AC is summarized in Fig. 2(a) and the solid bright circle on the right corner implies the convergence to the global optimum. The learning curve is presented in Fig. 2(b), ROMMEO-AC shows the capability of converging to the global optimum in a very small amount of steps, while most of the baselines can only reach the sub-optimal point. An exception is PR2-AC, which can also achieve the global optimum but requires many more steps to explore and learn. Additionally, fine tuning on the exploration noise or separate exploration stage is required for deterministic RL methods (MADDPG, DDPG, DDPG-OM, PR2-AC), and the learning outcomes of energy-based RL method (MASQL) is extremely sensitive to the annealing scheme for the temperature. In contrast, ROMMEO-AC employs a stochastic policy and controls the exploration level by the weighting factor . It does not need a separate exploration stage at the beginning of the training or a delicately designed annealing scheme for .

Furthermore, we analyze the learning path of policy and modeled opponent policy during the training, the results are shown in Fig. 2(c). The red and orange lines are mean of modeled opponent policy , which always learn to approach the optimal ahead of the policy (in dashed blue and green lines). This helps the agents to establish the trust and converge to the optimum quickly, which further justifies the effectiveness and benefits of conducting a regularized opponent model proposed in Sec. 2.3.

6 Conclusion

In this paper, we use Bayesian inference to formulate MARL problem and derive a novel objective ROMMEO which gives rise to a new perspective on opponent modeling. We design an off-policy algorithm ROMMEO-Q with complete convergence proof for optimizing ROMMEO. For better generality, we also propose ROMMEO-AC, an actor critic algorithm powered by NNs to solve complex and continuous problems. We also give a theoretical and empirical analysis of the effect of the new learning process of the opponent modeling on agent’s performance in MARL. We evaluate our methods on the challenging matrix game and differential game and show that they can outperform a series of strong base lines.



Appendix A Algorithms

  Result: policy , opponent model
  Initialize replay buffer to capacity .
  Initialize with random parameters , arbitrarily, set as the discount factor.
  Initialize target with random parameters , set the target parameters update interval.
  while not converge do
     Collect experience
     For the current state compute the opponent model and conditional policy respectively from:
     Compute the marginal policy and sample an action from it:
     Observe next state , opponent action and reward , save the new experience in the reply buffer:
     Update the prior from the replay buffer:
     Sample a mini-batch from the replay buffer:
     Update :
     for each tuple  do
        Sample .
        Compute empirical as: