We augment the centralised actor-critic architecture of multi-agent deep deterministic policy gradients (MADDPG) lowe2017maddpg with a novel opponent model based on the meta-learning algorithm RL duan2016rl2. RL is based on an LSTM network which stores the state of a task-specific agent in its activations. The role of the LSTM is to adapt the task-specific agent to a new task. The LSTM is trained to learn a generalisable update rule for its hidden state which can replace closed-form gradient descent techniques for training on new tasks. The state update rule of the LSTM therefore becomes an optimisation algorithm trained on the performance of the agents it generates in varied environments.
We aim to emulate an opponent’s learning rather than learn a generalisable optimisation technique. In light of this, our recurrent module stores and updates a representation of the opponent. The state update function is therefore trained to emulate the opponent’s learning. This training is treated as a regression problem for predicting opponent actions from the observed history of the game. We utilise a method we term Episode Processing (EP) whereby each episode of experience is summarised by a bidirectional LSTM and is then used to update our agent’s representation of its opponent.
We use the Open AI particle environments lowe2017maddpg for experiments. Specifically, we focus on the two-player adversarial game Keep-Away.
Our LeMOL agents are endowed with an in-episode LSTM network to aid with the issue of partial observability. Our agents take on the role of the defender trying to keep the attacker away from the goal. The goal is one of two landmarks and the defender does not know which. The attacker is trained using MADDPG.
Our experiments compare our MADDPG baseline, our full LeMOL-EP model, LeMOL-EP where modelling of the opponent’s learning process is removed (Ablated LeMOL-EP), LeMOL-EP where the opponent model has perfect prediction accuracy (LeMOL-EP Oracle OM) and LeMOL-EP where the opponent model is untrained (LeMOL-EP Naive OM). In the decentralised setting we also include a version of MADDPG with opponent modelling to make it amenable to the decentralised setting (MADDPG-OM).
Comparison of opponent model performance for the full and ablated LeMOL-EP models in Figure 1(a) demonstrates the benefit, in terms of action prediction accuracy, of modelling the opponent’s learning. Having a continuously updated opponent model improves and stabilises opponent model performance. Figure 1(b) shows the impact of improved opponent modelling on agent performance. We find that the reduction in the variance of policy updates resulting from conditioning an agent’s policy on predictions of the opponent’s actions improves overall agent performance.
In the decentralised setting (Figure 1(c)), we find that using the opponent model can enable effective decentralised training, as the opponent model compensates for the inability to access to others’ policies in the decentralised setting. Note that the architecture of the opponent models is consistent across centralised and decentralised settings. Our decentralised model attains a similar level of performance to the centralised MADDPG agent. The opponent model is the only means of accounting for non-stationarity under decentralised training. Results are therefore highly sensitive to the accuracy of the opponent model. This is demonstrated by the instability and poor performance of the model with a naive (untrained) opponent model. This model collapses back to a single agent approach ignoring the opponent’s presence.
We find that the more accurate an opponent model, the greater the improvement in agent performance. This is particularly pronounced in the decentralised setting where the increased opponent model accuracy and stability provided by modelling the opponent’s learning process is essential to attain similar performance to centralised MADDPG.
Directions for Future Work
This work provides initial evidence for the efficacy of modelling opponent learning as a solution to the issue of non-stationarity in multi-agent systems. Furthermore, we have shown that such modelling improves agent performance over the strong MADDPG baseline in the centralised setting. When our approach is applied to decentralised training it achieves comparable performance to the popular centralised MADDPG algorithm.
Despite these promising results there is significant work to be done to extend and enhance the framework we develop for handling non-stationarity through opponent modelling. We hope to pursue a formal Bayesian approach to opponent learning process modelling in future. We hope such an approach will enable a theoretical framework to emerge which can be validated through further experiments.
Appendix A Methodology Details & Model Architecture111Code for the implementation of LeMOL and baselines is available at https://github.com/ianRDavies/LeMOL
Our proposed architecture utilises the structure of the opponent’s learning process to develop a continuously adapting opponent model. We build a model which, by using an opponent representation updated in line with the opponent’s learning, achieves truly continuous adaptation.
We focus on the competitive setting where agents’ objectives are opposed and information is kept private. We therefore build an opponent model architecture capable of learning to account for the influence of the opponent’s learning on its behaviour from the observed history of the game alone.
a.1 Opponent Modelling
We pursue explicit opponent modelling, where the opponent model is used to predict the action of the opponent at each time step. The opponent’s policy is modelled as a distribution over actions.
We handle the non-stationarity of the opponent’s policy and the in-episode partial observability separately in what we term Episode Processing (EP). This approach makes modelling of the opponent’s learning process explicit and conditions the prediction of the opponent’s action on the outcome of the model of the opponent’s learning process.
We assume that opponents update periodically, once per episode. In Figure 2(a) each episode is highlighted in a different colour. Within each episode, agents’ policies are not updated and therefore the opponent’s policy is stationary such that opponent actions vary only due to the current state of play and policy stochasticity. The learning of the opponent then has an impact only in inter-episode variation. In order to capture this in our model of opponent learning, we first form a representation of the events of each episode, denoted by for the episode. These episode summaries then form a time series which traces the opponent’s learning trajectory. This time series is the input to an LSTM which updates its state to track the opponent’s evolution through learning.
During training, a LeMOL agent plays a full set of episodes, which we term a (learning) trajectory, such that both the LeMOL agent and its opponent train their policies and critics to convergence. After data from several trajectories,
have been collected, LeMOL’s opponent model is trained (via backpropagation) to model opponent learning and predict opponent actions using the previously experienced trajectories as training data.
We propose the use of a recurrent neural network (RNN) to model an opponent’s learning process. By learning an update rule for the state of the RNN,
, our agents are able to process and preserve experience across a full opponent learning trajectory. Specifically we choose a Long Short-Term Memory (LSTM) architecture for our RNN. This approach is inspired by the algorithm  which uses an LSTM to track and update a policy for a reinforcement learning agent. For our opponent model, the hidden activations of the LSTM, , form a representation of the opponent and the learned state update rule of the LSTM aims to emulate the opponent’s learning rule.
The episode summaries, , used for modelling the opponent’s learning process are produced by a bidirectional LSTM network [9, 5]. The input to the bidirectional LSTM is the full set of events of the episode in question. We define an event at time step of episode , to be the concatenation of an agent’s observation , action and reward with the observed opponent action and shared termination signal . We denote events by . We use to denote the set of episodes or time steps from the to the inclusive of the end points and .
The process for modelling the opponent’s learning process through episode summaries can be considered as two subprocesses: i) summarise the episode within which policies are fixed using a bidirectional LSTM (Equation 1), ii) use the episode summary to update the hidden state of the opponent model’s core LSTM which tracks opponent learning (Equation 2). These processes are depicted in Figure 2.
where denotes the sequence of events that make up the episode.
To handle partial observability from the environment, we introduce an in-episode LSTM which processes the incoming stream of observations during play to maintain an in-episode observation history summary .
The in-episode LSTM is reset at the start of each episode such that the memory of the action prediction function is limited to the current episode. The operation of the in-episode LSTM is summarised by Equations 3 and 4 and is depicted by Figure 2(b).
We train the opponent model using the chosen action target (CAT) formed from the opponents’ actions. This choice of target yields the loss function in Equation5 which is amenable to decentralised opponent model training, under the assumption that the opponent’s actions are observed as they are executed. This objective is one of maximum likelihood.
In practice, the integral required to calculate the expectation in Equation 5 is intractable due to the possibly high dimension of the observation space. We therefore use Monte Carlo sampling to replace the integral in the expectation leading to the tractable loss function in Equation 6.
This requires sampling a minibatch of full learning trajectories which are processed in sequence so that the LSTM can effecively maintain its hidden state.
Our opponent model is part of an actor-critic approach to training. We incorporate our novel opponent model into the MADDPG actor-critic framework proposed by lowe2017maddpg lowe2017maddpg.
a.2 The Critic
We use a centralised Q function that takes inputs from all agents to stabilise training and improve performance.333In this work we use and rather than and to denote subsequent states and actions in order to tidy up notation.
Our Q function is parameterised as a modestly sized neural network which outputs a single value given the observations of all agents as well as their actions. denotes the value of a state.
We train the Q function to minimise the temporal difference error defined in Equation 7.
where denotes the reward received by agent and and denote the observations and actions at the next time step for agents and respectively. Furthermore, denotes a target network and and are actions (given the next observations) output by target policy networks.
The target networks are independently initialised from the main networks of the model and are slowly updated by Polyak averaging , this is an approach to stabilise training by avoiding the optimism induced by bootstrapping from a Q function which is itself the subject of training.
a.3 Policy Training
As is conventional in actor-critic settings, we use a policy gradient approach to learn an effective policy. This leads us to use the Q function to form an objective for the policy to minimise by following the gradient with respect to the policy parameters, . The objective in the case of LeMOL is given in Equation 8, where the expectation is taken over the actions from the policy being trained with the opponent’s actions and the observations being drawn jointly from a replay buffer which stores the observed history of the game.
a.4 Decentralised Training Through Action Prediction
In the decentralised setting, we are no longer able to utilise the opponent’s observation as an input into the Q function. The Q function for decentralised training is therefore a function of the actions of both agents and the observation of the LeMOL agent only.
During training of the Q function, previous observations, actions, opponent actions and rewards are sampled from the replay buffer. The Q function target, denoted by (Equation 11), is then calculated using the sampled reward and an evaluation of the target Q-function.
Note that the opponent model is passed the latest representation of the opponent, , as an input. Decentralisation means that it is no longer possible to query the opponent’s target policy network. Therefore we must use the opponent model to generate an opponent action to input. Utilising this predicted action for the opponent in the target Q function evaluation enables the Q function target to reflect the opponent’s play as it would be given its present state of learning. Our opponent model makes this possible without relying on direct access to the opponent’s (target) policy.
In the decentralised setting, policy training uses the decentralised Q function as a critic. We predict the opponent’s action for the state as observed by our agent and use this in place of the historical opponent action sampled from the replay buffer. This means that the policy is trained using the Q function evaluated at the action profile of the agents at their present state of learning. In this way, an accurate opponent model enables us to train our agent using the game history reevaluated for the updated opponent.
Appendix B Related Work
Previous works seeking to handle the non-stationarity of opponents in multi-agent reinforcement learning have sought to relearn policies once the opponent is perceived to have changed [10, 2]. Such approaches rely on the opponent playing a stationary policy while they are being modelled so that an effective model for play and change detection can be trained. This approach is not suited to a continually learning opponent as present in our work. We utilise the structure of an opponent’s learning process to learn a series of opponent representations which can be used to inform opponent action prediction. We therefore avoid having to relearn a policy as we learn a single policy which handles non-stationarity by taking in a prediction of the opponent’s action.
al2017continuous al2017continuous take an alternative meta-learning inspired approach to effectively learn many models at once. Their agents adapt to differing opponents which are stationary within a given period. Their approach is an application of model-agnostic meta-learning  such that a base set of parameters are learned and adapted anew for different stages of an opponent’s training. Their agents therefore require an additional set of optimisation steps for each new period of play. Their approach also does not take advantage of the structured and sequential nature of an opponent’s learning.
hong2018drpiqn hong2018drpiqn present two algorithms Deep Policy Inference Q-Network (DPIQN) and a recurrent version (DRPIQN). Their approach updates Q learning with an auxiliary implicit opponent modelling objective. However, they do not explicitly consider the non-stationarity of the opponent. Their experiments have deterministic opponents which switch between different policies during play. They also consider the case of a learning teammate but do not consider the learning of an adversarial opponent. In future work, we would hope to include DRPIQN as a comparison benchmark to see how it performs in the competitive setting with a learning opponent. Such experiments would echo our MADDPG-OM and decentralised LeMOL-EP experiments which contrast opponent modelling for Q-learning alone (as in D(R)PIQN and MADDPG-OM) and opponent modelling for both policy and Q-learning (as for LeMOL-EP). Comparison to D(R)PIQN would enable us to consider whether predicting opponent actions is the best way to use an opponent model. The performance of agents with implicit opponent models could benefit from capturing opponent characteristics beyond their current action selection.
Both model switching and meta-learning works have been restricted to considering opponents which change in distinct discrete steps, between which the opponent is stationary. By doing so, previous works have been able to develop theoretical grounding for their models and achieve experimental success. We appeal to the structure of an opponent’s learning process and draw on the meta-learning literature to develop truly continuous adaptation. We believe that this is a novel and challenging line of work which warrants more attention.
One related work which considers the structure of an opponent’s learning architecture is Learning with Opponent Learning Awareness (LOLA) 
. LOLA agents account for the impact of their own policy updates on their opponent’s policy by differentiating through their opponent’s policy updates. The success of LOLA agents is dependent on self play of two homogeneous agents so that the impact of one agent’s policy on the learning step of the other agent(s) can be calculated or estimated. Our approach is agnostic to the architecture and learning methodology of the opponent.
Appendix C Experiments
The features of the models used in our experiments are laid out in Table 1
. Each model is run for 15 full trajectories against an MADDPG opponent. The hyperparameters used to train our models are kept fixed and laid out in Tables2 and 3. Note that the relevant parameters are the same for MADDPG when used as a defender as well as an attacker (the opponent).
|LeMOL-EP Oracle OM||–||–|
|Number of Episodes||61024|
|Episodes of Exploration||1024|
|Batch Size for Q Network and Policy Training||1024|
|Policy Network Size||(64, 64, 5)|
|Policy and Q Network Optimiser||ADAM |
|Policy and Q Network Learning Rate||0.01|
|Policy and Q Network ADAM Parameters|
|Q Network Size||(64, 64, 1)|
|Policy Update Frequency (Time Steps)||25|
|Replay Buffer Capacity||1,000,000|
|Opponent Model LSTM State Dimension||64|
|Chunk Length for Opponent Model Training (Time Steps)||500|
|Opponent Model Training Batch Size||8|
Opponent Model Training Epochs
|Episode Embedding Dimension||128|
|In-Episode LSTM State Dimension||32|
|Opponent Model Optimiser||ADAM |
|Opponent Model Learning Rate||0.001|
|Opponent Model ADAM Parameters|
-  (2016) : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §A.1.
-  (2018) Learning against non-stationary agents with opponent modelling and deep reinforcement learning. In 2018 AAAI Spring Symposium Series, Cited by: Appendix B.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: Appendix B.
-  (2018) Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 122–130. Cited by: Appendix B.
-  (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks 18 (5-6), pp. 602–610. Cited by: §A.1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §A.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Table 2, Table 3.
-  (1990) New stochastic approximation type procedures. Automat. i Telemekh 7 (98-107), pp. 2. Cited by: §A.2.
-  (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §A.1.
-  (2018) A deep bayesian policy reuse approach against non-stationary agents. In Advances in Neural Information Processing Systems, pp. 954–964. Cited by: Appendix B.