Introduction
A ubiquitous feature of social interaction is a need for individuals to coordinate (Lewis, 1969; Shoham and Tennenholtz, 1997; Bicchieri, 2005; Stone et al., 2010; Barrett et al., 2017). A common solution to the coordination problem is the establishment of social conventions which control daily tasks such as choosing which side of the road to drive on, who should get right of way during walking, what counts as polite, what language to speak, or how a team should apportion tasks. If we seek to construct artificial agents that can coordinate with humans, they must be able to act according to existing conventions.
In game theory, Nash equilibria are strategies for all players such that if everyone behaves according to them no individual can improve their payoff by deviating. In game theoretic models a convention is one of multiple possible equilibria in a coordination game (Lewis, 1969). Stated in these terms our agent’s task is to construct a policy that does well when paired with the equilibrium being played by existing agents.
There has been great recent progress in constructing policies that can do well in both single and multiagent environments using deep reinforcement learning (Mnih et al., 2015; Silver et al., 2017). Deep RL methods typically require orders of magnitude more experience in an environment to learn good policies than humans (Lake et al., 2017), so agents are typically trained in simulation before being deployed onto the real task.
In zerosum twoplayer environments (e.g. Go), it is the policy of the other player that is simulated during training. Typically, policies for both players are trained simultaneously or iteratively, in a process called selfplay. Selfplay produces successful policies because if selfplay converges then it converges to an equilibrium of the game (Fudenberg and Levine, 1998) and in twoplayer, zerosum games all equilibria are minimax/maximin strategies (von Neumann, 1928). Thus, a fully converged strategy is guaranteed to be unexploitable for the task of interest (e.g. playing Go with a human champion).
Constructing agents that can cooperate and coordinate with each other to achieve goals (e.g. work together as team to finish a task) has been a long running topic in multiagent reinforcement learning (MARL) research (Walker and Wooldridge, 1995; Stone and Sutton, 2001; Kapetanakis and Kudenko, 2002b; Lowe et al., 2017). However, this literature typically assumes that the cooperating agents will continue to interact with those with whom they have been cotrained (this is sometimes referred to as “centralized training with distributed execution”). In this case, if MARL converges, it finds an equilibrium and since agents will play with the same partners they trained with they will achieve these equilibrium payoffs.
Unfortunately, agents are no longer guaranteed equilibrium payoffs if there are multiple equilibria and agents must coordinate with those they were not trained with (in other words, when we remove the centralized training assumption). For example, training selfdriving cars in a virtual environment may lead to agents that avoid crashing into each other during training, but drive on the wrong side of the road relative to the society they will enter.
In this paper we propose to give the agent access to a small amount of observations of existing social behavior, i.e. samples of (state, action) pairs from the test time environment. We focus on how such data, though it is not enough to purely clone good policies, can be used in the training process to learn the best response to the policies of the future partners. Our key assumption is that the existing environment already has some existing, stable, social conventions. Thus, our assumption is that future partners will be playing some equilibrium strategies. In simple environments, agents could simply enumerate all possible equilibria and choose the one most consistent with the data. However, in more complex environments this becomes intractable. We propose to guide selfplay toward the correct equilibrium by training with a joint MARL and behavioral cloning objective. We call this method observationally augmented selfplay (OSP).
We consider OSP in several multiagent situations with multiple conventions: a multiagent traffic game (Sukhbaatar et al., 2016; Resnick et al., 2018), a particle environment combining navigation and communication (Mordatch and Abbeel, 2017; Lowe et al., 2017) and a Stag Hunt game where agents must take risks to accomplish a joint goal (Yoshida et al., 2008; Peysakhovich and Lerer, 2017b). In each of these games we find that selfplay can converge to multiple, incompatible, conventions. We find that OSP is able to learn correct conventions in these games with a small amount of observational data. Our results in the Markov Stag Hunt show that OSP can learn conventions it observes even when those conventions are very unlikely to be learned using MARL alone.
We do not claim that OSP is the ultimate approach to constructing agents that can learn social conventions. The success of OSP depends on both the game and the type of objective employed, thus an exploration of alternative algorithms is an important direction for future work. Our key result is that the combination of (a) a small number of samples from trajectories of a multiagent game, and (b) knowledge that test time agents are playing some equilibrium gives much stronger test time performance than either component alone.
Related Work
OSP is related to existing work on adding reward shaping in MARL (Kapetanakis and Kudenko, 2002b, a; Babes et al., 2008; Devlin and Kudenko, 2011). However the domain of interest differs slightly as reward shaping is typically used to cause all agents in a group to converge to a highpayoff equilibrium whereas we are interested in using shaping to guide training to select the correct testtime equilibrium.
Recent work has pointed out that strategies learned via a single instance of independent MARL can overfit to other agents’ policies during training (Lanctot et al., 2017). This work differs from ours in that it suggests the training of a single best response to a mixture of heterogeneous policies. This increases the robustness of agent policies but does not solve the problem of multiple, incompatible conventions that we study here.
The approach of combining supervised learning from trajectories with RL has been studied in the single agent case
(Hester et al., 2017). In that work the trajectories are expert demonstrations and are used to guide RL to an optimal policy. In this case supervision is used to speed up learning while in our work the trajectories are used to select among many possible optima (equilibria) which may be equivalent at training time but not at test time. However this literature explores many methods for combining imitation learning and RL and some of these techniques may be interesting to consider in the multiagent setting.Conventions in Markov Games
A partially observed Markov game (Shapley, 1953; Littman, 1994) consists of a set of players , a set of states , a set of actions for every player with the global set , a transition function , a reward function for each player that takes as input the current state and actions Players have observation functions and can condition their behavior on these observations. Markov policies for each player are functions Let denote the set of all policies available to a player and be the set of joint policies.^{1}^{1}1We consider only Markov policies in this work so that we can work with individual stateaction pairs, although our same approach could be applied across observed trajectories to learn nonMarkov policies (i.e. policies conditioned on their full history).
We use the standard notation
to denote the policy vector for all players other than
. A set of policies and a (possible random) initial state defines a (random) trajectory of rewards for each player. We let the value function denote the discounted expectation of this trajectory. The best response starting at state for player to is We let be the (possibly random) initial state of the game.There are many ways to extend the notion of a Nash equilibrium to the case of stochastic games. We will consider the Markov perfect best response. We denote by the policy (or policies) which is a best response starting at any state and consider equilibria to be policies for each player such that each .^{2}^{2}2There are weaker notions, for example, requiring that policies are best responses at every state reached during play. It is known that Markov perfect equilibria are harder to find by learning (Fudenberg and Levine, 1998) and it is interesting to consider whether different kinds of choices (e.g. onpolicy vs. offpolicy learning) can make stronger or weaker convergence guarantees. However, these questions are outside the scope of this paper.
We consider games with multiple, incompatible, conventions. Formally we say that conventions (equilibrium policy sets) and are incompatible if the compound policy is not an equilibrium.
The goal of training is to compute a policy which is a best response to the existing convention . During training, the agent has access to the environment but receives only a limited set of observations of in the form of a set of stateaction pairs sampled from . This is summarized in Figure 1.
We denote a generic element of by which is a (state, action) pair for agent . Let denote the subset of which includes actions for agent .
The dataset may be insufficient to identify a best response to all possible policies consistent with . However, the set of equilibrium policy sets is typically much smaller than all possible policy sets. Therefore, if we assume that all agents are minimizing regret then we must only consider equilibrium policy sets consistent with .
Given a game and dataset , a brute force approach to learn a policy compatible with the conventions of the group the agent will enter would be to compute the equilibrium of that maximizes the likelihood of . Formally this is given by
This constrained optimization problem quickly becomes intractable and we will instead try to find an equilibrium using multiagent learning, and use to increase the probability learning converges to it.
Observationally Initialized Best Response Dynamics
To get an intuition about how data can be used during training we will first study a learning rule where analytic results are more easily obtained. We will begin with the simplest multiagent learning algorithm: best response dynamics in a player Markov game where equilibria are in pure policies and are incompatible.
In best response dynamics each player begins with a policy initialized at . Players alternate updating their policy with the best response When there are multiple best responses we assume there is a (nonrandomized) tiebreaking rule used to select one. Given an equilibrium we denote by the basin of attraction of (the set of initial states from which BR dynamics converge to )
A naive way to use the observed data is to force the policies be consistent with at each step of the best response dynamic by changing it at each states where it differs. However, this can introduce new equilibria to the game.^{3}^{3}3For a simple example consider a game where agents choose an action or and receive reward equal to the number of other agents whose actions match theirs. In this case there are equilibria where all agents choose or all agents choose . If we restrict one agent to always choose we can introduce a new equilibrium where agents choose and one agent chooses .
In the context of reward shaping it is well known that the way to avoid the introduction of new equilibria is to use potential based reward shaping (Devlin and Kudenko, 2011), or, equivalently, use our information to only change the initialization of learning (Wiewiora, 2003). We will follow this advice and study observationally initialized best response dynamics. We begin with a policy chosen at random. However, for every player and stateaction pair in the data we form by setting the corresponding action of to . We then perform best response dynamics from this new initialization.
We will now discuss a class of games for which we can guarantee that observationally initialized best response dynamics have a larger basin of attraction for the equilibrium from which was drawn relative to standard best response dynamics. This class is a generalization of a commonly used matrix game class: games with strategic complements (Bulow et al., 1985). For our purposes strategic complements corresponds to assuming that one’s partners behave more consistently with some convention then one is also more incentivized to behave according to that convention.^{4}^{4}4In economic applications the notion of strategic complements is utilized in production games and roughly corresponds either the the idea of network effects (the more people use some product the higher a consumer’s utility is from using that product) or a joint monotonicity condition (if Firm X produces cars and firm Y produces car washing materials if firm X produces more cars then firm Y sees higher demand for car washing materials). See the Supplement for a more formal discussion. In existing work strategic complements are defined with respect to a single action rather than a Markov policy. To generalize to Markov games we introduce a notion of distance between policies:
Definition 1 (Policy Closeness).
Given a player and target policy we say that policy is weakly closer to than policy if on all states either or . We denote this by
Policy closeness gives us a partial ordering on policies which we use to generalize the idea of strategic complements.
Definition 2 (Markov Strategic Complements).
A Markov game exhibits Markov strategic complements if for any equilibrium we have that implies that
Let be a dataset drawn from equilibrium by sampling states and their equilibrium actions. Let be the basin of attraction of given observationally initialized best response dynamics.
Theorem 1 ().
If a game where bestresponse dynamics always converge exhibits Markov strategic complements then for any drawn from a equilibrium and there exists such that if then
We relegate the proof to the Appendix. Roughly, it has two steps: first, we show that if is in the basin of attraction of then anything closer to is also in the basin. Second, we show that there is an initial state that is not in the basin of attraction of best response dynamics but is in the basin of attraction of observationally initialized best response dynamics. Because initialization can increase the basin of attraction without introducing any new equilibria the observed data can strictly improve the probability that we learn a policy consistent with the observed agents.
Experiments
Observationally Augmented SelfPlay
We wish to use the insights from initialization in environments where function approximation (e.g. with deep neural networks) is required. However, if the policy is computed via function approximation, it is not clear how to ‘initialize’ its value at particular states. Specifically, the policy at the small number of states in
can only be expected to generalize if the approximation captures the regularities of the game, which will only be true after some interaction in the environment. Therefore, we consider a procedure where consistency with is enforced during training and smoothly decays over time.We consider training with stochastic gradient descent using a loss function which is a linear combination of the likelihood of the data (a supervised objective) plus the policy gradient estimator of the reward in the game (we denote by
the negative of this quantity). Formally, each agent receives rewards given bywith respect to the parameters of its policy.^{5}^{5}5Note this is different from reward shaping as the probably that a state is reached does not affect the supervised loss of the policy.
We optimize the joint objective using MARL. At any time when making an update to the parameters of each agent we will take gradient steps of the form
Where is our policy gradient and is the gradient of the supervised objective at ^{6}^{6}6As with the best response dynamics above using a compound objective with a constant can, in theory, introduce new equilibria during training. To be sure this does not occur we can anneal the weight on the supervised loss over time with In practice, however, using a fixed in our environments appeared to create policies that were still consistent with test time equilibria thus suggesting that if new equilibria were introduced they did not have large basins of attraction for our policygradient based learning procedures.
Our main analysis is experimental and we use three environments from the literature: traffic, language games, and risky coordination. Our main results are:
Result 1 (Experiments 1,2,3).
OSP greatly increases the probability our agent learns a convention compatible with test time agents in situations where standard selfplay by itself does not guarantee good test time performance and is insufficient to learn a good policy by behavioral cloning alone.
Result 2 (Experiment 3).
OSP can find conventions that have a small basin of attraction for MARL alone. Thus OSP can be used in situations where selfplay will rarely find a testtime compatible convention.
For all experiments, we represent the states using simple neural networks. The first two experiments have relatively low dimensional state representations so we use two layer MLPs with
hidden units per layer and ELU nonlinearities. Our third experiment has a grid structure so we represent the state using the convolutional neural network architecture from
(Peysakhovich and Lerer, 2017b).For RL training we use A3C using a multiagent variant of the pytorcha3c package
(Kostrikov, 2018) run on threads. We use step returns and . We use the Adam method for optimization (Kingma and Ba, 2014). For OSP, we add the supervised term to the A3C loss with , using minibatches from of size 20. Environmentspecific parameters are detailed in the subsections below. In each experiment we compare the performance of OSP for various sizes of . We populate with actions for all agents for states sampled at uniform intervals from true test time play.Traffic
We first consider a multiagent traffic navigation game inspired by (Sukhbaatar et al., 2016). Each of agents begins at a random edge of a grid and can move around in each of the cardinal directions or stand still (see Figure 2). Each agent is given a goal to reach. When the agent reaches the goal they receive a reward of and then a new goal. If agents collide with another agent or wall they receive a negative reward ( for colliding with another agent, for colliding with a wall). Agents do not have full observation of the environment. Rather, they have access to the position of their goal relative to themselves and a local view of any other agents nearby. We train agents for a total of episodes.^{7}^{7}7We found it necessary to ramp the collision penalty linearly over the first episodes to avoid agents becoming stuck in the local minima of never moving.
We train replicates and see that two incompatible conventions emerge. This can be seen in Figure 2 where we plot payoffs to an agent from one replicate paired with agents from another. We visualize the conventions by taking the empirical average action taken by any agent in any of the possible traffic coordinate locations (Figure 2 panel 3). We find that the two conventions that emerge are similar to the real world: either agents learn to drive on the left of the road or they learn to drive on the right.
We now train agents using OSP and insert them into environments with preconverged agents. The test time payoffs to the transplanted agent for various sizes of are shown in Figure 2 panel 4 top. The dashed line corresponds to the expected payoff of an agent trained using standard selfplay (no observations). We see 20 observations ( observations for each of the ) agents is sufficient to guide OSP to compatible conventions. The bottom panel shows that this is not enough data to train a good strategy via behavioral cloning alone (i.e. using just the supervised objective).
Language
An important type of convention is language. There is a resurgence of interest in the deep RL community in using communication games to construct agents which are able to communicate (Jorge et al., 2016; Foerster et al., 2016; Lazaridou et al., 2017; Lowe et al., 2017).
We now apply OSP to the cooperative communication task in the particle environment studied by (Lowe et al., 2017) and (Mordatch and Abbeel, 2017). In this environment there are two agents, a speaker and a listener, and three landmarks (blue, red, or green). One of the landmarks is randomly chosen as a goal and the reward for both agents at each step is equal to the distance of the listener from the landmark. However, which landmark is the goal during a particular episode is only known to the speaker who is able to produce a communication output from a set of symbols. To solve the cooperation task, agents thus need to evolve a simple ‘language’. This language only requires symbols to be used, but this still allows for at least incompatible conventions (one symbol per landmark).
In this experiment we use a lower discount factor of and as suggested by (Lowe et al., 2017) we also use a centralized critic. It was shown in prior work that if artificial agents learn language by selfplay they can learn arbitrary languages which may not be compatible with new partners (Lazaridou et al., 2017). Indeed, when we pair two agents who were trained separately they clearly do not speak the same language  i.e. they cannot communicate and so achieve low payoffs.
We look at the effect of adding observational data to the training of either a speaker or listener (we train a total of replicates to convergence). In the case of the speaker (whose policy is a simple map from goals to communication symbols) supervision is sufficient to learn the a good testtime language. However, pure behavioral cloning fails catastrophically for the listener. Again, OSP with a relatively small number of observations is able to achieve high payoffs (Figure 3).
Risky Coordination
We now consider a risky coordination game known as the Stag Hunt. The matrix game version of the Stag Hunt has both agents choosing either to Hunt (an action that require coordination) or Forage (a safe action). Foraging yields a sure (low) payoff whereas Hunting yields a high payoff if the other agent chooses to Hunt also and a low payoff if one shows up alone. It is known that in both matrix and Markov versions of Stag Hunt games many standard selfplay based algorithms yield agents that converge to the inefficient equilibrium in which both agents choose safe actions. This happens because while our partner is not hunting effectively (i.e. early in training), the payoff to hunting ourselves is quite low. Thus, the basin of attraction of joint hunting is much smaller than the basin of attraction of both foraging.
This situation is different from the ones in traffic and language: here there are multiple conventions (hunt or forage) but they are not payoff equivalent (hunting is better) nor do they have similar sized basins of attraction (hunting is very difficult to find via standard independent MARL).
We use the Markov version of Stag Hunt introduced by (Peysakhovich and Lerer, 2017b) where two agents live on a grid. Two plants and a stag are placed at random locations. If an agent moves over a plant, it receives a reward of . Stepping on the stag gives a reward of to both players if they step on it simultaneously, otherwise there is no reward. When either a plant or stag is stepped on, it restarts in a new location. Games last rounds.
We start by constructing a test time hunting partner by inducing joint hunting strategies in replicates. Because MARL by itself does not find hunt equilibria, we construct a hunting partner by training an agent under a modified payoff structure (payoff of for plants; payoff of for unilateral hunting).
We then test whether we can train agents in the original game who can coordinate with test time partners that hunt. We use OSP with varying amounts of data from the hunting agents. We see that with moderate amounts of data OSP often converges to the hunting convention at test time even though two agents trained together using independent MARL fail to find the high payoff joint stag equilibrium in any of the replicates. As a result, OSP outperforms even centralized selfplay because the observations of the risky partner guide the agent to a better equilibrium. As with the traffic and language environments above we see that pure behavioral cloning is insufficient to construct good test time strategies (Figure 4).
Conclusion
Conventions are an important part of social behavior and many multiagent environments support multiple conventions as equilibria. If we want to construct artificial agents that can adapt to people (rather than requiring people to adapt to them) these agents need to be able to act according to the existing social conventions. In this paper we have discussed how a simple procedure that combines small amounts of imitation learning with selfplay can lead to agents that can learn social conventions.
There are many open questions remaining in the study of conventions and building agents that can learn them quickly. OSP uses a straightforward combination of RL and behavioral cloning. It would be interesting to explore whether ideas from the learning with expert demonstrations literature (Hester et al., 2017). In addition, OSP follows current deep RL paradigms splits strategy construction into a training and a test phase. An interesting extension is to consider the OSP training strategies can be finetuned during test time.
We have focused on situations where agents have no incentive to deviate from cooperation and only need to learn correct conventions. An important future direction is considering problems where agents have partially misaligned incentives but, in addition to just solving the social dilemma, must also coordinate on a convention (KleimanWeiner et al., 2016; Leibo et al., 2017; Lerer and Peysakhovich, 2017; Foerster et al., 2017; Peysakhovich and Lerer, 2017a).
There is a large recent interest in hybrid systems which include both human and artificially intelligent participants
(Shirado and Christakis, 2017). Thus, another key extension of our work is to understand whether techniques like OSP can construct agents that can interact with humans in more complex environments.References
 (1)
 Babes et al. (2008) Monica Babes, Enrique Munoz De Cote, and Michael L Littman. 2008. Social reward shaping in the prisoner’s dilemma. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systemsVolume 3. International Foundation for Autonomous Agents and Multiagent Systems, 1389–1392.
 Barrett et al. (2017) Samuel Barrett, Avi Rosenfeld, Sarit Kraus, and Peter Stone. 2017. Making friends on the fly: Cooperating with new teammates. Artificial Intelligence 242 (2017), 132–171.
 Bicchieri (2005) Cristina Bicchieri. 2005. The grammar of society: The nature and dynamics of social norms. Cambridge University Press.
 Bulow et al. (1985) Jeremy I Bulow, John D Geanakoplos, and Paul D Klemperer. 1985. Multimarket oligopoly: Strategic substitutes and complements. Journal of Political economy 93, 3 (1985), 488–511.
 Devlin and Kudenko (2011) Sam Devlin and Daniel Kudenko. 2011. Theoretical considerations of potentialbased reward shaping for multiagent systems. In The 10th International Conference on Autonomous Agents and Multiagent SystemsVolume 1. International Foundation for Autonomous Agents and Multiagent Systems, 225–232.
 Foerster et al. (2016) Jakob Foerster, Yannis Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems. 2137–2145.
 Foerster et al. (2017) Jakob N Foerster, Richard Y Chen, Maruan AlShedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. 2017. Learning with OpponentLearning Awareness. arXiv preprint arXiv:1709.04326 (2017).
 Fudenberg and Levine (1998) Drew Fudenberg and David K Levine. 1998. The theory of learning in games. Vol. 2. MIT press.
 Hester et al. (2017) Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel DulacArnold, et al. 2017. Deep Qlearning from Demonstrations. arXiv preprint arXiv:1704.03732 (2017).
 Jorge et al. (2016) Emilio Jorge, Mikael Kågebäck, and Emil Gustavsson. 2016. Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence. arXiv preprint arXiv:1611.03218 (2016).
 Kapetanakis and Kudenko (2002a) Spiros Kapetanakis and Daniel Kudenko. 2002a. Improving on the reinforcement learning of coordination in cooperative multiagent systems. In Proceedings of the Second Symposium on Adaptive Agents and Multiagent Systems (AISB02).
 Kapetanakis and Kudenko (2002b) Spiros Kapetanakis and Daniel Kudenko. 2002b. Reinforcement learning of coordination in cooperative multiagent systems. AAAI/IAAI 2002 (2002), 326–331.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 KleimanWeiner et al. (2016) Max KleimanWeiner, MK Ho, JL Austerweil, Michael L Littman, and Josh B Tenenbaum. 2016. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In Proceedings of the 38th annual conference of the cognitive science society.
 Kostrikov (2018) Ilya Kostrikov. 2018. PyTorch Implementations of Asynchronous Advantage Actor Critic. https://github.com/ikostrikov/pytorcha3c.
 Lake et al. (2017) Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2017. Building machines that learn and think like people. Behavioral and Brain Sciences 40 (2017).
 Lanctot et al. (2017) Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Julien Perolat, David Silver, Thore Graepel, et al. 2017. A unified gametheoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems. 4193–4206.
 Lazaridou et al. (2017) Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2017. Multiagent cooperation and the emergence of (natural) language. In International Conference on Learning Representations.
 Leibo et al. (2017) Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. 2017. Multiagent Reinforcement Learning in Sequential Social Dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 464–473.
 Lerer and Peysakhovich (2017) Adam Lerer and Alexander Peysakhovich. 2017. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068 (2017).
 Lewis (1969) David Lewis. 1969. Convention: A philosophical study. John Wiley & Sons.
 Littman (1994) Michael L Littman. 1994. Markov games as a framework for multiagent reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 157–163.
 Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017. MultiAgent ActorCritic for Mixed CooperativeCompetitive Environments. arXiv preprint arXiv:1706.02275 (2017).
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
 Mordatch and Abbeel (2017) Igor Mordatch and Pieter Abbeel. 2017. Emergence of grounded compositional language in multiagent populations. arXiv preprint arXiv:1703.04908 (2017).
 Peysakhovich and Lerer (2017a) Alexander Peysakhovich and Adam Lerer. 2017a. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975 (2017).
 Peysakhovich and Lerer (2017b) Alexander Peysakhovich and Adam Lerer. 2017b. Prosocial learning agents solve generalized Stag Hunts better than selfish ones. arXiv preprint arXiv:1709.02865 (2017).
 Resnick et al. (2018) Cinjon Resnick, Ilya Kulikov, Kyunghyun Cho, and Jason Weston. 2018. Vehicle Community Strategies. arXiv preprint arXiv:1804.07178 (2018).
 Shapley (1953) Lloyd S Shapley. 1953. Stochastic games. Proceedings of the national academy of sciences 39, 10 (1953), 1095–1100.
 Shirado and Christakis (2017) Hirokazu Shirado and Nicholas A Christakis. 2017. Locally noisy autonomous agents improve global human coordination in network experiments. Nature 545, 7654 (2017), 370–374.
 Shoham and Tennenholtz (1997) Yoav Shoham and Moshe Tennenholtz. 1997. On the emergence of social conventions: modeling, analysis, and simulations. Artificial Intelligence 94, 12 (1997), 139–166.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.
 Stone et al. (2010) Peter Stone, Gal A Kaminka, Sarit Kraus, and Jeffrey S Rosenschein. 2010. Ad hoc autonomous agent teams: Collaboration without precoordination. In TwentyFourth AAAI Conference on Artificial Intelligence.
 Stone and Sutton (2001) Peter Stone and Richard S Sutton. 2001. Scaling reinforcement learning toward RoboCup soccer. In ICML, Vol. 1. Citeseer, 537–544.

Sukhbaatar
et al. (2016)
Sainbayar Sukhbaatar, Rob
Fergus, et al. 2016.
Learning multiagent communication with backpropagation. In
Advances in Neural Information Processing Systems. 2244–2252.  von Neumann (1928) J von Neumann. 1928. Zur theorie der gesellschaftsspiele. Mathematische annalen 100, 1 (1928), 295–320.
 Walker and Wooldridge (1995) Adam Walker and Michael Wooldridge. 1995. Understanding the Emergence of Conventions in MultiAgent Systems.. In ICMAS, Vol. 95. 384–389.
 Wiewiora (2003) Eric Wiewiora. 2003. Potentialbased shaping and Qvalue initialization are equivalent. Journal of Artificial Intelligence Research 19 (2003), 205–208.
 Yoshida et al. (2008) Wako Yoshida, Ray J Dolan, and Karl J Friston. 2008. Game theory of mind. PLoS computational biology 4, 12 (2008), e1000254.
Appendix
Relationship Between Markov Strategic Complements and Strategic Complements
The original definition of strategic complements comes from games with continuous actions used to model multiple firms in a market. In the simplest example we have multiple firms which produce units of goods . The revenue function of each firm is where is smooth, strictly concave, increasing and has . The goods are strategic complements if , in other words goods are strategic complements if “more ‘aggressive’ play… by one firm… raises the marginal profitabilities [of the others].” (Bulow et al., 1985) Firms have costs of production given by which has , , is convex, and increasing. Thus each firm’s objective function is
If firm is producing then firm ’s best response sets
Given the definition of strategic complements above this means that for all other firms .
Strategic complements implies our Markov strategic complements in a matrix game with multiple equilibria (since any firm changing their production level higher or lower causes other firms to also want to change their production). Markov strategic complements is weaker than strategic complements in matrix games since it only pins down how best responses to shift when others change to equilibrium actions rather than any action shift (though if action spaces in each state were totally ordered one could amend the definition to keep all of the properties).
Proof of Main Theorem
Lemma 1: In a Markov strategic complements (MSC) game, any policy in the basin of attraction of an equilibrium remains there under observational initialization, i.e. .
We define the operator as iterations of the best response operator,
Consider an initial policy for some equilibrium A. There exists such that . Now consider an observationally initialized policy for some dataset drawn from . By definition, this implies that . Now, since the game is MSC,
By repeated application of the MSC property, we find that for all ,
To conclude, we note that , which implies .
Lemma 2: In a MSC game with a finite number of states, there exists a state such that for any that contains the stateaction pair , there is a policy not in the basin of attraction of but which enters the basin of attraction of under observational initialization.
Consider a policy , and order the states lexicographically . Now consider the sequence of policies where for and for We know that , therefore there exists some such that and . Now, consider a dataset containing the stateaction pair . Then . As discussed in the last section, if and , then . Therefore, for any dataset containing , the policy enters the basin of attraction of under observational initialization.