Learning in two-player games between transparent opponents

by   Adrian Hutter, et al.

We consider a scenario in which two reinforcement learning agents repeatedly play a matrix game against each other and update their parameters after each round. The agents' decision-making is transparent to each other, which allows each agent to predict how their opponent will play against them. To prevent an infinite regress of both agents recursively predicting each other indefinitely, each agent is required to give an opponent-independent response with some probability at least epsilon. Transparency also allows each agent to anticipate and shape the other agent's gradient step, i.e. to move to regions of parameter space in which the opponent's gradient points in a direction favourable to them. We study the resulting dynamics experimentally, using two algorithms from previous literature (LOLA and SOS) for opponent-aware learning. We find that the combination of mutually transparent decision-making and opponent-aware learning robustly leads to mutual cooperation in a single-shot prisoner's dilemma. In a game of chicken, in which both agents try to manoeuvre their opponent towards their preferred equilibrium, converging to a mutually beneficial outcome turns out to be much harder, and opponent-aware learning can even lead to worst-case outcomes for both agents. This highlights the need to develop opponent-aware learning algorithms that achieve acceptable outcomes in social dilemmas involving an equilibrium selection problem.


Logit-Q Learning in Markov Games

We present new independent learning dynamics provably converging to an e...

Learning with Opponent-Learning Awareness

Multi-agent settings are quickly gathering importance in machine learnin...

How and Why to Manipulate Your Own Agent

We consider strategic settings where several users engage in a repeated ...

Should transparency be (in-)transparent? On monitoring aversion and cooperation in teams

Many modern organisations employ methods which involve monitoring of emp...

A Reinforcement Learning Based Approach to Play Calling in Football

With the vast amount of data collected on football and the growth of com...

Dynamic Information Sharing and Punishment Strategies

In this paper we study the problem of information sharing among rational...

SA-IGA: A Multiagent Reinforcement Learning Method Towards Socially Optimal Outcomes

In multiagent environments, the capability of learning is important for ...

Code Repositories



view repo

1 Introduction

Reinforcement learning is concerned with constructing agents that learn to achieve high rewards in a diverse set of environments. Multi-agent reinforcement learning studies learning in environments that contain other agents, possibly pursuing conflicting goals. The simplest way do deal with the presence of other learning agents during learning is to treat them as just another part of the environment. However, such an environment will be non-static and non-Markovian (Laurent et al., 2011; Lanctot et al., 2017). The simplistic treatment also ignores the incentives that our agent’s parameters and their updates create for the learning process of the other agents. This is particularly relevant in so-called social dilemmas, in which agents can benefit from mutual cooperation but have an incentive to unilaterally defect (Leibo et al., 2017). A number of authors recently developed strategies that allow reinforcement learners to reach mutual cooperation in social dilemmas (Lerer and Peysakhovich, 2017; Peysakhovich and Lerer, 2017; Hughes et al., 2018; Wang et al., 2018; Foerster et al., 2018; Letcher et al., 2018; Baumann et al., 2018; Clifton and Riché, 2020).

The prisoner’s dilemma (PD) is the canonical example of a social dilemma. Mutual cooperation in the PD is Pareto-superior to mutual defection, but defecting is a dominant strategy and mutual defection is the only Nash equilibrium of the game. Strategies like tit-for-tat enable mutual cooperation in the iterated PD (Axelrod and Hamilton, 1981; Harper et al., 2017), while cooperation is much more elusive in the single-shot case. Vanilla gradient descent learners fail to learn tit-for-tat and end up defecting with each other even when repeatedly interacting in an iterated PD. One strand of recent work considered learning agents that are aware of each other’s learning process, and use this to anticipate and shape each other’s parameter updates (Foerster et al., 2018; Letcher et al., 2018). When repeatedly playing an iterated PD, such learners can learn strategies similar to tit-for-tat, leading to mutual cooperation. In this work, we shall take the idea of mutual awareness of the two learning agents one step further, as we will now discuss.

Outside of machine learning, different authors have discussed conditions under which agents might cooperate with each other even in a

single-shot PD. What connects all of these ideas is a condition which we might call mutual transparency, i.e. agents having some insight into each other’s inner workings. Hofstadter (1983) coined the term “superrationality” to describe a type of player that will cooperate with other superrational players, since they reason that both players will arrive at the same decision, and hence mutual defection and mutual cooperation are the only two consistent outcomes. McAfee (1984), Howard (1988), and Tennenholz (2004) introduced a program equilibrium formalism, in which players do not directly chose actions in the game, but rather submit programs which chose actions and are given access to each other’s source code. Mutual cooperation is enabled by submitting a program that cooperates if and only if the opponent program is syntactically exactly identical. However, cooperation is rather fragile if it relies on programs being exact copies of each other, and so different authors have worked on making cooperation in a single-shot PD more robust when assuming mutual transparency of the agents (van der Hoek et al., 2013; LaVictoire et al., 2014; Critch, 2019; Oesterheld, 2019). These references rely on logical as opposed to syntactical properties of the agents to achieve cooperation. Critch (2019) considers agents that search through all possible proofs up to length to prove that their opponent will cooperate with them; if they find such a proof, they cooperate, otherwise they defect. Surprisingly, such agents cooperate with each other for large enough but finite . Oesterheld (2019) introduces an agent called which cooperates unconditionally with small probability , and otherwise mirrors the opponent’s action when predicting how the opponent will act when playing against . When facing a logically similar agent, the mutual recursive function calls terminate after a finite number with unit probability and cooperation ensues. When facing a defector, defects with probability .

In this work, we bring together the idea of mutual transparency enabling cooperation with opponent-aware learning as developed in Foerster et al. (2018); Letcher et al. (2018). Similar to Oesterheld (2019), we allow both agents to predict how their opponent will play against them, while requiring them to give an opponent-independent response with probability at least . The agents can repeatedly interact in a social dilemma and can update their parameters after each round. An interesting question is whether policies similar to can emerge as the result of such a learning process, which we shall answer in the affirmative.

Opponent-aware learning is necessary to learn such mutually cooperating policies; simple gradient learners will always learn mutual defection. We use two algorithms developed in previous literature for this, LOLA (Foerster et al., 2018) and SOS (Letcher et al., 2018). We will find that both of these have their own (sometimes unexpected) advantages and drawbacks.

In addition to the PD, we consider another well-known social dilemma, the game of chicken. A key difference of the game of chicken from the PD is that it has multiple Nash equilibria, and both players prefer a different one. We find that unlike in the PD, opponent-aware learning can be used in the game of chicken to outmaneuver the opponent and navigate them towards one’s preferred equilibrium. However, both players attempting to do this can result in a worst-case outcome of both players going straight.

We study this scenario from the perspective of the principals who deploy the learning agents to interact with each other. From the principals’ perspective, we can regard this as a game in which a move corresponds to submitting a certain learning agent. An interesting question will thus be whether in the game which the principals are playing, a socially optimal outcome is easier to achieve than in the underlying game which the agents are playing (PD or chicken). We find that this is the case for the PD, but not for the game of chicken.

The rest of this work is organized as follows. Sec. 2 briefly introduces learning with opponent-learning awareness, as developed in recent literature, and adapt it to our framework. For illustration, we apply it to two simple two-player games. Sec. 3 formally discusses games in which both players have the ability to predict how their opponent will play against themselves. Sec. 4 contains our main results, in which we apply the techniques from Secs. 2 and 3 to two well-known social dilemmas, the prisoner’s dilemma and the game of chicken. Sec. 5 contains our concluding remarks.

2 Learning with opponent-learning awareness

The theory of learning while taking the opponent’s learning into account as used in this work was developed by Zhang and Lesser (2010); Foerster et al. (2018); Letcher et al. (2018). We briefly recapitulate these ideas and adapt them to our scenario.

Consider a “game” between two players and , in which expected payoffs as a function of players’ parameters are given by and , respectively. In each round, player calculates a gradient of and updates their parameters in the direction of this gradient using a learning rate . Let and denote players’ current parameters. The “naive” gradient for player is then given by


This “naive” gradient however implicitly assumes to be static and so ignores that is learning as well.

A more sophisticated way of calculating ’s gradient might take ’s learning into account. There are two different ways in which ’s learning affects the direction in which ’s gradient points: might want to calculate their gradient at parameters that are already updated by a gradient step of (in other words, anticipate ’s gradient step); and might want to move towards regions of parameter space in which ’s gradient points into directions that benefit (in other words, shape ’s gradient step).

In order to arrive at this formally, considers a gradient step of in which updates their parameters using a learning rate . The learning rate , used by in order to anticipate and shape ’s learning, need not be identical to ’s actual learning rate . The LOLA (Learning with Opponent-Learning Awareness (Foerster et al., 2018)) gradient of is then given by


Zhang and Lesser (2010) study the first summand in Eq. (2), which (following Letcher et al. (2018)) we will call Look Ahead (LA). They prove that for two-action two-player games, LA leads to convergence to a Nash equilibrium for sufficiently small learning rates. Foerster et al. (2018) study the combination of the first and third summand in Eq. (3), that is, the naive gradient and the leading order opponent-shaping correction. Among other results, they find that it leads to mutual cooperation in an iterated PD. The version of LOLA we will use in this work is given by Eq. (2), so in contrast to Foerster et al. (2018) we also incorporate the effects of anticipating the opponent’s gradient step, and do not perform a leading-order Taylor expansion.

Letcher et al. (2018) introduce SOS (Stable Opponent Shaping), which, when adapted to our framework, uses a gradient


This is identical to Eq. (3), up to a factor which is used to tune the strength of the shaping term and is re-calculated for each gradient step. Letcher et al. (2018) prove that SOS converges to stable fixed points (SFPs) in a broad class of games and demonstrate that two SOS learners can avoid certain cooperation failures that two LOLA learners fall victim to; see Sec. 2.2 for further discussion of this point. The scaling factor is thereby chosen as large as possible while guaranteeing that a) always has positive inner product with the first two summands of Eq. (2) (that is, the naive gradient and the leading-order LA correction); and b) goes to zero whenever does.

In Letcher et al. (2018), the factor is at each step identical for all involved learners. This implicitly assumes a certain form of global coordination which we cannot assume for our purposes. We will thus use a form of SOS in which each player calculates independently of all others, as proposed in Remark 4.7 in Letcher (2018). As shown in this remark, this version of SOS inherits convergence guarantees from the version used in Letcher et al. (2018). The detailed calculation of is described in Appendix A.

As discussed, there are some technical differences between the way LOLA and SOS gradients are calculated in Foerster et al. (2018) and Letcher et al. (2018), respectively, and the way they are calculated in this work. A conceptually more interesting difference is that we do not assume and to be fixed and equal to each other. We consider the choice of (as well as the choice between LOLA and SOS) to be part of the strategy of the principal who deploys a learning agent, and study the effects of choosing different values for . For instance, is equivalent to calculating the naive gradient, while choosing much larger than corresponds to looking multiple steps of ahead. Larger values of and might be particularly relevant in competitive scenarios, in which both players attempt to outwit each other.

For illustration, we study the effects of LOLA and SOS on two simple games that do not yet involve predicting the opponent’s action against oneself.

2.1 Ultimatum game

We consider a binary version of the ultimatum game (Güth et al., 1982; Sanfey et al., 2003; Henrich et al., 2004; Oosterbeek et al., 2004) in which player (the proposer) receives a pot of $10 and can choose between a “fair” split ($5 each for them and their opponent) or an “unfair” split ($8 for themselves, $2 for the opponent). Player (the responder) can choose to accept the proposed split, or reject it, in which case both players receive nothing. We assume that will always accept fair splits, so both players have a single parameter describing their strategy. Player ’s parameter describes their probability of proposing a fair split, , where

denotes the sigmoid function. Player

’s parameter describes the probability of accepting a proposed unfair split, . Payoffs are given by

and so the naive gradients are


Player ’s naive gradient is thus always positive (since for any finite ), while ’s naive gradient is positive when and negative otherwise.

The full gradient fields for naive learners, SOS learners, and LOLA learners (with ) are shown in Fig. 1. The field for SOS learners looks very similar to that for naive learners. In particular, player ’s gradient is always positive, which is expected: Since player ’s naive gradient is always positive, so is their LA gradient; the SOS gradient always has positive inner product with the leading order LA gradient, meaning for agents with a single parameter that they always have the same sign. So the SOS gradient of player cannot become negative as long as the leading order approximation is good.

Things change when using LOLA with sufficiently high : ’s gradient now becomes negative in certain regions of parameter space, meaning that the responder decreases their likelihood of accepting an unfair proposal, since they take into account that a low will provide an incentive for the proposer to increase .

Figure 1: Gradient fields in the ultimatum game for naive learners (left), SOS learners (center), and LOLA learners (right) using .

2.2 Tandem game

Letcher et al. (2018) introduce the tandem game as an example of a game in which LOLA learners’ attempt at shaping each other’s gradients leads to worse outcomes for both of them, which is avoided by SOS learners. In this game, both players have a single parameter, and , and payoffs are given by


The first summand in the payoff functions encourages the players to coordinate their choice of parameters to satisfy , while the second summand encourages each player to choose their parameter as large as possible.

The overall welfare can be written as


When both players use LOLA gradients as in Eq. (2) with , straightforward calculus shows that the set of SFPs is described by


which increases monotonically as a function of and diverges as . The resulting overall welfare thus decreases monotonically as a function of .

LOLA learners display “arrogant behavior” (Letcher et al., 2018) – LOLA encourages both players to increase their own parameter, expecting that this will compel (via the first summand in the payoff) their opponent to decrease their own. SOS by contrast preserves the SFPs described by of two naive learners, leading to higher overall welfare.

Figure 2: Average rewards for different learners (LOLA or SOS) when playing the tandem game against other learners. Results are sampled over

runs, shaded regions show one standard deviation.

In the following, we simulate the learning dynamics of LOLA and SOS learners not only against themselves, but also against each other. We use and draw initial parameters from . Results are in Fig. (2). While two SOS learners against each other receive higher rewards on average than two LOLA learners against each other, a LOLA learner is able to exploit an SOS learner.

If we consider the “game” in which two principals have to choose a learning algorithm for their agents, the choice between LOLA and SOS thus turns into a prisoner’s dilemma: both players choosing SOS is better for both players than both players choosing LOLA, but choosing LOLA is a dominant strategy (i.e., leads to better outcomes than choosing SOS irrespective of the opponent’s choice). Without some coordination regarding the choice of learning algorithm, we should thus expect that LOLA will be chosen in a competitive scenario.

3 Decision-making with opponent transparency

Consider a normal form game between two players and . We assume that both players’ decision making is transparent to each other, and so each player is able to predict their opponent’s action against an other player, including themselves. Assume that players and have and actions available, respectively. The probabilistic reaction of to predicted actions of (when predicting how will play against ) can be summarized in a matrix , with non-negative entries and columns that sum to unity. We further denote ’s probability of choosing an action based on predicting (as opposed to an opponent-independent action) by , and their (

-dimensional) probability distribution over actions when choosing their action independently of their opponent by

. Player ’s policy is thus completely determined by the tuple , and analogously for player . introduced by Oesterheld (2019) is an example of an agent fitting into this general framework.

Clearly, having both and will lead to an infinite regress, so we require that both and , where in the following we will use . The resulting probability distribution over ’s actions can then be calculated analytically. Indeed, it is given by


and analogously for player .

Note that while both players use the opponent’s predicted action against themselves to chose their action, the actual action probabilities and of the players are not correlated.

4 Social dilemmas with opponent transparency

In this section, we bring together the techniques introduced in Sec. 2 and Sec. 3, and apply them to two well-known social dilemmas, the PD and the game of chicken.

4.1 Preliminaries

4.1.1 Social dilemmas

The available actions in the PD are cooperate (C) and defect (D). In the game of chicken they are swerving or going straight, which we will identify with cooperating and defecting, respectively, for easier comparison with the PD. Both games are described by four different payoffs, (punishment for both players defecting), (reward for both players cooperating), (temptation of defecting against a cooperator), and (sucker’s payoff for cooperating with a defector). The PD and the game of chicken share the following properties: (mutual cooperation is preferable to mutual defection); (mutual cooperation is preferable to being exploited); and (there is an incentive to defect against a cooperator). It is often also assumed that , such that mutual cooperation is preferable to alternating rounds of defection. The only difference between the two games is whether or is larger, that is, which action is preferable against a defector. In the PD, the best response against a defector is to defect oneself, while in the game of chicken, the best response is to cooperate. This makes defection a dominant strategy in the PD, while no dominant strategy is available in the game of chicken.

Tables 1 and 2 show the numerical payoffs we will use in the following. We choose payoff differences which are sufficiently larger than unity to ensure that the parameter-to-payoff landscape is sufficiently curved, and the effects of taking into account e.g. how the opponent’s next gradient step will affect one’s own gradient become relevant. We note that in iterated games such as in Foerster et al. (2018) or Letcher et al. (2018), a similar effect is achieved by adding up all the (discounted) payoffs of an iterated game.

(cooperate) (defect)
(cooperate) 30, 30 0, 40
(defect) 40, 0 10, 10
Table 1: Payoff matrix for the prisoner’s dilemma.
(swerve) (straight)
(swerve) 30, 30 0, 40
(straight) 40, 0 -30, -30
Table 2: Payoff matrix for the game of chicken.

4.1.2 Parameter initialization and updates

Foerster et al. (2018) and Letcher et al. (2018) consider a set up in which the two learning agents repeatedly play an iterated PD, and update their parameters after each iterated PD. There are thus two levels of iteration: an inner one (the iterated PD), during which parameters are frozen, and an outer one, in which parameter updates happen. By contrast, we consider a single level of iteration and update parameters after each single-shot PD.

At the beginning of each experiment, we initialize all parameters from with . In order to calculate player ’s gradients using Eq. (2) or Eq. (2), we need the parameter-to-payoff function . This function involves several steps. First, all parameters are translated to the probabilities appearing in using a sigmoid function. The probability is clamped to the range . Then we use use Eq. (3) to calculate the probability of each player cooperating, which leads to the probability of the four possible outcomes , , , and . Multiplying the probabilities of the four possible outcomes with the respective payoffs leads to the expected payoffs and

. We use PyTorch’s

autograd functionality to evaluate the gradients appearing in Eq. (2) and Eq. (2).

4.2 Prisoner’s dilemma

4.2.1 Cooperativeness as a function of the opponent learning rate

We start by studying the impact of the learning rate used to take the opponent’s gradient step into account, while holding the actual learning rates constant. We use values , each learner performs gradient steps after random initialization, and we perform experiments for each value of . Results for two LOLA learners are shown in Fig. 3 and look substantially the same for SOS vs. SOS and LOLA vs. SOS. With , the LOLA learners are essentially naive learners who end up always defecting. The probability of mutual cooperation then increases with increasing up to a sweet spot around where mutual cooperation is almost guaranteed. Increasing even further to again increases the probability of other outcomes while mutual cooperation remains the most likely outcome.

Figure 3: Probability of the four different outcomes in the PD after gradient steps of two LOLA learners as a function of , using . The horizontal axis is logarithmic and spans three orders of magnitude for

. Shaded regions show one standard error calculated as

, where is the sample standard deviation and results are sampled over experiments.

4.2.2 Fluctuations around equilibrium

We next study the fluctuations around the learning equilibrium in which players mostly cooperate. Fig. 4 shows the average payoffs and their standard deviations for LOLA vs. LOLA and SOS vs. SOS learners using . Both types of learners achieve average payoffs close to the Pareto optimal value of , but the LOLA learners show smaller fluctuations around their average rewards.

Figure 4: Average payoff in a PD as a function of learning step for LOLA vs. LOLA learners and SOS vs. SOS learners, using . Shaded regions show one standard deviation. Results are sampled over experiments.

4.2.3 Parameters after convergence

In the two-player, two-action games we are investigating, agents’ policies are described by the four probabilities (probability of choosing an action based on predicting (“simulating”) the opponent), (probability of cooperating when giving an opponent-independent response), (probability of cooperating after opponent’s simulated cooperation), and (probability of cooperating after opponent’s simulated defection). We will now discuss the values towards which these probabilities converge.

Figure 5: Final parameters in a PD of LOLA learners with after learning steps. Parameters of the learner who achieves higher expected payoff are at the top. Error bars show one standard deviation.

Learners with achieve final payoffs (after learning steps) which are close to identical: the higher payoff is (standard deviation over experiments), the lower payoff is . Interestingly, the final parameters are clearly not identical, see Fig. 5. It is instructive to consider how these final parameters differ from (, , ). Both learners converge to , which provides an incentive for their opponent to cooperate. The learner with higher final payoff differs from by defecting with significant probability when giving an opponent-independent response, the learner with lower final payoff differs from by choosing an opponent-independent action with significant probability, and cooperating after simulated defection with significant probability.

We can provide a heuristic explanation how such an equilibrium can emerge. When playing against “almost

” (say an agent that plays like of the time, and randomly of the time), it is better to cooperate unconditionally than to be . This can be seen with the help of Eq. (3), or by considering that cooperating unconditionally prevents (simulated and real) retribution. Starting from a hypothetical state in which both agents are “almost ”, there is thus a gradient towards cooperating unconditionally, i.e. increasing and lowering . If starts moving towards unconditional cooperation, this invites defection from , i.e.  is invited to lower their . The opponent-shaping gradient of will thus push against moving too far away from . In effect, in the rare cases where decides to act without simulating , they have a chance to defect against a cooperating opponent, which yields the highest possible reward ; this will lead to a slightly higher expected payoff for than for . Which of the two learners ends in which of the roles just described will be determined by small differences in the initial paramters.

4.2.4 The need for opponent-aware learning

We know that the parameter space available to the learners includes policies like , which provide incentives for the opponent to cooperate while being hard to exploit. We have seen in Fig. 3 that naive learners (or LOLA learners with ) do not converge towards such policies, but learn mutually defecting strategies. Here, we investigate whether mutual cooperation is at least stable for naive learners when they are initialized with parameters close to . We initialize all parameters to plus small noise from with , such that initially and .

Figure 6: Prisoner’s dilemma with learners whose parameters are initialized close to ( and ). Both opponents use the same learning algorithm. They are either naive gradient learners (with ) or LOLA learners (with ). Shaded regions show one standard deviation sampled over experiments.

Fig. 6 compares the outcomes with and without opponent-aware learning. Both two naive learners and two LOLA learners initially achieve high levels of cooperation. However, for the former cooperation starts degrading after a few hundred learning steps, while for the latter it remains stable over steps. As already discussed in Sec. 4.2.3, the reason for this is that when playing against “almost ”, the naive gradient points towards cooperating unconditionally, i.e. lowering . This can be observed for the naive learners in Fig. 6. As a consequence, the incentive to reward simulated cooperation with cooperation decreases, leading to decreasing . When the opponent-shaping gradient is present, stays close to unity, as seen for the LOLA learners.

Interestingly, both naive and LOLA learners end up more forgiving than (i.e., both have ). LOLA learners manage to strike a balance between forgiving simulated defection with significant probability (thus preventing cascades of simulated and real defection) and keeping low enough such that both agents are still incentivized to cooperate with high probability when choosing an opponent-independent action ().

4.2.5 No arms race towards higher opponent learning rates

In a competitive scenario there is no reason to assume that both learners use the same opponent learning rate . Indeed, in a scenario in which principals submit learning agents to compete with others, the principals might endow their agents with a high in the hope of their agent outwitting its opponent. It is thus interesting to study whether a learner with higher has an advantage over a learner with lower (which might lead to an “arms race” of learners with ever higher ), and how stable mutual cooperation is to learners with unequal opponent learning rates. Fig. 7 shows results for and . The learner using the higher opponent learning rate does not manage to achieve higher payoffs on average. For two LOLA learners, cooperation is stable over learning steps. For two SOS learners, on the other hand, cooperation slowly starts to deteriorate after a few hundred steps. It is still the case, however, that the learner with the higher opponent learning rate does not achieve higher payoffs, and that average payoffs remain clearly above the baseline of for mutual defection.

Figure 7: Average payoff in a PD as a function of learning step for LOLA vs. LOLA learners (top) and SOS vs. SOS learners (bottom), using and . Shown are average payoffs for the first learning steps. Shaded regions show one standard deviation. Results are sampled over experiments.

4.2.6 The game from the perspective of the principals

We now take the scenario of two principals submitting learning agents one step further. We can consider the “game” in which a strategy does not correspond to cooperating with some probability (as in the underlying PD), but to submitting a certain learning agent that will play multiple rounds of the underlying game (like the PD) against its opponent while updating its parameters. The principal then receives the reward of a final single-shot game. (For sufficiently long training periods, results do not change substantially if the principals instead receive the sum of all rewards received by their agent during training.) In such a scenario, each principal might be concerned not only with the learning of their own agent, but also with how they can shape the learning of their opponent’s agent.

Within the scope of this work, a learning agent is defined by the algorithm used to calculate its gradients (naive, LOLA, or SOS), its learning rate , and its opponent learning rate (for non-naive learners). This still produces a large space of possible learning agents. For simplicity, we consider just five representatives. We fix for all learners and consider a naive learner, and LOLA and SOS learners with or .

Fig. 8 shows the results (average reward received in the final single-shot game) for all possible encounters. A number of insights can be gained from these results:

  • No learner ever gets less than the “safe value” which can be guaranteed by always defecting.

  • LOLA learners with manage to get more than against naive learners.

  • Naive learners benefit a lot from playing against (LOLA or SOS) learners with large opponent learning rate . In fact, a naive learner is the best response against an SOS learner with .

  • Larger does not necessarily do better. Against (LOLA or SOS) learners with , the best response uses or .

  • The game defined by the payoff matrix in Fig. 8 has a Nash equilibrium in which one principal submits a LOLA learner with and the other submits a LOLA learner with . Both principals submitting a LOLA learner with is also a Nash equilibrium within error bars. These Nash equilibria produce payoffs close to for both players.

Submitting a LOLA learner with creates an incentive for the opponent to also submit a LOLA learner, which leads to payoffs much larger than (which are obtained by two naive learners) and in fact close to the Pareto-efficient value for both players. In some sense, our framework of learning with opponent transparency thus transforms the PD, with its default outcome of mutual defection, into a much easier game in which mutual cooperation is the default outcome.

Figure 8: Average payoff in the final single-shot PD after learning steps. Shown is the payoff of the row strategy versus the column strategy. All strategies use a learning rate . Error bars are calculated as , where is the sample standard deviation and is the number of experiments per strategy pair. Cells are highlighted if the row strategy is a best response versus the column strategy within statistical error bars.

4.3 Game of chicken

4.3.1 Baseline outcome for naive learners

Two naive learners in the PD simply learn to always defect against each other, even when given the opportunity to predict their opponent against themselves. In the game of chicken, defection is no longer the best response against a defecting opponent, and so mutual defection is not a stable outcome for naive learners. Indeed, one of the two naive learners will learn to always defect while the other learns to always cooperate, yielding an average payoff of . Which of the two learners becomes the “defector” and which one the “cooperator” is determined by small differences in the initial parameters.

4.3.2 Outwitting the opponent

When using opponent-aware learners instead of naive learners, it becomes possible for one learner to navigate its opponent towards its preferred equilibrium. Fig. 9 shows two examples of this. In both of these examples, one learner manages to get their opponent to cooperate with certainty, while itself defecting with significant probability. As a result, it ends up receiving an average reward larger than at test time. The left part of the figure shows an SOS learner with higher opponent learning rate taking advantage of its opponent with lower , suggesting that an “arms race” of choosing ever higher values of might ensue. The right part of the figure shows that – somewhat surprisingly – when using equal parameters () the more cautious SOS learner can outmanoeuvre a LOLA learner. The LOLA learner here receives less on average than the “baseline” of which two naive learners achieve against each other.

In the game of chicken, one promising opponent-aware strategy might be to move towards unconditional defection ( and ), forcing the opponent to cooperate lest they receive the lowest possible payoff . (This strategy in the game of chicken is sometimes described as “throwing the steering wheel out the window”, i.e. visibly committing to going straight.) This is not what we observe in the two examples in Fig. 9. It is the case that in both examples the player who receives the higher payoff uses , while moving their opponent to . However, in both examples the player who receives the higher payoff actually is the one with the higher . So the better performing player manoeuvres their opponent towards unconditional cooperation, while themselves making their action highly dependent on the opponent’s predicted action.

Figure 9: Two examples of a learner outmaneuvering their opponent in a game of chicken. Left: Two SOS learners with and . Right: A LOLA learner vs. an SOS learner with . In the “Outcome probabilities” plot, shaded regions show one standard error calculated as , where is the sample standard deviation; in the other plots, shaded regions show one sample standard deviation. Results are sampled from experiments.

4.3.3 Outcomes as a function of the opponent learning rate

Fig. 10 shows the results of both learners using ever larger values of (while holding constant). For sufficiently small values of , one of the players will always cooperate while their opponent will always defect. For some intermediate values of , mutual cooperation becomes the most likely outcome for two LOLA learners. Two SOS learners by contrast do not achieve a high probability of mutual cooperation for any value of , and for sufficiently large values of , disaster strikes with both players defecting and receiving the lowest possible payoff .

Figure 10: Outcome probabilities (after gradient steps) in a game of chicken as a function of (while holding constant), showing LOLA vs. LOLA at the top and SOS vs. SOS at the bottom. The horizontal axis is logarithmic and spans three orders of magnitude for . Shaded regions show one standard error calculated as , where is the sample standard deviation and .

For large enough , SOS learners display “arrogant behavior” in that they expect that them defecting will compel their opponent to cooperate. This is especially noteworthy since SOS was specifically designed to avoid such arrogant behavior. The reason why SOS does not manage to avoid arrogant behavior here is that the SOS gradient is designed to have non-negative inner product with the Look Ahead gradient (see Appendix A), not necessarily with the naive gradient . If both players are defecting, the naive gradient will point towards defecting less, but the Look Ahead gradient might (for large enough ) not, since it anticipates the opponent defecting less. This provides a clear example in which incorrectly anticipating one’s opponent proves harmful.

4.3.4 The game from the perspective of the principals

We now consider again the scenario in which two principals submit learning agents that will play on their behalf. As discussed, against an opponent with low or intermediate , there is an incentive to submit an agent with higher . As we will see, however, the reverse is also true: the best response to an SOS learner with high is a learner with intermediate , and so the disaster of mutual defection is an unlikely outcome.

Fig. 11 shows the full payoff matrix for the same learners as used when discussing the PD. Unlike in the PD, some strategies achieve the lowest () or highest () possible payoff against certain other strategies. It is noteworthy that the SOS learners manage to take advantage of the LOLA learners irrespective of the opponent learning rate . There is a Nash equilibrium in which one principal submits an SOS learners with and the other principal submits an SOS learner with ; these are mutual best responses within statistical error bars. The average payoffs that result when choosing between these two learning agents (i.e. the cells at the bottom-right in Fig. 11) again satisfy the inequalities defining the game of chicken. The game of chicken thus reproduces itself at a higher level (with somewhat different payoffs), where “going straight” now corresponds to choosing a high value of and “swerving” corresponds to choosing an intermediate value of .

Figure 11: Average payoff in the final single-shot game of chicken after learning steps. Shown is the payoff of the row strategy versus the column strategy. All strategies use a learning rate . Error bars are calculated as , where is the sample standard deviation and is the number of experiments per strategy pair. Cells are highlighted if the row strategy is a best response versus the column strategy within statistical error bars.

5 Discussion

We have studied a scenario in which two principals submit learning agents that repeatedly interact in a social dilemma. At test time, the agents play a single round of the social dilemma game, and the resulting rewards are the principals’ payoffs. Both players’ decision making is mutually transparent, and both learners are aware of each others’ learning, and take it into account when updating their own parameters. It is not at all clear how to do this “optimally”. We have considered two approaches discussed in previous literature, LOLA and SOS, after adapting them to our set up. We have seen that LOLA and SOS learners can achieve outcomes which are drastically different than those achieved by naive gradient descent learners, including mutual cooperation in the prisoner’s dilemma, yielding Pareto optimal outcomes, but also mutual defection in the game of chicken, which yields the worst possible outcome for both players.

We have found that both LOLA and SOS have their own advantages and drawbacks. In particular:

  • Given the choice between LOLA and SOS, choosing LOLA is a dominant strategy in the tandem game (Fig. 2).

  • Mutual cooperation in the prisoner’s dilemma is more robust to using different opponent learning rates for LOLA than for SOS (Fig. 7 and Fig. 8).

  • SOS learners outmanoeuvre LOLA learners in the game of chicken, even if the LOLA learner uses a higher opponent learning rate than the SOS learner (Fig. 9 and Fig. 11).

  • When two SOS learners with high opponent learning rate face each other in the game of chicken, they both get the lowest possible payoff (Fig. 10).

This leads to two important open problems. Firstly, develop an opponent-aware learning algorithm that improves upon LOLA and SOS. Secondly, develop a better understanding of when and how one learning agent manages to manoeuvre its opponent towards its preferred equilibrium in the game of chicken, or games involving an equilibrium selection problem more generally.

From the principals’ perspective, the underlying social dilemma game gets transformed into a different game: what learning agent is a good response to a learning agent which their opponent might submit? We have seen that the prisoner’s dilemma and the game of chicken behave very differently under this transformation. In the prisoner’s dilemma, there are learning agents which are a best response against themselves (within the limited set of learning agents considered in this work), and lead to mutual cooperation with high probability. The game of chicken, by contrast, offers more room to outwit the opponent by choosing a higher opponent learning rate than them. The choice between a “high” and an “intermediate” value of is again a game of chicken, so in a sense the game of chicken reproduces itself at a higher level. The scenario of learning under mutual transparency thus facilitates cooperative outcomes in the prisoner’s dilemma, but not necessarily in the game of chicken.

In the prisoner’s dilemma, mutual awareness of each other’s inner workings is helpful for achieving cooperative outcomes. This idea has existed for decades (at least since Hofstadter (1983)), and we have now validated it in the paradigm of gradient-based machine learning. On the other hand, we have seen that in the game of chicken, in which both learning agents attempt to navigate the opponent towards their preferred equilibrium, it is hard to predict what kind of learning agent succeeds against which other. It is even possible that opponent-aware learners produce the worst possible outcome for themselves and so do worse than more naive learning agents would. Developing techniques that are guaranteed to achieve acceptable outcomes in games that involve an equilibrium selection problem is thus an important open problem if we want to avoid worst case outcomes in multi agent interactions.


Many thanks to Jakob Foerster and Alistair Letcher for answering questions about opponent-aware learning, to Jesse Clifton and Daniel Kokotajlo for helpful discussions, and to Caspar Oesterheld for careful reading of the manuscript.

Appendix A Calculating the scaling parameter in SOS

This appendix describes how to calculate the scaling parameter in SOS. It is based on Letcher et al. (2018) and Remark 4.7 in Letcher (2018)

and adapted to our framework. The calculation involves two hyperparameters

, where we follow the recommendation of Letcher et al. (2018) to use and .

Let us define the Look Ahead (LA) gradient


given by the first two summands of Eq. (3). Further let


denote the leading-order opponent-shaping correction, i.e. the third summand in Eq. (3). The SOS gradient can then be written as


Let if and otherwise. Let if and otherwise. Choosing ensures that always has non-negative inner product with . Choosing ensures that SOS converges to SFPs. In order to guarantee both of these while making the opponent-shaping correction as strong as possible, choose .