# COLA: Consistent Learning with Opponent-Learning Awareness

Learning in general-sum games can be unstable and often leads to socially undesirable, Pareto-dominated outcomes. To mitigate this, Learning with Opponent-Learning Awareness (LOLA) introduced opponent shaping to this setting, by accounting for the agent's influence on the anticipated learning steps of other agents. However, the original LOLA formulation (and follow-up work) is inconsistent because LOLA models other agents as naive learners rather than LOLA agents. In previous work, this inconsistency was suggested as a cause of LOLA's failure to preserve stable fixed points (SFPs). First, we formalize consistency and show that higher-order LOLA (HOLA) solves LOLA's inconsistency problem if it converges. Second, we correct a claim made in the literature, by proving that, contrary to Schäfer and Anandkumar (2019), Competitive Gradient Descent (CGD) does not recover HOLA as a series expansion. Hence, CGD also does not solve the consistency problem. Third, we propose a new method called Consistent LOLA (COLA), which learns update functions that are consistent under mutual opponent shaping. It requires no more than second-order derivatives and learns consistent update functions even when HOLA fails to converge. However, we also prove that even consistent update functions do not preserve SFPs, contradicting the hypothesis that this shortcoming is caused by LOLA's inconsistency. Finally, in an empirical evaluation on a set of general-sum games, we find that COLA finds prosocial solutions and that it converges under a wider range of learning rates than HOLA and LOLA. We support the latter finding with a theoretical result for a simple game.

## Authors

• 3 publications
• 3 publications
• 8 publications
• 36 publications
11/16/2021

Many economic games and machine learning approaches can be cast as compe...
05/03/2022

### Model-Free Opponent Shaping

In general-sum games, the interaction of self-interested learning agents...
09/13/2017

### Learning with Opponent-Learning Awareness

Multi-agent settings are quickly gathering importance in machine learnin...
10/16/2019

### On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach

Many tasks in modern machine learning can be formulated as finding equil...
02/15/2018

### The Mechanics of n-Player Differentiable Games

The cornerstone underpinning deep learning is the guarantee that gradien...
03/08/2018

### SA-IGA: A Multiagent Reinforcement Learning Method Towards Socially Optimal Outcomes

In multiagent environments, the capability of learning is important for ...
11/20/2018

### Stable Opponent Shaping in Differentiable Games

A growing number of learning methods are actually games which optimise m...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Much research in deep multi-agent reinforcement learning (MARL) has focused on zero-sum games like Starcraft and Go

(Silver et al., 2017; Vinyals et al., 2019) or fully cooperative settings (Oroojlooyjadid and Hajinezhad, 2019). However, many real-world problems, e.g. self-driving cars, contain both cooperative and competitive elements, and are thus better modeled as general-sum games. One such game is the famous Prisoner’s Dilemma (Axelrod and Hamilton, 1981), in which agents have an individual incentive to defect against their opponent, even though they would prefer the outcome in which both cooperate to the one where both defect. A strategy for the infinitely iterated version of the game (IPD) is tit-for-tat, which starts out cooperating and otherwise mirrors the opponent’s last move. It achieves mutual cooperation when chosen by both players and has proven to be successful at IPD tournaments (Axelrod and Hamilton, 1981). If MARL algorithms are deployed in the real-world, it is essential that they are able to cooperate with others and entice others to cooperate with them, using strategies such as tit-for-tat (Dafoe et al., 2021; Stastny et al., 2021). However, naive gradient descent and other more sophisticated methods (Korpelevich, 1977; Mescheder et al., 2017; Balduzzi et al., 2018; Mazumdar et al., 2019; Schäfer and Anandkumar, 2019) converge to the mutual defection policy under random initialization (Letcher et al., 2019b).

An effective paradigm to improve learning in general-sum games is opponent shaping, where agents take into account their influence on the anticipated learning step of the other agents. LOLA (Foerster et al., 2018a) was the first work to make explicit use of opponent shaping and is one of the only general learning methods designed for general-sum games that obtains mutual cooperation with the tit-for-tat strategy in the IPD. While LOLA discovers these prosocial equilibria, the original LOLA formulation is inconsistent because LOLA agents assume that their opponent is a naive learner. This assumption is clearly violated if two LOLA agents learn together. It has been suggested that this inconsistency is the cause for LOLA’s main shortcoming, which is not maintaining the stable fixed points (SFPs) of the underlying game, even in some simple quadratic games (Letcher 2018, p. 2, 26; see also Letcher et al. 2019b).

#### Contributions.

To address LOLA’s inconsistency, we first revisit the concept of higher-order LOLA (HOLA) (Foerster et al., 2018a) in Section 4.1. For example, second-order LOLA assumes that the opponent is a first-order LOLA agent (which in turn assumes the opponent is a naive learner) and so on. Supposing that HOLA converges with increasing order, we define infinite-order LOLA (iLOLA) as the limit. Intuitively, two iLOLA agents have a consistent view of each other since they accurately account for the learning behavior of the opponent under mutual opponent shaping. Based on this idea, we introduce a formal definition of consistency and prove that, if it exists, iLOLA is indeed consistent (Proposition 1).

Second, in Section 4.2, we correct a claim made in previous literature, which would have provided a closed-form solution of the iLOLA update. According to Schäfer and Anandkumar (2019), the Competitive Gradient Descent (CGD) algorithm recovers HOLA as a series expansion. If true, this would imply that CGD coincides with iLOLA, thus solving LOLA’s inconsistency problem. We prove that this is untrue: CGD’s series expansion does, in general, not recover HOLA, CGD does not correspond to iLOLA, and CGD does not solve the inconsistency problem (Proposition 2).

In lieu of a closed-form solution, a naive way of computing the iLOLA update is to iteratively compute higher orders of LOLA until convergence. However, there are two main problems with addressing consistency using a limiting update: the process may diverge and typically requires arbitrarily high derivatives. To address these, in Section 4.3

, we propose Consistent LOLA (COLA) as a more robust and efficient alternative. COLA learns a pair of consistent update functions by explicitly minimizing a differentiable measure of consistency inspired by our formal definition. We use the representational power of neural networks and gradient based optimization to minimize this loss, resulting in

learned update functions that are mutually consistent. By reframing the problem as such, we only require up to second-order derivatives.

In Section 4.4, we prove initial results about COLA. First, we show that COLA’s solutions are not necessarily unique. Second, despite being consistent, COLA does not recover SFPs, contradicting the prior belief that this shortcoming is caused by inconsistency. Third, to show the benefit of additional consistency, we prove that COLA converges under a wider range of look-ahead (LA) rates than LOLA in a simple general-sum game.

Finally, in Sections 5 and 6, we report our experimental setup and results, investigating COLA and HOLA and comparing COLA to LOLA and CGD in a range of games. We experimentally confirm our theoretical result that CGD does not equal iLOLA. Moreover, we show that COLA converges under a wider range of look-ahead rates than HOLA and LOLA, and that it is generally able to find socially desirable solutions. It is the only algorithm consistently converging to the fair solution in the Ultimatum game, and while it does not find tit-for-tat in the IPD (unlike LOLA), it does learn policies with near-optimal total payoff. We find that COLA learns consistent update functions even when HOLA diverges with higher order and its updates are similar to iLOLA when HOLA converges. Although COLA solutions are not unique in theory, COLA empirically tends to find similar solutions over different runs.

## 2 Related work

General-sum learning algorithms have been investigated from different perspectives in the reinforcement learning, game theory, and GAN literature

(Schmidhuber, 1991; Barto and Mahadevan, 2003; Goodfellow et al., 2014; Racanière et al., 2017). Next, we will highlight a few of the approaches to the mutual opponent shaping problem.

Opponent modeling maintains an explicit belief of the opponent, allowing to reason over their strategies and compute optimal responses. Opponent modeling can be divided into different subcategories: There are classification methods, classifying the opponents into pre-defined types

(Weber and Mateas, 2009; Synnaeve and Bessière, 2011), or policy reconstruction methods, where we explicitly predict the actions of the opponent (Mealing and Shapiro, 2017). Most closely related to opponent shaping is recursive reasoning, where methods model nested beliefs of the opponents (He and Boyd-Graber, 2016; Albrecht and Stone, 2017; Wen et al., 2019).

In comparison, COLA assumes that we have access to the ground-truth model of the opponent, e.g., the opponent’s payoff function, parameters, and gradients, putting COLA into the framework of differentiable games (Balduzzi et al., 2018). Various methods have been proposed, investigating the local convergence properties to different solution concepts (Mescheder et al., 2017; Mazumdar et al., 2019; Letcher et al., 2019b; Schäfer and Anandkumar, 2019; Azizian et al., 2020; Schäfer et al., 2020; Hutter, 2021). Most of the work in differentiable games has not focused on opponent shaping or consistency. Mescheder et al. (2017) and Mazumdar et al. (2019) focus solely on zero-sum games without shaping. To improve upon LOLA, Letcher et al. (2019b) suggested Stable Opponent Shaping (SOS), which applies ad-hoc corrections to the LOLA update, leading to theoretically guaranteed convergence to SFPs. However, despite its desirable convergence properties, SOS still does not solve the conceptual issue of inconsistent assumptions about the opponent. CGD (Schäfer and Anandkumar, 2019) addresses the inconsistency issue for zero-sum games but not for general-sum games. The exact difference between CGD, LOLA and our method is addressed in Section 4.2.

## 3 Background

### 3.1 Differentiable games

The framework of differentiable games has become increasingly popular to model multi-agent learning. Whereas stochastic games are limited to parameters such as action-state probabilities, differentiable games generalize to any real-valued parameter vectors and differentiable loss functions

(Balduzzi et al., 2018). We restrict our attention to two-player games, as is standard in much of the literature (Foerster et al., 2018a, b; Letcher et al., 2019b; Schäfer and Anandkumar, 2019).

###### Definition 1 (Differentiable games).

In a two-player differentiable game, players control parameters to minimize twice continuously differentiable losses . We adopt the convention to write to denote the respective other player.

A fundamental challenge of the multi-loss setting is finding a good solution concept. Whereas in the single loss setting the typical solution concept are local minima, in multi-loss settings there are different sensible solution concepts. Most prominently, there are Nash Equilibria (Osborne and Rubinstein, 1994). However, Nash Equilibria include unstable saddle points that cannot be reasonably found via gradient-based learning algorithms (Letcher et al., 2019b). A more suitable concept are stable fixed points (SFPs), which could be considered a differentiable game analogon to local minima in single loss optimization. We will omit a formal definition here for brevity and point the reader to previous work on the topic (Letcher et al., 2019a).

### 3.2 Lola

Consider a differentiable game with two players. A LOLA agent uses its access to the opponent’s parameters to differentiate through a learning step of the opponent. That is, agent 1 reformulates their loss to , where represents the assumed learning step of the opponent. In first-order LOLA we assume the opponent to be a naive learner: . This assumption makes LOLA inconsistent when the opponent is any other type of learner. Here, denotes the gradient with respect to , and represents the look-ahead rate, which is the assumed learning rate of the opponent. This rate may differ from the opponent’s actual learning rate, but we will only consider equal learning rates and look-ahead rates across opponents for simplicity. In the original paper the loss was approximated using a Taylor expansion . For agent 1, their first-order Taylor LOLA update is then defined as

 Δθ1=−α(∇1L1+∇12L1Δθ2+(∇1Δθ2)⊤∇2L1).

Alternatively, in exact LOLA, the derivative is taken directly with respect to the reformulated loss, yielding the update

 Δθ1=−α∇1(L1(θ1,θ2+Δθ2)).

LOLA has had some empirical success, being one of the first general learning methods to discover tit-for-tat in the IPD. However, later work showed that LOLA does not preserve SFPs , e.g., the rightmost term in the equation for Taylor LOLA can be nonzero at . In fact, LOLA agents show “arrogant” behavior: they assume they can shape the learning of their naive opponents without having to adapt to the shaping of the opponent. Prior work hypothesized that this arrogant behavior is due to LOLA’s inconsistent formulation and may be the cause for LOLA’s failure to preserve SFPs (Letcher (2018), p. 2, 26; Letcher et al. (2019b))

### 3.3 Cgd

CGD (Schäfer and Anandkumar, 2019) proposes updates that are themselves Nash Equilibra of a local bilinear approximation of the game. It stands out by its robustness to different look-ahead rates and its ability to find SFPs. However, CGD does not find tit-for-tat on the IPD, instead converging to mutual defection (see Figure 2(e)). CGD’s update rule is given by

 (Δθ1Δθ2)=−α(Idα∇12L1α∇21L2Id)−1(c∇1L1∇2L2).

One can recover different orders of CGD by approximating the inverse matrix via the series expansion . For example, at N=1, we recover a version called Linearized CGD (LCGD), defined via .

## 4 Method and theory

In this section, we formally define iLOLA and consistency under mutual opponent shaping and show that iLOLA is consistent, thus in principle addressing LOLA’s inconsistency problem. We then clarify the relation between CGD and iLOLA, correcting a false claim in Schäfer and Anandkumar (2019). Lastly, we introduce COLA as an alternative to iLOLA and present some initial theoretical analysis, including the result that, contrary to prior belief, even consistent update functions do not recover SFPs.

### 4.1 Convergence and consistency of higher-order LOLA

The original formulation of LOLA is inconsistent when two LOLA agents learn together, because LOLA agents assume their opponent is a naive learner. To address this problem, we define and analyze iLOLA. In this section, we focus on exact LOLA, but we provide a version of our analysis for Taylor LOLA in Appendix C. HOLA is defined by the recursive relation

 hn+11 \coloneqq−α∇1(L1(θ1,θ2+hn2)) hn+12 \coloneqq−α∇2(L2(θ1+hn1,θ2))

with , omitting arguments for convenience. In particular, HOLA coincides with simultaneous gradient descent while HOLA coincides with LOLA.

###### Definition 2 (iLOLA).

If = converges pointwise as , define

 iLOLA\coloneqqlimn→∞(hn1hn2)asthelimitingupdate.

We show in Appendix A that HOLA does not always converge, even in simple quadratic games. But, unlike LOLA, iLOLA satisfies a criterion of consistency whenever HOLA does converge (under some assumptions), formally defined as follows:

###### Definition 3 (Consistency).

Any update functions and are consistent (under mutual opponent shaping with look-ahead rate ) if for all , they satisfy

 f1(θ1,θ2) =−α∇1(L1(θ1,θ2+f2(θ1,θ2))) (1) f2(θ1,θ2) =−α∇2(L2(θ1+f1(θ1,θ2),θ2)) (2)
###### Proposition 1.

Let denote both players’ exact -th order LOLA updates. Assume that and exist for all and . Then iLOLA is consistent under mutual opponent shaping.

In Appendix B. ∎

### 4.2 CGD does not recover higher-order LOLA

Schäfer and Anandkumar (2019) claim that “LCGD [linearized CGD] coincides with first order LOLA” (page 6), and moreover that the higher-order “series-expansion [of CGD] would recover higher-order LOLA” (page 4). If this were correct, it would imply that full CGD is equal to iLOLA and thus provides a convenient closed-form solution. We prove that this is false in general games:

###### Proposition 2.

CGD is inconsistent and does not in general coincide with iLOLA. In particular, LCGD does not coincide with LOLA and the series-expansion of CGD does not recover HOLA (neither exact nor Taylor). Instead, LCGD recovers LookAhead (Zhang and Lesser, 2010), an algorithm that lacks opponent shaping, and the series-expansion of CGD recovers higher-order LookAhead.

###### Proof.

In Appendix D. For the negative results, it suffices to construct a single counterexample: we show that LCGD and LOLA differ almost everywhere in the Tandem game (excluding a set of measure zero). We prove by contradiction that the series-expansion of CGD does not recover HOLA. If it did, CGD would equal iLOLA, and by Proposition 1, CGD would satisfy the consistency equations. However, this fails almost everywhere in the Tandem game, concluding the contradiction. ∎

### 4.3 Cola

iLOLA is consistent under mutual opponent shaping. However, HOLA does not always converge and, even when it does, it may be expensive to recursively compute HOLA for sufficiently high to achieve convergence. As an alternative, we propose COLA.

COLA learns consistent update functions and avoids infinite regress by directly solving the equations in Definition 3. We define the consistency losses for learned update functions parameterized by , obtained for a given as the difference between RHS and LHS in Definition 3:

 C1(ϕ1,ϕ2,θ1,θ2) =∥∥f1+α∇1(L1(θ1,θ2+f2))∥∥ C2(ϕ1,ϕ2,θ1,θ2) =∥∥f2+α∇2(L2(θ1+f1,θ2))∥∥.

If both losses are for all , then the two update functions defined by are consistent. For this paper, we parameterise , as neural networks with parameters , respectively, and numerically minimize the sum of both losses over a region of interest.

The parameter region of interest depends on the game being played. For games with probabilities as actions, we select an area that captures most of the probability space (e.g. we sample a pair of parameters , since where

is the sigmoid function).

We optimize the mean of the sum of consistency losses,

 C(ϕ1,ϕ2):=E(θ1,θ2)∼U(Θ)[C1(ϕ1,ϕ2,θ1,θ2)+C2(ϕ1,ϕ2,θ1,θ2)],

by sampling parameter pairs uniformly from and feeding them to the neural networks , each outputting an agent’s parameter update. The weights are then updated by taking a gradient step to minimize . We train the update functions until the loss has converged and use the learned update functions to train a pair of agent policies in the given game.

### 4.4 Theoretical results for COLA

In this section, we provide some initial theoretical results for COLA’s uniqueness and convergence behavior, using the Tandem game (Letcher et al., 2019b) and the Hamiltonian game (Balduzzi et al., 2018) as examples. These are simple polynomial games, with losses given in Section 5. Proofs for the following propositions can be found in Appendices E, F and G, respectively.

First, we show that solutions to the consistency equations are in general not unique, even when restricting to linear update functions in the Tandem game. Interestingly, empirically, COLA does seem to consistently converge to similar solutions regardless (see Table 7 in Appendix I.3).

###### Proposition 3.

Solutions to the consistency equations are not unique, even when restricted to linear solutions; more precisely, there exist several linear consistent solutions to the Tandem game.

Second, we show that consistent solutions do not, in general, preserve SFPs, contradicting the hypothesis that LOLA’s failure to preserve SFPs is due to its inconsistency (Letcher (2018), p. 2, 26; Letcher (2018)). We experimentally support this result in Section 6.

###### Proposition 4.

Consistency does not imply preservation of SFPs: there is a consistent solution to the Tandem game with that fails to preserve any SFP. Moreover, for any , there are no linear consistent solutions to the Tandem game that preserve more than one SFP.

Third, we show that COLA can have more robust convergence behavior than LOLA and SOS:

###### Proposition 5.

For any non-zero initial parameters and any , LOLA and SOS have divergent iterates in the Hamiltonian game. By contrast, any linear solution to the consistency equations converges to the origin for any initial parameters and any look-ahead rate ; moreover, the speed of convergence strictly increases with .

## 5 Experiments

We perform experiments on a set of games from the literature (Balduzzi et al., 2018; Letcher et al., 2019b) using LOLA, SOS and CGD as baselines. For details on the training procedure of COLA, we refer the reader to Appendix H.

First, we compare HOLA and COLA on polynomial general-sum games, including the Tandem game (Letcher et al., 2019b), where LOLA fails to converge to SFPs. Second, we investigate non-polynomial games, such as the zero-sum Matching Pennies (MP) game, the general-sum Ultimatum game (Hutter, 2021) and the IPD (Axelrod and Hamilton, 1981; Harper et al., 2017).

#### Polynomial games.

Losses in the Tandem game (Letcher et al., 2019b) are given by and for agent 1 and 2 respectively. The Tandem game was introduced to show that LOLA fails to preserve SFPs at and instead converges to Pareto-dominated solutions (Letcher et al., 2019b). Additionally to the Tandem game, we investigate the algorithms on the Hamiltonian game, and ; and the Balduzzi game, where and (Balduzzi et al., 2018).

#### Matching Pennies.

The payoff matrix for MP (Lee and K, 1967) is shown in Appendix I.3 in Table 6

. Each policy is parameterized with a single parameter, the log-odds of choosing heads

. In this game, the unique Nash equilibrium is playing heads half the time.

#### Ultimatum game.

The binary, single-shot Ultimatum game (Güth et al., 1982; Sanfey et al., 2003; Oosterbeek et al., 2004; Henrich et al., 2006) is set up as follows. There are two players, player A and B. Player A has access to . They can split the money fairly with B ( for each player) or they can split it unfairly ( for player A, for player B). Player B can either accept or reject the proposed split. If player B rejects, the reward is 0 for both players. If player B accepts, the reward follows the proposed split. Player A’s parameter is the log-odds of proposing a fair split . Player B’s parameter is the log-odds of accepting the unfair split (assuming that player B always accepts fair splits) . We then have and

#### Ipd.

We next investigate the IPD (Axelrod and Hamilton, 1981; Harper et al., 2017) with discount factor and the usual payout function (see Appendix I.6). An agent is defined through 5 parameters, the log-odds of cooperating in the first time step and across each of the four possible tuples of past actions of both players in the later steps.

## 6 Results

First, we report and compare the learning outcomes achieved by COLA and our baselines. We find that COLA update functions converge even under high look-ahead rates and learn socially desirable solutions. We also confirm our theoretical result (Proposition 2) that CGD does not equal iLOLA, contradicting Schäfer and Anandkumar (2019), and that COLA does not, in general, maintain SFPs (Proposition 4), contradicting the prior belief that this shortcoming is caused by inconsistency.

Second, we provide a more in-depth empirical analysis and comparison of the COLA and HOLA update functions, showing that COLA and HOLA tend to coincide when the latter converges, and that COLA is able to find consistent solutions even when HOLA diverges. Moreover, while COLA’s solutions are not unique in theory (Proposition 3), we empirically find that in our examples COLA tends to find similar solutions across different independent training runs. Additional results supporting the above findings are reported in Appendix I.

#### Learning Outcomes.

In the Tandem game (Figure 0(d)), we see that COLA and HOLA8 converge to similar outcomes in the game, whereas CGD does not. This supports our theoretical result that CGD does not equal iLOLA (Proposition 2). We also see that COLA does not recover SFPs, thus experimentally confirming Proposition 4. In contrast to LOLA, HOLA and SOS, COLA finds a convergent solution even at a high look-ahead rate (see COLA:0.8) (Figure 3(b) in Appendix I.1). CGD is the only other algorithm in the comparison that also shows robustness to high look-ahead rates in the Tandem game.

On the IPD, all algorithms find the defect-defect strategy on low look-ahead rates (Figure 2(b)). At high look-ahead rates, COLA finds a strategy qualitatively similar to tit-for-tat, as displayed in Figure 2(f), though more noisy. However, COLA still achieves close to the optimal total loss, in contrast to CGD, which finds defect-defect even at a high look-ahead rate (see Figure 13 in Appendix I.6). The fact that, unlike HOLA and COLA, CGD finds defect-defect, further confirms that CGD does not equal iLOLA.

On MP at high look-ahead rates, SOS and LOLA mostly don’t converge, whereas COLA converges even faster with a high look-ahead rate (see Figure 1(a)), confirming Proposition 5 experimentally (also see Figure 5(b) and 6(b) in Appendix I.2). To further investigate the influence of consistency on learning behavior, we plot the consistency of an update function against the variance of the losses across learning steps achieved by that function, for different orders of HOLA and for COLA (Figure 1(b)). At a high look-ahead rate in Matching Pennies, we find that more consistent update functions tend to lead to lower variance across training, demonstrating a potential benefit of increased consistency at least at high look-ahead rates.

For the Ultimatum game, we find that COLA is the only method that finds the fair solution consistently at a high look-ahead rate, whereas SOS, LOLA, and CGD do not (Figure 0(f)). At low look-ahead rates, all algorithms find the unfair solution (see Figure 9(b) in Appendix I.4). This demonstrates an advantage of COLA over our baselines and shows that higher look-ahead rates can lead to better learning outcomes.

Lastly, we introduce the Chicken game in Appendix I.5. Both Taylor LOLA and SOS crash, whereas COLA, HOLA, CGD, and Exact LOLA swerve at high look-ahead rates (Figure 11(d)). Crashing in Chicken results in a catastrophic payout for both agents, whereas swerving results in a jointly preferable outcome.111Interestingly, in contrast to Taylor LOLA, exact LOLA swerves. The Chicken game is the only game where we found a difference in learning behavior between exact LOLA and Taylor LOLA.

#### Update functions.

Next, we provide a more in-depth empirical analysis of the COLA and HOLA update functions. First, we investigate how increasing the order of HOLA affects the consistency of its updates. As we show in Table 1, 2 and 3, HOLA’s updates become more consistent with increasing order, but only below a certain, game-specific look-ahead rate threshold. Above that threshold, HOLA’s updates become less consistent with increasing order.

Second, we compare the consistency losses of COLA and HOLA. In the aforementioned tables, we observe that COLA achieves low consistency losses on most games. Below the threshold, COLA finds similarly low consistency losses as HOLA, though there HOLA’s are lower in the non-polynomial games. Above the threshold, COLA finds consistent updates, even when HOLA does not. A visualization of the update function learned by COLA at a high look-ahead rate on the MP is given in Figure 1(c).

For the IPD, COLA’s consistency losses are high compared to other games, but much lower than HOLA’s consistency losses at high look-ahead rates. We leave it to future work to find methods that obtain more consistent solutions.

Third, we are interested whether COLA and HOLA find similar solutions. We calculate the cosine similarity between the respective update functions over . As we show in Table 1, 2 and 3, COLA and HOLA find very similar solutions when HOLA’s updates converge, i.e., when the look-ahead rate is below the threshold. Above the threshold, COLA’s and HOLA’s updates unsurprisingly become less similar with increasing order, as HOLA’s updates diverge with increasing order.

Lastly, we investigate Proposition 3 empirically and find that COLA finds similar solutions in Tandem and MP over 5 training runs (see Table 7 in Appendix I.3). Moreover, the small standard deviations in Table 3 indicate that COLA also finds similar solutions over different runs in the IPD.

## 7 Conclusion and Future Work

In this paper, we corrected a claim made in prior work (Schäfer and Anandkumar, 2019), clearing up the relation between the CGD and LOLA algorithms. We also showed that iLOLA solves part of the consistency problem of LOLA. Next, we introduced COLA, which finds consistent solutions without requiring many recursive computations like iLOLA. It was believed that inconsistency leads to arrogant behaviour and lack of preservation for SFPs. We showed that even with consistency, opponent shaping behaves arrogantly, pointing towards a fundamental open problem for the method.

In a set of games, we found that COLA tends to find prosocial solutions. Although COLA’s solutions are not unique in theory, empirically, COLA tends to find similar solutions in different runs. It coincides with iLOLA when HOLA converges and finds consistent update functions even when HOLA fails to converge with increasing order. Moreover, we showed empirically (and in one case theoretically) that COLA update functions converge under a wider range of look-ahead rates than HOLA and LOLA update functions.

This work raises many questions for future work, such as the existence of solutions to the COLA equations in general games and general properties of convergence and learning outcomes. Moreover, additional work is needed to scale COLA to large settings such as GANs or Deep RL, or settings with more than two players. Another interesting axis is addressing further inconsistent aspects of LOLA as identified in Letcher et al. (2019b).

## References

• S. V. Albrecht and P. Stone (2017) Reasoning about hypothetical agent behaviours and their parameters. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 547–555. Cited by: §2.
• R. Axelrod and W. D. Hamilton (1981) The evolution of cooperation. Science 211 (4489), pp. 1390–1396. External Links: https://www.science.org/doi/pdf/10.1126/science.7466396 Cited by: §1, §5, §5.
• W. Azizian, I. Mitliagkas, S. Lacoste-Julien, and G. Gidel (2020) A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games. In

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics

,

Proceedings of Machine Learning Research

, Vol. 108, pp. 2863–2873.
Cited by: §2.
• D. Balduzzi, S. Racanière, J. Martens, J. N. Foerster, K. Tuyls, and T. Graepel (2018) The mechanics of n-player differentiable games. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 363–372. Cited by: Appendix D, Appendix D, §I.2, §I.2, §1, §2, §3.1, §4.4, §5, §5.
• A. G. Barto and S. Mahadevan (2003) Recent advances in hierarchical reinforcement learning. Discret. Event Dyn. Syst. 13 (1-2), pp. 41–77. Cited by: §2.
• A. Dafoe, E. Hughes, Y. Bachrach, T. Collins, K. R. McKee, J. Z. Leibo, K. Larson, and T. Graepel (2021) Open problems in cooperative ai. In Cooperative AI workshop, Cited by: §1.
• J. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch (2018a) Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 122–130. Cited by: §1, §1, §3.1.
• J. N. Foerster, G. Farquhar, M. Al-Shedivat, T. Rocktäschel, E. P. Xing, and S. Whiteson (2018b)

DiCE: the infinitely differentiable monte carlo estimator

.
In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 1524–1533. Cited by: §3.1.
• I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Vol. 27, pp. 2672–2680. Cited by: §2.
• W. Güth, R. Schmittberger, and B. Schwarze (1982) An experimental analysis of ultimatum bargaining. Journal of Economic Behavior & Organization 3 (4), pp. 367–388 (en). Cited by: §5.
• M. Harper, V. Knight, M. Jones, G. Koutsovoulos, N. E. Glynatsi, and O. Campbell (2017) Reinforcement learning produces dominant strategies for the iterated prisoner’s dilemma. PLOS ONE 12 (12), pp. e0188046. Cited by: §5, §5.
• H. He and J. L. Boyd-Graber (2016) Opponent modeling in deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1804–1813. Cited by: §2.
• J. Henrich, R. Boyd, S. Bowles, C. Camerer, E. Fehr, and H. Gintis (2006) Foundations of Human Sociality: Economic Experiments and Ethnographic Evidence From Fifteen Small-Scale Societies. In American Anthropologist, Vol. 108. Cited by: §5.
• A. Hutter (2021) Learning in two-player games between transparent opponents. Note: arXiv preprint arXiv:2012.02671 Cited by: §2, §5.
• D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, Cited by: §H.1, §H.2.
• G. M. Korpelevich (1977) The extragradient method for finding saddle points and other problems. Vol. 13, pp. 35–49. Cited by: §1.
• K. Lee and L. K (1967) The Application of Decision Theory and Dynamic Programming to Adaptive Control Systems. Thesis, (en_US). Cited by: §I.3, §5.
• A. Letcher, D. Balduzzi, S. Racanière, J. Martens, J. N. Foerster, K. Tuyls, and T. Graepel (2019a) Differentiable game mechanics. J. Mach. Learn. Res. 20, pp. 84:1–84:40. Cited by: §3.1.
• A. Letcher, J. N. Foerster, D. Balduzzi, T. Rocktäschel, and S. Whiteson (2019b) Stable opponent shaping in differentiable games. In 7th International Conference on Learning Representations, Cited by: Appendix D, Appendix D, Appendix D, Appendix G, Figure 4, §1, §1, §2, §3.1, §3.1, §3.2, §4.4, §5, §5, §5, §7.
• A. Letcher (2018) Stability and exploitation in differentiable games. Master’s Thesis, University of Oxford. Cited by: §1, §3.2, §4.4.
• E. V. Mazumdar, M. I. Jordan, and S. S. Sastry (2019) On finding local nash equilibria (and only local nash equilibria) in zero-sum games. Note: arXiv preprint arXiv:1901.00838 Cited by: §1, §2.
• R. Mealing and J. L. Shapiro (2017)

Opponent modeling by expectation-maximization and sequence prediction in simplified poker

.
IEEE Trans. Comput. Intell. AI Games 9 (1), pp. 11–24. Cited by: §2.
• L. M. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems, Vol. 30, pp. 1825–1835. Cited by: §1, §2.
• H. Oosterbeek, R. Sloof, and G. van de Kuilen (2004) Cultural Differences in Ultimatum Game Experiments: Evidence from a Meta-Analysis. Experimental Economics 7 (2), pp. 171–188. Cited by: §5.
• A. Oroojlooyjadid and D. Hajinezhad (2019) A review of cooperative multi-agent deep reinforcement learning. Note: arXiv preprint arXiv:1908.03963 Cited by: §1.
• M. J. Osborne and A. Rubinstein (1994) A course in game theory. The MIT Press, Cambridge, MA. Cited by: §3.1.
• A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32, pp. 8024–8035. Cited by: Appendix H.
• S. Racanière, T. Weber, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. W. Battaglia, D. Hassabis, D. Silver, and D. Wierstra (2017) Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5690–5701. Cited by: §2.
• A. G. Sanfey, J. K. Rilling, J. A. Aronson, L. E. Nystrom, and J. D. Cohen (2003) The Neural Basis of Economic Decision-Making in the Ultimatum Game. Science 300 (5626), pp. 1755–1758 (en). Cited by: §5.
• F. Schäfer, A. Anandkumar, and H. Owhadi (2020) Competitive mirror descent. Note: arXiv preprint arXiv:2006.10179 Cited by: §2.
• F. Schäfer and A. Anandkumar (2019) Competitive gradient descent. In Advances in Neural Information Processing Systems, Vol. 32, pp. 7623–7633. Cited by: Appendix D, Appendix D, §1, §1, §2, §3.1, §3.3, §4.2, §4, §6, §7.
• J. Schmidhuber (1991) A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, J. A. Meyer and S. W. Wilson (Eds.), pp. 222–227. Cited by: §2.
• D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis (2017) Mastering the game of go without human knowledge. Nat. 550 (7676), pp. 354–359. Cited by: §1.
• J. Stastny, M. Riché, A. Lyzhov, J. Treutlein, A. Dafoe, and J. Clifton (2021) Normative disagreement as a challenge for cooperative ai. In Cooperative AI workshop, Cited by: §1.
• G. Synnaeve and P. Bessière (2011) A bayesian model for opening prediction in RTS games with application to starcraft. In IEEE Conference on Computational Intelligence and Games, pp. 281–288. Cited by: §2.
• O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, Ç. Gülçehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. P. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver (2019) Grandmaster level in starcraft II using multi-agent reinforcement learning. Nat. 575 (7782), pp. 350–354. Cited by: §1.
• B. G. Weber and M. Mateas (2009) A data mining approach to strategy prediction. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Games, pp. 140–147. Cited by: §2.
• Y. Wen, Y. Yang, R. Luo, J. Wang, and W. Pan (2019) Probabilistic recursive reasoning for multi-agent reinforcement learning. In 7th International Conference on Learning Representations, Cited by: §2.
• C. Zhang and V. R. Lesser (2010) Multi-agent learning with policy prediction. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Cited by: Proposition 2.

## Appendix A Nonconvergence of HOLA in the Tandem Game

In the following, we show that for the choice of look-ahead rate , HOLA does not converge in the Tandem game. This shows that given a large enough look-ahead rate, even in a simple quadratic game, HOLA need not converge.

###### Proposition 6.

Let be the two players’ loss functions in the Tandem game as defined in Section 5:

 L1(x,y)=(x+y)2−2xandL2(x,y)=(x+y)2−2y, (3)

and let denote the -th order exact LOLA update for player (where denotes naive learning). Consider the look-ahead rate . Then the functions for do not converge pointwise.

###### Proof.

We will prove the auxiliary statement that

 hni(x,y)=2n+2−2(1+x+y)

for . It then follows trivially that the cannot converge.

The auxiliary result can be proven by induction. The base case follows from

 ∇iLi(x,y)=2−2(x+y)=22−2(1+x+y)

for . Next, for the inductive step, we have to show that

 hn1(x,y) =−∇1(L1(x,y+hn−12(x,y))) (4) hn2(x,y) =−∇2(L2(x+hn−11(x,y),y)) (5)

for any . Substituting the inductive hypothesis in the second step, we have

 =−∇1(L1(x,y+hn−12(x,y))) (6) =−∇1(L1(x,y+2n+1−2(1+x+y))) (7) =−∇1((x+y+2n+1−2−2x−2y)2−2x) (8) =−∇1((−x−y+2n+1−2)2−2x) (9) =2(−x−y+2n+1−2)+2 (10) =2n+2−2(1+x+y) (11) =hn1(x,y). (12)

The derivation for is exactly analogous. This shows the inductive step and thus finishes the proof.

## Appendix B Proof of Proposition 1

To begin, recall that some differentiable game with continuously differentiable loss functions is given, and that denotes the -th order exact LOLA update function. We assume that the iLOLA update function exists, defined via

 hi(θ):=limn→∞hni(θ),

for all .

To prove Proposition 1, we need to show that are consistent, i.e., satisfy Definition 3, under the assumption that

 limn→∞∇ihn−i(θ)=∇ih−i(θ)

for and any .

To that end, define the (exact) LOLA operator as the function mapping a pair of update functions to the RHS of Equations 1 and 2,

 Ψ1(f)(θ) :=−α∇1(L1(θ1,θ2+f2(θ1,θ2))) (13) Ψ2(f)(θ) :=−α∇2(L2(θ1+ f1(θ1,θ2),θ2)) (14)

for any . Note that then we have , i.e., maps -th order LOLA to -order LOLA.

In the following, we show that iLOLA is a fixed point of the LOLA operator, i.e., . It follows from the definition of that then is consistent. We denote by the Euclidean norm or the induced operator norm for matrices. We focus on showing . The case is exactly analogous.

For arbitrary and , define and as the updated parameter of player . First, it is helpful to show that converges to :

 0 ≤∥Ψ1(h)(θ)−Ψ1(hn)(θ)∥ (15) =α∥∇1(L1(θ1,θ2+h2(θ)))−∇1(L1(θ1,θ2+hn2(θ))∥ (16) =α∥(∇1h2(θ))⊤∇2L1(θ1,^θ2)−(∇1hn2(θ))⊤∇2L1(θ1,^θn2)+∇1L1(θ1,^θ2)−∇1L1(θ1,^θn2)∥ (17) ≤α∥(∇1h2(θ))⊤∇2L1(θ1,^θ2)−(∇1hn2(θ))⊤∇2L1(θ1,^θn2)∥+α∥∇1L1(θ1,^θ2)−∇1L1(θ1,^θn2)∥ (18) =α∥(∇1h2(θ))⊤(∇2L1(θ1,^θ2)−∇2L1(θ1,^θn2)) (19) +(∇1h2(θ)−∇1hn2(θ))⊤∇2L1(θ1,^θn2)∥+α∥∇1L1(θ1,^θ2)−∇1L1(θ1,^θn2)∥ ≤α∥(∇1h2(θ))⊤∥∥∇2L1(θ1,^θ2)−∇2L1(θ1,^θn2)∥ +α∥(∇1h2(θ)−∇1hn2(θ))⊤∥∥∇2L1(θ1,^θn2)∥+α∥∇1L1(θ1,^θ2)−∇1L1(θ1,^θn2)∥ (20) n→∞⟶0. (21)

In the last step, we used the following two facts. First, since is assumed to be continuous in , and by assumption, it follows that and . Second, by assumption, . In particular, must be bounded, and thus the three terms in (20) must all converge to as . It follows by the sandwich theorem that .

Now we can directly prove that . It is

 0 ≤∥Ψ1(h)(θ)−h1(θ)∥ (22) =∥Ψ1(h)(θ)−Ψ1(hn)(θ)+Ψ1(hn)(θ)−hn1(θ)+hn1(θ)−h1(θ)∥ (23) ≤∥Ψ1(h)(θ)−Ψ1(hn)(θ)∥+∥Ψ1(hn)(θ)−hn1(θ)∥+∥hn1(θ)−h1(θ)∥ (24) =∥Ψ1(h)(θ)−Ψ1(hn)(θ)∥+∥hn+11(θ)−hn1(θ)∥+∥hn1(θ)−h1(θ)∥ (25) n→∞⟶0, (26)

where in the last step we have used the above result, as well as the assumption that converges pointwise, and thus must also be a Cauchy sequence, so the last and the middle term both converge to zero as well.

It follows by the sandwich theorem that . Since was arbitrary, this concludes the proof. ∎

## Appendix C Infinite-order Taylor LOLA

In this Section, we repeat the analysis of iLOLA from Section 4.1 for infinite-order Taylor LOLA (Taylor iLOLA). I.e., we define Taylor consistency, and show that Taylor iLOLA satisfies this consistency equation under certain assumptions. This result will be needed for our proof of Proposition 2.

To begin, assume that some differentiable game with continuously differentiable loss functions is given. Define the Taylor LOLA operator that maps pairs of update functions to the associated Taylor LOLA update

 Φi(f):=−α∇i(Li+(∇−iLi)⊤f−i) (27)

for .

We then have the following definition.

###### Definition 4 (Taylor consistency).

Two update functions are called Taylor consistent if for any , we have

 Φ(f1,f2)=(f1,f2).

Next, let denote ’s -th order Taylor LOLA update. I.e., for , where we let . Then we define

###### Definition 5 (Taylor iLOLA).

If converges pointwise as , define Taylor iLOLA as the limiting update

 h\coloneqqlimn→∞(hn1hn2.)

Finally, we provide a proof that Taylor iLOLA is Taylor consistent; i.e., we give a Taylor version of Proposition 1.

###### Proposition 7.

Let denote player ’s -th order Taylor LOLA update. Assume that and for all and . Then Taylor iLOLA is Taylor consistent.

###### Proof.

The proof is exactly analogous to that of Proposition 1, but easier. We show . It follows from the definition of in Equation 27 that then is Taylor consistent. We focus on showing , and the case is exactly analogous.

First, we show that converges to for all . Letting be arbitrary and omitting in the following for clarity, it is

 0 ≤∥Φ1(h)−Φ1(hn)∥ (28) =∥−α∇1(L1+(∇2L1)⊤h2)+α∇1(L1+(∇2L1)⊤hn2)∥ (29) =α∥−∇12L1h2−(∇2L1)⊤(∇1h2)⊤+∇12L1hn2+(∇2L1)⊤(∇1hn2)⊤∥ (30) ≤α∥∇12L1(hn2−h2)∥+α∥(∇2L1)⊤(∇1hn2−∇1h2)⊤∥ (31) ≤α∥∇12L1∥∥hn2−h2∥+α∥∇2L1∥∥∇1hn2−∇1h2∥ (32) n→∞⟶0. (33)

In the last step, we used the assumptions that and . It follows by the sandwich theorem that .

It follows from the above that , using exactly the same argument as in Equations 22-26 with instead of . Since was arbitrary, this concludes the proof.

## Appendix D Proof of Proposition 2

We begin by proving that LCGD and CGD do not coincide with LOLA and iLOLA. It is sufficient to manifest a single counter-example: we consider the Tandem game given by and (using instead of for simplicity). Throughout this proof we use the notation introduced by Balduzzi et al. (2018) and Letcher et al. (2019b) including the simultaneous gradient, the off-diagonal Hessian and the shaping term of the game as

 ξ=(∇1L1∇2L2)andHo=(0∇12L2∇21L20)and\raisebox0.8pt$χ$=diag(HTo∇L)

respectively. Note that in two-player games, LOLA’s shaping term reduces to

 \raisebox0.8pt$χ$=(∇12L2∇2L1∇21L1∇1L2).

#### Lcgd ≠ Lola.

Following Schäfer and Anandkumar (2019), LCGD is given by

 LCGD=−α(∇xf−αD2xyf∇yg∇yg−αD2yxg∇xf)=−α(I−αD2xyf−αD2yxgI)(∇xf∇yg)=−α(I−αHo)ξ

while LOLA is given (Letcher et al., 2019b) by

 LOLA =−α(I−αHo)ξ+α2\raisebox0.8pt$χ$.

Any game with will yield a difference between LCGD and LOLA; in particular,

 \raisebox0.8pt$χ$=4(x+y)(11)

in the Tandem game implies that LCGD LOLA whenever parameters lie outside the measure-zero set .

#### CGD does not recover HOLA.

Since CGD is obtained through a bilinear approximation (Taylor expansion) of the loss functions, one would expect that the authors’ claim of recovering HOLA is with regards to Taylor (not exact) HOLA. For completeness, and to avoid any doubts for the reader, we prove that CGD neither corresponds to exact nor Taylor HOLA.

Following Schäfer and Anandkumar (2019), the series-expansion of CGD is given by

 CGDn=−αn∑i=0(0−αD2xyf−αD2yxg0)i(DxfDyg)=−αn∑i=0(−αHo)iξ

and converges to CGD whenever (where denotes the operator norm induced by the Euclidean norm on the space). Assume for contradiction that the series-expansion of CGD recovers HOLA, i.e. that CGD = HOLA for all . In particular, we must have

 CGD=limn→∞CGDn=limn→∞HOLAn=iLOLA

whenever . In the tandem game, we have

 Ho=2(0110)

with , so CGD = iLOLA whenever . Moreover, being constant implies that

 ∇HOLAn=∇CGDn=−αn∑i=0(−αHo)i∇ξ,

so gradients of HOLA also converge pointwise for all . In particular, CGD = iLOLA must satisfy the (exact or Taylor) consistency equations by Proposition 1 or Proposition 7. However, the update for CGD is given by

For the exact case, the RHS of the first consistency equation is

 −α∇x((x+y+f2)2−2x) =−2α((1+∇xf2)(x+y+f2)−1) =−2α1+2α(x+y+−2α(x+y−1)1+2α−1−2α) =f1+4α2(x+y+2α)(1+2α)2

which does not coincide with the LHS of the consistency equation () whenever parameters lie outside the measure-zero set . Similarly for Taylor iLOLA, the RHS of the first consistency equation is

 −α∇x((x+y)2−2x+2(x+y)f2) =−2α(x+y−1+−2α(x+y−1)1+2α+−2α(x+y)1+2α) =f1+4α2(x+y)(1+2α)2

which does not coincide with the LHS of the consistency equation () whenever parameters lie outside the measure-zero set . This is a contradiction to consistency; we are done.