# Modeling Friends and Foes

How can one detect friendly and adversarial behavior from raw data? Detecting whether an environment is a friend, a foe, or anything in between, remains a poorly understood yet desirable ability for safe and robust agents. This paper proposes a definition of these environmental "attitudes" based on an characterization of the environment's ability to react to the agent's private strategy. We define an objective function for a one-shot game that allows deriving the environment's probability distribution under friendly and adversarial assumptions alongside the agent's optimal strategy. Furthermore, we present an algorithm to compute these equilibrium strategies, and show experimentally that both friendly and adversarial environments possess non-trivial optimal strategies.

## Authors

• 28 publications
• 34 publications
• ### Curiosity Killed the Cat and the Asymptotically Optimal Agent

Reinforcement learners are agents that learn to pick actions that lead t...
06/05/2020 ∙ by Michael K. Cohen, et al. ∙ 6

• ### Natural Emergence of Heterogeneous Strategies in Artificially Intelligent Competitive Teams

Multi agent strategies in mixed cooperative-competitive environments can...
07/06/2020 ∙ by Ankur Deka, et al. ∙ 0

• ### CropGym: a Reinforcement Learning Environment for Crop Management

Nitrogen fertilizers have a detrimental effect on the environment, which...
04/09/2021 ∙ by Hiske Overweg, et al. ∙ 0

• ### Security Against Impersonation Attacks in Distributed Systems

In a multi-agent system, transitioning from a centralized to a distribut...
11/02/2017 ∙ by Philip N. Brown, et al. ∙ 0

• ### Synthesizing safe coalition strategies

Concurrent games with a fixed number of agents have been thoroughly stud...
08/09/2020 ∙ by Nathalie Bertrand, et al. ∙ 0

• ### An Unethical Optimization Principle

If an artificial intelligence aims to maximise risk-adjusted return, the...
11/12/2019 ∙ by Nicholas Beale, et al. ∙ 14

• ### Non-Cooperative Inverse Reinforcement Learning

Making decisions in the presence of a strategic opponent requires one to...
11/03/2019 ∙ by Xiangyuan Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

How can agents detect friendly and adversarial behavior from raw data? Discovering whether an environment (or a part within) is a friend or a foe is a poorly understood yet desirable skill for safe and robust agents. Possessing this skill is important for a number of situations, including:

• Multi-agent systems: Some environments, especially in multi-agent systems, might have incentives to either help or hinder the agent [1]. For example, an agent playing football must anticipate both the creative moves of its team members and its opponents. Thus, learning to discern between friends and foes might not only help the agent to avoid danger, but also open the possibility to solving taks throught collaboration that it could not solve alone otherwise.

• Model uncertainty

: An agent can choose to impute “adversarial” or “friendly” qualities to an environment that it does not know well. For instance, an agent that is trained in a simulator could compensate for the innaccuracies by assuming that the real environment differs from the simulated one—but in an adversarial way, so as to devise countermeasures ahead of time

[2]. Similarly, innacuracies might also originate in the agent itself—for instance, due to bounded rationality [3, 4].

Typically, these situations involve a knowledge limitation that the agent addresses by responding with a risk-sensitive policy.

The contributions of this paper are threefold. First, we offer a broad definition of friendly and adversarial behavior

. Furthermore, by varying a single real-valued parameter, one can select from a continuous range of behaviors that smoothly interpolate between fully adversarial and fully friendly. Second, we derive the agent’s (and environment’s)

optimal strategy under friendly or adversarial assumptions. To do so, we treat the agent-environment interaction as a one-shot game with information-constraints, and characterize the optimal strategies at equilibrium. Finally, we provide an algorithm to find the equilibrium strategies of the agent and the environment. We also demonstrate empirically that the resulting strategies display non-trivial behavior which vary qualitatively with the information-constraints.

## 2 Motivation & Intuition

We begin with an intuitive example using multi-armed bandits. This helps motivating our mathematical formalization in the next section.

Game theory is the classical economic paradigm to analyze the interaction between agents [5]. However, within game theory, the term adversary is justified by the fact that in zero-sum games the equilibrium strategies are maximin strategies, that is, strategies that maximize the expected payoff under the assumption that the adversary will minimize the payoffs. However, when the game is not a zero-sum game, interpreting an agent’s behavior as adversarial is far less obvious, as there are is no coupling between payoffs, and the strong guarantees provided by the minimax theorem are unavailable [6]. The notions of “indifferent” and “friendly” are similarly troublesome to capture using the standard game-theoretic language.

Intuitively though, terms such as adversarial, indifferent, and friendly have simple meanings. To illustrate, consider two different strategies (I & II) used on four different two-armed bandits (named A–D), where each strategy gets to play 1000 rounds. In each round, a bandit secretly chooses which one (and only one) of the two arms will deliver the reward. Importantly, each bandit chooses the location of the arm using a different probabilistic rule. Then the agent pulls an arm, receiving the reward if it guesses correctly and nothing otherwise.

The two different strategies are:

I

The agent plays all 1000 rounds using a uniformly random strategy.

II

The agent deterministically pulls each arm exactly 500 times using some fixed rule.

Note that each strategy pulls each arm approximately 50% of the time. Now, the agent’s average rewards for all four bandits are shown in Figure 1. Based on these results, we make the following observations:

1.

Sensitivity to strategy. The two sets of average rewards for bandit A are statistically indistinguishable, that is, they stay the same regardless of the agent’s strategy. This corresponds to the stochastic bandit type in the literature [7, 8]. In contrast, bandits B–D yielded different average rewards for the two strategies. Although each arm was pulled approximately 500 times, it appears as if the reward distributions were a function of the strategy.

2.

Adversarial/friendly exploitation of strategy. The average rewards do not always add up to one, as one would expect if the rewards were truly independent of the strategy. Compared to the uniform strategy, the deterministic strategy led to either an increase (bandit B) or a decrease of the empirical rewards (bandits C and D). We can interpret the behavior of bandits C & D as an adversarial exploitation of the predictability of the agent’s strategy—much like an exploitation of a deterministic strategy in a rock-paper-scissors game. Analogously, bandit B appears to be friendly towards the agent, tending to place the rewards favorably.

3.

Strength of exploitation. Notice how the rewards of both adversarial bandits (C & D) when using strategy II differ in how strongly they deviate from the baseline set by strategy I. This difference suggests that bandit D is better at reacting to the agent’s strategy than bandit C—and therefore also more adversarial. A bandit that can freely choose any placement of rewards is known as a non-stochastic bandit [9, 8].

4.

Cooperating/hedging. The nature of the bandit qualitatively affects the agent’s optimal strategy. A friendly bandit (B) invites the agent to cooperate through the use of predictable policies; whereas adversarial bandits (C & D) pressure the agent to hedge through randomization.

Simply put, bandits B–D appear to react to the agent’s private strategy in order to manipulate the payoffs. Abstractly, we can picture this as follows. First, a reactive environment can be thought of as possessing privileged information about the agent’s private strategy, acquired e.g. through past experience or through “spying”. Then, the amount of information determines the extent to which the environment is willing to deviate from a baseline, indifferent strategy. Second, the adversarial or friendly nature of the environment is reflected by the strategy it chooses: an adversarial (resp. friendly) environment will select the reaction that minimizes (resp. maximizes) the agent’s payoff. The diagram in Figure 2 illustrates this idea in a Rock-Paper-Scissors game.

We make one final observation.

5.

Agent/environment symmetry

. Let us turn the tables on the agent: how should we play if we were the bandit? A moment of reflection reveals that the analysis is symmetrical. An agent that does not attempt to maximize the payoff, or cannot do so due to limited reasoning power, will pick its strategy in a way that is indifferent to our placement of the reward. In contrast, a more effective agent will react to our choice, seemingly anticipating it. Furthermore, the agent will appear friendly if our goal is to maximize the payoff and adversarial if our goal is to minimize it.

This symmetry implies that the choices of the agent and the environment are coupled to each other, suggesting a solution principle for determining a strategy profile akin to a Nash equilibrium [5]. The next section will provide a concrete formalization.

## 3 Characterizing Friends and Foes

In this section we formalize the picture sketched out in the preceeding section. We first state an objective function for a game that couples the agent’s interests with those of the environment and limits both player’s ability to react to each other. We then derive expressions for their optimal strategies (i.e. the best-response functions). Based on these, we then give an existence plus an indifference result for the ensuing equilibrium strategies.

### 3.1 Objective function

Let and denote the set of actions (i.e. pure strategies) of the agent and the environment respectively; and let and denote prior and posterior strategies respectively. We represent the interaction between the agent and the environment as a one-shot game in which both the agent, starting from prior strategies and , choose (mixed) posterior strategies and respectively. The goal of the agent is to maximize the payoffs given by a function that maps choices into utilities .

We model the two players’ sensitivity to each other’s strategy as coupled deviations from indifferent prior strategies, whereby each player attempts to extremize the expected utility, possibly pulling in opposite directions. Formally, consider the objective function

 J=E[U(X,Z)]−1αDKL(P(X)∥Q(X))−1βDKL(P(Z)∥Q(Z))=∑x,zP(x)P(z)U(x,z)−1α∑xP(x)logP(x)Q(x)−1β∑zP(z)logP(z)Q(z)={coupled expected payoffs}−{agent deviation cost}−{env. deviation cost} (1)

where , known as the inverse temperature parameters, determine the reaction abilities of the agent and the environment respectively.

This objective function (1) is obtained by coupling two free energy functionals (one for each player) which model decision-making with information-constraints (see e.g. [10, 4]). We will discuss the interpretation of this choice further in Section 6.1. Other constraints are possible, e.g. any deviation quantified as a Bregman divergence [11].

### 3.2 Friendly and Adversarial Environments

Both the sign and the magnitude of the inverse temperatures control the player’s reactions as follows.

1.

The sign of determines the extremum operation: if is positive, then is concave for fixed and the environment maximizes the objective w.r.t. ; analogously, a negative value of yields a convex objective  that is minimized w.r.t .

2.

The magnitude of determines the strength of the deviation: when , the environment can only pick strategies that are within a small neighborhood of the center , whereas yields a richer set of choices for .

The parameter plays an analogous role, although in this exposition we will focus on and interpret it as a parameter that controls the agent’s ability to react. In particular, setting fixes to , which is useful for deriving the posterior environment for a given, fixed agent strategy.

From the above, it is easy to see that friendly and adversarial environments are modeled through the appropriate choice of . For and , we obtain a friendly environment that helps the agent in maximizing the objective, i.e.

 maxP(X)maxP(Z){J[P(X),P(Z)]}.

In contrast, for and , we get an adversarial environment that minimizes the objective:

 maxP(X)minP(Z){J[P(X),P(Z)]}=minP(Z)maxP(X){J[P(X),P(Z)]}.

In particular, the equality after exchanging the order of the minimization and maximization can be shown to hold using the minimax theorem: is a continuous and convex-concave function of and , which in turn live in compact and convex sets and respectively. The resulting strategies then locate a saddle point of .

### 3.3 Existence and Characterization of Equilibria

To find the equilibrium strategies for (1), we calculate the best-response function for each player, i.e. the optimal strategy for a given strategy of the other player in both the friendly and adversarial cases. Proofs to the claims can be found in Appendix A.

##### Proposition 1.

The best-response functions for the agent and the environment respectively are given by the Gibbs distributions

 P(X) =fX[P(Z)]: P(x) =1NXQ(x)exp{αU(x)}, U(x) :=∑zP(z)U(x,z); (2) P(Z) =fZ[P(X)]: P(z) =1NZQ(z)exp{βU(z)}, U(z) :=∑xP(x)U(x,z) (3)

respectively, where and are normalizing constants.

Given the above best-response functions and , we define an equilibrium strategy profile of the objective function (1) as a fixed-point of the combined best-response function defined as a mapping that concatenates the two best-response functions, i.e.

 f[P(X),P(Z)]:=(fX[P(Z)],fZ[P(X)]). (4)

That is, the equilibrium strategy profile111This definition is closely related to the Quantal-Response-Equilibrium [12]., in analogy with the Nash equilibrium, is a mixed-strategy profile that lies at the intersection of both best-response curves. With this definition, the following existence result follows immediately.

##### Proposition 2.

There always exists an equilibrium strategy profile.

Finally, the following result characterizes the equilibrium strategy profile in terms of an indifference principle (later illustrated in Figure 3). In particular, the result shows that both players strive towards playing strategies that equalize each other’s net payoffs defined as

 JX(x):=α∑zP(z)U(x,z)−logP(x)Q(x)andJZ(z):=β∑xP(x)U(x,z)−logP(z)Q(z). (5)
##### Proposition 3.

In equilibrium, the net payoffs are such that for all and all in the support of and respectively,

 JX(x)=JX(x′)andJZ(z)=JZ(z′).□

## 4 Computing Equilibria

Now we derive an algorithm for computing the equilibrium strategies for the agent and the environment. It is well-known that using standard gradient descent with competing losses is difficult [13, 14], and indeed a straight-forward gradient-ascent/descent method on the objective (1) turns out to be numerically brittle, especially for values of and near zero. Rather, we let the strategies follow a smoothed dynamics on the exponential manifold until reaching convergence. Equation (2) shows that the log-probabilities of the best-response strategies are:

 logP(x) +=logQ(x)+α∑zP(z)U(x,z) logP(z) +=logQ(z)+β∑xP(x)U(x,z)

where denotes equality up to a constant. This suggests the following iterative algorithm. Starting from and , one can iterate the following four equations for time steps

 Lt+1(x) =(1−ηt)⋅Lt(x)+ηt⋅(logQ(x)+α∑zPt(z)U(x,z)) (6) Pt+1(x) =expLt+1(x)∑~xexpLt+1(~x) (7) Lt+1(z) =(1−ηt)⋅Lt(z)+ηt⋅(logQ(z)+β∑xPt+1(x)U(x,z)) (8) Pt+1(z) =expLt+1(z)∑~zexpLt+1(~z) (9)

Here, the learning rate can be chosen constant but sufficiently small to achieve a good approximation; alternatively, one can use an annealing schedule that conform to the Robbins-Monro conditions and . Figure (3) shows four example simulations of the learning dynamics.

## 5 Experiments

### 5.1 Bernoulli Bandit

We now return to the two-armed bandits discussed in the introduction (Figure 1) and explain how these were modeled.

Assuming that the agent plays the rows (i.e. arms) and the bandit/environment the columns (i.e. reward placement), the utility matrix was chosen as

 U=I2×2=[1001].

This reflects the fact that there is always one and only one reward.

We did not want the agent to play an equilibrium strategy, but rather investigate each bandit’s reaction to a uniform strategy and two pure strategies: that is, , , and respectively. Then, choosing implies that the agent’s posterior strategy stays fixed, i.e. .

For the bandits, we fixed a common prior strategy , that is, slightly biased toward the second arm. Obviously, appropriate inverse temperatures were chosen to model the indifferent, friendly, adversarial, and very adversarial bandit:

Bandit Type
A Indifferent/Stochastic
B Friendly

We then computed the equilibrium strategies for each combination, which in this case (due to  ) reduces to computing the bandits’ best-response functions for each one of the three agent strategies. Once the posterior distributions  were calculated, we simulated each one and collected the empirical rewards which are summarized in Figure 1.

### 5.2 Gaussian Bandit, and Dependence on Variance

In a stochastic bandit, the optimal strategy is to deterministically pick the arm with the largest expected payoff. However, in adversarial and friendly bandits, the optimal strategy can depend on the higher-order moments of the reward distribution, as was shown previously in [15]. The aim of this experiment is to reproduce these results, showing the dependence of the optimal strategy on

both the mean and variance

of the payoff distribution.

##### Setup.

To do so, we considered a four-armed bandit with payoffs that are distributed according to (truncated and discretized) Gaussian distributions. To investigate the interplay between mean and the variance, we chose four Gaussians with:

• increasing means, where the means are uniformly spaced between -0.2 and 0.2;

• and decreasing variances

, where the standard deviations

are uniformly spaced between 1 and 2.

Thus, the arm with the largest mean is the most precise. Clearly, if the bandit is stochastic, arms are ranked according to their mean payoffs irrespective of their variances.

We then performed a sweep through the bandit’s inverse temperature , starting from an adversarial () and ending in a friendly bandit (), computing the equilibrium strategies for both players along the way. Throughout the sweep, the agent’s inverse temperature was kept constant at , modeling a highly rational agent.

##### Results.

The results are shown in Figure 4. As expected, for values close to the agent’s optimal strategy consists in (mostly) pulling the arm with the largest expected reward. In contrast, adversarial bandits () attempt to diminish the payoffs of the agent’s preferred arms, thus forcing it to adopt a mixed strategy. In the friendly case (), the agent’s strategy becomes deterministic. Interestingly though, the optimal arm switches to those with higher variance as increases (e.g. see Figure 4c). This is because Gaussians with larger variance are “cheaper to shift”; in other words, the rate of growth of the KL-divergence per unit of translation is lower. Thus, arms that were suboptimal for can become optimal for larger

values if their variances are large. We believe that these “phase transitions” are related to the ones previously observed under information-constraints

[16, 17, 18].

### 5.3 Linear Classifier

The purpose of our last experiment is to illustrate the non-trivial interactions that may arise between a classifier and a reactive data source, be it friendly or adversarial. Specifically, we designed a simple 2-D linear classification example in which the agent chooses the parameters of the classifier and the environment picks the binary data labels.

##### Method.

We considered hard classifiers of the form where and are the input and the class label, and

are the weight and the bias vectors, and

is a hard sigmoid defined by if and otherwise. To simplify our analysis, we chose a set of 25 data points (i.e. the inputs) placed on a grid spread uniformly in . Furthermore, we discretized the parameter space so that and have both 25 settings that are uniform in (that is, just as the input locations), yielding a total of possible parameter combinations .

The agent’s task consisted in choosing a strategy to set those parameters. However, unlike a typical classification task, here the agent could choose a distribution over if deemed necessary. Similarly, the environment picked the data labels indirectly by choosing the parameters of an optimal classifier. Specifically, the environment could pick a distribution  over , which in turn induced (stochastic) labels on the data set.

The utility and the prior distributions were chosen as follows. The utility function mapped each classifier-label pair  into the number of correctly classified data points. The prior distribution of the agent  was uniform over , reflecting initial ignorance. For the environment we chose a prior with a strong bias toward label assignments that are compatible with , where , , and . Specifically, for each label assignment

, its prior probability

was proportional to , the number of data points that a model based on would correctly classify when the true labels are given by instead. Figure 5 shows the stochastic labels obtained by marginalizing over the prior strategies of the agent and the environment respectively. Notice that although each individual classifier is linear, their mixture is not.

##### Results.

Starting from the above prior, we then calculated various friendly and adversarial deviations. First we improved the agent’s best-response strategy by setting the inverse temperature to and keeping the environment’s strategy fixed (). As a result, we obtained a crisper (i.e. less stochastic) classifier that reduced the mistakes for the data points that are more distant from the decision boundary. We will use these posterior strategies as the prior for our subsequent tests.

Next we generated two adversarial modifications of the environment ( and ) while keeping the agent’s strategy fixed, simulating an attack on a pre-trained classifier. The case shows that a slightly adversarial environment attempts to increase the classification error by “shifting” the members of the second class (white) towards the agent’s decision boundary. However, a very adversarial environment (case ) will simply flip the labels, nearly maximizing the expected classification error.

If instead we pair a reactive agent () with the previous adversarial environment (), we see that the agent significantly improves his performance by slightly randomizing his classifier, thereby thwarting the environment’s attempt to fully flip the labels. Finally, we paired a reactive agent () with a friendly environment (). As a result, both players cooperated by significantly aligning and sharpening their choices, with the agent picking a crisp decision boundary that nearly matched all the labels chosen by the environment.

## 6 Discussion and Conclusions

### 6.1 Information Channel

The objective function (1) can be tied more closely to the idea of an information channel that allows the agent and the environment to anticipate each other’s strategy (Section 2).

Let

be a random variable that encapsulates the totality of the information that informs the decisions of the agent and the environment. Identify the posteriors

and with the mixed strategies that the two players adopt after learning about , that is, and respectively. This corresponds to the graphical model:

Assuming and taking the expectation over of the objective function (1), we obtain

 ED[J]=ED,X,Z[U(X,Z)]−1αI(X;D)−1βI(Z;D), (10)

where and are mutual information terms. These terms quantify the capacity of the information channel between the background information  and the strategies.

With this connection, we can draw two conclusions. First, The baseline strategies and are the strategies that result when the two players do not observe the background information, since

 Q(x)=∑dQ(d)Q(x|d)andQ(z)=∑dQ(d)Q(z|d),

that is, the players play one of their strategies according to their base rates, effectively averaging over them. Second, the objective function (1) controls, via the inverse temperature parameters, the amount of information about that the agent and environment use to choose their strategy.

### 6.2 Relation to previous work

This work builds on a number of previous ideas. The characterization of friendly and adversarial from Section 2 is a direct adaptation of the Gaussian case introduced in [15]

to the case of discrete strategy sets. Therein, the authors presented a model of multi-armed bandits for the special case of Gaussian-distributed rewards in which the bandit can react to the strategy of the agent in a friendly or adversarial way. They furthermore used this model to derive the agent’s optimal policy and a Thompson sampling algorithm to infer the environment’s inverse temperature parameter from experience. Our work can be thought as a adaptation of their model to normal-form games.

In turn, the formalization of friendly and adversarial behavior through information-constraints was suggested in [19] (in the context of risk-sensitivity) and in [20, 4] (in sequential decision-making). Information-constraints have also been used in game theory: in particular, Quantal Response Equilibria feature reactive strategies that are bounded rational [12, 21], which relates to our case when the inverse temperatures are positive.

Our work was also inspired by existing work in the bandit literature; in particular, by bandit algorithms that achieve near-optimal performance in both stochastic and adversarial bandits [22, 23, 24]. Notice however, that these algorithms do neither include the friendly/cooperative case nor distinguish between different degrees of attitudes.

In multiagent learning there is work that considers how to act in different scenarios with varying, discrete degrees of adversarial behaviour (see for instance [25, 26, 27, 28]). In particular, a Bayesian approach has been tested to learn against a given class of opponents in stochastic games [29].

### 6.3 Learning the attitude of an environment

In this work we have centered our attention on formalizing friendly and adversarial behavior by characterizing the statistics of such an environment in a one-shot game given three components:

a)

the strategy of the agent;

b)

the prior strategy of the environment;

c)

and an inverse temperature parameter.

This model can be used within a learning algorithm to detect the environment’s friendly or adversarial attitude from past interactions akin to [25]. However, since this sacrifices the one-shot setup, additional assumptions (e.g. of stationarity) need to be made to accommodate our definition.

For instance, in a Bayesian setup, the model can be used as a likelihood function  of the inverse temperature  and the parameters  of the environment’s prior distribution  given the parameters  of the (private) strategy used by the agent. If combined with a suitable prior over  and , one can e.g. use Thompson sampling [30, 31] to implement an adaptive strategy that in round plays the best response for simulated parameters  and  drawn from the posterior . This is the method adopted in [15].

Alternatively, another way of detecting whether the environment is reactive is by estimating the mutual information

between the agent’s strategy parameter  and the environment’s action  given the player’s action. This is because, for a non-reactive environment, the agent’s action forms a Markov blanket for the environment’s response  and hence ; whereas if , then it must be that the environment can “spy” on the agent’s private policy.

### 6.4 Final thoughts

We have presented an information-theoretic definition of behavioral attitudes such as friendly, indifferent, and adversarial and shown how to derive optimal strategies for these attitudes. These results can serve as a general conceptual basis for the design of specialized detection mechanisms and more robust strategies.

Two extensions are worth pointing out. The first is the extension of the model to extensive-form games to represent sequential interactions. This will require formulating novel backward induction procedures involving subgame equilibria [5], perhaps similar to [32]

. The second is the analysis of state-of-the-art machine learning techniques such as deep neural networks: e.g. whether randomizing the weights protects from adversarial examples

[33, 34]; and whether friendly examples exist and can be exploited.

Importantly, we have shown the existence of a continuous range of environments that, if not finely discriminated by the agent, will lead to strictly suboptimal strategies, even in the friendly case.

#### Acknowledgments

We thank Marc Lanctot, Bernardo Pires, Laurent Orseau, Victoriya Krakovna, Jan Leike, Neil Rabinowitz, and David Balduzzi for comments on an earlier manuscript.

## Appendix A Proofs

##### Proof of Proposition 1.

The best response-functions are obtained by optimizing the Lagrangian

 L:=J−λX(∑xP(x)−1)−λZ(∑zP(z)−1)

where and are the Lagrange multipliers for the equality constraints enforcing the normalization of and respectively. For , we fix and equate the derivatives to zero:

 ∂L∂P(z)=∑xP(x)U(x,z)−1β(logP(z)Q(z)+1)+λZ!=0.

Solving for yields

 P(z)=Q(z)exp{β∑xP(x)U(x,z)+βλZ−1} (11)

Since , it must be that the Lagrange multiplier is equal to

 λZ=−1βlog∑zQ(z)exp{β∑zP(x)U(x,z)−1},

which, when substituted back into (11) gives the best-response function of the claim. The argument for proceeds analogously.

##### Proof of Proposition 2.

The combined best-response function is a continuous map from a compact set into itself. It follows therefore from Brouwer’s fixed-point theorem that it has a fixed-point.

##### Proof of Proposition 3.

We first multiply (1) by and then see that, for any fixed , the agent’s best response is also the maximizer (that is, irrespective of the sign of ) of the objective function

 ∑xP(x){α∑zP(z)U(x,z)−logP(x)Q(x)}=∑xP(x)JX(x) (12)

which ignores the terms that do not depend on . is a continuous and monotonically decreasing function in . Assume by contradiction that there are actions  such that . Then one can always improve the objective function by transferring a sufficiently small amount of probability mass from action to action . However, this contradicts the assumption that is a best-response and thus, . The argument for proceeds analogously.

## References

• Leike et al. [2017] J. Leike, M. Martic, V. Krakovna, P.A Ortega, T. Everitt, L. Orseau, and S. Legg. AI safety gridworlds. arXiv:1711.09883, 2017.
• Amodei et al. [2016] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete Problems in AI Safety. arXiv preprint arXiv:1606.06565, 2016.
• Russell [1997] S. Russell. Rationality and intelligence. Artificial Intelligence, 94(1–2):57–77, 1997.
• Ortega and Braun [2013] P. A. Ortega and D. A. Braun. Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A, 469, 2013.
• Osborne and Rubinstein [1994] M. J. Osborne and A. Rubinstein. A course in game theory. MIT Press, first edition, 1994.
• Von Neumann and Morgenstern [1947] J. Von Neumann and O. Morgenstern. Theory of games and economic behavior, 2nd rev. 1947.
• Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
• Bubeck and Cesa-Bianchi [2012] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
• Auer et al. [2002] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32:48–77, 2002.
• Tishby and Polani [2011] N. Tishby and D. Polani. Information theory of decisions and actions. In Perception-action cycle, pages 601–636. Springer New York, 2011.
• Banerjee et al. [2005] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
• McKelvey and Palfrey [1995] R. McKelvey and T. Palfrey. Quantal Response Equilibria for Normal Form Games. Games and Economic Behavior, 10:6–38, 1995.
• Mescheder et al. [2017] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pages 1823–1833, 2017.
• Balduzzi et al. [2018] D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, and T. Tuyls, K. Graepel. The Mechanics of n-Player Differentiable Games. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
• Ortega et al. [2015] P. A. Ortega, K.-E. Kim, and D. D. Lee. Reactive bandits with attitude. In Artificial Intelligence and Statistics, 2015.
• Tishby and Slonim [2001] N. Tishby and N. Slonim. Data clustering by markovian relaxation and the information bottleneck method. In Advances in neural information processing systems, pages 640–646, 2001.
• Chechik et al. [2005] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss. Information bottleneck for Gaussian variables. Journal of Machine Learning Research, 6(Jan):165–188, 2005.
• Genewein et al. [2015] T. Genewein, F. Leibfried, J. Grau-Moya, and D. A. Braun. Bounded Rationality, Abstraction, and Hierarchical Decision-Making: An Information-Theoretic Optimality Principle. Frontiers in Robotics and AI, 2:27, 2015.
• van den Broek et al. [2010] B. van den Broek, W. Wiegerinck, and B. Kappen. Risk sensitive path integral control. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 615–622. AUAI Press, 2010.
• Ortega and Braun [2011] P. A. Ortega and D. A. Braun. Information, utility and bounded rationality. In International Conference on Artificial General Intelligence. Springer Berlin Heidelberg, 2011.
• Wolpert et al. [2012] D. H. Wolpert, M. Harré, E. Olbrich, N. Bertschinger, and J. Jost. Hysteresis effects of changing the parameters of noncooperative games. Physical Review E, 85, 2012.
• Bubeck and Slivkins [2012] S. Bubeck and A. Slivkins. The best of both worlds: stochastic and adversarial bandits. In

In Proceedings ofthe International Conference on Computational Learning Theory (COLT)

, 2012.
• Seldin and Silvkins [2014] Y. Seldin and A. Silvkins. One practical algorithm for both stochastic and adversarial bandits. In 31 st International Conference on Machine Learning, 2014.
• Auer and Chao-Kai [2016] P. Auer and C. Chao-Kai. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In 29th Annual Conference on Learning Theory, 2016.
• Littman [2001] M. L. Littman. Friend-or-Foe Q-Learning in General-Sum Games. In Proceedings of the International Conference on Machine Learning (ICML), 2001.
• Powers and Shoham [2005] R. Powers and Y. Shoham. New criteria and a new algorithm for learning in multi-agent systems. In Advances in neural information processing systems, pages 1089–1096, 2005.
• Greenwald and Hall [2003] A. Greenwald and K. Hall. Correlated Q-Learning. In Proceedings of the 22nd Conference on Artificial Intelligence, pages 242–249, 2003.
• Crandall and Goodrich [2011] J. W. Crandall and M. A. Goodrich. Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning, 82(3):281–314, 2011.
• Hernandez-Leal and Kaisers [2017] P. Hernandez-Leal and M. Kaisers. Learning against sequential opponents in repeated stochastic games. In The 3rd Multi-disciplinary Conference on Reinforcement Learning and Decision Making, Ann Arbor, 2017.
• Thompson [1933] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933.
• Chappelle and Li [2011] O. Chappelle and L. Li. An empirical evaluation of Thompson Sampling. In Advances in neural information processing systems, 2011.
• Ling et al. [2018] C. K. Ling, F. Fang, and J. Z. Kolter. What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777, 2018.
• Szegedy et al. [2013] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
• Goodfellow et al. [2014] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.