1 Introduction
Stochastic Games (SG) provide a natural extension of reinforcement learning [Sutton and Barto1998, Mnih et al.2015, Busoniu et al.2010, Peters et al.2010] to multiple agents, where adapting strategies in presence of humans or other agents is necessary [Shapley1953, Littman1994, Littman2001].
Current frameworks for stochastic games assume perfectly rational agents – an assumption that is violated in a variety of realworld scenarios, e.g., humanrobot interaction [Goodrich and Schultz2007], and pickup and dropoff domains [Agussurja and Lau2012]. In the context of computer games, the main focus of this paper, such a problem of perfect rationality is even more amplified. Here, in fact, it is not desirable to design agents that seek optimal behaviour as this leads humans to become quickly uninterested when playing against adversarial agents that are impossible to defeat [Hunicke2005]. Hence, to design games with adaptable and balancing properties tailored to humanlevel performance, there is a need to extend stateoftheart SG beyond optimality to allow for tuneable behaviour.
One method to produce tuneable behaviour in reinforcement learning is to introduce an adjustable KullbackLeibler (KL) constraint between the agent’s policy and a reference one. In particular, by increasingly strengthening this constraint we can obtain policies increasingly close to the reference policy, and vice versa. Bounding policy updates in such a manner has been previously introduced in literature under different names. Examples include KL control, relative entropy policy search [Peters et al.2010], path integral control [Kappen2005, Braun et al.2011], informationtheoretic bounded rationality [Ortega and Braun2013], informationtheory of decisions and actions [Tishby and Polani2011, Rubin et al.2012], and soft Qlearning [Fox et al.2016, Haarnoja et al.2017]. Targeted problems using these methods are also widespread, e.g., tackling the overestimation problem in tabular Qlearning [Fox et al.2016] and in Deep Qnetworks [Leibfried et al.2017], accounting for model misspecification [GrauMoya et al.2016], introducing safe policy updates in robot learning [Schulman et al.2015], and inducing risksensitive control [van den Broek et al.2010].
Contributions: Though abundant in literature, previous works only consider singleagent problems and are not readily applicable to stochastic games, which consider more than one interacting entity. With game balancing as our motivation, we propose a novel formulation of SG where agents are subject to KL constraints. In particular, our formulation introduces two KL constraints, one for each agent, limiting the space of available policies, which, in turn, enables tuneable behaviour. We then introduce an online strategy that can be used for gameplay balancing even in highdimensional spaces through a neural network architecture.
In short, the contributions of this paper can be summarised as: (1) proving convergence of the twoplayer soft Qlearning to a fixed point through contractions; (2) generalising team and zerosum games in a continuous fashion and showing a unique value; (3) demonstrating convergence to correct behaviour by tuning the KL constraints on a simplified gridworld scenario; (4) extending our method to handle highdimensional spaces; and (5) inferring opponent’s Lagrange multiplier by maximumlikelihood, and demonstrating gamebalancing behaviour on the game of Pong.
2 Background
2.1 Reinforcement Learning
In reinforcement learning (RL) [Sutton and Barto1998]
an agent interacts with an unknown environment to determine an optimal policy that maximises total expected return. These problems are formalised as Markov decision processes (MDPs). Formally, an MDP is defined as the tuple
where is the state space, the action space, and denotes the state transition density. Namely, when being in state and applying an action , the agent transitions to . The reward function quantifies the agent’s performance and is the discount factor that trades off current and future rewards. The goal is to implement a policy that maximises total discounted rewards, i.e., , where .2.2 Single Agent Soft QLearning
A way to constrain the behaviour of an agent is to modify the feasibility set of allowable policies. This can be achieved by introducing a constraint, such as a KL between two policy distributions, to the reinforcement learning objective. Such an approach has been already used within single agent reinforcement learning. For example, soft Qlearning has been used to reduce the overestimation problem of standard Qlearning [Fox et al.2016] and for building flexible energybased policies in continuous domains [Haarnoja et al.2017]. Most of these approaches modify the standard objective of reinforcement learning to
(1) 
where is the amount of bits (or nats if using the natural logarithm) measured by the KL divergence that the policy is allowed to deviate from a reference policy . The expectation operation is over stateaction trajectories.
To solve the above constrained problem, one typically introduces a Lagrange multiplier, , and rewrites an equivalent unconstrained problem
To derive an algorithm for solving the above, one comes to recognise that also satisfies a recursion similar to that introduced by the Bellman equations [Puterman1994]. Additionally, the optimal policy can be written in closed form as
where , and . Notice that the above represents a generalisation of standard RL settings, where corresponds to a perfectly rational valuation (), while for we recover the valuation under (). Clearly, we can generate a continuum of policies between the reference and the perfectly rational policy that maximises the expected reward by tuning the choice of as detailed in [Leibfried et al.2017].
2.3 TwoPlayer Stochastic Games
In twoplayer stochastic games [Shapley1953, Littman1994], two agents, that we denote as the player and the opponent, are interacting in an environment. Each agent executes a policy that we write as and . At some time step , the player chooses an action , while the opponent picks . Accordingly, the environment transitions to a successor state , where denotes the joint transition model for the game. After transitioning to a new state, both agents receive a particular reward depending on the type of game considered. In team games, both the player and the opponent maximise the same reward function . For zerosum games, the player seeks to maximise , whereas the opponent seeks to find a minimum. We write the policy dependent value as where, in contrast to the oneplayer setting, the expectation is over state and jointaction trajectories.
In stochastic games it is common to assume perfect rationality for both agents i.e., in the case of a zerosum game the player computes the optimal value of state as , while the opponent as . Similarly, in team games the optimal value for the player is and for the opponent . Although it is straightforward to show that for team games
, an important classic result in game theory – the minimax theorem
[Osborne and Rubinstein1994] – states that for zerosum games , i.e both team and zero sum games have a unique value.Importantly, in complex games with large statespaces the and the operations over all available policies are extremely difficult to compute. Humans and suboptimal agents seek to approximate these operations as best they can but never fully do so due to the lack of computational resources [Ortega and Stocker2016], approximations and introduced biases [Lieder et al.2012]. This limits the applicability of SG when interacting with suboptimal entities, e.g., in computer games when competing against human players. We next provide the first extension, to the best of our knowledge, of soft Qlearning to SGs and show how our framework can be used within the context of balancing the game’s difficulty.
3 TwoPlayer Soft QLearning
To enable soft Qlearning in twoplayer games we introduce two KL constraints that allow us to separately control the performance of both agents. In particular, we incorporate a constraint similar to (1) into the objective function for each agent and apply the method of Lagrange multipliers
(2)  
where the expectation is over jointaction trajectories, is the information cost for the player (that turns into a KL divergence with the expectation operator), and is the information cost for the opponent. The Lagrange multipliers and are tuneable parameters that we can vary at will. The distributions and are the arbitrary reference policies that we assume to be uniform^{1}^{1}1Please note considering other reference policies is left as an interesting direction for future work.. Using the above, the player and the opponent compute optimal softvalue of a state using
(3) 
We define the extremum operator to correspond to a in the case of positive and to a in the case of negative .
It is clear that this novel formulation of the optimisation problems in Equations (3) generalise to cover both zerosum and team games depending on the choice of . By fixing and setting or we recover, respectively, a zerosum or a team game with perfectly rational agents. For we derive a game by which the opponent simply employs policy . For finite values of , we obtain a continuum of opponents with bounded performance ranging from fully adversarial to fully collaborative including a random policy. It is important to note, as we will show later, that the analytical form of the optimal policies that solve (3) are independent of the extremum operator and only depend on the parameters and .
3.1 Unique Value for TwoPlayer Soft QLearning
In this section we show that the equations in (3) are equivalent, , for any and , that is, our two player soft Qlearning exhibit a unique value.
We start by defining the free energy operator as
(4) 
for an arbitrary free energy vector
. Then the Bellmanlike operators for both the player and the opponent can be expressed as:(5)  
Proof Sketch: For proving our main results, summarised in Theorem 2, we commence by showing that the equations in (3) are equivalent. This is achieved by showing that the two operators in Equation (5) are in fact equivalent, see Lemma 1. Proving these operators to be contractions converging to a unique fixed point (see Theorem 1), we conclude that (see Appendix for proof details).
Lemma 1.
For any and , and arbitrary free energy vector , then .
Due to Lemma 1, we can define the generic operator . Then, for this generic operator, we can prove the following.
Theorem 1 (Contraction).
For and , the operator is an norm contraction map , where and are two arbitrary free energy vectors and is the discount factor.
Note that our reward is policy dependent (in the information cost) and, therefore, Theorem 1 is not a direct consequence of known results [Littman and Szepesvári1996], which assume that these rewards are policy independent. Using the above and the Banach’s fixed point theorem [Puterman1994], we obtain the following corollary.
Corollary 1 (Unique fixed point).
The contraction mapping exhibits a unique fixedpoint such that .
Corollary 2.
Twoplayer stochastic games with soft Qlearning have a unique value, i.e. .
3.2 BoundedOptimal Policies
Corollary 2 allows us to exploit the fact that there exists one unique value to generate the policies for both agents. With this in mind, we next design an algorithm (similar in spirit to standard QLearning) that acquires tuneable policies. We start by defining a stateaction value function, in resemblance to the Qfunction, as
For action selection, neither the player nor the opponent can directly use as it depends on the action of the other agent, which is unknown a priori. Instead, it can be shown that agents must first compute the certainty equivalent by marginalising as
With these definitions and using standard variational calculus, we obtain optimal policies for both the player and the opponent as^{2}^{2}2Note that, if we assume that the action space has low cardinality, and can be computed exactly.
(6)  
where and are normalising functions which can be exactly computed when assuming small discrete action spaces.
Hence, can be expressed in closed form by incorporating the optimal policies in Equation (3) giving
(7) 
As summarised in Algorithm 1, we learn by applying the following recursion rule:
(8)  
Here, is the learning rate, the learning step, and is computed as in Equation (7) using the current estimate.
4 RealWorld Considerations
Two restrictions limit the applicability of our algorithm to realworld scenarios. First, Algorithm 1 implicitly assumes the knowledge of the opponent’s parameter . Obtaining in realworld settings can prove difficult. Second, our algorithm has been developed for lowdimensional state representations. Clearly, this restricts its applicability to highdimensional states that are typical to computer games.
To overcome these issues, we next develop an online maximum likelihood procedure to infer from data gathered through the interaction with the opponent, and then generalise Algorithm 1
to highdimensional representations by proposing a deep learning architecture.
4.1 Estimating & GameBalancing
Rather than assuming access to the opponents rationality parameter, we next devise a maximum likelihood estimate that allows the agent to infer (in an online fashion) about and consequently, about the real policy of the opponent (through Equation (3.2)).
Contrary to current SG techniques that attempt to approximate the opponent’s policy directly, our method allows to reason about the opponent by only approximating a one dimensional parameter, i.e., in Equation (3.2)^{3}^{3}3Please note that similar to the previous section we assume the opponent’s reference policy to be uniform. This, however, does not impose a strong restriction since having a uniform reference policy enables enough flexibility to model various degrees of the opponent’s performances (see Section 5).
.
Estimating : We frame the problem of estimating as a one of online maximum likelihood estimation. Namely, we assume that the player interacts in rounds with the opponent. At each round, , the player gathers a dataset of the form with denoting the total number of sampled transitions during round . Given , the agent estimates its knowledge of the opponent’s model i.e., by solving the following problem^{4}^{4}4
Please note that this problem can be easily solved using stochastic gradient descent.
where is defined in Equation (3.2). As rounds progress, the agent should learn to improve its estimate of . Such an improvement is quantified, in terms of regret^{5}^{5}5Regret is a standard notion to quantify the performance of an online learning algorithm. Regret measures the performance of the agent with respect to an adversary that has access to all information upfront. , in the following theorem for both a fixed and a timevarying opponent.
Theorem 2.
After rounds, the average staticregret for estimating vanishes as:
(9) 
For a timevarying opponent, the dynamic regret bound dictates:
with denoting the negative of the loglikelihood and .
From the above theorem we conclude that against a fixedopponent our method guarantees correct approximation of . This is true since the average regret, , vanishes as . When it comes to a dynamic opponent, however, it is clear that our bound depends on how the value of the opponents multiplier parameter (in other words its policy) vary with in terms of rounds. In case these variations are bounded in number, we can still guarantee vanishing regrets. If not, the regret bound can grow arbitrarily large since can introduce a factor .
Game Balancing: Now that we have a way to learn and estimate simultaneously, we could balance the game using the estimate of to adjust the player’s parameter
. A simple heuristic that proved successful in our experiments was to simply set
, where denotes an additional performancelevel the player can achieve. Setting would correspond to agents with the same KL constraints, whereas setting would imply a stronger player with a softer KL constraint (see Section 5.2).4.2 Deep TwoPlayer Soft QLearning
When tackling higher dimensional problems, one has to rely on function approximators to estimate the Qfunction, or in our case, the function . We borrow two ideas from deep Qnetworks [Mnih et al.2015] that allow us to stabilise learning with highdimensional representations for our SG setting. First, we use the notion of a replay memory to store the following transitions and, second, we use a target network denoted by to handle nonstationarity of the objective. We learn , by using a neural network that receives as input and outputs a matrix of
values for each combination of the agents’ actions. The loss function that we seek to minimise is
with the expectation taken over the distribution of transitions sampled from the replay memory, and computed as in Equation (7). Clearly, the above optimisation problem is similar to standard DQNs with the difference that error is measured between soft Qvalues.
5 Experiments
We consider two cases in our experiments. The first assumes a lowdimensional setting, while the second targets the highdimensional game of Pong. In both cases we consider full and no control of the opponent. Full control will allow us to validate our intuitions of tuneable behaviour, while the second shedsthelight on the game balancing capabilities of Section 4.
5.1 Lowdimensional Experiments
The Setup: We validate Algorithm 1 on a 5 6 gridworld, where we consider two agents interacting. Each can choose an action from . The first four actions are primitive movements, while the last corresponds to pickingup an object when possible. The reward of the first player is set to for any movement and to for picking up the object located in cell (2,6).
The setting described in this paper allows for a range of games that can be continuously varied between cooperative and defective games depending on the choice of – a setting not allowed by any of the current techniques to stochastic games. In other words, the goal of the opponent, now, depends on the choice of . Namely, for positive values of , the opponent is collaborative, whereas for negative it is adversarial. values in between correspond to tuneable performance varying between the above two extremes.
We demonstrate adversarial behaviour by allowing agents to block each other either when trying to reach the same cell, or when attempting to transition to a cell previously occupied by the other agent. In such cases the respective agent remains in its current position. Given the determinism of the environment, a perfectly rational adversarial opponent can always impede the player to reach the goal. However, due to the KL constraints the opponent’s policy becomes “less” aggressive, allowing the player to exploit the opponent’s mistakes and arrive to the goal. For all experiments we used a high learning rate of ^{6}^{6}6A deterministic environment transitions allows for a large learning rate..
Tuning the Player’s Performance: To validate tuneablity, we assess the performance of the player when reaching convergence while varying and . In the first set of experiments, we fixed and varied . We expect that the player obtains high reward for collaborative opponents () or highly suboptimal adversarial opponents (), and low rewards for strong adversarial opponents (). Indeed, the results shown in Figure 1(a) confirm these intuitions.
For a broader spectrum of analysis, we lower from to and rerun the same experiments. Results in Figure 1(b) reaffirm the previous conclusions. Here, however, the player attains slightly lower rewards as is decremented. Finally, in Figure 1(c) we plot the reward attained after convergence for a broad range of parameter values. We clearly see the effect of the modulation in both parameters on the resultant reward. The best reward is achieved when both parameters have positive high values, and the least reward for the lowest values.
Estimating :
The goal of these experiments is to evaluate the correctness of our maximum likelihood estimate (Section 4.1) of . To conduct these experiments, we fixed and generated data with and that are unknown to the player. At each interaction with the environment, we updated according to Algorithm 1 and using a gradient step in the maximum likelihood objective. Results reported in Figure 2, clearly demonstrate that our extension to estimate is successful^{7}^{7}7In the case where the opponent would have an arbitrary policy then would converge to a value that attempts to make as close as possible to ..
5.2 Highdimensional Experiments
We repeat the experiments above but now considering our deep learning architecture of Section 4.2 on the game of Pong.
The Setup: We use the game Pong from the Roboschool package^{8}^{8}8https://github.com/openai/roboschool. The state space is 13dimensional i.e., xy positions and xy velocities for both agents and the ball, and an additional dimension for time. We modified the action space to consist of actions where the set corresponds to . We also modified the reward function to make it compatible with zerosum games in such a way that if the player scores, the reward is set to , whereas if the opponent scores, to . The networks that represent soft Qvalues,
, are multilayer perceptrons composed of two hidden layers, each with
units, and an a matrix output layer composed of units ( actions). Here, each unit denotes a particular combination of and. After each hidden layer, we introduce a ReLU nonlinearity. We used a learning rate of
, the ADAM optimizer, a batch size of , and updated the target every training steps.Tuning the Player’s Performance: In this experiment, we demonstrate successful tuneable performance. Figure 3 shows that for a highly adversarial opponent (i.e., ) the player () acquired negative rewards, whereas for a weak opponent or even collaborative, the player obtained high reward. Gameplay videos can be found at https://sites.google.com/site/submission3591/.
Estimating and game balancing: Finally, we assess the performance of the maximum likelihood estimator applied to game balancing using neural networks. We pretrained a policy for the opponent with parameters and , thus the player being stronger than the opponent (see blue line in Figure 4). In Figure 4, we demonstrate game balancing using Section 4.1. In particular, we are able to vary the player’s performance by adapting (online) . For instance, if we set close to we observe that the player is as strong as the opponent attaining reward, see green line.
6 Conclusion
We extended twoplayer stochastic games to agents with KL constraints. We evaluated our method theoretically and empirically in both small and highdimensional state spaces. The most interesting direction for future work is to scale our method to a large number of interacting agents by extending the approach in [Mguni et al.2018].
References

[Agussurja and Lau2012]
Lucas Agussurja and Hoong Chuin Lau.
Toward largescale agent guidance in an urban taxi service.
Uncertainty in Artificial Intelligence
, 2012.  [Braun et al.2011] Daniel A Braun, Pedro A Ortega, Evangelos Theodorou, and Stefan Schaal. Path integral control and bounded rationality. In Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), 2011 IEEE Symposium on, pages 202–209. IEEE, 2011.
 [Busoniu et al.2010] Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Reinforcement learning and dynamic programming using function approximators, volume 39. CRC press, 2010.
 [Fox et al.2016] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence, pages 202–211. AUAI Press, 2016.
 [Goodrich and Schultz2007] Michael A Goodrich and Alan C Schultz. Humanrobot interaction: a survey. Foundations and trends in humancomputer interaction, 1(3):203–275, 2007.

[GrauMoya et al.2016]
Jordi GrauMoya, Felix Leibfried, Tim Genewein, and Daniel A Braun.
Planning with informationprocessing constraints and model
uncertainty in markov decision processes.
In
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
, pages 475–491. Springer, 2016.  [Haarnoja et al.2017] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. In International Conference on Machine Learning, pages 1352–1361, 2017.
 [Hunicke2005] Robin Hunicke. The case for dynamic difficulty adjustment in games. In Proceedings of the 2005 ACM SIGCHI International Conference on Advances in computer entertainment technology, pages 429–433. ACM, 2005.
 [Kappen2005] Hilbert J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005(11):P11011, 2005.
 [Leibfried et al.2017] Felix Leibfried, Jordi GrauMoya, and Haitham BouAmmar. An informationtheoretic optimality principle for deep reinforcement learning. arXiv preprint arXiv:1708.01867, 2017.
 [Lieder et al.2012] Falk Lieder, Tom Griffiths, and Noah Goodman. Burnin, bias, and the rationality of anchoring. In Advances in neural information processing systems, pages 2690–2798, 2012.
 [Littman and Szepesvári1996] Michael L Littman and Csaba Szepesvári. A generalized reinforcementlearning model: Convergence and applications. In International Conference on Machine Learning, 1996.
 [Littman1994] Michael L Littman. Markov games as a framework for multiagent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, 1994, pages 157–163, 1994.
 [Littman2001] Michael L Littman. Friend or foe qlearning in generalsum games. In In Proceedings of the 18th Int. Conf. on Machine Learning. Citeseer, 2001.
 [Mguni et al.2018] David Mguni, Joel Jennings, and Enrique Munoz de Cote. Decentralised learning in systems with many, many strategic agents. In AAAI Conference on Artificial Intelligence, 2018.
 [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [Ortega and Braun2013] Pedro A Ortega and Daniel A Braun. Thermodynamics as a theory of decisionmaking with informationprocessing costs. In Proc. R. Soc. A, volume 469, page 20120683. The Royal Society, 2013.
 [Ortega and Stocker2016] Pedro A Ortega and Alan A Stocker. Human decisionmaking under limited time. In Advances in Neural Information Processing Systems, pages 100–108, 2016.
 [Osborne and Rubinstein1994] Martin J Osborne and Ariel Rubinstein. A course in game theory. 1994.
 [Peters et al.2010] Jan Peters, Katharina Mülling, and Yasemin Altün. Relative entropy policy search. In Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence, pages 1607–1612. AAAI Press, 2010.
 [Puterman1994] Martin Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.
 [Rubin et al.2012] Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information in mdps. Decision Making with Imperfect Decision Makers, pages 57–74, 2012.
 [Schulman et al.2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 1889–1897, 2015.
 [Shapley1953] Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
 [Sutton and Barto1998] R Sutton and A Barto. Reinforcement learning. MIT Press, Cambridge, 1998.
 [Tishby and Polani2011] Naftali Tishby and Daniel Polani. Information theory of decisions and actions. In Perceptionaction cycle, pages 601–636. Springer, 2011.
 [van den Broek et al.2010] Bart van den Broek, Wim Wiegerinck, and Bert Kappen. Risk sensitive path integral control. In Proceedings of the TwentySixth Conference on Uncertainty in Artificial Intelligence, UAI’10, pages 615–622, Arlington, Virginia, United States, 2010. AUAI Press.
Appendix A Appendix
a.1 Proof Lemma 1
Proof.
The function is a concave function in when fixing because the terms and are linear in , the last term is constant and the relative entropy term is concave for . Similarly, is convex in when fixing because is convex for and all the other term are linear or constant in . If is a concaveconvex function then
For the remaining case it is trivial to show that
Therefore, . ∎
a.2 Proof of Theorem 1
We start by proving two propositions that we use later in the proof of Theorem 1.
Proposition 1.
Proof.
Given that
we can assume without loss of generality that . Let then
∎
Proposition 2.
Proof.
Given that
Therefore, without loss of generality we can assume . Let then
∎
Corollary 3.
From Proposition 1 and 2 we can conclude that where both extremum operators are equal and either or .
Proof Theorem 1
Now we continue with the full proof of Theorem 1.
Proof.
To show contraction, we start by explicitly rewriting the infinity norm as
where from the second equality to the third we solved and, and are computed as in the equations from the main text (that depend on and recursively). By using the extremum operator (that can be either or ) we cover the cases where is either positive or negative. We continue with the proof by applying Corollary 3 to the last equality, which gives
Note that this is valid for both cases when having negative (minimization), whereas the second inequality correspond to positive (maximization). Therefore our proof will cover both, positive and negative values of . Now, we are ready to handle the rightside of the equation making use of Corollary 3 once again,
∎
a.3 Derivation Bounded Optimal Policies
In this section we sketch the derivation of the bounded optimal policies, first, for the player and, second, for the opponent. The player chooses its policy by first doing the extremization and then its own maximization .
Solving the maximization problem by applying standard variational calculus we obtain the equation in the main manuscript.
In contrast to the case of the player, the policy of the opponent is computed by interchanging the extremization operators from to . Therefore, we have to solve first the inner maximization problem over the player’s policy and then its own extremization (that is a maximization for and a minimization for ). Solving first for gives
Similarly, by applying standard variational calculus we can solve for that gives the policy for the opponent written in the main manuscript.
a.4 Maximum likelihood for estimation of
Consider the dataset of the form with being the total number of data points. The actions of the opponent are sampled according to a fixed distribution