Stochastic Games (SG) provide a natural extension of reinforcement learning [Sutton and Barto1998, Mnih et al.2015, Busoniu et al.2010, Peters et al.2010] to multiple agents, where adapting strategies in presence of humans or other agents is necessary [Shapley1953, Littman1994, Littman2001].
Current frameworks for stochastic games assume perfectly rational agents – an assumption that is violated in a variety of real-world scenarios, e.g., human-robot interaction [Goodrich and Schultz2007], and pick-up and drop-off domains [Agussurja and Lau2012]. In the context of computer games, the main focus of this paper, such a problem of perfect rationality is even more amplified. Here, in fact, it is not desirable to design agents that seek optimal behaviour as this leads humans to become quickly uninterested when playing against adversarial agents that are impossible to defeat [Hunicke2005]. Hence, to design games with adaptable and balancing properties tailored to human-level performance, there is a need to extend state-of-the-art SG beyond optimality to allow for tuneable behaviour.
One method to produce tuneable behaviour in reinforcement learning is to introduce an adjustable Kullback-Leibler (KL) constraint between the agent’s policy and a reference one. In particular, by increasingly strengthening this constraint we can obtain policies increasingly close to the reference policy, and vice versa. Bounding policy updates in such a manner has been previously introduced in literature under different names. Examples include KL control, relative entropy policy search [Peters et al.2010], path integral control [Kappen2005, Braun et al.2011], information-theoretic bounded rationality [Ortega and Braun2013], information-theory of decisions and actions [Tishby and Polani2011, Rubin et al.2012], and soft Q-learning [Fox et al.2016, Haarnoja et al.2017]. Targeted problems using these methods are also wide-spread, e.g., tackling the overestimation problem in tabular Q-learning [Fox et al.2016] and in Deep Q-networks [Leibfried et al.2017], accounting for model misspecification [Grau-Moya et al.2016], introducing safe policy updates in robot learning [Schulman et al.2015], and inducing risk-sensitive control [van den Broek et al.2010].
Contributions: Though abundant in literature, previous works only consider single-agent problems and are not readily applicable to stochastic games, which consider more than one interacting entity. With game balancing as our motivation, we propose a novel formulation of SG where agents are subject to KL constraints. In particular, our formulation introduces two KL constraints, one for each agent, limiting the space of available policies, which, in turn, enables tuneable behaviour. We then introduce an online strategy that can be used for game-play balancing even in high-dimensional spaces through a neural network architecture.
In short, the contributions of this paper can be summarised as: (1) proving convergence of the two-player soft Q-learning to a fixed point through contractions; (2) generalising team and zero-sum games in a continuous fashion and showing a unique value; (3) demonstrating convergence to correct behaviour by tuning the KL constraints on a simplified grid-world scenario; (4) extending our method to handle high-dimensional spaces; and (5) inferring opponent’s Lagrange multiplier by maximum-likelihood, and demonstrating game-balancing behaviour on the game of Pong.
2.1 Reinforcement Learning
In reinforcement learning (RL) [Sutton and Barto1998]
an agent interacts with an unknown environment to determine an optimal policy that maximises total expected return. These problems are formalised as Markov decision processes (MDPs). Formally, an MDP is defined as the tuplewhere is the state space, the action space, and denotes the state transition density. Namely, when being in state and applying an action , the agent transitions to . The reward function quantifies the agent’s performance and is the discount factor that trades off current and future rewards. The goal is to implement a policy that maximises total discounted rewards, i.e., , where .
2.2 Single Agent Soft Q-Learning
A way to constrain the behaviour of an agent is to modify the feasibility set of allowable policies. This can be achieved by introducing a constraint, such as a KL between two policy distributions, to the reinforcement learning objective. Such an approach has been already used within single agent reinforcement learning. For example, soft Q-learning has been used to reduce the overestimation problem of standard Q-learning [Fox et al.2016] and for building flexible energy-based policies in continuous domains [Haarnoja et al.2017]. Most of these approaches modify the standard objective of reinforcement learning to
where is the amount of bits (or nats if using the natural logarithm) measured by the KL divergence that the policy is allowed to deviate from a reference policy . The expectation operation is over state-action trajectories.
To solve the above constrained problem, one typically introduces a Lagrange multiplier, , and rewrites an equivalent unconstrained problem
To derive an algorithm for solving the above, one comes to recognise that also satisfies a recursion similar to that introduced by the Bellman equations [Puterman1994]. Additionally, the optimal policy can be written in closed form as
where , and . Notice that the above represents a generalisation of standard RL settings, where corresponds to a perfectly rational valuation (), while for we recover the valuation under (). Clearly, we can generate a continuum of policies between the reference and the perfectly rational policy that maximises the expected reward by tuning the choice of as detailed in [Leibfried et al.2017].
2.3 Two-Player Stochastic Games
In two-player stochastic games [Shapley1953, Littman1994], two agents, that we denote as the player and the opponent, are interacting in an environment. Each agent executes a policy that we write as and . At some time step , the player chooses an action , while the opponent picks . Accordingly, the environment transitions to a successor state , where denotes the joint transition model for the game. After transitioning to a new state, both agents receive a particular reward depending on the type of game considered. In team games, both the player and the opponent maximise the same reward function . For zero-sum games, the player seeks to maximise , whereas the opponent seeks to find a minimum. We write the policy dependent value as where, in contrast to the one-player setting, the expectation is over state and joint-action trajectories.
In stochastic games it is common to assume perfect rationality for both agents i.e., in the case of a zero-sum game the player computes the optimal value of state as , while the opponent as . Similarly, in team games the optimal value for the player is and for the opponent . Although it is straightforward to show that for team games
, an important classic result in game theory – the minimax theorem[Osborne and Rubinstein1994] – states that for zero-sum games , i.e both team and zero sum games have a unique value.
Importantly, in complex games with large state-spaces the and the operations over all available policies are extremely difficult to compute. Humans and suboptimal agents seek to approximate these operations as best they can but never fully do so due to the lack of computational resources [Ortega and Stocker2016], approximations and introduced biases [Lieder et al.2012]. This limits the applicability of SG when interacting with suboptimal entities, e.g., in computer games when competing against human players. We next provide the first extension, to the best of our knowledge, of soft Q-learning to SGs and show how our framework can be used within the context of balancing the game’s difficulty.
3 Two-Player Soft Q-Learning
To enable soft Q-learning in two-player games we introduce two KL constraints that allow us to separately control the performance of both agents. In particular, we incorporate a constraint similar to (1) into the objective function for each agent and apply the method of Lagrange multipliers
where the expectation is over joint-action trajectories, is the information cost for the player (that turns into a KL divergence with the expectation operator), and is the information cost for the opponent. The Lagrange multipliers and are tuneable parameters that we can vary at will. The distributions and are the arbitrary reference policies that we assume to be uniform111Please note considering other reference policies is left as an interesting direction for future work.. Using the above, the player and the opponent compute optimal soft-value of a state using
We define the extremum operator to correspond to a in the case of positive and to a in the case of negative .
It is clear that this novel formulation of the optimisation problems in Equations (3) generalise to cover both zero-sum and team games depending on the choice of . By fixing and setting or we recover, respectively, a zero-sum or a team game with perfectly rational agents. For we derive a game by which the opponent simply employs policy . For finite values of , we obtain a continuum of opponents with bounded performance ranging from fully adversarial to fully collaborative including a random policy. It is important to note, as we will show later, that the analytical form of the optimal policies that solve (3) are independent of the extremum operator and only depend on the parameters and .
3.1 Unique Value for Two-Player Soft Q-Learning
In this section we show that the equations in (3) are equivalent, , for any and , that is, our two player soft Q-learning exhibit a unique value.
We start by defining the free energy operator as
for an arbitrary free energy vector. Then the Bellman-like operators for both the player and the opponent can be expressed as:
Proof Sketch: For proving our main results, summarised in Theorem 2, we commence by showing that the equations in (3) are equivalent. This is achieved by showing that the two operators in Equation (5) are in fact equivalent, see Lemma 1. Proving these operators to be contractions converging to a unique fixed point (see Theorem 1), we conclude that (see Appendix for proof details).
For any and , and arbitrary free energy vector , then .
Due to Lemma 1, we can define the generic operator . Then, for this generic operator, we can prove the following.
Theorem 1 (Contraction).
For and , the operator is an -norm contraction map , where and are two arbitrary free energy vectors and is the discount factor.
Note that our reward is policy dependent (in the information cost) and, therefore, Theorem 1 is not a direct consequence of known results [Littman and Szepesvári1996], which assume that these rewards are policy independent. Using the above and the Banach’s fixed point theorem [Puterman1994], we obtain the following corollary.
Corollary 1 (Unique fixed point).
The contraction mapping exhibits a unique fixed-point such that .
Two-player stochastic games with soft Q-learning have a unique value, i.e. .
3.2 Bounded-Optimal Policies
Corollary 2 allows us to exploit the fact that there exists one unique value to generate the policies for both agents. With this in mind, we next design an algorithm (similar in spirit to standard Q-Learning) that acquires tuneable policies. We start by defining a state-action value function, in resemblance to the Q-function, as
For action selection, neither the player nor the opponent can directly use as it depends on the action of the other agent, which is unknown a priori. Instead, it can be shown that agents must first compute the certainty equivalent by marginalising as
With these definitions and using standard variational calculus, we obtain optimal policies for both the player and the opponent as222Note that, if we assume that the action space has low cardinality, and can be computed exactly.
where and are normalising functions which can be exactly computed when assuming small discrete action spaces.
Hence, can be expressed in closed form by incorporating the optimal policies in Equation (3) giving
As summarised in Algorithm 1, we learn by applying the following recursion rule:
4 Real-World Considerations
Two restrictions limit the applicability of our algorithm to real-world scenarios. First, Algorithm 1 implicitly assumes the knowledge of the opponent’s parameter . Obtaining in real-world settings can prove difficult. Second, our algorithm has been developed for low-dimensional state representations. Clearly, this restricts its applicability to high-dimensional states that are typical to computer games.
To overcome these issues, we next develop an online maximum likelihood procedure to infer from data gathered through the interaction with the opponent, and then generalise Algorithm 1
to high-dimensional representations by proposing a deep learning architecture.
4.1 Estimating & Game-Balancing
Rather than assuming access to the opponents rationality parameter, we next devise a maximum likelihood estimate that allows the agent to infer (in an online fashion) about and consequently, about the real policy of the opponent (through Equation (3.2)).
Contrary to current SG techniques that attempt to approximate the opponent’s policy directly, our method allows to reason about the opponent by only approximating a one dimensional parameter, i.e., in Equation (3.2)333Please note that similar to the previous section we assume the opponent’s reference policy to be uniform. This, however, does not impose a strong restriction since having a uniform reference policy enables enough flexibility to model various degrees of the opponent’s performances (see Section 5).
Please note that this problem can be easily solved using stochastic gradient descent.
Estimating : We frame the problem of estimating as a one of online maximum likelihood estimation. Namely, we assume that the player interacts in rounds with the opponent. At each round, , the player gathers a dataset of the form with denoting the total number of sampled transitions during round . Given , the agent estimates its knowledge of the opponent’s model i.e., by solving the following problem444
Please note that this problem can be easily solved using stochastic gradient descent.
where is defined in Equation (3.2). As rounds progress, the agent should learn to improve its estimate of . Such an improvement is quantified, in terms of regret555Regret is a standard notion to quantify the performance of an online learning algorithm. Regret measures the performance of the agent with respect to an adversary that has access to all information upfront. , in the following theorem for both a fixed and a time-varying opponent.
After rounds, the average static-regret for estimating vanishes as:
For a time-varying opponent, the dynamic regret bound dictates:
with denoting the negative of the log-likelihood and .
From the above theorem we conclude that against a fixed-opponent our method guarantees correct approximation of . This is true since the average regret, , vanishes as . When it comes to a dynamic opponent, however, it is clear that our bound depends on how the value of the opponents multiplier parameter (in other words its policy) vary with in terms of rounds. In case these variations are bounded in number, we can still guarantee vanishing regrets. If not, the regret bound can grow arbitrarily large since can introduce a factor .
Game Balancing: Now that we have a way to learn and estimate simultaneously, we could balance the game using the estimate of to adjust the player’s parameter
. A simple heuristic that proved successful in our experiments was to simply set, where denotes an additional performance-level the player can achieve. Setting would correspond to agents with the same KL constraints, whereas setting would imply a stronger player with a softer KL constraint (see Section 5.2).
4.2 Deep Two-Player Soft Q-Learning
When tackling higher dimensional problems, one has to rely on function approximators to estimate the Q-function, or in our case, the function . We borrow two ideas from deep Q-networks [Mnih et al.2015] that allow us to stabilise learning with high-dimensional representations for our SG setting. First, we use the notion of a replay memory to store the following transitions and, second, we use a target network denoted by to handle non-stationarity of the objective. We learn , by using a neural network that receives as input and outputs a matrix of
-values for each combination of the agents’ actions. The loss function that we seek to minimise is
with the expectation taken over the distribution of transitions sampled from the replay memory, and computed as in Equation (7). Clearly, the above optimisation problem is similar to standard DQNs with the difference that error is measured between soft Q-values.
We consider two cases in our experiments. The first assumes a low-dimensional setting, while the second targets the high-dimensional game of Pong. In both cases we consider full and no control of the opponent. Full control will allow us to validate our intuitions of tuneable behaviour, while the second sheds-the-light on the game balancing capabilities of Section 4.
5.1 Low-dimensional Experiments
The Setup: We validate Algorithm 1 on a 5 6 grid-world, where we consider two agents interacting. Each can choose an action from . The first four actions are primitive movements, while the last corresponds to picking-up an object when possible. The reward of the first player is set to for any movement and to for picking up the object located in cell (2,6).
The setting described in this paper allows for a range of games that can be continuously varied between cooperative and defective games depending on the choice of – a setting not allowed by any of the current techniques to stochastic games. In other words, the goal of the opponent, now, depends on the choice of . Namely, for positive values of , the opponent is collaborative, whereas for negative it is adversarial. values in between correspond to tuneable performance varying between the above two extremes.
We demonstrate adversarial behaviour by allowing agents to block each other either when trying to reach the same cell, or when attempting to transition to a cell previously occupied by the other agent. In such cases the respective agent remains in its current position. Given the determinism of the environment, a perfectly rational adversarial opponent can always impede the player to reach the goal. However, due to the KL constraints the opponent’s policy becomes “less” aggressive, allowing the player to exploit the opponent’s mistakes and arrive to the goal. For all experiments we used a high learning rate of 666A deterministic environment transitions allows for a large learning rate..
Tuning the Player’s Performance: To validate tuneablity, we assess the performance of the player when reaching convergence while varying and . In the first set of experiments, we fixed and varied . We expect that the player obtains high reward for collaborative opponents () or highly sub-optimal adversarial opponents (), and low rewards for strong adversarial opponents (). Indeed, the results shown in Figure 1(a) confirm these intuitions.
For a broader spectrum of analysis, we lower from to and re-run the same experiments. Results in Figure 1(b) reaffirm the previous conclusions. Here, however, the player attains slightly lower rewards as is decremented. Finally, in Figure 1(c) we plot the reward attained after convergence for a broad range of parameter values. We clearly see the effect of the modulation in both parameters on the resultant reward. The best reward is achieved when both parameters have positive high values, and the least reward for the lowest values.
Estimating : The goal of these experiments is to evaluate the correctness of our maximum likelihood estimate (Section 4.1) of . To conduct these experiments, we fixed and generated data with and that are unknown to the player. At each interaction with the environment, we updated according to Algorithm 1 and using a gradient step in the maximum likelihood objective. Results reported in Figure 2, clearly demonstrate that our extension to estimate is successful777In the case where the opponent would have an arbitrary policy then would converge to a value that attempts to make as close as possible to ..
5.2 High-dimensional Experiments
We repeat the experiments above but now considering our deep learning architecture of Section 4.2 on the game of Pong.
The Setup: We use the game Pong from the Roboschool package888https://github.com/openai/roboschool. The state space is 13-dimensional i.e., x-y positions and x-y- velocities for both agents and the ball, and an additional dimension for time. We modified the action space to consist of actions where the set corresponds to . We also modified the reward function to make it compatible with zero-sum games in such a way that if the player scores, the reward is set to , whereas if the opponent scores, to . The networks that represent soft Q-values,
, are multilayer perceptrons composed of two hidden layers, each withunits, and an a matrix output layer composed of units ( actions). Here, each unit denotes a particular combination of and
. After each hidden layer, we introduce a ReLU non-linearity. We used a learning rate of, the ADAM optimizer, a batch size of , and updated the target every training steps.
Tuning the Player’s Performance: In this experiment, we demonstrate successful tuneable performance. Figure 3 shows that for a highly adversarial opponent (i.e., ) the player () acquired negative rewards, whereas for a weak opponent or even collaborative, the player obtained high reward. Game-play videos can be found at https://sites.google.com/site/submission3591/.
Estimating and game balancing: Finally, we assess the performance of the maximum likelihood estimator applied to game balancing using neural networks. We pre-trained a policy for the opponent with parameters and , thus the player being stronger than the opponent (see blue line in Figure 4). In Figure 4, we demonstrate game balancing using Section 4.1. In particular, we are able to vary the player’s performance by adapting (online) . For instance, if we set close to we observe that the player is as strong as the opponent attaining reward, see green line.
We extended two-player stochastic games to agents with KL constraints. We evaluated our method theoretically and empirically in both small and high-dimensional state spaces. The most interesting direction for future work is to scale our method to a large number of interacting agents by extending the approach in [Mguni et al.2018].
[Agussurja and Lau2012]
Lucas Agussurja and Hoong Chuin Lau.
Toward large-scale agent guidance in an urban taxi service.
Uncertainty in Artificial Intelligence, 2012.
- [Braun et al.2011] Daniel A Braun, Pedro A Ortega, Evangelos Theodorou, and Stefan Schaal. Path integral control and bounded rationality. In Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), 2011 IEEE Symposium on, pages 202–209. IEEE, 2011.
- [Busoniu et al.2010] Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Reinforcement learning and dynamic programming using function approximators, volume 39. CRC press, 2010.
- [Fox et al.2016] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pages 202–211. AUAI Press, 2016.
- [Goodrich and Schultz2007] Michael A Goodrich and Alan C Schultz. Human-robot interaction: a survey. Foundations and trends in human-computer interaction, 1(3):203–275, 2007.
[Grau-Moya et al.2016]
Jordi Grau-Moya, Felix Leibfried, Tim Genewein, and Daniel A Braun.
Planning with information-processing constraints and model
uncertainty in markov decision processes.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 475–491. Springer, 2016.
- [Haarnoja et al.2017] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352–1361, 2017.
- [Hunicke2005] Robin Hunicke. The case for dynamic difficulty adjustment in games. In Proceedings of the 2005 ACM SIGCHI International Conference on Advances in computer entertainment technology, pages 429–433. ACM, 2005.
- [Kappen2005] Hilbert J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005(11):P11011, 2005.
- [Leibfried et al.2017] Felix Leibfried, Jordi Grau-Moya, and Haitham Bou-Ammar. An information-theoretic optimality principle for deep reinforcement learning. arXiv preprint arXiv:1708.01867, 2017.
- [Lieder et al.2012] Falk Lieder, Tom Griffiths, and Noah Goodman. Burn-in, bias, and the rationality of anchoring. In Advances in neural information processing systems, pages 2690–2798, 2012.
- [Littman and Szepesvári1996] Michael L Littman and Csaba Szepesvári. A generalized reinforcement-learning model: Convergence and applications. In International Conference on Machine Learning, 1996.
- [Littman1994] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, 1994, pages 157–163, 1994.
- [Littman2001] Michael L Littman. Friend or foe q-learning in general-sum games. In In Proceedings of the 18th Int. Conf. on Machine Learning. Citeseer, 2001.
- [Mguni et al.2018] David Mguni, Joel Jennings, and Enrique Munoz de Cote. Decentralised learning in systems with many, many strategic agents. In AAAI Conference on Artificial Intelligence, 2018.
- [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- [Ortega and Braun2013] Pedro A Ortega and Daniel A Braun. Thermodynamics as a theory of decision-making with information-processing costs. In Proc. R. Soc. A, volume 469, page 20120683. The Royal Society, 2013.
- [Ortega and Stocker2016] Pedro A Ortega and Alan A Stocker. Human decision-making under limited time. In Advances in Neural Information Processing Systems, pages 100–108, 2016.
- [Osborne and Rubinstein1994] Martin J Osborne and Ariel Rubinstein. A course in game theory. 1994.
- [Peters et al.2010] Jan Peters, Katharina Mülling, and Yasemin Altün. Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pages 1607–1612. AAAI Press, 2010.
- [Puterman1994] Martin Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.
- [Rubin et al.2012] Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information in mdps. Decision Making with Imperfect Decision Makers, pages 57–74, 2012.
- [Schulman et al.2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015.
- [Shapley1953] Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
- [Sutton and Barto1998] R Sutton and A Barto. Reinforcement learning. MIT Press, Cambridge, 1998.
- [Tishby and Polani2011] Naftali Tishby and Daniel Polani. Information theory of decisions and actions. In Perception-action cycle, pages 601–636. Springer, 2011.
- [van den Broek et al.2010] Bart van den Broek, Wim Wiegerinck, and Bert Kappen. Risk sensitive path integral control. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI’10, pages 615–622, Arlington, Virginia, United States, 2010. AUAI Press.
Appendix A Appendix
a.1 Proof Lemma 1
The function is a concave function in when fixing because the terms and are linear in , the last term is constant and the relative entropy term is concave for . Similarly, is convex in when fixing because is convex for and all the other term are linear or constant in . If is a concave-convex function then
For the remaining case it is trivial to show that
Therefore, . ∎
a.2 Proof of Theorem 1
We start by proving two propositions that we use later in the proof of Theorem 1.
we can assume without loss of generality that . Let then
Therefore, without loss of generality we can assume . Let then
From Proposition 1 and 2 we can conclude that where both extremum operators are equal and either or .
Proof Theorem 1
Now we continue with the full proof of Theorem 1.
To show contraction, we start by explicitly rewriting the infinity norm as
where from the second equality to the third we solved and, and are computed as in the equations from the main text (that depend on and recursively). By using the extremum operator (that can be either or ) we cover the cases where is either positive or negative. We continue with the proof by applying Corollary 3 to the last equality, which gives
Note that this is valid for both cases when having negative (minimization), whereas the second inequality correspond to positive (maximization). Therefore our proof will cover both, positive and negative values of . Now, we are ready to handle the right-side of the equation making use of Corollary 3 once again,
a.3 Derivation Bounded Optimal Policies
In this section we sketch the derivation of the bounded optimal policies, first, for the player and, second, for the opponent. The player chooses its policy by first doing the extremization and then its own maximization .
Solving the maximization problem by applying standard variational calculus we obtain the equation in the main manuscript.
In contrast to the case of the player, the policy of the opponent is computed by interchanging the extremization operators from to . Therefore, we have to solve first the inner maximization problem over the player’s policy and then its own extremization (that is a maximization for and a minimization for ). Solving first for gives
Similarly, by applying standard variational calculus we can solve for that gives the policy for the opponent written in the main manuscript.