# Deep Q-Learning for Nash Equilibria: Nash-DQN

Model-free learning for multi-agent stochastic games is an active area of research. Existing reinforcement learning algorithms, however, are often restricted to zero-sum games, and are applicable only in small state-action spaces or other simplified settings. Here, we develop a new data efficient Deep-Q-learning methodology for model-free learning of Nash equilibria for general-sum stochastic games. The algorithm uses a local linear-quadratic expansion of the stochastic game, which leads to analytically solvable optimal actions. The expansion is parametrized by deep neural networks to give it sufficient flexibility to learn the environment without the need to experience all state-action pairs. We study symmetry properties of the algorithm stemming from label-invariant stochastic games and as a proof of concept, apply our algorithm to learning optimal trading strategies in competitive electronic markets.

## Authors

• 5 publications
• 3 publications
• 11 publications
02/03/2020

### Local Nash Equilibria are Isolated, Strict Local Nash Equilibria in `Almost All' Zero-Sum Continuous Games

We prove that differential Nash equilibria are generic amongst local Nas...
09/01/2020

### Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

We explore the use of policy approximation for reducing the computationa...
05/31/2019

### Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games

We study the global convergence of policy optimization for finding the N...
01/07/2018

### Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations

This paper considers the problem of inverse reinforcement learning in ze...
12/17/2018

### Double Deep Q-Learning for Optimal Execution

Optimal trade execution is an important problem faced by essentially all...
04/26/2021

### Computational Performance of Deep Reinforcement Learning to find Nash Equilibria

We test the performance of deep deterministic policy gradient (DDPG), a ...
09/30/2014

### Non-Myopic Learning in Repeated Stochastic Games

This paper addresses learning in repeated stochastic games (RSGs) played...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The study of equilibria in systems of interacting agents is ubiquitous throughout the natural and social sciences. The classical approach to studying these equilibria requires building a model of the interacting system, solving for its equilibrium, and studying its properties thereafter. This approach often runs into complications, however, as a fine balance between (i) model tractability and (ii) its ability to capture the main features of the data it aims to represent, must be struck. Rather than taking a model-based approach, it is possible to derive non-parametric reinforcement-learning (RL) methods to study these equilibria. The main idea behind these methods is to directly approximate equilibria from simulations or observed data, providing a powerful alternative to the usual approach.

The majority of the existing literature on RL is dedicated to single-player games. Most modern approaches follow either a deep Q-learning approach(e.g. [16]), policy gradient methods (e.g. [18]), or some mixture thereof (e.g. [7]). RL methods have also been developed for multi-agent games, but are for the most part restricted to the case of zero-sum games. For a survey see [1].

There are recent efforts on extending RL to general sum games with fictitious play as in [9], or with iterative fixed point methods as in [14]. In the specific context of (discrete state-action space) mean-field games, [6] provides a Q-learning algorithm for solving for the Nash-equilibria. Many of the existing algorithms suffer either from computational intractability as the size and complexity of a game increases, when state-action space becomes continuous, or from the inability to model complex game behaviour.

Hu and Wellman [8] introduce a Q-learning based approach for obtaining Nash equilibria in general-sum stochastic games. Although they prove convergence of the algorithm for games with finite game and action spaces, their approach is computationally infeasible for all but the simplest examples. The main computational bottleneck in their approach is the need to repeatedly compute a local Nash equilibrium over states, which is an NP-hard operation in general. Moreover, the method proposed in [8] does not extend to games where agents choose continuous-valued controls or to games with either high-dimensional game state representations or with large numbers of players. We instead combine the iLQG framework of of [19, 5] and the Nash Q-learning algorithm of [8] to produce an algorithm which can learn Nash equilibria in these more complex and practically relevant settings.

In particular, we decompose the state-action value (Q)-function as a sum of the value function and the advantage function. We approximate the value function using a neural-net, and we locally approximate the advantage function as linear-quadratic in the agents’ actions with coefficients that are non-linear functions of the features given by a neural-net. This allows us to compute the Nash equilibrium analytically at each point in feature space (i.e., the optimal action of all agents) in terms of the network parameters. Using this closed form local Nash equilibrium, we derive an iterative actor-critic algorithm to learn the network parameters.

In principle, our approach allows us to deal with stochastic games with a large number of game state features and a large action space. Moreover, our approach can be easily adapted to mean-field game (MFG) problems, which result from the infinite population limit of certain stochastic games (see [11, 15, 2]), such as those developed in, e.g., [3, 4] or major-minor agent MFGs such as those studied in, e.g., [10, 17, 12]. A drawback of the method we propose is the restriction on the local structure of the proposed Q-function approximator. We find, however, that the proposed approximators are sufficiently expressive in most cases, and perform well in the numerical examples that we include in this paper.

The remainder of this paper is structured as follows. Section 2

introduces a generic Markov model for a general-sum stochastic game. In

Section 3, we present optimality conditions for the stochastic game and motivate our Q-learning approach to finding Nash equilibria. 4 introduces our local linear-quadratic approximations to the Q-function and the resulting learning algorithm. We also provide several simplifications that arise in label-invariant games. Section 5 covers implementation details and Section 6 presents some illustrative examples.

## 2 Model Setup

We consider a stochastic game with agents all competing together. We assume the state of the game is represented via the stochastic process so that for each time , , for some separable Banach space . At each time , agent- chooses an action , where is assumed to be a separable Banach space. In the sequel, we use the notation

to denote the vector of actions of all agents other than agent-

at time and the notation

to denote the vector of actions of all agents. We assume that the game is a Markov Decision Process (MDP) with a fully-visible game state. The MDP assumption is equivalent to assuming the joint state-action process

is Markov, whose state transition probabilities are defined by the stationary Markov transition kernel

and a distribution over initial states .

At each step of the game, agents receive a reward that varies according to the current state of the game, their own choice of actions, and the actions of all other agents. The agent-’s reward is represented by the function , so that at each time , agent- accumulates a reward . We assume that each function is continuously differentiable and concave in and is continuous in and .

At each time , agent- may observe other agents’ actions , as well as the state of the game . Moreover, each agent- chooses their actions according to a deterministic Markov policy . The objective of agent- is to select the policy that maximizes the objective functional which represents their personal expected discounted future reward over the remaining course of the game, given a fixed policy for themselves and a fixed policy for all other players. The objective functional for agent- is

 Ri(x;πi,π−i)=E[∞∑t=0γ−tiri(xt,πi,t,π−i,t)], (1)

where the expectation is over the process , with , and where we assume is a fixed constant representing a discount rate. In Equation (1), we use the compressed notation and . The agent’s objective functional (1) explicitly depends on the policy choice of all agents. Each agent, however, can only control their own policy, and must choose their actions while conditioning on the behavior of all other players.

Agent- therefore seeks a policy that optimizes their objective function, but remains robust to the actions of others. In the end, agents’ policies form a Nash equilibrium – a collection of policies such that unilateral deviation from this equilibrium by a single agent will result in a decrease in the value of that agent’s objective functional. Formally, we say that a collection of policies forms a Nash equilibrium if

 Ri(x;πi,π∗−i)≤Ri(x;π∗i,π∗−i) (2)

for all admissible policies and for all . Informally, we can interpret the Nash equilibrium as the policies for which each agent simultaneously maximizes their own objective function, while conditioned on the actions of others.

## 3 Optimality Conditions

Our ultimate goal is to obtain an algorithm that can attain the Nash equilibrium of the game without a-priori knowledge of its dynamics. In order to do this, we first identify conditions that are more easily verifiable than the formal definition of a Nash equilibrium given above.

We proceed by extending the well known Bellman equation for Nash equilibria. While leaving fixed, we may apply the dynamic programming principle to agents- reward resulting in

 Ri(x;π∗i,π∗−i)=maxu∈Ui{ri(x,u,π∗−i(x))+γiEx′∼p(⋅∣x,u)[Ri(x′;π∗i,1,π∗−i,1)]}. (3)

At the Nash equilibrium, equation (3) is satisfied simultaneously for all .

To express this more concisely, we introduce a vector notation. First define the vector-valued function , consisting of the stacked vector of objective functions. We call the stacked objective functions evaluated at their Nash equilibria the stacked value function, which we write as .

Next, we define the Nash state-action value function, also called the Q-Function, which we denote , where

 Q(x;u)=r(x;u)+γiEx′∼p(⋅∣x,u)[V(x′)], (4)

and where we denote to indicate the vectorized reward function. Each element of can be interpreted as the expected maximum value their objective function may take, given a fixed current state and a fixed (arbitrary) immediate action taken by all agents.

Next, we define the Nash operator as follows.

###### Definition 1 (Nash Operator)

Consider a collection of concave real-valued functions, , where . We define the Nash operator , as a map from the collection of functions to their Nash equilibrium value , where is the unique point satisfying,

 (5)

For a sufficiently regular collection of functions , the Nash operator corresponds to simultaneously maximizing each of the in their first argument .

This definition provides us with a relationship between the value function and agents’ -function as . Using the Nash operator, we may then express the Bellman Equation (3) in a concise form as

 V(x) =Nu∈UQ(x;u)=Nu∈U{r(x;u)+γiEx′∼p(⋅∣x,u)[V(x′)]}, (6)

which we refer to as the Nash-Bellman equation for the remainder of the paper. The definition of the value function equation (6) implies that . Hence, in order to identify the Nash equilibrium , it is sufficient to obtain the -function and apply the Nash operator to it. This principle will inform the approach we take in the remainder of the paper: rather than directly searching the space of policy collections for the Nash-equilibrium via equations (1) and (2), we may rely on identifying the function satisfying (6), and thereafter compute .

## 4 Locally Linear-Quadratic Nash Q-Learning

In this section, we formulate an algorithm which learns the Nash equilibrium of the stochastic game described in the previous section. The principal idea behind the approach we take is to construct a parametric estimator

of agent-’s -function, where we search for the set of parameters , which results in estimators that approximately satisfy the Nash-Bellman equation (6). Thus, our objective is to minimize the quantity

 EAx∼ρx′∼p(⋅∣x,u)[∥∥∥\bQhatθ(x;u)−r(x;u)−γiNu′∈U\bQhatθ(x′;u′)∥∥∥2], (7)

over all , where we define to be an unconditional proability measure over game states . Equation (7) is designed as a measure of the gap between the right and left sides of equation (6). We may also interpret it as the distance between and the true value of . The expression (7) is intractable, since we do not know nor a-priori, and we wish to make little to no assumptions on the system dynamics. Therefore, we rely on a simulation based method and approximate (7) with

 L(θ)=1MM∑m=1∥∥∥\bQhatθ(xm;um)−r(xm;um)−γiNu′m∈U\bQhatθ(x′m;u′m)∥∥∥2, (8)

where for each , represents an observed transition triplet from the game. We then search for that minimizes the in order to approximate .

Our approach is motivated by Hu and Wellman [8] and Todorov & Li [19]. [8] presents a -learning algorithm where , which is assumed to take only finitely many values, can be estimated through an update rule that relies on the repeated computation of the Nash operator . As the computation of

is NP-hard in general, this approach proves to be computationally intractable beyond trivial examples. To circumvent this issue and to make use of more expressive parametric models, we generalize and adapt techniques in Gu et al.

[5] to the multi-agent game setting to develop a computational and data efficient algorithm for approximating the Nash equilibria.

In our algorithm, we make the additional assumption that game states and actions are real-valued. Specifically, we assume that for some positive integer and for each , where are all positive integers. For notational convenience we define .111Our approach can be easily extended to the case of controls that are restricted to convex subsets of .

We now define a specific model for the collection of approximate -functions . For each , we have and decompose the -function into two components:

 \bQhatθ(x;u)=\bVhatθ(x)+ˆAθ(x;u), (9)

where is a model of the collection of value functions so that and where is what we refer to as the collection of advantage functions. The advantage function represents the optimality gap between and . We further assume that for each , has the linear quadratic form

 ˆAθi(x;u)= −(ui−μθi(x)u−i−μθ−i(x))⊺Pθi(x)(ui−μθi(x)u−i−μθ−i(x))+(u−i−μθ−i(x))⊺Ψθi(x), (10)

where the block matrix

 Pθi(x):=(Pθ11,i(x)Pθ12,i(x)Pθ21,i(x)Pθ22,i(x)), (11)

with , and . In (11), , , and are matrix valued functions, for each . We require that is positive-definite for all and without loss of generality we may choose , as the advantage function depends only the symmetric combination of and .

Hence, rather than modelling , we instead model the functions , , and separately as functions of the state space . Each of these functions can be modeled by universal function approximators such as neural networks. The only major restriction is that must remain a positive-definite function of . This restriction is easily attained by decomposing using Cholesky decomposition, so that we write and instead model the lower triangular matrices .

The model assumption in (10) implicitly assumes that agent-’s -function can be approximately written as a linear-quadratic function of the actions of each agent. One can equivalently motivate such an approximation by considering a second order Taylor expansion of in the variable around the Nash equilibrium, together with the assumption that the are convex functions of their input . This expansion, however, assumes nothing about the dependence of on the value of game state .

The form of (10) is designed so that each is a concave function of , guaranteeing that is bijective. Moreover, under our model assumption, the Nash-equilibrium is attained at the point and at this point, the advantage function is zero, hence we obtain simple expressions for the value function and the equilibrium strategy

 \bVhatθ(x)=Nu∈U\bQhatθ(x;u)andμ(x)=argNu∈U\bQhatθ(x;u). (12)

Consequently, our model allows us to directly specify the Nash equilibrium strategy and the value function of each agent through the functions and

. The outcome of this simplification is that the summand of the loss function in equation (

8), which contains the Nash equilibria and was itself previously intractable, becomes tractable. For each sample observation (consisting of a state , an , and new state ) we then have a loss of equationparentequation

 Lm(θ)=∥∥∥\bVhatθ(xm)+ˆAθ(xm;um)−r(xm;um)−γi\bVhatθ(x′m)∥∥∥2, (13a) and all that remains is to minimize the total loss L(θ)=1MM∑m=1Lm(θ) (13b)

over the parameters given a set of observed state-action triples .

### 4.1 Simplifying Game Structures

Equation (10) requires a parametric model of the functions , ,, which results in potentially a very large parameter space and in principle result in requiring many training steps. In many cases, however, the structure of the game can significantly reduce the dimension of the parameter space and leads to easily learnable model structures. The following subsections enumerate these typical simiplications.

#### Label Invariance

Many games have symmetric players, and hence are invariant to a permutation of the label of players. Such label invariance implies that each agent- does not differentiate among other game participants and the agent’s reward functional is independent of any reordering of all other agents’ states and/or actions.

More formally, we assume that for an arbitrary agent-, the game’s state can be represented as , where represents the part of the game state not belonging to any agent, represents the portion of the game state belonging to agent- and represents the part of the game state belonging to other agents. Next, let denote the set of permutations over sets of indices, where for each , we express the permutation of a collection as , where is a one-to-one and onto map from the indices of the collection into itself.

Label invariance is equivalent to the assumption that for any , each agent’s reward function satisfies

 ri(x0,xi,λ(x−i);ui,λ(u−i))=ri(x0,xi,x−i;ui,u−i). (14)

With such label invariance, the form of the linear quadratic expansion of the advantage function in (10) simplifies. Assuming that , for all , independent label invariance in only the actions of agents requires to have the simplified form

 ˆAθi(x;u)= −∥∥ui−μθi(x)∥∥2Pθ11,i(x)−∑j∈N/{i}⟨(ui−μi(x)θ),(uj−μθj(x))⟩Pθ12,i(x) (15) −∑j∈N/{i}∥uj−μθj(x)∥2Pθ22,i(x)+∑j∈N/{i}(uj−μθj(x))⊺ψθ(x),

for all , where we use the notation and for appropriately sized matrices . The functional form of (15) allows us to drastically reduce the size of the matrices being modelled by an order of .

To impose label invariance on states, we require permutation invariance on the inputs the function approximations , , . [20] provide necessary and sufficient conditions on neural network structures to be permutation invariant. This necessary and sufficient structure is defined as follows. Let and be two arbitrary functions. From these functions, let be the composition of these functions, such that

 finv(z)=σ(J∑j=1ϕ(zj)). (16)

It is clear that constructed in this manner is invariant to the reordering of the components of . Equation (16) may be interpreted as a layer which aggregates the all dimensions of the inputs (which will corresponding to the state of all agents), through , and a layer that transforms the aggregate result to the output, through . We assume further that and are both neural networks with appropriate input and output dimension. This structure can also be embedded as an input later inside of a more complex neural network.

#### Identical Preferences

It is quite common that the admissible actions of all agents are identical, i.e., , , and agents have homogeneous objectives, or large sub-populations of agents have homogeneous objectives. Thus far, we allowed agents to assign different performance metrics, and the variations are show through the set of rewards and discount rates, . If agents have identical preferences, then we simply need to assume and for all . By the definition of total discounted reward, state-action value function, and value function, identical preferences and admissible actions imply that , and are independent of .

In addition, the assumption of identical preferences, combined with the assumption of label invariance can further reduce the parametrization of the advantage function. Under this additional assumption we have that must be identical for all , which reduces modelling of all of the , ,, to modelling these for a single . This further reduces the number of functions that must be modeled by an order of . The combined effect of label invariance and identical preferences has a compounding effect which can have a large impact on the modelling task, particularily when considering large populations of players.

###### Remark 1 (Sub-population Invariance and Preferences)

We can also consider cases where label and preference invariance occur within sub-population of agents, rather than across the entire population. For example, in games in which some agents may cooperate with other agents, we can assume that agents are indifferent to re-labeling of cooperators and non-cooperators separately. Similarly, we can consider cases in which groups of agents share the same performance metrics. Such situations, among others, lead to modelling simplifications similar to equation (15) and simplifying neural network structures can be developed. In the interest of space, we do not develop further examples simplifying examples, nor do we claim the list we provide is exhaustive as one can easily imagine a multitude of other almost symmetric cases that can be of interest.

## 5 Implementation of Nash Actor-Critic Algorithm

With the locally linear-quadratic form of the advantage function, and the simplifying assumptions outlined in the previous section, we can now minimize the objective (8), which reduces to the sum over (13), over the parameters

through an iterative optimization and sampling scheme. One could in principle apply a simple stochastic gradient descent method using back-propagation on the appropriate loss function. Instead, we propose an actor-critic style algorithm to increase stability and efficiency of the algorithm. Actor-critic methods (see e.g.

[13]) have been shown to provide faster and more stable convergence of reinforcement learning methods towards their optima, and our model lends itself naturally to such methods.

The decomposition in Equation (9) allows us to model the value function independently from other components. Therefore, we employ an actor-critic update rule to minimize the loss function (13) by separating the parameter set , where represents the parameter set for modelling and represents the parameter set used for modeling . Our proposed actor-critic algorithm updates these parameters by minimizing the total loss equationparentequation

 1MM∑m=1^L(ym,θV,θA), (17a) where the individual sample loss corresponding to the error in the Nash-Bellman equation, after already solving for the Nash-equilibria, is ^L(ym,θV,θA)=∥∥∥\bVhatθV(xm)+ˆAθA(xm;um)−r(xm;um)−γi\bVhatθV(x′m)∥∥∥2, (17b)

with , , and we minimize the loss by alternating between minimization in the variables and .

Algorithm 1 below provides an outline of the actor-critic procedure for our optimization problem. We include a replay buffer and employ mini-batching. A replay buffer is a collection of previously experienced transition tuples of the form representing the previous state of the system, the action taken in that state, the resulting state of the system, and the reward during the transition. We randomly sample a mini-batch from the replay buffer to update the model parameters using SGD. The algorithm also uses a naïve Gaussian exploration policy, although it may be replaced by any other action space exploration method.

During the optimization steps over and , we use stochastic gradient descent, or any other adaptive optimization methods.

## 6 Experiments

We test our algorithm on a multi-agent game that is important in the study of behaviour on electronic exchanges called the optimal execution problem. The game consists of agents trading a single asset with a stochastic price process that is affected by their actions.

An arbitrary agent-, , may buy or sell of the asset at each time period . At , agents must completely liquidate their holdings. Each agent- keeps track of their inventory and inventories are visible to all other agents. We assume the asset price process evolves according to the discrete dynamics

 St+1−St=g1(St,νt)ΔT+g2(St,νt)√ΔTξt, (18)

with initial condition . Here and are iid for all . The impact of all agents’ actions appear through the drift and noise of the asset price dynamics through the functions and . We assume and are invariant with respect to the ordering of so that, with the same inventory, responds identically regardless of which agent is trading. In addition, each agent pays a transaction cost proportional to the amount they decide to buy or sell during each time period. Agents keep track of their total cash from trading and we denote the corresponding process , where is the transaction cost constant.

The agent’s objective is to maximize the sum of (i) total cash they possess by time , (ii) a penalty for risk taking, and (iii) excess exposure at time . We express agent-’s objective (total expected reward) as

 Ri:=E[Xi,T+qi,T(ST−b2qi,T)−b3T∑t=1q2i,tΔT], (19)

where . In Equation (19), the second term serves as a cost of instantaneously liquidating the inventory at time and the last term serves as a penalty for taking on excess risk proportional to the square of the holdings at each time period. In this objective function, the effect of all agent’s trading actions appears implicitly through the dynamics of , and through its effect on the cash process . This particular form of objective assumes that agents have identical preferences which are invariant to agent relabeling222We could extend this to include the case of sub-populations with homogeneous preferences, but that are heterogeneous across sub-populations.. Hence, we may employ the techniques discussed in Section 4.1 to simplify the form of the advantage function . In our example, we model each component of the advantage function with neural networks that includes a permutation invariant layer.

Our experiments assume a total of five agents over a time horizon of fifteen time steps () with inventory levels restricted to be between positive and negative 100 units (, for all ).

### 6.1 Features

We use the following features to represent the state of the environment at time :
Price (): Scalar representing the current price of the asset,
Time (): Scalar representing the current time step the agent is at in the time horizon, and
Inventory (): Vector representing the inventory levels of all agents.

We assume the first two sets of features (price and time) plus each agent’s individual inventory () to be non-label invariant and all other agent’s inventory levels () to be label invariant.

### 6.2 Network Details

The network structure for the advantage function approximation consists of two network components: (i) a permutation invariant layer that feeds into (ii) a main network layer. The input of the permutation invariant layer are the label invariant features. This layer, as described in Section 4.1

, is a fully connected neural network with three hidden layers each containing 20 nodes. Layers are connected by ReLU activation functions. We then combine the output of this permutation invariant later with the non-label invariant features and together they form the inputs to the main network. The main network comprises of three hidden layers with 20, 40, and 20 nodes, respectively. The outputs of this main network are the parameters

and , of the approximated advantage function defined in Section 4. These parameters fully specify the value of the advantage function.

The network structure for the value function approximation contains four hidden layers with 20, 60, 60, and 20 nodes respectively. This network takes the features from all states described in Section 6.1 and outputs the approximate value function for all agents.

We use stochastic gradient descent with mini-batches to optimize the loss functions defined in Section 5. Mini-batch sizes are set to one hundred uniformly sampled past experiences from the replay buffer. The replay buffer is set to a maximum size of five thousand sets of transitions where the oldest experienced transition is removed from the buffer when the size limit is reached. Learning rates are set to 0.01 and are held constant throughout training. Training is performed over a total of 15,000 simulations.

In the next two subsections, we investigate the results using two common price impact functions – the linear case and the square-root case. In both cases, in the absence of trading, the price process is assumed to mean-revert.

### 6.3 Linear Price Impact

In this example, we assume a mean-reverting price process with a linear price impact, which corresponds to choosing

 g1(St,νt)=κ(θ−St)+b1∑i∈Nνi,t,andg2(St,νt)=σ, (20)

where are constants corresponding to the price impact of net trades, the rate of mean-reversion of the price process, the level of mean-reversion, and the asset’s volatility, respectively. We denote the average inventory of all other agents and the corresponding average inventory as

 ¯ν−i:=1N−1∑j∈N/{i}νj,t,and¯q−i:=1N−1∑j∈N/{i}qj,t,∀i∈N, (21)

respectively.

For our experiments, we use the parameters in Table 1. Recall that the agents’ reward function are given by (19) and that and correspond to terminal and running risk penalties. In order to impose the restriction that the agent must be in a neutral position at time step T, we set .

Figure 1 illustrates the resulting Nash-equilibria by looking at the optimal trading strategy for a single agent. Specifically, it shows the heatmap of the first agent’s optimal actions as time, price, inventory, and the average inventory of other agents vary. Panels (a), (b), and (c) represents states in which the average inventory levels of all other agents are long (), zero(), and short (), respectively. Each panel is further divided into subplots of varying asset prices from left to right. Each individual subplot’s -axis represents the first agent’s inventory level and -axis represents the current time step.

As the plots demonstrate, whenever the agent’s inventory is significantly negative, they tend to buy, and when it is significantly positive they tend to sell. The threshold at which the switch occurs depends on other features of the system including: the time since the beginning of the trading horizon, the price of the asset, and the inventory levels of the other agents. The closer the system is to the end of the trading horizon, the closer this threshold is to zero inventory levels. This is a byproduct of restricting all agents to have neutral positions at time step . Within any one panel, and moving from the left to right subplots, the threshold moves downwards and sometimes falls below zero inventory. In contrast, higher average inventories of other agents generally increases the threshold.

Some of these features can be seen more clearly through sample inventory paths of agents acting according to the optimal strategy – see Figure 2

. Initial inventories of all agents are randomly drawn from a normal distribution (

, ) but remain constant across columns but vary across rows. Initial asset price are randomly drawn from a normal distribution (, ) but remain constant across rows but vary across columns. The asset price process is simulated (using the impact functions (20) and dynamics in (18)) with the same random seed across rows but vary across columns. Generally, the inventories of all agents converge, eventually vanish at the end of the trading horizon, and react to changes in the asset price but buying when prices are low and selling when prices are high.

### 6.4 Square-Root Price Impact

Another important price impact function is the square root impact which corresponds to choosing:

 g1(St,νt)=κ(θ−St)+b1sgn(¯νt)√|¯νt|,andg2(St,νt)=σ, (22)

where are constants corresponding to the price impact of net trades, the rate of mean-reversion of the price process, the level of mean-reversion, and the assets’s volatility, respectively.

For our experiments, we use the same parameters as in Table 1. As a reminder, the agents’ reward function are given by (19) and and correspond to terminal and running risk penalties. The analogues of the heatmaps and sample inventory paths in Figures 1 and 2 for the square-root case may be found in Figures 3 and 4.

Much of the same observations from Section 6.3 hold in the square-root case, however, one key difference is the magnitude of the effect that other agents’ inventories have on the single agent’s optimal actions. Here, the impact of increasing other agents’ inventories is significantly lower than the case with linear price. This can also be observed, albeit less clearly, from the sample inventory paths in Figure 4 where agents with initial inventories significantly different from zero converges more slowly to other agents’ inventories – see particularly the center panels.

## 7 Conclusions

Here we present a computationally tractable reinforcement framework for multi-agent (stochastic) games. Our approach utilizes function approximations after decomposing the collection of agents’ state-action value functions into the individual value functions and their advantage functions. Further, we approximate the advantage function in a locally linear-quadratic form and use neural-net architectures to approximate both the value and advantage function. Typical symmetries in games allow us to use permutation invariant neural-nets, motivated by the Arnold-Kolmogorov representation theorem, to reduce the dimensionality of the parameter space. Finally, we develop an actor-critic paradigm to estimate parameters and apply our approach to two important applications in electronic trading. Our approach is data efficient, and is applicable to large number of players and continuous state-action spaces.

There are a number of doors left open for exploration including extending our approach to the case when there are latent factors driving the environment, and when the state of all agents are partially (or completely) hidden from any individual agent. As well, our approach can be easily applied to mean-field games which correspond to the infinite population limit of stochastic games that have interactions where any individual agent has only an infinitesimal contribution to the state dynamics.