Approximate Nash Equilibrium Learning for n-Player Markov Games in Dynamic Pricing

by   Larkin Liu, et al.
Technische Universität München

We investigate Nash equilibrium learning in a competitive Markov Game (MG) environment, where multiple agents compete, and multiple Nash equilibria can exist. In particular, for an oligopolistic dynamic pricing environment, exact Nash equilibria are difficult to obtain due to the curse-of-dimensionality. We develop a new model-free method to find approximate Nash equilibria. Gradient-free black box optimization is then applied to estimate ϵ, the maximum reward advantage of an agent unilaterally deviating from any joint policy, and to also estimate the ϵ-minimizing policy for any given state. The policy-ϵ correspondence and the state to ϵ-minimizing policy are represented by neural networks, the latter being the Nash Policy Net. During batch update, we perform Nash Q learning on the system, by adjusting the action probabilities using the Nash Policy Net. We demonstrate that an approximate Nash equilibrium can be learned, particularly in the dynamic pricing domain where exact solutions are often intractable.


page 5

page 19

page 20

page 21


Near-Optimal Communication Lower Bounds for Approximate Nash Equilibria

We prove an N^2-o(1) lower bound on the randomized communication complex...

Walrasian Dynamics in Multi-unit Markets

In a multi-unit market, a seller brings multiple units of a good and tri...

The Nash Equilibrium with Inertia in Population Games

In the traditional game-theoretic set up, where agents select actions an...

Resolving Implicit Coordination in Multi-Agent Deep Reinforcement Learning with Deep Q-Networks Game Theory

We address two major challenges of implicit coordination in multi-agent ...

Equilibrium Learning in Combinatorial Auctions: Computing Approximate Bayesian Nash Equilibria via Pseudogradient Dynamics

Applications of combinatorial auctions (CA) as market mechanisms are pre...

The Computation of Approximate Generalized Feedback Nash Equilibria

We present the concept of a Generalized Feedback Nash Equilibrium (GFNE)...

Methods for finding leader--follower equilibria with multiple followers

The concept of leader--follower (or Stackelberg) equilibrium plays a cen...

1 Introduction

The application of deep reinforcement learning to solve single player Markov Decision Processes are relatively well-researched in today’s machine learning literature, however, the application of novel deep reinforcement learning methods to solve multi-agent competitive MDP’s are relatively few. Particularly, the challenge revolves around competing objectives and solving for, often computationally intractable, Nash equilibria

[Nash, 1950] in a competitive setting, where agents cannot merely maximize only their respective Q functions. Nash equilibria in Markov Games are particularly useful in the area of dynamic pricing, a well-known problem in today’s online eCommerce marketplaces. In a dynamic pricing game, multiple firms compete for market share of their product on the same online platform. To model this market, we adopt a modified version of Bertrand oligopoly [Bertrand, 1889], where the marginal cost of production is 0, and where purchasing behaviour is probabilistic.

1.1 Market Oligopoly

Especially in today’s online commerce industries, dynamic pricing provides platform firms with an adversarial advantage and increases profit margins in a competitive economy. In an ideal case, a firm’s pricing strategy should consider the actions of other firms, providing the justification for Nash equilibrium policies over unilateral reward maximization policies. In previous approaches, [Liu et al., 2019] modelled the dynamic pricing problem using deep reinforcement learning, however, focused primarily on single agent revenue maximization and not solving for multi-agent Nash equilibria. [Schlosser and Boissier, 2018] apply stochastic dynamic programming in a simulated environment to estimate consumer demand and maximize expected profit, however, do not consider the equilibrium solution of the market. [van de Geer et al., 2018] model market demand using parametric functions, with unknown parameters, yet they do not consider a Nash equilibrium nor the actions of previous pricing on demand. Thus, in almost all recent research on the dynamic pricing problem under uncertainty, there exists a desideratum to compute and promote Nash equilibria for market pricing strategies. Such strategies can be applied to eCommerce platforms such as Amazon auto pricing, allowing firms to opt for an automated pricing algorithm.

A large body of literature on dynamic pricing has assumed an unknown demand function and tried to learn the intrinsic relationship between price and demand. [Harrison et al., 2012] showed that a myopic Bayesian policy may lead to incomplete learning and poor profit performance. [Liu et al., 2020] studied a joint pricing and inventory problem while learning the price-sensitive demand using a Bayesian dynamic program. Many recent studies have revolved around non-parametric methods. [Besbes and Zeevi, 2009] developed a pricing strategy that balances the trade-off between exploration and exploitation. Furthermore, in [Liu et al., 2019] the demand function was a black box system where the demand was approximated from experience replay and live online testing in a reinforcement learning framework.

Dynamic pricing in oligopolies presents multiple problems. One aspect is the optimization problem, where we learn a Nash equilibrium policy when presented with the parameters of the market demand function, and second is the learning of the demand parameters itself. For example, [den Boer and Zwart, 2014] and [Keskin and Zeevi, 2014] studied parametric approaches, using maximum likelihood to estimate unknown parameters of the demand function. Recent research has been conducted in demand learning, for example in cases where multiple products exist, and firms can learn to price on an assortment of items [Ferreira and Mower, 2022]

using multinomial logit regression, or when multiple products exist in a demand learning scenario, and a limited inventory exists, Multi-Armed Bandit approaches have shown to be effective at devising profitable policies

[Ferreira et al., 2018]. However, these approaches do not consider past pricing actions influencing the future market demand parameters. Furthermore, they do not consider competing firms aiming for a Nash equilibrium.

We propose a demand market where past pricing affects the future market demand in a market competition. To search for a Nash equilibrium, we apply a multi-agent Markov Decision Process, where the state transition is driven by the joint pricing actions of all players (or firms). To simplify the problem, we assume that each firm has unlimited inventory, and there is only a single product with no substitutes.

1.2 Multi-agent Markov Decision Processes

Multi-agent Markov Decision Processes, or Markov Games (MG’s), constitute an important area of research, especially when multiple agents are self-interested, and an exact or approximate Nash equilibrium for the system is desired. The computation of Nash equilibria additionaly presents great difficulty when searching over the enormous joint state action space of the problem, and approximations to this search problem do exist [Porter et al., 2008]. Moreover, the existence of multiple Nash equilibria can further complicate the solution, as some solutions may be more desirable than others.

Yet approximate search algorithms require prior knowledge of the joint reward function and are often limited to two players, modelled by a best response function. We treat our payoff function as an oracle [Vorobeychik and Wellman, 2008], meaning we have knowledge regarding the parametric form of the payoff function, however, there is no knowledge to the agents in the MG regarding how the reward function generates rewards for the players. The state of the game is visible to the agents, yet the agents have no knowledge regarding the state transition function. Nevertheless, the market demand parameters and Nash equilibria are known for purposes of experimentation. This eliminates the need for pre-period model fitting to pre-estimate market demand parameters [Ferreira et al., 2016] allowing us to compare our solution to the theoretical solution, as opposed to an empirical estimate.

Model-based approaches furthermore exist to provably find the existence of a Nash equilibrium [Zhang et al., 2020a]. However, we aim for a Nash equilibrium solver which is model-free, where no knowledge regarding the reward or transition function is known to the agents. In recent literature, [Ramponi et al., 2021] present a model-free algorithm to solve competitive MDP’s when the environment parameters can be altered by an external party - but this is not always the case in market economics. Furthermore, [Sayin et al., 2021]

propose a model-free decentralized deep learning framework, where the agent is blind to the other agents actions, and convergence to a Nash equilibrium is proven.

[Zhang et al., 2020b] propose a convergent solution for zero-sum MG’s via entropy-regularization. However, both [Zhang et al., 2020b] and [Sayin et al., 2021] impose a number of theoretical restrictions on the MG in order for this convergence to occur. Moreover, [Kozuno et al., 2021] present a provably convergent MG solver for imperfect information restricted to two agents. Nevertheless, we are concerned with an approximate solution to a full information MG for N agents.

Contributions: This work outlines a methodology for Deep Q Learning, as introduced in [Mnih et al., 2015], by extending the framework to multi-agent reinforcement learning (MARL) with a Nash equilibrium objective based on the methods in [Hu and Wellman, 2003] and [Wang and Sandholm, 2003]. Although the framework presented in [Hu and Wellman, 2003] is theoretically sound, the solution to the Nash equilibrium function is often intractable. We therefore apply approximation methods to compute a Nash equilibrium. Black box optimization is applied to estimate an -Nash Equilibrium policy, and this approximation is learned as a series of neural networks. This MARL model is then applied to the domain of dynamic pricing.

2 Dynamic Pricing Game

An oligopoly across firms is modelled as a multi-agent Markov Decision Process with a fully observable state action space. The state space is represented by the fully-observable demand-influencing reference price being the state of the game . In discrete intervals at time , each agent issues a price for the item to be sold to the general market, . The reference price, of the market at time is determined by a stochastic function of all the agents actions at time . To focus the problem strictly on dynamic pricing, for each firm we assume a sufficiently large inventory, with no marginal costs of production, no possibility for inventory stockouts, no holding costs, and the capacity to always meet any realization of market demand.

2.1 Markov Game Parameters

The joint action space constitutes the current actions of all agents at time , which drive the state transition by setting prices at time and represent the set price of the item from agent at time . The state-action-reward space is defined as a tuple where is the reward for agent at time . The joint reward can be written as , and the joint action can be written as . denotes the number of agents. The exact transition probabilities are not known to the agents. Each agent must learn the demand function and optimal strategy as the MG progresses. A state is determined by the reference price of the market observable to all agents, . We discretize the action space into even segments representing a firm’s price. Where each segment represents an action , and is the action space for any agent .

2.2 Reference Pricing

The first pillar of any market dynamic is constituted by the demand function, dictated by both the historical and contemporary prices, of the firms in the market. Demand processes can be ideally linear and continuous, however may not be guaranteed to be stationary or continuous [Den Boer and Keskin, 2020], and can be subject to various exogenous factors such as environmental carryover effects [Rao, 1993].

We create an idealization of a market based on [Taudes and Rudloff, 2012], where a dynamic pricing problem, with reference price effects and a linear demand function, is imposed. The expected current demand of a product is a function of the average price set by all firms, denoted as , and a historical reference price, . The reference price is given, and cannot be modified during the current time , moreover, it is a function of the immediate past, . Although [Taudes and Rudloff, 2012] is not necessarily the only model that incorporates reference pricing [Mazumdar et al., 2005], [Fibich et al., 2003], [Popescu and Wu, 2007], [Heidhues and Kőszegi, 2014], we adopt it due to its simplicity and the existence of provable Nash equilibria in an oligopolistic framework.


is defined as the noise from a Poisson process. Such a Poisson process has the arrival rate from Eq. (1

), and standard deviation

. Furthermore, [Taudes and Rudloff, 2012] make the stipulation of decreasing demand with respect to increasing price, as illustrated in Condition (1).

This demand-influencing reference price can be affected by a variety of factors, such as inflation, unemployment, psychological perception [Raman et al., 2002]. Moreover, in many proposed oligopoly models, as in [Janiszewski and Lichtenstein, 1999] and [Briesch et al., 1997], the reference price is dictated by a historical joint market price of multiple firms. However, modelling a competitive market oligopoly with an autocorrelated reference price in a MG setting is not heavily investigated until now. In our model, we focus on designing a market whose reference price is driven by the historical joint market price, and additional factors which also affect the reference price are represented as noise . Thus the transition of the reference price is determined by the average of the previous joint prices plus some Guassian randomness, .

In our experiment, the reference price of a product is determined by the previous actions of the firms. We express the reference price function as

, mapping a vector of historical pricing actions

to . In the beginning of the Markovian process, the reference price is randomly generated within the action space of the pricing range. The stochastic nature of the market price transition entail the Markov property of the MG.


2.3 Probabilistic Demand Function

The expected profit for player is , which equals revenue as the marginal cost is of production is assumed to be 0, is defined as,


in Eq. (4) represents the probability that a customer purchases the item sold by firm , where the player prices their merchandise with price . is the expected demand during the single stage game for a fixed time period. Following the quantal response function from [Luce, 1959] and [Goeree et al., 2020], we define a purchase elasticity function, ,

where (6)

In , represents the maximum weighted contribution to the probability of an item being purchased by a customer given a price. When the price is 0, this measure is . For simplification, we assume the linear marginal decline of this measure of all players as equal, , that is the market has the same elasticity regarding the marginal decrease of the customer willingness to purchase from any firm as price increases, with a negative slope . The probability of a customer choosing to purchase from firm at price among prices set by other firms is defined by combining a softmax function and purchase elasticity function, . This mechanism will prevent a solution where each firm undercuts each other to set their prices to 0, as the lowest price need not guarantee that a consumer will buy their product.

3 Nash Equilibrium Conditions

Computation of exact Nash, or approximate -Nash, equilibria is an NP-hard problem and notoriously difficult to compute [Daskalakis et al., 2006]. This involves a search over the entire joint state action space, computing the value of each state under a candidate policy. Furthermore, it involves knowledge of the joint Q function and the transition probability of the system. In our scenario, we have the condition that all agents are identical, therefore the solution of one agent can apply to the solution of another.

3.1 Theoretical -Nash Equilibria

Under the market outlined in Section 2, given the market inputs and , in any pricing strategy, there exists an optimal deviation , such that will yield the maximum profit advantage .


An increase in the individual profit of an agent from unilaterally deviating is denoted as . Given this optimal deviation, , a maximum theoretical upper bound on the profit advantage that can be obtained. Eq. (8) provides the theoretical value of which is a function of both reference price and market price , as well as the fixed market parameters . A derivation of can be found in Appendix B.

Market Scenario 1:
Market Scenario 2:
Figure 3.1: Surface plot of potential advantage from deviation (-deviation) from Nash equilibrium with respect to market price , and reference price , with their respective market parameters for arrival of a single sales event.

Using the Eq. (8), as visualized in Fig. 3.1, multiple Nash equilibria can exist when , or when is sufficiently small (see Section 5). Such behaviour occurs when one agent unilaterally deviates from , under reference price . Thus, the lower plateau from Fig. 3.1 represents the region where a theoretical Nash Equilibrium exists. Bounds on the state space and action space are limited to , that is there are 10 discrete intervals for each state and action , representing price values.

3.2 Nash Equilibrium in Markov Games ()

The value of a policy, at each state, can be represented by a joint action which maximizes the joint Q function [Watkins and Dayan, 1992] under policy . The probability of agent taking action is defined as .


An -Nash equilibrium is defined as a joint policy such that the reward of a single stage game will not result in a greater payoff to the agent by more than , when any agent unilaterally deviates from said joint policy. Provided , we consider a bound on the corresponding MG, , which defines a bound on the gain of the value of a policy, , should agent unilaterally deviate with an alternative policy. The solution to Eq. (9) can be computed by searching over the joint action space .


serves as an upper bound on , therefore the minimization of will also minimize any possible existence of for each single stage game in the MG. In fact, the existence of implies the existence of upper bound , we provide a proof of this existence in Appendix B.3.

4 Multi-Agent Nash Q Learning

In our model, the game provides full information, where information regarding the state and the actions of all agents are visible to any agent. This allows for experience replay for Q learning [Mnih et al., 2015] to occur, and the Q function can be shared for all agents as they are assumed to be identical. The joint action space is defined as the combined actions of all agents at time . Normally, in Q learning, the update mechanism searches for the action that maximizes the Q function, however, in Nash Q learning, we must update the Q function with the solution to the Nash equilibrium. As new experiences are obtained when the MG is played, the Nash Q value is updated. We utilize the Q update mechanism defined in [Hu and Wellman, 2003] as the update mechanism for the Nash Q estimator for Q learning. Given a Q function and a Nash policy, we search over the joint action that maximizes the scaled Q function [Laumônier and Chaib-draa, 2005], as in Eq. (13). In our representation, the Q function is a vector, returning the respective Q value for each agent, based on the joint probability input for the joint policy.


represents a Nash Operator, indicating the Q value of a Nash equilibrium, this is approximated by a scaling factor, according to the Nash equilibrium policy, , multiplied by the Q value of the joint action, . Extending from the work of [Hu and Wellman, 2003], we introduce a neural network for mapping any state action combination to its Nash policy, where the Nash scaling factor derives from and is used to compute in conjunction with the Q function. The Q-function used in Eq. (14) and (12) is represented by a deep neural network, simply referred to as the Nash Q Net . In a full information game with homogenous agents, a shared neural network representing can be used for all agents in the Markov Game (if not homogeneous a separate Q network must be retained and updated separately for each agent). The Q Network parameters representing is learned in the same method as Deep Q learning [Mnih et al., 2015], the key innovation, is that the scaling factor to compute is obtained via a Nash Policy Net, (defined later in Section 4.2).

4.1 Estimating Value Advantage via Black Box Optimization

We apply a deep neural network to represent a mapping of a joint policy to its respective from Eq. (11), designated as , where is a vector containing the of each state. As the gradient of cannot be easily evaluated, we apply gradient free black box optimization for minimization. Trust Region Optimization (TRO) has been shown to be effective for solving high-dimensional optimization problems [Wang et al., 2020] [Eriksson et al., 2019] [Diouane et al., 2021] [Regis, 2016], particularly when the computational resources are available. To compute the existence of an -Nash equilibrium in the high-dimensional policy space efficiently, we apply model-based Bayesian Optimization via Trust Region optimization (TuRBO) from [Eriksson et al., 2019]. TuRBO

combines standard TRO with a multi-armed bandit system via Thompson Sampling

[Thompson, 1933] to optimize for local multiple trust regions simultaneously, with sufficient convergence time. However, the candidate generation step in TuRBO is not constrained to account for the valid joint probabilities of each agent, in which the sum of probabilities for each agent in its respective action space must sum to 1. We alter TuRBO by simply normalizing the original candidate probabilities over each set of probabilities belonging to an agent . The resulting modified algorithm is denoted TuRBO-p, where any candidate value generated by TuRBO-p will have joint probabilities that sum to 1 for policies corresponding to each agent.


The maximization problem is formulated in Eq. (16) representing the maximum gain in value of a policy, where agent deviates from policy with .

4.2 Nash Policy Learning

The -Nash policy is defined as a joint policy that minimizes from Eq. (11). Drawing inspiration from [Ceppi et al., 2010], where a Nash equilibrium is found by effectively minimizing for

via linear programming, we apply a similar technique of

-minimization. However, in [Ceppi et al., 2010], the model parameters are known to the agents, and the Markov game was constrained to two players. In our MG, the joint reward function must be learned. Therefore, we perform approximate search over the policy space instead of using exact methods to discover any possible approximate Nash equilibrium policies. In principle, each state has maps to a corresponding probability in accordance with the Nash Policy which minimizes in Eq. 16. However, a table keeping track of such an enormous state policy space is not feasible. Therefore, we implement a Nash Policy Net to learn the state to policy mapping, which is the joint policy producing an approximate Nash equilibrium as approximated via TuRBO-p, designated as .

1:Initialize state joint policy .
2:Initialize random parameters for , and , as , , and .
3:Initialize MDP Environment, .
4:for  do:
5:     Get initial state from .
6:     for  do: Iterate until end of episode.
7:         Get action probabilities from Nash Policy Net .
8:         for  do: Iterate through agents.
9:              Sample selected action from Nash policy.
10:         end for
11:         Obtain joint action
12:         Assign via TuRBO-p.
13:         Find the Nash policy via TuRBO-p.
14:         Execute joint action in to obtain .
15:         Append experience to experience replay .
16:         Update State .
17:         if  then:
18:              Sample minibatch from .
19:              Set
20:               Nash Q Update.
21:               Nash policy update, where is from .
22:               Nash update, where is from

              Backpropagate loss functions

, and on , , and .
24:              Update model parameters , , and via gradient descent.
25:         end if
26:         Update joint policy .
27:     end for
28:end for
Algorithm 1 Nash equilibrium learning

5 Results

Deep learning loss convergence: We demonstrate empirically that the loss function of and decrease with each RL episode. The loss function for Q update defined in Eq. (12) decreases as shown in Fig. 5.1. Furthermore, the loss of the Nash Policy Estimator also decreases, indicating that a representation of the function mapping state to its corresponding Nash policy based on policy-to- model

is becoming more accurate with each iteration. Hyperparameters are presented in Appendix


Loss behaviour of Scenario 1 with .
Loss behaviour of Scenario 2 with .
Figure 5.1: Decreasing loss behaviour of batch update during Deep Q Learning (left y axis) and Nash Net update (right y axis) demonstrating that has learned a representation of the Nash policy.

Stabilization of realized market rewards to Nash equilibrium: Fig. 5.2 represents the convergence of the average market reward of a single agent (randomly selected as agent 0) towards a Nash equilibrium. We superimpose the boundaries of the theoretical Nash equilibrium reward over the plot. The reward is obtained from the boundaries of theoretical Nash equilibria, where (). The topology of this function is illustrated in Fig. 3.1, and for each state, or reference price, there exists a boundary where the policy deviation of any agent can occur without significant unilateral reward gain. The reward is then computed using the market model parameters per Scenario () and setting theoretical equilibrium reward where exists as two boundary points indicating the area where . For each episode the average reward per timestep over the episode lengths is recorded. From Fig. 5.2, we observe the average reward per agent of the system per episode (blue dashed line), and the average reward of a single agent per episode (orange dashed line), to converge within the boundary of Nash equilibrium (blue shade).

Market rewards per episode Scenario 1, .
Market rewards per episode Scenario 2, .
Figure 5.2: In a Nash equilibrium, both the market average reward (blue), and single agent reward (orange) should fall within the Nash equilibria boundary (blue shade), as the MG progresses.

6 Conclusion

We created a Markov Game that represents an n-firm oligopoly, based on previously established market models where theoretical bounds for a Nash equilibrium policy exist. A black box algorithm is applied to estimate an upper bound on the value of a joint policy, represented by neural network . Similarly a Nash Policy Net is learned to represent the -minimizing policy from black box optimization, constituting an -Nash policy. Thus can be used together with traditional Deep Q learning, to solve for a Nash equilibrium. Empirically, we show that the average reward of all agents, and the reward of a respective single agent, converges to an approximate Nash equilibrium. The limitations of this research are the limited action space of the agents, as well as the identical nature of the agents. Larger scale experimentation under this framework is suggested. Furthermore, the proposed market model could be enhanced by implementing more complex non-linear market oligopoly environments.


  • J. Bertrand (1889) Review of walras’s théorie mathématique de la richesse sociale and cournot’s recherches sur les principes mathématiques de la théorie des richesses. In Cournot Oligopoly: Characterization and Applications, A. F. Daughety (Ed.), pp. 73–81. External Links: Document Cited by: §1.
  • O. Besbes and A. Zeevi (2009) Dynamic pricing without knowing the demand function: risk bounds and near-optimal algorithms. Operations Research 57 (6), pp. 1407–1420. Cited by: §1.1.
  • R. A. Briesch, L. Krishnamurthi, T. Mazumdar, and S. P. Raj (1997) A comparative analysis of reference price models. Journal of Consumer Research 24 (2), pp. 202–214. Cited by: §2.2.
  • S. Ceppi, N. Gatti, G. Patrini, and M. Rocco (2010) Local search methods for finding a nash equilibrium in two-player games. pp. 335–342. External Links: Document Cited by: §4.2.
  • C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou (2006) The complexity of computing a nash equilibrium. In

    Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing

    STOC ’06, New York, NY, USA, pp. 71–78. External Links: ISBN 1595931341, Document Cited by: §3.
  • A. V. den Boer and B. Zwart (2014)

    Simultaneously learning and optimizing using controlled variance pricing

    Management Science 60 (3), pp. 770–783. Cited by: §1.1.
  • A. V. Den Boer and N. B. Keskin (2020) Discontinuous demand functions: estimation and pricing. Management Science 66, pp. . External Links: Document Cited by: §2.2.
  • Y. Diouane, V. Picheny, R. L. Riche, and A. S. Di Perrotolo (2021) TREGO: a trust-region framework for efficient global optimization. arXiv. External Links: Document, Link Cited by: §4.1.
  • D. Eriksson, M. Pearce, J. R. Gardner, R. Turner, and M. Poloczek (2019) Scalable global optimization via local bayesian optimization. CoRR abs/1910.01739. External Links: Link, 1910.01739 Cited by: §4.1.
  • K. J. Ferreira, B. H. A. Lee, and D. Simchi-Levi (2016) Analytics for an online retailer: demand forecasting and price optimization. Manufacturing & Service Operations Management 18 (1), pp. 69–88. Cited by: §1.2.
  • K. J. Ferreira and E. Mower (2022) Demand learning and pricing for varying assortments. Manufacturing & Service Operations Management. Cited by: §1.1.
  • K. J. Ferreira, D. Simchi-Levi, and H. Wang (2018) Online network revenue management using thompson sampling. Operations Research 66 (6), pp. 1586–1602. Cited by: §1.1.
  • G. Fibich, A. Gavious, and O. Lowengart (2003) Explicit solutions of optimization models and differential games with nonsmooth (asymmetric) reference-price effects. Operations Research 51 (5), pp. 721–734. Cited by: §2.2.
  • J. Filar and K. Vrieze (1997) Competitive markov decision processes. Springer. Cited by: §B.3.
  • J. K. Goeree, C. A. Holt, and T. R. Palfrey (2020)

    Stochastic game theory for social science: a primer on quantal response equilibrium

    Edward Elgar Publishing, Cheltenham, UK. External Links: ISBN 9781785363320, Link Cited by: §2.3.
  • J. M. Harrison, N. B. Keskin, and A. Zeevi (2012) Bayesian dynamic pricing policies: learning and earning under a binary prior distribution. Management Science 58 (3), pp. 570–586. Cited by: §1.1.
  • P. Heidhues and B. Kőszegi (2014) Regular prices and sales. Theoretical Economics 9 (1), pp. 217–251. Cited by: §2.2.
  • J. Hu and M. P. Wellman (2003) Nash q-learning for general-sum stochastic games. Journal of Machine Learning Research 4, pp. 1039–1069. External Links: ISSN 1532-4435 Cited by: §1.2, §4, §4.
  • C. Janiszewski and D. R. Lichtenstein (1999) A range theory account of price perception. Journal of Consumer Research 25 (4), pp. 353–368. Cited by: §2.2.
  • N. B. Keskin and A. Zeevi (2014) Dynamic pricing with an unknown demand model: asymptotically optimal semi-myopic policies. Operations Research 62 (5), pp. 1142–1167. Cited by: §1.1.
  • T. Kozuno, P. Ménard, R. Munos, and M. Valko (2021) Model-free learning for two-player zero-sum partially observable markov games with perfect recall. arXiv. External Links: Document, Link Cited by: §1.2.
  • J. Laumônier and B. Chaib-draa (2005) Multiagent q-learning: preliminary study on dominance between the nash and stackelberg equilibriums. Cited by: §4.
  • J. Liu, Y. Zhang, X. Wang, Y. Deng, and X. Wu (2019) Dynamic pricing on e-commerce platform with deep reinforcement learning. CoRR abs/1912.02572. External Links: Link, 1912.02572 Cited by: §1.1, §1.1.
  • J. Liu, Z. Pang, and L. Qi (2020) Dynamic pricing and inventory management with demand learning: a bayesian approach. Computers & Operations Research 124, pp. 105078. External Links: Document Cited by: §1.1.
  • R. D. Luce (1959) Individual choice behavior: a theoretical analysis. Wiley, New York, NY, USA. Cited by: §2.3.
  • T. Mazumdar, S. P. Raj, and I. Sinha (2005) Reference price research: review and propositions. Journal of Marketing 69 (4), pp. 84–102. Cited by: §2.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: ISSN 00280836 Cited by: §1.2, §4, §4.
  • J. F. Nash (1950) Equilibrium points in n-person games. Proceedings of the National Academy of Sciences 36 (1), pp. 48–49. External Links: Document, ISSN 0027-8424, Link, Cited by: §1.
  • I. Popescu and Y. Wu (2007) Dynamic pricing strategies with reference effects. Operations Research 55 (3), pp. 413–429. Cited by: §2.2.
  • R. Porter, E. Nudelman, and Y. Shoham (2008) Simple search methods for finding a nash equilibrium. Games and Economic Behavior 63 (2), pp. 642–662. Note: Second World Congress of the Game Theory Society Cited by: §1.2.
  • K. Raman, F. M. Bass, et al. (2002) A general test of reference price theory in the presence of threshold effects. Tijdschrift voor Economie en management 47 (2), pp. 205–226. Cited by: §2.2.
  • G. Ramponi, A. M. Metelli, A. Concetti, and M. Restelli (2021) Learning in non-cooperative configurable markov decision processes. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §1.2.
  • V. R. Rao (1993) Pricing models in marketing. Handbooks in Operations Research and Management Science 5, pp. 517–552. Cited by: §2.2.
  • R. G. Regis (2016) Trust regions in kriging-based optimization with expected improvement. Engineering Optimization 48 (6), pp. 1037–1059. External Links: Link, Cited by: §4.1.
  • M. O. Sayin, K. Zhang, D. S. Leslie, T. Basar, and A. E. Ozdaglar (2021) Decentralized q-learning in zero-sum markov games. CoRR abs/2106.02748. External Links: Link, 2106.02748 Cited by: §1.2.
  • R. Schlosser and M. Boissier (2018) Dynamic pricing under competition on online marketplaces: a data-driven approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningProceedings of AAAI-2005 Workshop on Multiagent Learning, Pittsburgh, USA, KDD ’18, New York, NY, USA. External Links: Document Cited by: §1.1.
  • A. Taudes and C. Rudloff (2012) Integrating inventory control and a price change in the presence of reference price effects: a two-period model. Mathematical Methods of Operations Research 75 (1), pp. 29–65. Cited by: §2.2, §2.2.
  • W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of evidence of two samples. Biometrika 25 (3-4), pp. 285–294. External Links: Document, Cited by: §4.1.
  • R. van de Geer, A. V. den Boer, C. Bayliss, C. S. M. Currie, A. Ellina, M. Esders, A. Haensel, X. Lei, K. D. S. Maclean, A. Martinez-Sykora, and et al. (2018) Dynamic pricing and learning with competition: insights from the dynamic pricing challenge at the 2017 informs rm & pricing conference. Journal of Revenue and Pricing Management 18 (3), pp. 185–203. External Links: ISSN 1477-657X, Link, Document Cited by: §1.1.
  • Y. Vorobeychik and M. Wellman (2008) Stochastic search methods for nash equilibrium approximation in simulation-based games. pp. 1055–1062. External Links: Document Cited by: §1.2.
  • L. Wang, R. Fonseca, and Y. Tian (2020) Learning search space partition for black-box optimization using monte carlo tree search. CoRR abs/2007.00708. External Links: Link, 2007.00708 Cited by: §4.1.
  • X. Wang and T. Sandholm (2003) Reinforcement learning to play an optimal nash equilibrium in team markov games. In Advances in Neural Information Processing Systems, S. Becker, S. Thrun, and K. Obermayer (Eds.), Vol. 15, pp. . External Links: Link Cited by: §1.2.
  • C. J. C. H. Watkins and P. Dayan (1992) Q-learning. Machine Learning 8 (3), pp. 279–292. External Links: Document Cited by: §3.2.
  • K. Zhang, S. M. Kakade, T. Basar, and L. F. Yang (2020a) Model-based multi-agent RL in zero-sum markov games with near-optimal sample complexity. CoRR abs/2007.07461. External Links: Link, 2007.07461 Cited by: §1.2.
  • Q. Zhang, Y. Guan, and P. Tsiotras (2020b) Learning nash equilibria in zero-sum stochastic games via entropy-regularized policy approximation. CoRR abs/2009.00162. External Links: Link, 2009.00162 Cited by: §1.2.

Supplemental Material

Appendix A Softmax Win Probability

Proposition: In an N-player game when deviating from the equilibrium market price by an amount of , given the softmax win probability in Eq. (4), the probability of winning a customer changes by a factor of as defined in Eq. (19).

Proof: Suppose N-1 players set an equal price of and one player deviates with price , where .

The players who do not deviate from the equilibrium price, will have a win probability of


The player that deviates from the market price will have a win probability of ,


The increase in win probability by deviating from the equilibrium price by an increment of is denoted by ,


As follows from Eq. (20) the probability of winning a customer’s purchase by changing price with deviation with respect to the equilibrium price effectively changes the win probability by a factor .

a.1 Admissible Values of Profit Function

Proposition: In an N-player game when deviating from the market price by an amount of , there always exists a boundary such that the expected profit from deviating exists. We define this as the admissible range.


Proof: We define the gain function as,


Given the deviation condition and Condition (1), the admissible range is more precisely defined as a range for where a solution for Inequality (23) exists,


We see from Inequality (23) that there is a bound on admissible values of which results in admissible values.

Appendix B Equilibrium Study

The Nash equilibrium of a pricing strategy can be either a pure or mixed strategy. We prove that for a pure strategy, in the support of a mixed strategy, an -Nash equilibrium exists. Consequently, multiple Nash equilibria can exist in this pricing game. Suppose a hypothetical equilibrium where a market price exists. We examine the hypothetical situation if one agent were to deviate from the of with a price of . Particularly, we study the case when a player undercuts or prices above a set market price, where and is the value which the player deviates from the equilibrium price. From the equilibrium setting, as follows from Eq. (24),


We derive Eq. (25) from Eq. (24). From Eq. (25) we see that the expected demand is simply the demand function at equal pricing corrected by a factor of as defined in the Eq. (26), where is the amount the player deviates from the equilibrium price.


b.1 Proof of a Best Response Function (Market Undercutting Scenario)

Proposition: In an N-player game, under specific market conditions dictated by the reference price and equilibrium price , under certain proven conditions, stipulated later in Condition 33, there can exist a boundary such that the expected profit from deviating is greater than not deviating , when deviating from the market price by undercutting with an amount of , as illustrated by Inequality (27).


Given the deviation constraint , therefore , refer to Supplemental A, we find the solution for the polynomial section of the gain function, defined by Eq. (28).


With the constraint , a solution to Inequality (29) is restricted by Condition (30). And such a solution only exists when .


Making substitutions from, Eq. (24) into Eq. (31),


We see from Inequality (32) that if the equilibrium price of the other agents is priced below a function of the reference price there will be no value of that satisfies Inequality (31). Therefore, the agent will not gain profit from undercutting if , the current market price, is under a certain limit with respect a monotonic function the reference price . The exact conditions on which undercutting the will yield profit gain for any agent is outlined in Inequality (33).


b.2 Proof the Existence of Multiple -Nash Equilibria Single Stage Game

Proposition: Suppose the market is in an equilibrium state where all agents price their items at a fixed price, and one player elects to undercut the market. We demonstrate that there is a theoretical maximum expected reward in this oligopoly for a single stage game for undercutting the market.


Inequality (34) expresses the conditions of an -Nash equilibrium, that is no player can obtain a higher reward than a margin of by deviating from the equilibrium price of the market . In effect we acknowledge that the upper bound of in Eq. (20) as,


We take the partial derivative with respect the deviation amount to obtain theoretical maximum of .


Solve the derivative,


The solution to Eq. (38)


We proved that when a player deviates from the market price by a margin of , there exists an optimal deviation amount , outlined in Eq. (39) such that the expected profit gain is maximized. is therefore,


Multiple solutions can exist where , to give a perfect Nash equilibria solution, as and are functions of . The Nash equilibrium condition is,


Therefore, by undercutting the market with any price , a player can theoretically yield no higher expected profits than greater than its competitors as defined in Eq. (39) and Eq. (41). Suppose a policy of pure strategy exists, , where,


Thus, constitutes an -Nash equilibrium resulting from a pure strategy, with denoted in Eq. (41), from which varying the parameters of and in Eq. (39), multiple -Nash or Nash equilibria exist. ∎

b.3 Existence of Nash equilibrium for Markov Game

Proposition: The existence of an -Nash equilibrium in single stage game ensures that there exists an -Nash equilibrium in the Markov Game, as defined in Eq. (11), for the value of a joint policy regardless of the state transition behaviour of the Markov Game.

Proof: From [Filar and Vrieze, 1997], the value of a policy can be defined,


With a specific policy, the transition matrix of the Markov Game is known, and therefore the value of a policy can be expressed in Eq. (44), with defined transition and reward matrices. Provided the Nash equilibrium condition from Eq. (11) we must demonstrate that,


Where denotes any joint policy, and denotes a Nash equilibrium policy. We know that the Markov transition probability holds such that,


represents the probability of transition from state to state . Eq. (46) simply indicates the sum of transition probabilities from state to any other state must equal 1 (Markov property). Furthermore, suppose,