Geometrical Regret Matching of Mixed Strategies

08/18/2019
by   Sizhong Lan, et al.
0

We argue that the existing regret matchings for equilibrium approximation lead to "jumpy" strategy updating when the probabilities of future plays are set to be proportional to positive regret measures. We propose a geometrical regret matching which has a "smooth" strategy updating. Our approach is simple, intuitive and natural. The analytical and numerical results show that, continuously and "smoothly" suppressing "unprofitable" pure strategies is sufficient for the game to evolve towards equilibrium, suggesting that in reality the tendency could be pervasive and irresistible. Technically, iterative regret matching gives rise to a sequence of adjusted mixed strategies for our study its approximation to the true equilibrium point. The sequence can be analyzed in metric space and visualized nicely as a clear path towards an equilibrium point. Our theory has limitations in optimizing the approximation accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

08/18/2019

The Path to Nash Equilibrium

It has been proved that every non-cooperative game has a Nash equilibriu...
12/03/2020

On the Impossibility of Convergence of Mixed Strategies with No Regret Learning

We study convergence properties of the mixed strategies that result from...
09/30/2019

Strategizing against No-regret Learners

How should a player who repeatedly plays a game against a no-regret lear...
01/14/2020

Faster Regret Matching

The regret matching algorithm proposed by Sergiu Hart is one of the most...
06/28/2021

Evolutionary Dynamics and Φ-Regret Minimization in Games

Regret has been established as a foundational concept in online learning...
11/28/2014

Solving Games with Functional Regret Estimation

We propose a novel online learning method for minimizing regret in large...
03/21/2022

Fictitious Play with Maximin Initialization

Fictitious play has recently emerged as the most accurate scalable algor...

Code Repositories

eqpt

The Path to Nash Equilibrium


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In 2000 Hart and Mas-Colell proposed an iterative algorithm called regret matching to approximate an equilibrium Hart2000; Hart2001. The players keep tracking the regrets of their own past plays and make the future plays with probabilities proportional to positive regret measures. This algorithm is particularly natural in that the players don’t have to know their opponents’ payoff functions, as opposed to the non-adaptive variety, e.g. the celebrated Lemke-Howson algorithm Lemke1964; Lemke1965 which takes two players’ payoff matrices as input and pinpoints equilibrium points as output. In other words, regret matching allows the players to reach equilibrium without strict mediation from beyond themselves. The concept of regret matching has since been adopted and developed by other algorithms cfr2008.

Iterative regret matching can be seen as continuous updating of mixed strategy with regret information: the mixed strategy to update is statistical structure of the whole past plays, and the mixed strategy updated will determine the probabilities of plays in the immediate future. Regret matching is essentially a function between them. In general the existing regret-matching functions update the mixed strategy proportional to positive regret measures, meaning that each matching is a “strategy jump” and the past mixed strategy has little relevance except for it being used for regret evaluation. In this paper, we try to propose a “smoother” regret matching with more respect to the past plays. Partly inspired by the mapping in Nash’s existential proof of equilibrium point Nash1951, our function for regret matching takes the form of

As shown in FIG 1

our regret-matching function can be interpreted geometrically if we treat mixed strategies and regrets on pure strategies both as vectors: each regret matching will “push” mixed strategy vector towards regret vector by a small angle, and the smaller

is the “smoother” matching will be. Other than the smoothness for strategy updating, more importantly our regret matching agrees well with the behavioral tendency of immediate suppressing “unprofitable” pure strategies, which makes the matching a natural one. Our regret matching turns out to have clear analytical results for regret minimization. The test results in FIG 23 and 4 show that, the matching delivers strong numerical convergence to Nash equilibrium both for two-person and many-person non-cooperative games. And as shown in FIG 5, the convergence is independent of the initial mixed strategies to start game with.

Figure 1: An example for the geometrical relation of mixed strategy , regret vector and regret-matched . The dotted line represents the simplex in . In this example regret vector must be to the direction of one vertex since the other one is “less profitable than average” such that the corresponding regret component in vector is zero.
Figure 2: The pair of “strategy paths” on the simplex of for a two-person game. The game has the code name 3X3-1eq2sp to refer to, indicating that the game has one single equilibrium point and either player can use three pure strategies.
Figure 3: The pair of “strategy paths” for a two-person game where two players use and pure strategies respectively. The or dimensions of “strategy path” data are reduced with PCA to three dimensions for the visualization in .
Figure 4: The “strategy paths” for a six-person game, where each player can use three pure strategies.
Figure 5: The equilibrium point of game 3X3-1eq2sp is an “attractor” for all possible “strategy paths”.

The remainder of paper goes as follows. In Section II, we vectorize the key concepts of non-cooperative game as well as regret measures in preparation of Section III, where our regret matching is elaborated and applied directly upon mixed strategies so as to approximate a Nash equilibrium point. In Section IV, the approximation algorithm is verified with the examples both of two-person game and general -person game, and their “strategy paths" are visualized as shown in FIG 25. In Section V, we address the accuracy issues of equilibrium approximation. In Section VI, we consider some generalizations for our theory.

Ii Non-cooperative game and regret

In this section, we will define the condition of Nash equilibrium as well as the regret measures in the form of vector and its -norm.

First of all, let us define the familiar concepts of non-cooperative game and set up their notations.

  • There are players, and each player has pure strategies. The th player’s th pure strategy is denoted by .

  • The mixed strategy of each player is a point of a probability simplex , which is a compact and convex subset of the real vector space of dimensions. That is, .

  • Each player has a payoff function . Any is an -tuple of mixed strategies, e.g. in which each item is associated with the th player such that . Therefore, is the product of all players’ simplices, i.e. .

  • Any player can, from the given , form a new -tuple of mixed strategies, denoted by , by unilaterally substituting the th item of with its new mixed strategy , or as a special case, by unilaterally substituting with its th pure strategy. Obviously, if already uses .

Intendedly, these vector-centric definitions and notations allow us to, for any player , translate the linearity of its payoff in into . Here is the inner product of vectors, is the mixed strategy player uses in and is the player’s payoff vector for the vertices of its simplex . Given any -tuple , for each player we have

(1)

Given any , each player can substitute its item in with to form a new one , which means that we can designate for each player a variant payoff function of substitute strategy :

(2)

Note that is the expected value of payoff or simply the average payoff, since

can be conveniently treated as probability distribution. From Eq. (

1) and (2) we can derive for each player a function

(3)

Here each component of vector is defined by the very function proposed by Nash in his existential proof:

(4)

Each component represents the payoff gain for player when its strategy moves from mixed to its th pure strategy. We say a pure strategy is “more profitable” than if its or otherwise “less (or equally) profitable”. Vector has at least one zero component, since there must be one “least profitable” pure strategy as implied by the linearity of in .

Given , vector measures, in terms of payoff, how far a substitute is from the optimal one. Specifically, the -norm of , i.e. , decreases as increases and would turn zero when is maximized with the optimal substitute. Recall that an equilibrium point is an -tuple in which each player’s mixed strategy is optimal against those of its opponentsNash1951. That is, an -tuple is an equilibrium point, if and only if or equivalently for each item in .

Vector will be the focal point of our study. We call the regret vector of player in that the vector indicates how much payoff it would have gained if its strategy moved to the vertices of its simplex , and correspondingly the regret sum of player . We shall occasionally abbreviate equilibrium point by EqPt, and write , and as implicitly indexed by given .

Iii Regret Matching and Fixed point iteration

In this section, we will conduct regret matching with regret vector iteratively on the initial mixed strategies to approximate a Nash equilibrium. Ideally, this iterative regret matching could develop into a fixed point iteration fpi.

We define each player’s regret matching to be a function :

(5)

Here . By we translate that, on the game with given -tuple , if the player unilaterally changes its mixed strategy to any other , it will regret that should have been used instead. We write occasionally. Our design of for regret matching has three important aspects to consider.

First, function has a geometrical interpretation.

We can see in FIG 1 that vector is “closer” to vector than vector is. Generally, the “closeness” of two vectors, say and , can be measured by their angle , which further can be measured by . By Theorem 1, we have and thus

(6)
Theorem 1.

For any non-zero real vectors , and where real , .

Proof.

By definition, we have and to compare. There are three cases to consider with respect to , as follows:
(i) If , by the triangle inequality we have

(7)

(ii) If , such that .
(iii) If , . Let and . Because , as with case (i) we have and thus , which easily gives . ∎

Second, function has a behavioral interpretation.

By comparing vectors and component-wise, there must be less proportion of pure strategy used in than in , if its corresponding component in vector is zero. In other words, preferably suppresses the use of “less profitable than average” pure strategies and meantime enhances the use of some of the “more profitable than average” ones. Note that this behavioral aspect was mentioned in Nash’s existential proof.

Third, gives more payoff and less regret than .

Theorem 2 and Theorem 3 show that will always be a better choice in terms of payoff and regret sum, no matter what the player unilaterally changes its mixed strategy to, unless itself is optimal.

Theorem 2.

For any and , where .

Proof.

By Eq. (2) we know where is the player ’s vertex payoff vector as in Eq. (1). Then by Eq. (5) we have,

(8)

Also by Eq. (2) we have . And

(9)

Here is -norm of vector. Finally because and , we have

(10)

And if and only if . ∎

Theorem 3.

For any and , where .

Proof.

For each component in vector and its counterpart in vector , by Eq. (4) and Theorem 2 we have such that . ∎

These three aspects makes function a natural regret matching of mixed strategies: if a player unilaterally and continuously adjusts its strategy by , intuitively its strategy will keep “pushing” towards regret vector as target, the use of “unprofitable” pure strategies will keep dropping, and most importantly its payoff will monotonically increase and regret sum will monotonically decrease. Since a player’s payoff is bounded, continuous strategy adjustment by would preferably lead to the optimal mixed strategy with maximum payoff and zero regret sum.

The rest of this section considers all players adjusting their strategies simultaneously and continuously.

On the game with given , if all players simultaneously adjust strategies by their own , there will form a new . Let and we have

(11)

Note that for any because for any . And all players’ continuous strategy adjustment can be described by an iterated function , denoted by with being the times of iterations. Given , those iterated functions with give rise to an infinite sequence of points in for our study:

(12)

Here denotes the identity function. This sequence is referred to as implicitly indexed by and . In fact, the tuple uniquely determines a sequence ; we say generates . Sequence describes an infinite dynamics of players adjusting their strategies in parallel. Specifically, at the th iteration, each player adjusts its strategy by substituting in with so as to collectively form a new -tuple ; at the th iteration is formed by adjusting in the same manner; and so forth.

Now we might hope that each and every player still has its payoff monotonically increase and its regret sum monotonically decrease in the process of simultaneous strategy adjustment. In that case, the infinite sequence must tend towards an EqPt. Moreover, it would be ideal if the overall regret sum of all players, i.e. , converged to zero along such that and thus for each player . In that case, by definition must be an EqPt. Denote it by and we must have . Therefore, EqPt is also a fixed point fixedpoint of function ; the sequence generated by is essentially an outcome of the process of fixed point iteration, which can be formulated to . For our purpose we assume that, with proper finite iterations, can be a reasonable approximate of the true EqPt such that , and the accuracy of such EqPt approximation can be measured by the overall regret sum of all players.

Only this hope comes with huge caveat. As it turns out in Section V, the presumption on the monotonicity of payoff and regret sum for all players along oftentimes doesn’t hold true. Instead, due to simultaneous strategy adjustment each player’s regret sum might fluctuate along with a tendency to diminish towards zero. Nevertheless, we will see that this tendency can still serve the purpose of approximating an EqPt. Notice that in FIG 1, the parameter in Eq. (5) determines the rate of angle by which vector and regret vector “close up”. Here we call adjustment rate. To counter the influence from simultaneous strategy adjustment, as we will see in next section “infinitesimal” adjustment rates are used for all players, in the hope that each player’s strategy adjustment could be treated as negligible by its opponents and thus their simultaneous strategy adjustment could be treated as approximately unilateral.

We should visualize the sequence of Eq. (III) as players’ perpetual searching for the equilibrium point of non-cooperative game. In this searching, players don’t have to be aware of the equilibrium point per se. Instead, each player is constantly chasing after its regret vector in the hope of reducing its regret sum to zero. Increasing payoff and meantime reducing regret sum is the players’ sole incentive and purpose, and regret vector is the players’ sole information and target for their strategies adjustment. Regret vector is dynamic along iterations, of course. At any iteration, given the current mixed strategies of its opponents, each player can calculate its regret vector by evaluating the payoff on its pure strategies without any across-iteration retrospective considerations. In this searching, there is no place for the mediator beyond players; no inter-player information exchange other than the current use of mixed strategies.

Iv Game examples and visualizations

In this section we will verify our approach of EqPt approximation proposed in last section. Examples of both two-person game and general -person game will be tested, although the former ones are our main consideration due to their simplicity and sufficiency in revealing the important aspects of EqPt approximation.

For the two-person game we have players, say the row player with pure strategies and the column player with pure strategies. Accordingly, the -tuple is specialized to two-tuple of vectors, and thus the tuple is specialized to since function is determined by the two players’ payoff matrices and their adjustment rates . Recall that given any the tuple generates an infinite sequence correspondingly. Then we can implement the fixed point iteration in last section to Algorithm 1111Source code and demos can be found at https://github.com/lansiz/eqpt., which takes as input, and for our study outputs an approximate EqPt as and its regret sum , a finite sequence in correspondence to , and a finite sequence of regret sum. In Algorithm 1, the operation compares vectors component-wise as opposed to Eq. (4). Obviously, given the computational complexity of Algorithm 1 is under with .

1:payoff bimatrix , adjustment rates , initial two-tuple of mixed strategies and iterations .
2: approximate EqPt and its regret sum ; sequence ; sequence .
3:for to do
4:     vertices payoff: , .
5:     payoff: , .
6:     regret vector: , .
7:     regret sum: , .
8:     if overall regret sum is a new minimum then
9:          update with .
10:          update with .
11:     endif
12:     append into its sequence.
13:     append into its sequence.
14:     adjust mixed strategies: , .
15:end for
Algorithm 1 EqPt approximation of two-person game.

Given the output of Algorithm 1, we can plot the true EqPt , the approximate EqPt and the sequence on or to provide an intuitive visualization of approximating EqPt. Specifically, for the two-person games with and , we split into two sequences of , and then transform each into a sequence of for plotting on plane by defining

(13)

Note that Eq. (13) transforms the simplex in into a subset of enclosed by an equilateral triangle. The following are our observations on the test results of four typical games:

  • Game 3X3-1eq1sp has one unique EqPt, which uses one single pure strategy. As shown in FIG 6, converges straight towards the EqPt.

  • Game 3X3-1eq2sp has one unique EqPt, which uses two pure strategies. As shown in FIG 2, converge towards the EqPt with oscillation.

  • Game 3X3-2eq2sp has two EqPts, each of which uses two pure strategies. As shown in FIG 7, converges with oscillation towards one of the two EqPts dependent on the initial strategies.

  • Game 3X3-1eq3sp has one unique EqPt, which uses three pure strategies. As shown in FIG 8, doesn’t converge towards the EqPt but develops into a seemingly perfect circle away from the true EqPt. The last position of could be any point in the circle.

Figure 6: The “strategy paths” on the simplex of for game 3X3-1eq1sp. The two black crosses combine to represent the true EqPt. The input and output of game are annotated.
Figure 7: The “strategy paths” for game 3X3-2eq2sp. The approximate EqPt annotated corresponds to the true EqPt at the base line of equilateral triangle.
Figure 8: The “strategy paths” for game 3X3-1eq3sp.
Figure 9: The “strategy paths” for a game. The blue and red dots represent the last positions of strategies.

And for the general two-person games with and , we split into two sequences of

, and then with PCA (Principal Component Analysis for dimension reduction 

pca) transform each into a sequence of for plotting in space. The following are our observations on the test results of two typical games:

  • FIG 3 shows a game whose converges towards an EqPt.

  • FIG 9 shows a game whose develops into a perfect circle.

For the two-person game, its EqPts are determined by the bimatrix , independent of the initial . Then for the game examples mentioned previously, we test them with many random and observe that there appear to be two kinds of EqPts, “attractor” EqPt and “repellor” EqPt, as follows:

Figure 10: Both EqPts of game 3X3-2eq2sp are “attractors”. Ten pairs of random “strateg paths” are depicted.
Figure 11: The EqPt of game 3X3-1eq3sp is a “repellor”. Ten pairs of random “strateg paths” are depicted, five of which are started from closely near the true EqPt.
  • For game 3X3-1eq2sp FIG 5 shows that, any is “attracted” towards the unique EqPt.

  • For game 3X3-2eq2sp FIG 10 shows that, any is “attracted” towards one of the two EqPts.

  • For game 3X3-1eq3sp FIG 11 shows that, any is ”repelled” from the EqPt unless is exactly the EqPt.

We also test the examples of -person game by extending Algorithm 1 from bivariate case to general multivariate case, which can be better explained by the source code. And we observe the convergence or disconvergence of as with the two-person games. FIG. 4 shows an example for the convergence case. The computational complexity of this extended algorithm is under with being the number of players and , meaning that it is polynomial time when is given.

V EqPt approximation accuracy

In last section, we observe that for some games sequence seemingly converges towards EqPt, e.g., 3X3-1eq1sp, 3X3-1eq2sp and 3X3-2eq2sp. Whereas for games such as 3X3-1eq3sp, sequence clearly doesn’t converge at all; instead, it evolves into a perpetual cyclic path away from EqPt. Roughly speaking, the convergence or disconvergence of affects the accuracy of EqPt approximation. In this section, we will look into the approximation accuracy from two numerical perspectives. One is to examine sequence since regret sum is a measure of approximation accuracy as discussed in Section III, and the other one is to revisit the convergence of with metric. And we will show the connection between them.

(a) Game 3X3-1eq1sp.
(b) Game 3X3-1eq2sp.
(c) Game 3X3-2eq2sp.
(d) Game 3X3-1eq3sp.
Figure 12: The output sequences of Algorithm 1 for the four typical games in last section.

First, by taking the games in last section for example, we have the following observations on their sequences plotted in FIG 12:

  • As opposed to the presumption in Section III, regret sum oftentimes doesn’t monotonically decrease. Instead, due to the inter-player influence from simultaneous strategy adjustment, regret sum fluctuates with a decreasing tendency to reach new minimum. Game 3X3-1eq1sp barely has regret sum fluctuation, whereas game 3X3-1eq3sp has intensive regret sum fluctuation.

  • There is different degree of periodicity in regret sum fluctuation for different games. Regret sum fluctuation of game 3X3-1eq3sp exhibits strong periodicity and yet weak decreasing tendency.

  • The strong periodicity of regret sum sequence in game 3X3-1eq3sp, caused by the “repellor” EqPt, severely undermines the accuracy of EqPt approximation.

  • It is painfully slow for regret sum sequence to converge towards zero in that regret sum decreases slowly at the latter iterations, even for 3X3-1eq1sp. That is partly because, as implied by Eq. (10), the ratio of payoff increment to regret sum is getting smaller as iteration goes.

Now it can be concluded that the inter-player influence has a significant impact on the EqPt approximation accuracy which otherwise could, given time, go on to perfection. In addition, FIG 13 shows the decreasing tendency of regret sum for a five-person game.

Figure 13: The regret sum sequences of a five-person game with players using two, three, four, five and six pure strategies.

Next, for the games in last section, we will reexamine their output sequences by introducing a metric on . Generally, metric is necessary for the definitions of limit, convergence and contractiveness of sequence . Following the definitions in Section III, let us have a set and a function of Eq. (11). From we can derive a complete metric space with the metric function

(14)

Here and are the th items of and respectively. Assume that is a contraction mapping on metric space . That is, there exists a such that for any

(15)

Then by Banach fixed point theorem fixedpoint, the function must admit one unique fixed point such that and for any . Apparently the contraction condition of Eq. (15) is too strong for the convergence of , since it precludes the sequence’s possible convergence towards more than one EqPt, which is an important case of our interest as shown in last section. Instead, let us consider the contraction property of a specific sequence generated by the given tuple . From we can derive an infinite real sequence by defining

(16)

And from sequence we can further derive an infinite real sequence by defining

(17)
(a) Game 3X3-1eq1sp.
(b) Game 3X3-1eq2sp.
(c) Game 3X3-2eq2sp.
(d) Game 3X3-1eq3sp.
Figure 14: The derived sequences in blue and in red for the four typical games in last section.

If is a contractive sequence ideally, i.e. for any , it must be an Cauchy sequence and converge to an EqPt. That is, is a process of fixed point iteration. Meanwhile, the sequence must monotonically decrease and converge to zero. From the sequences generated by the games of last section, we derive and depict their and in FIG 14 to show that sequences don’t monotonically decrease or converge to zero, meaning that neither sequences nor their functions are contractive on . Nevertheless, by comparing FIG 12 and FIG 14, we observe that there is strong coherence among the sequences , and of the same game, as follows:

  • As with regret sum, oftentimes fluctuates with a decreasing tendency to reach new minimum. of game 3X3-1eq1sp exhibits almost no fluctuation, whereas of game 3X3-1eq3sp exhibits strong periodicity.

  • As with regret sum, in game 3X3-1eq3sp the strong periodicity of could have hurt the accuracy of EqPt approximation.

  • fluctuates around at the latter iterations, which explains why and regret sum sequence converge at such a low speed.

Therefore, can be seen as an another overall measure of EqPt approximation accuracy in addition to overall regret sum and to improve accuracy is to reduce , which in return justifies our definition of metric function in Eq. (14). Trivially, there is an alternative metric function and applying it gives us observations similar to those with Eq. (14). Here is the alternative topology:

(18)

Now we can see the shortcomings of our theory: it relies too much on the naked-eye observation. Given and , only by observing the generated sequence and its derived sequences can we learn about their convergence or disconvergence, their periodicity, EqPts being “attractors” or “repellors”, EqPt approximation accuracy, etc. And yet we fail to, in the case of two-person game, provide a function if any – say – to conveniently determine for . Nor can we provide a function to determine whether every leads to a convergent sequence.

As a practical use of our theory, given a game and players’ initial strategy we need to find out the optimal parameters for function in order to optimize EqPt approximation accuracy. That is, in the case of two-person game, given payoff bimatrix and initial , we need to find out the optimal input for Algorithm 1 to minimize overall regret sum. For that purpose, obviously we can try different adjustment rates in input as in FIG 15. And notice that bimatrix , where and , has the same EqPts as bimatrix . Thus we can try different scale factors in input as shown in FIG 16. With parameters , and combined, there forms a search space for function , which could be overwhelming when it comes to many-person games.

Figure 15: The approximate EqPt’s regret sum of Algorithm 1 for different adjustment rates in the case of game 3X3-1eq3sp. For simplicity we consider for the input .
Figure 16: The approximate EqPt’s regret sum of Algorithm 1 for different bimatrix scale factors in the case of game 3X3-1eq3sp. For simplicity we consider and for the input . Note that the output is “scaled back” with for proper comparison.

Vi Variants of function

In this section, we will introduce two variants of function for the purpose of generalization.

By applying on each component of the vector a general function , we have the first variant:

(19)

Accordingly, we can write Eq. (9) to

(20)

Suppose that for any and