# Geometrical Regret Matching of Mixed Strategies

We argue that the existing regret matchings for equilibrium approximation lead to "jumpy" strategy updating when the probabilities of future plays are set to be proportional to positive regret measures. We propose a geometrical regret matching which has a "smooth" strategy updating. Our approach is simple, intuitive and natural. The analytical and numerical results show that, continuously and "smoothly" suppressing "unprofitable" pure strategies is sufficient for the game to evolve towards equilibrium, suggesting that in reality the tendency could be pervasive and irresistible. Technically, iterative regret matching gives rise to a sequence of adjusted mixed strategies for our study its approximation to the true equilibrium point. The sequence can be analyzed in metric space and visualized nicely as a clear path towards an equilibrium point. Our theory has limitations in optimizing the approximation accuracy.

## Authors

• 2 publications
08/18/2019

### The Path to Nash Equilibrium

It has been proved that every non-cooperative game has a Nash equilibriu...
12/03/2020

### On the Impossibility of Convergence of Mixed Strategies with No Regret Learning

We study convergence properties of the mixed strategies that result from...
09/30/2019

### Strategizing against No-regret Learners

How should a player who repeatedly plays a game against a no-regret lear...
01/14/2020

### Faster Regret Matching

The regret matching algorithm proposed by Sergiu Hart is one of the most...
06/28/2021

### Evolutionary Dynamics and Φ-Regret Minimization in Games

Regret has been established as a foundational concept in online learning...
11/28/2014

### Solving Games with Functional Regret Estimation

We propose a novel online learning method for minimizing regret in large...
03/21/2022

### Fictitious Play with Maximin Initialization

Fictitious play has recently emerged as the most accurate scalable algor...

## Code Repositories

### eqpt

The Path to Nash Equilibrium

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In 2000 Hart and Mas-Colell proposed an iterative algorithm called regret matching to approximate an equilibrium Hart2000; Hart2001. The players keep tracking the regrets of their own past plays and make the future plays with probabilities proportional to positive regret measures. This algorithm is particularly natural in that the players don’t have to know their opponents’ payoff functions, as opposed to the non-adaptive variety, e.g. the celebrated Lemke-Howson algorithm Lemke1964; Lemke1965 which takes two players’ payoff matrices as input and pinpoints equilibrium points as output. In other words, regret matching allows the players to reach equilibrium without strict mediation from beyond themselves. The concept of regret matching has since been adopted and developed by other algorithms cfr2008.

Iterative regret matching can be seen as continuous updating of mixed strategy with regret information: the mixed strategy to update is statistical structure of the whole past plays, and the mixed strategy updated will determine the probabilities of plays in the immediate future. Regret matching is essentially a function between them. In general the existing regret-matching functions update the mixed strategy proportional to positive regret measures, meaning that each matching is a “strategy jump” and the past mixed strategy has little relevance except for it being used for regret evaluation. In this paper, we try to propose a “smoother” regret matching with more respect to the past plays. Partly inspired by the mapping in Nash’s existential proof of equilibrium point Nash1951, our function for regret matching takes the form of

 s′i=si+riRi,S(si)1+ri∣∣Ri,S(si)∣∣, % where ri>0.

As shown in FIG 1

our regret-matching function can be interpreted geometrically if we treat mixed strategies and regrets on pure strategies both as vectors: each regret matching will “push” mixed strategy vector towards regret vector by a small angle, and the smaller

is the “smoother” matching will be. Other than the smoothness for strategy updating, more importantly our regret matching agrees well with the behavioral tendency of immediate suppressing “unprofitable” pure strategies, which makes the matching a natural one. Our regret matching turns out to have clear analytical results for regret minimization. The test results in FIG 23 and 4 show that, the matching delivers strong numerical convergence to Nash equilibrium both for two-person and many-person non-cooperative games. And as shown in FIG 5, the convergence is independent of the initial mixed strategies to start game with.

The remainder of paper goes as follows. In Section II, we vectorize the key concepts of non-cooperative game as well as regret measures in preparation of Section III, where our regret matching is elaborated and applied directly upon mixed strategies so as to approximate a Nash equilibrium point. In Section IV, the approximation algorithm is verified with the examples both of two-person game and general -person game, and their “strategy paths" are visualized as shown in FIG 25. In Section V, we address the accuracy issues of equilibrium approximation. In Section VI, we consider some generalizations for our theory.

## Ii Non-cooperative game and regret

In this section, we will define the condition of Nash equilibrium as well as the regret measures in the form of vector and its -norm.

First of all, let us define the familiar concepts of non-cooperative game and set up their notations.

• There are players, and each player has pure strategies. The th player’s th pure strategy is denoted by .

• The mixed strategy of each player is a point of a probability simplex , which is a compact and convex subset of the real vector space of dimensions. That is, .

• Each player has a payoff function . Any is an -tuple of mixed strategies, e.g. in which each item is associated with the th player such that . Therefore, is the product of all players’ simplices, i.e. .

• Any player can, from the given , form a new -tuple of mixed strategies, denoted by , by unilaterally substituting the th item of with its new mixed strategy , or as a special case, by unilaterally substituting with its th pure strategy. Obviously, if already uses .

Intendedly, these vector-centric definitions and notations allow us to, for any player , translate the linearity of its payoff in into . Here is the inner product of vectors, is the mixed strategy player uses in and is the player’s payoff vector for the vertices of its simplex . Given any -tuple , for each player we have

 vi,S=(¯pi(πi1;S),…,¯pi(πij;S),…,¯pi(πigi;S)). (1)

Given any , each player can substitute its item in with to form a new one , which means that we can designate for each player a variant payoff function of substitute strategy :

 pi,S(si)=¯pi(si;S)=⟨si,vi,S⟩. (2)

Note that is the expected value of payoff or simply the average payoff, since

can be conveniently treated as probability distribution. From Eq. (

1) and (2) we can derive for each player a function

 Ri,S(si)=(φi1,S(si),…,φij,S(si),…,φigi,S(si)). (3)

Here each component of vector is defined by the very function proposed by Nash in his existential proof:

 φij,S(si)=max{0,¯pi(πij;S)−pi,S(si)}. (4)

Each component represents the payoff gain for player when its strategy moves from mixed to its th pure strategy. We say a pure strategy is “more profitable” than if its or otherwise “less (or equally) profitable”. Vector has at least one zero component, since there must be one “least profitable” pure strategy as implied by the linearity of in .

Given , vector measures, in terms of payoff, how far a substitute is from the optimal one. Specifically, the -norm of , i.e. , decreases as increases and would turn zero when is maximized with the optimal substitute. Recall that an equilibrium point is an -tuple in which each player’s mixed strategy is optimal against those of its opponentsNash1951. That is, an -tuple is an equilibrium point, if and only if or equivalently for each item in .

Vector will be the focal point of our study. We call the regret vector of player in that the vector indicates how much payoff it would have gained if its strategy moved to the vertices of its simplex , and correspondingly the regret sum of player . We shall occasionally abbreviate equilibrium point by EqPt, and write , and as implicitly indexed by given .

## Iii Regret Matching and Fixed point iteration

In this section, we will conduct regret matching with regret vector iteratively on the initial mixed strategies to approximate a Nash equilibrium. Ideally, this iterative regret matching could develop into a fixed point iteration fpi.

We define each player’s regret matching to be a function :

 (5)

Here . By we translate that, on the game with given -tuple , if the player unilaterally changes its mixed strategy to any other , it will regret that should have been used instead. We write occasionally. Our design of for regret matching has three important aspects to consider.

First, function has a geometrical interpretation.

We can see in FIG 1 that vector is “closer” to vector than vector is. Generally, the “closeness” of two vectors, say and , can be measured by their angle , which further can be measured by . By Theorem 1, we have and thus

 cos∠(s′i,Ri(si))≥cos∠(si,Ri(si)). (6)
###### Theorem 1.

For any non-zero real vectors , and where real , .

###### Proof.

By definition, we have and to compare. There are three cases to consider with respect to , as follows:
(i) If , by the triangle inequality we have

 ≥⟨u,w⟩∥u∥∥w∥=cos∠(u,w). (7)

(ii) If , such that .
(iii) If , . Let and . Because , as with case (i) we have and thus , which easily gives . ∎

Second, function has a behavioral interpretation.

By comparing vectors and component-wise, there must be less proportion of pure strategy used in than in , if its corresponding component in vector is zero. In other words, preferably suppresses the use of “less profitable than average” pure strategies and meantime enhances the use of some of the “more profitable than average” ones. Note that this behavioral aspect was mentioned in Nash’s existential proof.

Third, gives more payoff and less regret than .

Theorem 2 and Theorem 3 show that will always be a better choice in terms of payoff and regret sum, no matter what the player unilaterally changes its mixed strategy to, unless itself is optimal.

###### Theorem 2.

For any and , where .

###### Proof.

By Eq. (2) we know where is the player ’s vertex payoff vector as in Eq. (1). Then by Eq. (5) we have,

 =11+ri∣∣Ri(si)∣∣(⟨si,vi⟩+ri⟨Ri(si),vi⟩) =⟨si,vi⟩+⟨Ri(si),vi⟩−⟨si,vi⟩∣∣Ri(si)∣∣r−1i+∣∣Ri(si)∣∣. (8)

Also by Eq. (2) we have . And

 ⟨Ri(si),vi⟩−⟨si,vi⟩∣∣Ri(si)∣∣ =gi∑jφij(si)¯pi(πij;S)−gi∑jpi(si)φij(si) =gi∑jφij(si)(¯pi(πij;S)−pi(si)) (9)

Here is -norm of vector. Finally because and , we have

 (10)

And if and only if . ∎

###### Theorem 3.

For any and , where .

###### Proof.

For each component in vector and its counterpart in vector , by Eq. (4) and Theorem 2 we have such that . ∎

These three aspects makes function a natural regret matching of mixed strategies: if a player unilaterally and continuously adjusts its strategy by , intuitively its strategy will keep “pushing” towards regret vector as target, the use of “unprofitable” pure strategies will keep dropping, and most importantly its payoff will monotonically increase and regret sum will monotonically decrease. Since a player’s payoff is bounded, continuous strategy adjustment by would preferably lead to the optimal mixed strategy with maximum payoff and zero regret sum.

The rest of this section considers all players adjusting their strategies simultaneously and continuously.

On the game with given , if all players simultaneously adjust strategies by their own , there will form a new . Let and we have

 Ψ(S)=(ψ1,S(s1),…,ψi,S(si),…,ψn,S(sn)). (11)

Note that for any because for any . And all players’ continuous strategy adjustment can be described by an iterated function , denoted by with being the times of iterations. Given , those iterated functions with give rise to an infinite sequence of points in for our study:

 S0 =Ψ0(S0), S1 =Ψ1(S0)=Ψ(S0), S2 =Ψ2(S0)=Ψ(S1), ⋯ St =Ψt(S0)=Ψ(St−1), (12) ⋯

Here denotes the identity function. This sequence is referred to as implicitly indexed by and . In fact, the tuple uniquely determines a sequence ; we say generates . Sequence describes an infinite dynamics of players adjusting their strategies in parallel. Specifically, at the th iteration, each player adjusts its strategy by substituting in with so as to collectively form a new -tuple ; at the th iteration is formed by adjusting in the same manner; and so forth.

Now we might hope that each and every player still has its payoff monotonically increase and its regret sum monotonically decrease in the process of simultaneous strategy adjustment. In that case, the infinite sequence must tend towards an EqPt. Moreover, it would be ideal if the overall regret sum of all players, i.e. , converged to zero along such that and thus for each player . In that case, by definition must be an EqPt. Denote it by and we must have . Therefore, EqPt is also a fixed point fixedpoint of function ; the sequence generated by is essentially an outcome of the process of fixed point iteration, which can be formulated to . For our purpose we assume that, with proper finite iterations, can be a reasonable approximate of the true EqPt such that , and the accuracy of such EqPt approximation can be measured by the overall regret sum of all players.

Only this hope comes with huge caveat. As it turns out in Section V, the presumption on the monotonicity of payoff and regret sum for all players along oftentimes doesn’t hold true. Instead, due to simultaneous strategy adjustment each player’s regret sum might fluctuate along with a tendency to diminish towards zero. Nevertheless, we will see that this tendency can still serve the purpose of approximating an EqPt. Notice that in FIG 1, the parameter in Eq. (5) determines the rate of angle by which vector and regret vector “close up”. Here we call adjustment rate. To counter the influence from simultaneous strategy adjustment, as we will see in next section “infinitesimal” adjustment rates are used for all players, in the hope that each player’s strategy adjustment could be treated as negligible by its opponents and thus their simultaneous strategy adjustment could be treated as approximately unilateral.

We should visualize the sequence of Eq. (III) as players’ perpetual searching for the equilibrium point of non-cooperative game. In this searching, players don’t have to be aware of the equilibrium point per se. Instead, each player is constantly chasing after its regret vector in the hope of reducing its regret sum to zero. Increasing payoff and meantime reducing regret sum is the players’ sole incentive and purpose, and regret vector is the players’ sole information and target for their strategies adjustment. Regret vector is dynamic along iterations, of course. At any iteration, given the current mixed strategies of its opponents, each player can calculate its regret vector by evaluating the payoff on its pure strategies without any across-iteration retrospective considerations. In this searching, there is no place for the mediator beyond players; no inter-player information exchange other than the current use of mixed strategies.

## Iv Game examples and visualizations

In this section we will verify our approach of EqPt approximation proposed in last section. Examples of both two-person game and general -person game will be tested, although the former ones are our main consideration due to their simplicity and sufficiency in revealing the important aspects of EqPt approximation.

For the two-person game we have players, say the row player with pure strategies and the column player with pure strategies. Accordingly, the -tuple is specialized to two-tuple of vectors, and thus the tuple is specialized to since function is determined by the two players’ payoff matrices and their adjustment rates . Recall that given any the tuple generates an infinite sequence correspondingly. Then we can implement the fixed point iteration in last section to Algorithm 1111Source code and demos can be found at https://github.com/lansiz/eqpt., which takes as input, and for our study outputs an approximate EqPt as and its regret sum , a finite sequence in correspondence to , and a finite sequence of regret sum. In Algorithm 1, the operation compares vectors component-wise as opposed to Eq. (4). Obviously, given the computational complexity of Algorithm 1 is under with .

Given the output of Algorithm 1, we can plot the true EqPt , the approximate EqPt and the sequence on or to provide an intuitive visualization of approximating EqPt. Specifically, for the two-person games with and , we split into two sequences of , and then transform each into a sequence of for plotting on plane by defining

 (x,y)t=st⎡⎢ ⎢⎣00√22√62√20⎤⎥ ⎥⎦. (13)

Note that Eq. (13) transforms the simplex in into a subset of enclosed by an equilateral triangle. The following are our observations on the test results of four typical games:

• Game 3X3-1eq1sp has one unique EqPt, which uses one single pure strategy. As shown in FIG 6, converges straight towards the EqPt.

• Game 3X3-1eq2sp has one unique EqPt, which uses two pure strategies. As shown in FIG 2, converge towards the EqPt with oscillation.

• Game 3X3-2eq2sp has two EqPts, each of which uses two pure strategies. As shown in FIG 7, converges with oscillation towards one of the two EqPts dependent on the initial strategies.

• Game 3X3-1eq3sp has one unique EqPt, which uses three pure strategies. As shown in FIG 8, doesn’t converge towards the EqPt but develops into a seemingly perfect circle away from the true EqPt. The last position of could be any point in the circle.

And for the general two-person games with and , we split into two sequences of

, and then with PCA (Principal Component Analysis for dimension reduction

pca) transform each into a sequence of for plotting in space. The following are our observations on the test results of two typical games:

• FIG 3 shows a game whose converges towards an EqPt.

• FIG 9 shows a game whose develops into a perfect circle.

For the two-person game, its EqPts are determined by the bimatrix , independent of the initial . Then for the game examples mentioned previously, we test them with many random and observe that there appear to be two kinds of EqPts, “attractor” EqPt and “repellor” EqPt, as follows:

• For game 3X3-1eq2sp FIG 5 shows that, any is “attracted” towards the unique EqPt.

• For game 3X3-2eq2sp FIG 10 shows that, any is “attracted” towards one of the two EqPts.

• For game 3X3-1eq3sp FIG 11 shows that, any is ”repelled” from the EqPt unless is exactly the EqPt.

We also test the examples of -person game by extending Algorithm 1 from bivariate case to general multivariate case, which can be better explained by the source code. And we observe the convergence or disconvergence of as with the two-person games. FIG. 4 shows an example for the convergence case. The computational complexity of this extended algorithm is under with being the number of players and , meaning that it is polynomial time when is given.

## V EqPt approximation accuracy

In last section, we observe that for some games sequence seemingly converges towards EqPt, e.g., 3X3-1eq1sp, 3X3-1eq2sp and 3X3-2eq2sp. Whereas for games such as 3X3-1eq3sp, sequence clearly doesn’t converge at all; instead, it evolves into a perpetual cyclic path away from EqPt. Roughly speaking, the convergence or disconvergence of affects the accuracy of EqPt approximation. In this section, we will look into the approximation accuracy from two numerical perspectives. One is to examine sequence since regret sum is a measure of approximation accuracy as discussed in Section III, and the other one is to revisit the convergence of with metric. And we will show the connection between them.

First, by taking the games in last section for example, we have the following observations on their sequences plotted in FIG 12:

• As opposed to the presumption in Section III, regret sum oftentimes doesn’t monotonically decrease. Instead, due to the inter-player influence from simultaneous strategy adjustment, regret sum fluctuates with a decreasing tendency to reach new minimum. Game 3X3-1eq1sp barely has regret sum fluctuation, whereas game 3X3-1eq3sp has intensive regret sum fluctuation.

• There is different degree of periodicity in regret sum fluctuation for different games. Regret sum fluctuation of game 3X3-1eq3sp exhibits strong periodicity and yet weak decreasing tendency.

• The strong periodicity of regret sum sequence in game 3X3-1eq3sp, caused by the “repellor” EqPt, severely undermines the accuracy of EqPt approximation.

• It is painfully slow for regret sum sequence to converge towards zero in that regret sum decreases slowly at the latter iterations, even for 3X3-1eq1sp. That is partly because, as implied by Eq. (10), the ratio of payoff increment to regret sum is getting smaller as iteration goes.

Now it can be concluded that the inter-player influence has a significant impact on the EqPt approximation accuracy which otherwise could, given time, go on to perfection. In addition, FIG 13 shows the decreasing tendency of regret sum for a five-person game.

Next, for the games in last section, we will reexamine their output sequences by introducing a metric on . Generally, metric is necessary for the definitions of limit, convergence and contractiveness of sequence . Following the definitions in Section III, let us have a set and a function of Eq. (11). From we can derive a complete metric space with the metric function

 d(X,Y)=n∑i∥xi−yi∥. (14)

Here and are the th items of and respectively. Assume that is a contraction mapping on metric space . That is, there exists a such that for any

 d(Ψ(X),Ψ(Y))≤qd(X,Y). (15)

Then by Banach fixed point theorem fixedpoint, the function must admit one unique fixed point such that and for any . Apparently the contraction condition of Eq. (15) is too strong for the convergence of , since it precludes the sequence’s possible convergence towards more than one EqPt, which is an important case of our interest as shown in last section. Instead, let us consider the contraction property of a specific sequence generated by the given tuple . From we can derive an infinite real sequence by defining

 ˙dt=d(St−1,St)=d(St−1,Ψ(St−1)). (16)

And from sequence we can further derive an infinite real sequence by defining

 (17)

If is a contractive sequence ideally, i.e. for any , it must be an Cauchy sequence and converge to an EqPt. That is, is a process of fixed point iteration. Meanwhile, the sequence must monotonically decrease and converge to zero. From the sequences generated by the games of last section, we derive and depict their and in FIG 14 to show that sequences don’t monotonically decrease or converge to zero, meaning that neither sequences nor their functions are contractive on . Nevertheless, by comparing FIG 12 and FIG 14, we observe that there is strong coherence among the sequences , and of the same game, as follows:

• As with regret sum, oftentimes fluctuates with a decreasing tendency to reach new minimum. of game 3X3-1eq1sp exhibits almost no fluctuation, whereas of game 3X3-1eq3sp exhibits strong periodicity.

• As with regret sum, in game 3X3-1eq3sp the strong periodicity of could have hurt the accuracy of EqPt approximation.

• fluctuates around at the latter iterations, which explains why and regret sum sequence converge at such a low speed.

Therefore, can be seen as an another overall measure of EqPt approximation accuracy in addition to overall regret sum and to improve accuracy is to reduce , which in return justifies our definition of metric function in Eq. (14). Trivially, there is an alternative metric function and applying it gives us observations similar to those with Eq. (14). Here is the alternative topology:

 d(X,Y)=nmaxi∥xi−yi∥. (18)

Now we can see the shortcomings of our theory: it relies too much on the naked-eye observation. Given and , only by observing the generated sequence and its derived sequences can we learn about their convergence or disconvergence, their periodicity, EqPts being “attractors” or “repellors”, EqPt approximation accuracy, etc. And yet we fail to, in the case of two-person game, provide a function if any – say – to conveniently determine for . Nor can we provide a function to determine whether every leads to a convergent sequence.

As a practical use of our theory, given a game and players’ initial strategy we need to find out the optimal parameters for function in order to optimize EqPt approximation accuracy. That is, in the case of two-person game, given payoff bimatrix and initial , we need to find out the optimal input for Algorithm 1 to minimize overall regret sum. For that purpose, obviously we can try different adjustment rates in input as in FIG 15. And notice that bimatrix , where and , has the same EqPts as bimatrix . Thus we can try different scale factors in input as shown in FIG 16. With parameters , and combined, there forms a search space for function , which could be overwhelming when it comes to many-person games.

## Vi Variants of function Ri,S

In this section, we will introduce two variants of function for the purpose of generalization.

By applying on each component of the vector a general function , we have the first variant:

 ¯Ri,S(si)= (αi1∘φi1,S(si),…,αij∘φij,S(si),…,αigi∘φigi,S(si)). (19)

Accordingly, we can write Eq. (9) to

 =gi∑jαij∘φij(si)(¯pi(πij;S)−pi(si)). (20)

Suppose that for any and