Learning by Fictitious Play in Large Populations

01/09/2019 ∙ by Misha Perepelitsa, et al. ∙ University of Houston 0

We consider learning by fictitious play in a large population of agents engaged in single-play, two-person rounds of a symmetric game, and derive a mean-filed type model for the corresponding stochastic process. Using this model, we describe qualitative properties of the learning process and discuss its asymptotic behavior. Of the special interest is the comparative characteristics of the fictitious play learning with and without a memory factor. As a part of the analysis, we show that the model leads to the continuous, best-response dynamics equation of Gilboa and Matsui (1991), when all agents have similar empirical probabilities.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning theories concern with the rules by which players can discover optimal strategies in repeated plays of games, typically, when the players act in self-interest, in the absence of the complete information about the game, and having limited ability to communicate with other players.

In the learning by fictitious play (FP) one assumes that players keep the statistics of their opponent’s actions over the whole history of the process and compute empirical probabilities for actions played, as if playing against stationary environment. Given the game payoffs, agents’ actions are the best responses to their assessment of opponents. A number of sufficient conditions for convergence of the empirical probabilities to Nash equilibria was established by Robinson (1951), Miyasawa (1961), Nachbar (1990), Krishna & Sjostrom (1995).

Fictitious play learning, however, need not to converge, as shown by an example of Shapley (1964), in which empirical probabilities follow a cycle. This type of behavior was discussed in greater details by Gilboa & Matsui (1991), Gaunersdorfer & Hofbauer (1995), Monderer et al. (1997), and Benaim, Hofbauer & Hopkins (2009).

The convergence of empirical probabilities, even if it does take place, does not tell the whole story of learning. Equally important is to know what actions are selected by the payers in the course of learning. An example of Fudenberg & Kreps (1993), for the game in Table 1, shows that the process can go through the correlated play of with players realizing zero payoffs, rather than positive payoff of the Nash equilibrium. A variant of fictitious play, called stochastic fictitious play, was introduced by Fudenberg & Kreps (1993), following the idea of Harsanyi’s (1973), to provide a reasonable learning model in which players choose mixed strategies as their best responses. The convergence of stochastic FP in 2x2 games in various situations was established by Fudenberg & Kreps (1993), Benaim & Hirsch (1996) and Kanoivski & Young (1995).

There are different scenarios for learning to evolve in a population of agents depending on available information and how plays are arranged between agents. One of the scenarios, which we adopt in this paper, is to consider sequential plays between pairs of randomly selected agents, and keep the outcomes of the games private. To describe learning in such processes, Gilboa & Matsui (1991) proposed the continuous, ODE model for the change of the distribution of players on a set of pure strategies. A similar equation was used by Fudenberg & Levine (1998) to describe the changes in population average subjective probabilities.

The equation holds under conditions that the number of players is large, and only small number of players are adjusting their play to the best response of the population average during short time periods. It is also implicitly assumed that all players in the population have nearly similar vector of empirical probabilities at all times and the best response function is evaluated at that vector.

The equation, known as the best response dynamics (BRD) equation, became popular model for studying FP learning in large populations, and its asymptotic properties were discussed by Hofbauer & Sigmund (1998), Gaunersdorfer & Hofbauer (1995), Benaim, Hofbauer & Hopkins (2009).

In this paper we would like to obtain a refinement of the BRD equation by considering the changes in the probability density function for the distribution of agents in the space of empirical priors, rather than postulating equations for its moments.

The mean-field model, that we obtain, contains significantly more information about the state of the learning. It shows the whole spectrum of empirical priors in the population, and indicates how learning affects agents with different subjective probabilities.

We follow the approach, based on Fokker-Planck equation, is classical in many-agents systems in physics, biology, and social sciences. It applies to systems in the state of “chaos” when states of the learning of two randomly selected agents are independent (or nearly independent). This condition is reasonable in large populations where the same pair of agents is rarely selected for the play. At the same time, it excludes any type of correlation (coordination) between players.

Two models of fictitious play are considered in this paper: the classical FP and FP with a memory factor. The latter model places higher weights to more recent observations, compared to equally weighted classical model. Memory factor models have been used in the context of reinforced learning by Harley (1981), Erev & Roth (1998), Roth & Erev (1995) and in fictitious play learning by Benaim, Hofbauer & Hopkins (2009).

The paper organized as follows. In section 2 we describe the model and the PDE equations that serve as an approximation of the learning process. The derivation of the equations from many-particle stochastic process is presented in Appendix. Section 2.2 compares the dynamics of the stochastic learning and the corresponding dynamics from the deterministic PDE model, on the example of the 2x2 miscoordination game from Table 1. In section 2.3 we explain how BRD equation is obtained from the PDE model, and in sections 2.4 and 2.4 consider the learning dynamics obtained from the PDE model for learning in games with a single Nash equilibrium. In section 2.6 we return to the example of the miscoordination game of Fudenberg & Kreps (1993), and using the PDE model as the predictor for the evolution of learning, we show that both, mean subjective probabilities and mean best response probabilities, converge to the Nash equilibrium. Thus, in large populations, in roughly half of the plays each agent gets a positive payoff.

L (0,  0) (1, 1)
R (1,  1) (0,  0)
Table 1: Persistent miscoordination game from Fudenberg & Kreps (1993).

2 The Model

We consider a series of plays of a symmetric 2-player game between randomly selected agents in a large population. Denote the set of pure strategies in the game by Each agent maintains a record of times her opponents played

up to the epoch

Agent is using this vector as a prior for the probability

for action to be played next time by her opponent, and selects her action as a best response. In the classical fictitious play the best response is a multi-valued function. For definiteness we assume that there is a rule by which agent decided between equivalent actions. This will not enter in the equations for the averaged dynamics, as long as the situation is non-generic: the set of strategies for which the best response is multi-valued is of measure zero. The latter condition is assumed in the paper.

After the play, agent updates her empirical priors by the rule


where if the opponent played and zero otherwise. The learning increment can be any positive number, without altering large time learning process. A positive parameter is the memory factor. Leaning starts with agents having initial priors Two agents are selected at random for a play of the game each epoch Only the agents who played the game update their priors, and the outcome is not revealed to others. The stochastic process defined in this way is a discrete-time Markov process on the dimensional state space of priors.

2.1 Priors state space

Our main interest is in the probability density function (PDF) of agents in the priors-space where is the number of times (actual count, not the proportion) action was played by an agent’s opponent. In this space the straight lines through the origin represent the sets of the opponent’s constant probabilities to play that is, the vector of empirical probabilities is For any subset in the priors-space, represents the proportion of agents with their priors in the set at time By we denote the best response (vector) function. We consider the generic case when takes on values of unit, basis vectors, away from a set of measure zero. In large populations is well approximated by a bounded function (not a distribution) and the value of the population mean best response


does not depend on the values of on the exceptional set.

Our main interest is in describing the changes in in the process of learning and analyzing its asymptotic behavior. Here, the main numerical characteristics are the vector of mean empirical probabilities

and the mean best response vector

The following equation is found (see Appendix) to be the leading order approximation of the stochastic learning process, when the number of players is large and the learning increment is small.


with velocity given by the formula


We will refer to equations (2), (3), and (4) as the PDE model of the fictitious play learning. To complete the description of the model we will prescribe zero boundary conditions for function Because the domain is invariant under the flow of (4), the problem is correctly posed (not over-determined). Notice also that the problem is non-linear: the “conservation of mass” equation (3) carries the density with velocity which, in its turn, depends on function

2.2 Numerical test

In this section we compare the solution of the PDE model for a 2x2 game with a mixed Nash equilibrium, given in Table 1, with the direct simulation of the learning process.

We take the initial data

is the density of uniform distribution of initial priors in the box

The number of agents is and the learning increment is The memory factor is set to zero. The game has a single mixed Nash equilibrium We consider the state of the learning at time when the learning settles near the equilibrium. By formula (8) this corresponds to iterations of the game.

Recall that at all times the solution of the PDE is an uniform distribution on a box of side-length and only the coordinates of the center of the box, need to be computed. We use the explicit Euler method for the ODE

and find that at see Figure 1. We compare it with the data points that are the priors of 1000 agents after random plays. At time the priors of agents had been selected from an uniform distribution on the box The figure shows good agreement of the PDE model with the actual learning process.

Figure 1: Distribution of priors. Uniform distribution on the unit box (shown in red) is predicted by the PDE model. The center of the box, point (12,001,12,001), corresponds to the Nash equilibrium in the game. Data points are priors from learning in the population of N=1000 agents in a single run of the model for 40000 iterations. At time distribution of priors is uniform on a box (not shown).

2.3 Relation to the Best Response Dynamics (BRD) equation

Using equation (3), one can compute the equation for the mean empirical frequencies vector


since Note, that the memory factor does not explicitly enter the equation for It does, however, contributes to the dynamics of the distribution

If one postulates that all agents have the same, or approximately the same, priors


then is concentrated near and the above equation reduces to a variant of the best response dynamics equation:


In this equation is “a regularization” of the best response function over the support of function If the latter converges to a delta mass, converges to a value of Notice, also, the positive factor on the right-hand side of the equation. For a learning processes in which priors become large, the learning rate slows down. The learning factor corresponds to the factor in BRD equation in Fudenberg & Levine (1998).

Hypothesis (6) can be replaced with a weaker one, by requiring that the empirical probabilities in the support of are nearly constant. The extent to which this hypothesis or (6) are consistent with the dynamics of (3) is limited only to the case when has a single, asymptotically stable fixed point, or the support of is bounded but it is carried by the velocity to large values of so that the empirical probabilities inside the support are nearly constant. The former condition holds for the model with a memory factor and latter for the model with

2.4 Fictitious play with memory factor

In this and the following sections we will assume that the problem (2)–(4) has a unique, regular solution with a generic initial data the fact that can be proved by standard PDE techniques. To start with the qualitative analysis of the model with notice that the velocity is a simple linear function of and for all A flow of this type compresses the support of (the set where is positive) toward a point, the position of which, in general, changes. Notice also that each component becomes negative when is sufficiently large. This means that the learning takes place in the bounded region of the priors space. After some transient time, all mass of is concentrated near a point and the dynamics is approximated by equation (7), where the factor is larger than some fixed positive number. The asymptotic behavior is determined by the BRD equation (7). In the presence of a dominant strategy, the solution of the PDE model will converge to a delta mass concentrated at a boundary point, located on the boundary set corresponding to the dominant strategy

For symmetric games with the payoff matrix such that the quadratic form is strictly concave on a strategy simplex

Hofbauer (2009) showed that solutions of BRD (or smoothed BRD) equation converge to a unique Nash equilibrium. With marginal modifications one can show that solutions of (7) converge to a unique Nash equilibrium as well. Combined with the fact the support of converges to a point, we conclude that for all agents will have the same vector of the equilibrium empirical probabilities.

We stress again that above arguments are based on the leading order approximation of the stochastic learning process. The next order is the drift-diffusion equation (14) from Appendix. In that model, the outcome of the learning will be a stationary distribution of small deviation of order around the Nash equilibrium.

2.5 Fictitious play with zero memory factor

In this case the velocity is uni-directional:

The initial profile of the probability distribution

is simply carried by the velocity and its shape doesn’t change. All components of velocity are non-negative and as can be seen from formula (2). With such velocity is moved away from the origin at non-vanishing speed.

Suppose that at is an arbitrary distribution on a box of side-length After time is given by the same distribution on a box that is located, approximately, units away from the origin. For large the vector of empirical frequencies is approximately constant in the ball (with deviations of order ), effectively rendering it as a single point. The dynamics can now be computed from the equation (7), in which the learning factor is of the order Under the strict concavity of on the expected payoff the vector of average empirical frequencies will converge to a unique Nash equilibrium.

2.6 Equation for the mean best response: a 2x2 game

Equation (7) and the BRD equation of Gilboa & Matsui (1991) are the equations for the population averaged empirical probabilities. Another important characteristic of a learning process is the average strategy played at time In this section we show on a simple example that PDE model (3) can be used to derive the equation for Consider the game in Table 1 and FP learning model with zero memory factor We will take initial distribution of priors to be uniform on some box. As was mentioned earlier, the learning dynamics (the leading order) transports the initial distribution with uni-directional velocity If at time the support of is completely contained in the wedge (or ), the velocity is constant (or ). This type of velocity moves towards the line When the support of intersects that line, we denote by the length of segment of intersection, and by multiplying (3) by function and integrating over the whole domain we obtain the following first order system of ODEs for

Assuming that after some transient time, by changing to the new time variable (still labeled

), the system is reduced to the constant coefficient case. We find that there are eigenvalues:

and the corresponding eigenvectors:

and This implies that asymptotically,

Mean population best reply strategy approaches the mixed Nash equilibrium. With this information at hand, we can use equation (5

), and by estimating the learning factor

by obtain the following equation for the mean empirical probabilities

By solving it, we conclude that

The arguments leading to the convergence of and to the Nash equilibrium can be repeated for a generic symmetric game. Let the payoffs for actions be respectively, with The following theorem holds.

Theorem 1.

Suppose that the initial distribution of agents in the priors space is described by a smooth function with compact support in the open quadrant Let be the unique solution of the problem (2)–(4). Then, the mean empirical probabilities, , and the mean best response vector, converge as to the unique, mixed Nash equilibrium.

3 Appendix: Fokker-Planck equation

Consider a group of individuals acting according to FP learning rule described in section 2 in a symmetric game. For the simplicity of the presentation, we restrict ourselves only to the case of two pure strategies The model with strategies is written down at the end of the Appendix. Denote the vector of counts of opponent’s plays of and for agent up to epoch By we denote the vector of counts for all agents. By where we denote PDF for distribution of We will write where each The best response of agent will be denoted as Suppose that agents and are selected for the interaction. There will be only one game played during the period from to

We consider the learning rule (1) in which the priors are incremented by that is, is either or Since the best response function depends on the empirical probabilities rather than on priors, the magnitude of the increment is irrelevant, apart from the fact that for smaller increments the initial data influence the dynamics for longer periods of time. The memory factor in (1) is set to with The games are arranged to be played at time periods of length determined as


If is a time scale of learning process (in arbitrary units), then is the number of rounds of the games needed to be played to observe changes at this time scale. All games are played between two agents, thus, on average an agent plays times during an interval length

Conditioned on the event the agents priors for the next period are set according to formulas

and symmetrically for For all other agents, for The definition of makes it a discrete-time Markov process. We proceed by writing down the integral form of the Chapman-Kolmogorov equations and approximate its solution by a solution of the Fokker-Planck equation (forward Kolmogorov’s equation), for small values of and large This is a classic approach to stochastic processes, the details of which can be found in Feller’s (1957) monograph.

The change of from to can be described in the following way.


This equation can be written in slightly different way:


Denote the PDF of the distribution by

where is a dimensional vector of all coordinates, excluding In statistical physics this function is also called one-particle distribution. In the formulas to follow we need to use two-particle distribution function

where is the dimensional vector of all coordinated excluding and Function is symmetric in and is related to by the formulas

The moments of function and are computed from the moment of


This follows from the definition of these functions.

Now we use (10) to obtain an integral equation of the change of function For that select sum over and take average. We get


The right-hand side can be conveniently expressed in terms of the two-particle function


where and and similar for In the processes with large number of agents and random binary interactions, two-particle distribution function can be factored into two independent distributions:

With this relation, (12), becomes a family of non-linear integral relations for the next time step distribution Selecting time step proportional to so that and taking Taylor expansions up to the second order for the increment of the test function we obtain the following Fokker-Planck equation


where and the drift velocity is given by the formula where

are elements of a symmetric, positive definite diffusion matrix computed by the formula:

Consider now the learning from playing a symmetric game with pure strategies. Denote the priors vector Then, the Fokker-Planck equation approximating the stochastic learning is



are elements of a symmetric, positive definite diffusion matrix computed by the formula:


  • (1) Benaïm, M, Hofbauer, J, & Hopkins, E. (2009). Learning in games with unstable equilibria. Journal of Economic Theory 144, 1694–1709.
  • (2) Erev, I. & Roth, A. E. (1998). Predicting how people play games: reinforcement learning in experimental games with unique, mixed strategy equilibrium. American Econ. Review 88, 848–881.
  • (3)

    Feller, W. (1957). An Introduction to Probability Theory and Its Applications. Vol II. John Wiley & Sons, New York, NY.

  • (4) Krishna, V. & Sjostrom,  T. (1995). On the convergence of fictitious play. Mimeo. Harvard University.
  • (5) Fudenberg, D. & Kreps, D. (1993). Learning mixed equilibria. Games and Economic Behavior 5, 320–367.
  • (6) Fudenberg, D. & Levine, D. (1998). The theory of learning in games. MIT Press, Cambridge, MA., London, England.
  • (7) Harley, C.B. (1981). Learning the Evolutionary Stable Strategy. J. theor. Biol. 89, 611–633.
  • (8)

    Hirsanyi, J. (1973). Games with randomly perturbed payoffs. International J. Game Theory 2, 1–23.

  • (9) Gaunersdorfer, A. & Hofbauer, J. (1995). Fictitious play, Shapley polygons and the replicator equation. Game and Economic Behavior, 11, 279–303.
  • (10) Hofbauer, J. (2000). From Nash and Brown to Maynard Smith: equilibria, dynamics and ESS. Selection 1, 81–88.
  • (11) Hofbauer, J. & Sigmund, K. (1998). Evolutionary games and population dynamics. Cambridge University Press.
  • (12) Gilboa, I. & Matsui, A. (1991). Social Stability and Equilibrium. Econometrica, 59, 3, 859–867.
  • (13) Kaniovksi, Y. & Young, P. (1995). Learning dynamics in games with stochastic perturbations. Games Economic Behavior, 1, 330-363.
  • (14) Maynard Smith, J. (1982). Evolution and the Theory of Games. Cambridge, Cambridge University Press.
  • (15) Miyasawa, K. (1961). On the convergence of learning process in a 2x2 non-zero-sum two-person game. Reserch Memeo 33. Princeton University.
  • (16) Monderer, D., Samet, D. & Sela, A. (1997). Belief affirming in learning processes. J. Economic Eheory 73, 438–452.
  • (17) Nachbar, J. (1990). “Evolutionary” selection dynamics in games. Convergence and limit properties. International J. Game Theory 19, 59–89.
  • (18) Robinson, J. (1951). An iterative method for solving a game. Annals of Mathematics 54, 296–301.
  • (19) Shapley, L. (1964). Some topics in two-person games. Adv. Game Theory ed. by M. Drescher, L.S.Shapley, and A.W.Tucker. Princeton. Princeton University Press.
  • (20) Roth, A.E. & Erev, I. (1995). Learning in extensive-form games: experimental data and simple dynamics models in the intermediate term. Games and Economic Behavior, 8 164–212.