Stochastic learning dynamics, like log-linear learning, address the issue of equilibrium selection for a class of games that includes potential games (see, e.g., , ,  and ). Because of the equilibrium selection property, these learning dynamics have received significant attention, particularly in the context of opinion dynamics in coordination games (see, e.g.,  and ) and game theoretic approaches to the distributed control of multiagent systems .
A well-known problem of stochastic learning dynamics is the slow mixing of their induced Markov chain , , . The mixing time of a Markov chain is the time required by the chain to converge to its stationary behavior. This mixing time is crucial because the definition of a stochastically stable state depends on the stationary distribution of the Markov chain induced by a learning dynamics. The slow mixing time implies that the behavior of these dynamics in the short and medium run are equally important particularly for engineered systems with a limited lifetime. However, stochastic stability only deals with the steady-state behavior and provides no information about the transient behavior of these dynamics.
The speed of convergence of stochastic learning dynamics is an active area of research that is receiving significant research attention , , , , , and . However, there is another aspect related to the slow convergence of these dynamics that has received relatively little research attention. Stochastic stability only explains the steady-state behavior of a system under a learning rule. We establish that there are learning dynamics with considerably different update rules that lead to the same steady-state behavior. Since these learning dynamics have the same stochastically stable states, stochastic stability cannot distinguish between these dynamics. The different update rules may result in significantly different behaviors over short and medium run that may be desirable or undesirable but remain entirely unnoticed.
We first establish the implications of having different learning rules with the same steady state. Through an example setup, we demonstrate the differences in short and medium run behaviors for two particular learning dynamics with different update rules that lead to the same stochastically stable states. The example setup is that of sensor coverage game in which we formulate a sensor coverage problem in the framework of a potential game. An important conclusion that we draw from this comparison is that for stochastic learning dynamics, characterization of stochastically stable states is not sufficient. It is also essential to analyze the paths that lead to these stochastically stable states from any given initial condition. Analysis of these paths is critical because there are specific properties of these paths that play a crucial role not only in the short and medium run but also in the long run steady state behavior of the system.
The transient behavior of stochastic dynamics was studied in the context of learning in games in  and . However, the issues related to various learning dynamics leading to the same steady state behavior were not highlighted in these works. Therefore, after motivating the problem, we propose a novel framework for performing a comparative analysis of different stochastic learning dynamics with the same steady state. The proposed framework is based on the theory of Markov chains with rare transitions , , , , , , and .
In the proposed framework, we present multiple criteria for comparing the short, medium, and long-run behaviors of a system under different learning dynamics. We refer to the analysis related to the short-run behavior as first order analysis. The first order analysis deals with the expected hitting time of the set of pure Nash equilibria, which is the expected time to reach a Nash equilibrium (NE) for the first time. Because of the known hardness results of computing a NE , , and the fact that all the Nash equilibria of a potential game are not necessarily potential maximizer, the first order analysis is typically not considered for stochastic learning dynamics. However, we are interested in the comparative analysis of learning dynamics for which we show that first-order analysis provides valuable insights into the behavior of a system.
We refer to the analysis related to the medium and long-run behavior of stochastic learning dynamics as higher-order analysis. The higher-order analysis is based on the fact that the Markov chains induced by stochastic learning dynamics explore the space of joint action profiles hierarchically. This hierarchical exploration of the state space is well explained by an iterative decomposition of the state space into cycles of different orders as shown in , , , and . Thus, the evolution of Markov chains with rare transitions can be well approximated by transitions among cycles of proper order.
Therefore, we develop our higher order analysis on the cycle decomposition of the state space as presented in . We compare the behavior of different learning rules by comparing the exit height , and the mixing height of the cycles generated by the cycle decomposition algorithm applied to these learning rule. The significance of these parameters is that once a Markov chain enters a cycle, the time to exit the cycle is of the order of and the time to visit each state within a cycle before exiting is of the order of . Thus, we can efficiently characterize the behavior of each cycle from and .
Our comparative analysis framework applies to the class of learning dynamics in which the induced Markov chains satisfy certain regularity conditions. However, we present the details of the framework in the context of two particular learning dynamics, Log-Linear Learning (LLL) and Metropolis Learning (ML) over potential games. Log-Linear learning is a noisy best response dynamics in which the probability of a noisy action from a player is inversely related to the cost of deviating from the best response. This learning rule is well-known in game theory and the stationary distribution of the induced Markov chain is a Gibbs distribution, which depends on a potential function. The Gibbs distribution over the space of joint action profiles assigns the maximum probability to the action profiles that maximize the potential function. Moreover, it was shown in that LLL is a good behavioral model for decision making when the players have sufficient information to compute their utilities for all the actions in their action set given the actions of other players in the game.
On the other hand, Metropolis learning is a noisy better response dynamics for which the induced Markov chain is a Metropolis chain. It is well established in the statistical mechanics literature that the unique stationary distribution of Metropolis chain is the Gibbs distribution (see, e.g.,  and ). As a behavioral model for decision making, ML is related closely to the pairwise comparison dynamics presented in . Thus, ML is a behavioral model for decision making with low information demand. A player only needs to compare its current payoff with the payoff of a randomly selected action. It does not need to know the payoffs for all the actions as in LLL. The only assumption is that each player has the ability or the resources to compute the payoff for one randomly selected action.
Hence, we have two learning dynamics, LLL and ML, which correspond to two behavioral models for decision making with very different information requirements. However, both of the learning rules lead to the same steady-state behavior. We compare these learning dynamics based on the proposed framework. The crux of our comparative analysis is that the availability of more information in the case of LLL as compared to ML does not guarantee better performance when the performance criterion is to reach the potential maximizer quickly.
A summary of our main contributions in this work is as follows
For problem motivation, we present our setup of sensor coverage game in which we formulate the sensor coverage problem with random sensor deployment as a potential game.
For the first order analysis, we derive and compare upper bounds on the expected hitting time to the set of NE for both LLL and ML.
We also obtain a sufficient condition to guarantee a smaller expected hitting time to the set of Nash equilibria under LLL than ML from any initial condition.
For higher order analysis, we identify cycle decomposition algorithm as a useful tool for the comparative analysis of stochastic learning dynamics. Moreover, we show through an example of a simple Markov chain that cycle decomposition algorithm is also suitable for describing system behavior at different levels of abstraction.
We compare the exit heights and mixing heights of cycles under LLL and ML. We show that if a subset of state space is a cycle under both LLL and ML, then the mixing and exit heights of that cycle will always be smaller for ML as compared to LLL.
We denote the cardinality of a set by
. For a vector, denotes its entry and is its Euclidean norm. The Hamming distance between any two vectors and in is
denotes the dimensional probability simplex, i.e.,
where is a column vector in with all the entries equal to 1.
Ii-B Markov Chains
A discrete time Markov chain on a finite state space
is a random process that consists of a sequence of random variablessuch that for all and
where and for all . Let be the transition matrix for Markov chain and be the transition probability from state to . A distribution is a stationary distribution with respect to if
If a Markov chain is ergodic and reversible, then it has a unique stationary distribution, i.e.,
for all . The following definitions are adapted from .
Let be a transition matrix for a Markov chain over state space . Let be a family of perturbed Markov chains on for sufficiently small corresponding to . We say that is a regular perturbation of if
is ergodic for sufficiently small ,
for some implies that there exists some function such that
where is the cost of transition from to and is normally referred to as resistance.
The Markov process corresponding to is called a regularly perturbed Markov process.
Let be a regular perturbation of with stationary distribution . A state is a stochastically stable state if
Thus, any state that is not stochastically stable will have a vanishingly small probability of occurrence in the steady state as .
Given any two states and in , is reachable from if for some . The neighborhood of is
A path between any two states and in is a sequence of distinct states such that , , , and for all . The length of the path is denoted as . The superscript will be ignored in path notation when the state space is clear from the context. Given a set and a path , we say that if for all . We define as the set of all paths between states and in state space . States and communicate with each other () if the sets and are not empty.
A set is connected if for every , .
The hitting time of is the first time it is visited, i.e.,
The hitting time of a set is the first time one of the states of is visited, i.e.,
The exit time of a Markov chain from is , where
where . We will refer to as the boundary of . In the above definition it is assumed that .
Ii-C Game Theory
Let be a set of strategic players in which each player has a finite set of strategies . The utility of each player is represented by a utility function where is the set of joint action profiles. The combination of the action of the player and the actions of everyone else is represented by . The joint action profiles of all the players except are represented by the set
Player prefers action profile over , where and , if and only if . If , then it is indifferent to both the actions. An action profile is a Nash Equilibrium (NE) if
for all and . The best response set of player given an action profile is
The set of all possible best responses from an action profile is
The neighborhood of an action profile is
The agent specific neighborhood set of action profile is
Potential Game: A game is a potential game if there exists a real valued function such that
for all and for all , . The function is called a potential function.
Iii Stochastic Learning Dynamics
Stochastic learning dynamics is a class of learning dynamics in games in which the players typically play best/better reply to the actions of other players. However, the players sporadically play noisy actions for exploration because of which these dynamics have equilibrium selection property for a class of games like potential games.
Iii-a Log-Linear Learning
Let be the joint action profile representing the current state of the game. Then, the steps involved in LLL are as follows.
Activate one of the players, say player , uniformly at random.
All other players repeat their previous actions.
Player selects an action with the following probability
Here is a normalizing constant, is a best response of player to , and
In (2), is the noise parameter, normally referred to as temperature. For , the players update their strategies uniformly at random. However, as , the probability of the actions yielding higher utilities increases.
Thus, LLL induces a Markov chain over the joint action profile with transition matrix . The transition probability between any two distinct action profiles and is
It was shown in  that for LLL is a regular perturbation of with . That is why we have used the notation instead of . Here, is the transition matrix of the Markov chain induced by sequential best response dynamics. It was also proved in  that in an -player potential game with a potential function , if all the agents update their actions based on LLL, then the only stochastically stable states are the potential maximizers. The stationary distribution for is the Gibbs distribution
where is the normalizing constant.
Iii-B Metropolis Learning
We introduce another learning dynamics that has the same stationary distribution as in (3). We refer to it as Metropolis Learning (ML) because the Markov chain induced by ML is a Metropolis chain, which is well studied in statistical mechanics and in simulated annealing . The steps involved in ML are as follows.
Activate one of the players, say player , uniformly at random.
All other players repeat their previous actions.
Player selects an action uniformly at random.
Player switches its action form to with probability
Thus, the probability of transition from to is
where if and is equal to zero otherwise.
In ML, player switches to a randomly selected action with probability one as long as . Here, is the action that player was playing in the previous time slot. Thus, unlike LLL in which a player needs to compute the utilities for all the actions in its action set given , the update in ML only requires a player to make a pairwise comparison between a randomly selected action and its previous action. Furthermore, the probability of a noisy action is a function of loss in payoff as compared to the previous action.
Metropolis learning generates a Markov Chain over joint action profile with transition matrix . The transition probability between any two distinct action profiles and is
Next we show that is a regularly perturbed Markov process.
Transition matrix is a regular perturbation of , where is the transition matrix for asynchronous better reply dynamics. Moreover, the resistance of any feasible transition from to is
To prove that is a regular perturbation, we first describe the unperturbed process which is asynchronous better reply dynamics. The unperturbed process has the following dynamics.
A player, say , is selected at random.
All the other players repeat their previous actions.
Player selects an action uniformly at random.
Player switches its action from to if
Otherwise, it repeats . Thus
Similar to LLL, the noise parameter . The Metropolis chain is ergodic for a given and it satisfies . For the final condition
where for any given pair and . Thus, is a regular perturbation of .
The important fact regarding ML in the context of this work is that the stationary distribution is also the Gibbs distribution, i.e.,
Thus, from the perspective of stochastic stability, both LLL and ML are precisely the same. To observe and understand the effects of different update rules on system behavior, we simulated a sensor coverage game with both LLL and ML. Next, we present the setup and the results of the simulation.
Iv Motivation: Sensor Coverage Problem
To study the difference in behaviors between LLL and ML, which is ignored under stochastic stability, we set up sensor coverage problem as a potential game. Through extensive simulations under various noise conditions, we exhibit the essential differences between the behavior of these learning dynamics in the short and medium runs. We want to mention here that this formulation of sensor coverage problem with random sensor deployment in a potential game theoretic framework is also a contribution and can be of independent interest in the context of local scheduling schemes for sensor coverage problem.
Iv-a Coverage Game Setup
Consider a scenario in which sensors are deployed randomly to monitor an environment for a long period of time. We approximate with a square region defined over intervals . To simplify the problem, the area is discretized as a 2-dimensional grid represented by the Cartesian product . The location of each sensor is where , i.e.,
is a random variable uniformly distributed over the region of interest. The footprint of sensor is a circular disk of radius , i.e.,
We assume that each sensor can choose the radius of its footprint from a finite set, which determines its energy consumption. Let be the communication range of each sensor. We assume that , where is the maximum sensing radius.
We propose a game-theoretic solution to the sensor coverage problem in which we formulate the problem as a strategic game and implement some local learning rule so that each sensor can learn its schedule based on local information only. The players in this game are the sensors and each player has actions, i.e., and . Here, an action of a player is its sensing radius. For each sensor, , which is the off state of a sensor. The joint action profile is the joint state of all the sensors.
Let be a point on the grid where . The state of a grid point is whether it is covered or uncovered, i.e.,
Thus, the objective is to solve the following optimization problem.
is the total coverage achieved by the sensor network and
is the total cost incurred by the sensors that are on. We assume that , i.e., no cost is incurred by the sensors that are off.
The local utility of each player is computed through marginal contribution utility as explained in  with base action , i.e., the base action of each sensor is to be in the off state in which there is no energy consumption. If is the joint state of all the other sensors, then the utility of player for action is
The above equation implies that . For any
Thus, the marginal contribution utility of sensor with action is the number of grid points that are covered by the sensor exclusively with footprint of radius minus the cost .
To make the payoff and the cost terms in compatible, we express the cost of turning a sensor on as a function of the minimum number of grid points that a sensor should cover exclusively. Let be the maximum number of grid points that a sensor can cover if its footprint has radius . We define the cost as
Thus, the net utility of a sensor is negative if the number of points it covers exclusively is less than given .
Iv-B Simulation Results
We simulated the sensor coverage game with , , , and for all . For this setup, the maximum global utility was
which was computed numerically based on extensive simulations. To achieve the maximum payoff, we implemented LLL and ML with different values for the noise parameter and the number of iterations . The results of the simulation are presented in Figs. 1 and 2.
Initially, all the sensors were in the off state. To compare the short-term behavior of the network with small noise, we set and ran the simulation for twenty times for both LLL and ML with . Since players were randomly selected to update their actions at each decision time, each simulation led to a different system configuration in one hundred iterations even with the same initial condition. The results of twenty simulations are presented in Fig. 1. In Fig. 1(a), we show the number of iterations to reach a NE for the first time under LLL and ML. Based on the results in Fig. 1(a), the average number of iterations to reach a NE for the first time under LLL and ML were 43.15 and 63.75 respectively. Thus, on average, the system reached a NE faster under LLL than ML.
For a system with multiple Nash equilibria, reaching a NE faster is not the only objective. The quality of the NE is also a significant factor. In Fig. 1(b)
, we present the global payoff at the Nash equilibria reached under LLL and ML in our twenty simulations. The global payoffs at the Nash equilibria under LLL and ML had a mean value of 229.6 and 230.1, and a standard deviation of 8.39 and 12.49 respectively. Although the average global payoffs were almost equal, the higher standard deviation under ML implies that ML explored the state space more as compared to LLL. As of result of this higher exploration tendency, the system achieved the global maximum of 247 three times under ML and only one time under LLL.
Thus, based on the comparisons from Fig. 1, LLL seems to be better than ML because it can lead to a NE faster on average. However, ML seems to have a slight edge over LLL if we consider the quality of the Nash equilibria. This observation provides a strong rationale for comprehensive comparative analysis because we cannot simply declare one learning rule better than the other.
For higher order analysis, the objective was to observe and compare system behavior over an extended period. For comparison, we were interested in the following crucial aspects.
Time to reach a payoff maximizing NE under each learning dynamics.
The paths adopted to reach the payoff-maximizing NE and their characteristics.
System behavior after reaching a payoff maximizing NE.
For , optimal configuration could not be achieved under both LLL and ML even in iterations. For LLL, the network remained stuck at some NE for iterations. Under ML, there was a single switch in network configuration after approximately from one NE to another. As we increased the noise to , payoff maximizing configurations were reached under both LLL and ML. However, the number of iterations to reach these optimal configurations were huge, particularly in LLL. Finally, for , the optimal configurations were reached rapidly.
The ability of ML to stay at an optimal configuration after reaching it is affected more by noise as compared to LLL. In Fig. 2(a) with , the network configuration switched from one NE to another under ML, but there was no switch under LLL. In Fig. 2(b) with , the network configuration switched to an optimal NE quickly under ML then under LLL. Finally, the increase of noise led to an interesting behavior that can be observed in Fig. 2(c). Under ML, the network configuration kept on leaving the payoff-maximizing configurations periodically for a significant duration of times. However, under LLL, after reaching an optimal configuration, the network never left the configuration for long durations of time. Every time it left the optimal configuration because of noise, it immediately switched back. We can summarize the observations from the simulation setup as follows
In short run, LLL can drive network configuration to a NE quickly as compared to ML.
In short, medium, and long run, starting from the same initial condition, LLL and ML can drive network configurations along entirely different paths that lead to the payoff-maximizing configurations in the long run.
The effect of noise on LLL and ML is significantly different.
From the above observations, we can conclude that the concept of stochastic stability alone is not sufficient to describe the behavior of stochastic learning dynamics. However, these observations are based on the simulation of a particular system under certain conditions, which prohibits us from drawing any general conclusions regarding the behavior of these learning rules. Therefore, we present a general framework to analyze and compare the behavior of different learning rules that have the same stochastically stable states. We establish that the setup of Cycle Decomposition is useful for the comparative analysis of learning dynamics in games. In particular, we identify and compare the parameters that enable us to explain the system behavior that we observed in the motivating setup of sensor coverage games.
V Cycle Decomposition
Consider a Markov chain on a finite state space with transition matrix . We assume that the transition matrix satisfies the following property.
where for and
Here is defined as follows
For any pair, can be considered as a transition cost from to . It is assumed that the function is irreducible, which implies that for any state pair , there exists a path of length such that
A function is induced by a potential function if, for all and in , the following weak reversibility condition is satisfied.
The following result is from  (Prop. 4.1).
Thus, in the limit as , only the states maximizing the potential will have a non-zero probability. Based on Prop. V.1, there is an entire class of Markov chains that lead to potential maximizers. We want to mention here that the results in  were for minimizing a potential function. Since we are dealing with maximizing a payoff, all the definitions and results are adapted accordingly.
V-a Cycle Decomposition Algorithm
Cycle Decomposition Algorithm (CDA) was presented in , based on the ideas originally presented in . It was presented to study the transient behavior of Markov chains that satisfy (7), (8), and (9) and lead to the stationary distribution defined in Prop. V.1. In this algorithm, the state space is decomposed into unique cycles in an iterative procedure. The formal definition of cycle as presented in  and  is as follows
A set is a cycle if it is a singleton or it satisfies either of the two conditions.
For any , in ,
For any , in ,
where is the number of round trips including and performed by the chain before leaving .
The first condition simply means that a subset is a cycle if starting from some , the probability of leaving before visiting every state is exponentially small. Thus,
The second statement states that the expected number of times each is visited by starting from any is exponentially large.
For higher order comparative analysis, we first decompose the state space into cycles via CDA. Then, we compare the properties of the cycles under each learning dynamics. For the completeness of presentation, we reproduce CDA in Alg. 1. The outcome of CDA as presented in Alg. 1 is the set defined in (12). To explain system behavior using CDA, we need the following definitions and results, which are mostly adopted from .
The minimum cost of leaving a state is
We will refer to as the exit height of state . For any set of states and such that , we define
i.e., is the excess cost above the minimum transition cost form . For a path
The exterior boundary of set is
The interior boundary of set is
We say that a cycle is non-trivial if it has a non-zero exit height. Thus, a singleton is non-trivial cycle if it is a local maxima. The order of the decomposition of the state space is
An increasing family of cycles is defined for each as follows. Define . For each
Given a set such that , the maximal proper partition is
For a cycle ,
exit height is
mixing height is
communication altitude between any two states and is
where is the element in the path .
the communication altitude of a cycle is