Path to Stochastic Stability: Comparative Analysis of Stochastic Learning Dynamics in Games

04/08/2018 ∙ by Hassan Jaleel, et al. ∙ King Abdullah University of Science and Technology 0

Stochastic stability is a popular solution concept for stochastic learning dynamics in games. However, a critical limitation of this solution concept is its inability to distinguish between different learning rules that lead to the same steady-state behavior. We address this limitation for the first time and develop a framework for the comparative analysis of stochastic learning dynamics with different update rules but same steady-state behavior. We present the framework in the context of two learning dynamics: Log-Linear Learning (LLL) and Metropolis Learning (ML). Although both of these dynamics have the same stochastically stable states, LLL and ML correspond to different behavioral models for decision making. Moreover, we demonstrate through an example setup of sensor coverage game that for each of these dynamics, the paths to stochastically stable states exhibit distinctive behaviors. Therefore, we propose multiple criteria to analyze and quantify the differences in the short and medium run behavior of stochastic learning dynamics. We derive and compare upper bounds on the expected hitting time to the set of Nash equilibria for both LLL and ML. For the medium to long-run behavior, we identify a set of tools from the theory of perturbed Markov chains that result in a hierarchical decomposition of the state space into collections of states called cycles. We compare LLL and ML based on the proposed criteria and develop invaluable insights into the comparative behavior of the two dynamics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Stochastic learning dynamics, like log-linear learning, address the issue of equilibrium selection for a class of games that includes potential games (see, e.g., [1], [2], [3] and [4]). Because of the equilibrium selection property, these learning dynamics have received significant attention, particularly in the context of opinion dynamics in coordination games (see, e.g., [2] and [5]) and game theoretic approaches to the distributed control of multiagent systems [6].

A well-known problem of stochastic learning dynamics is the slow mixing of their induced Markov chain [2], [7], [8]. The mixing time of a Markov chain is the time required by the chain to converge to its stationary behavior. This mixing time is crucial because the definition of a stochastically stable state depends on the stationary distribution of the Markov chain induced by a learning dynamics. The slow mixing time implies that the behavior of these dynamics in the short and medium run are equally important particularly for engineered systems with a limited lifetime. However, stochastic stability only deals with the steady-state behavior and provides no information about the transient behavior of these dynamics.

The speed of convergence of stochastic learning dynamics is an active area of research that is receiving significant research attention [9], [10], [11], [12], [13], and [14]. However, there is another aspect related to the slow convergence of these dynamics that has received relatively little research attention. Stochastic stability only explains the steady-state behavior of a system under a learning rule. We establish that there are learning dynamics with considerably different update rules that lead to the same steady-state behavior. Since these learning dynamics have the same stochastically stable states, stochastic stability cannot distinguish between these dynamics. The different update rules may result in significantly different behaviors over short and medium run that may be desirable or undesirable but remain entirely unnoticed.

We first establish the implications of having different learning rules with the same steady state. Through an example setup, we demonstrate the differences in short and medium run behaviors for two particular learning dynamics with different update rules that lead to the same stochastically stable states. The example setup is that of sensor coverage game in which we formulate a sensor coverage problem in the framework of a potential game. An important conclusion that we draw from this comparison is that for stochastic learning dynamics, characterization of stochastically stable states is not sufficient. It is also essential to analyze the paths that lead to these stochastically stable states from any given initial condition. Analysis of these paths is critical because there are specific properties of these paths that play a crucial role not only in the short and medium run but also in the long run steady state behavior of the system.

The transient behavior of stochastic dynamics was studied in the context of learning in games in [15] and [16]. However, the issues related to various learning dynamics leading to the same steady state behavior were not highlighted in these works. Therefore, after motivating the problem, we propose a novel framework for performing a comparative analysis of different stochastic learning dynamics with the same steady state. The proposed framework is based on the theory of Markov chains with rare transitions [17], [18], [19], [20], [21], [22], and [23].

In the proposed framework, we present multiple criteria for comparing the short, medium, and long-run behaviors of a system under different learning dynamics. We refer to the analysis related to the short-run behavior as first order analysis. The first order analysis deals with the expected hitting time of the set of pure Nash equilibria, which is the expected time to reach a Nash equilibrium (NE) for the first time. Because of the known hardness results of computing a NE [24], [25], and the fact that all the Nash equilibria of a potential game are not necessarily potential maximizer, the first order analysis is typically not considered for stochastic learning dynamics. However, we are interested in the comparative analysis of learning dynamics for which we show that first-order analysis provides valuable insights into the behavior of a system.

We refer to the analysis related to the medium and long-run behavior of stochastic learning dynamics as higher-order analysis. The higher-order analysis is based on the fact that the Markov chains induced by stochastic learning dynamics explore the space of joint action profiles hierarchically. This hierarchical exploration of the state space is well explained by an iterative decomposition of the state space into cycles of different orders as shown in [17], [19], [18], and [26]. Thus, the evolution of Markov chains with rare transitions can be well approximated by transitions among cycles of proper order.

Therefore, we develop our higher order analysis on the cycle decomposition of the state space as presented in [27]. We compare the behavior of different learning rules by comparing the exit height , and the mixing height of the cycles generated by the cycle decomposition algorithm applied to these learning rule. The significance of these parameters is that once a Markov chain enters a cycle, the time to exit the cycle is of the order of and the time to visit each state within a cycle before exiting is of the order of . Thus, we can efficiently characterize the behavior of each cycle from and .

Our comparative analysis framework applies to the class of learning dynamics in which the induced Markov chains satisfy certain regularity conditions. However, we present the details of the framework in the context of two particular learning dynamics, Log-Linear Learning (LLL) and Metropolis Learning (ML) over potential games. Log-Linear learning is a noisy best response dynamics in which the probability of a noisy action from a player is inversely related to the cost of deviating from the best response. This learning rule is well-known in game theory and the stationary distribution of the induced Markov chain is a Gibbs distribution, which depends on a potential function. The Gibbs distribution over the space of joint action profiles assigns the maximum probability to the action profiles that maximize the potential function. Moreover, it was shown in

[28] that LLL is a good behavioral model for decision making when the players have sufficient information to compute their utilities for all the actions in their action set given the actions of other players in the game.

On the other hand, Metropolis learning is a noisy better response dynamics for which the induced Markov chain is a Metropolis chain. It is well established in the statistical mechanics literature that the unique stationary distribution of Metropolis chain is the Gibbs distribution (see, e.g., [18] and [20]). As a behavioral model for decision making, ML is related closely to the pairwise comparison dynamics presented in [29]. Thus, ML is a behavioral model for decision making with low information demand. A player only needs to compare its current payoff with the payoff of a randomly selected action. It does not need to know the payoffs for all the actions as in LLL. The only assumption is that each player has the ability or the resources to compute the payoff for one randomly selected action.

Hence, we have two learning dynamics, LLL and ML, which correspond to two behavioral models for decision making with very different information requirements. However, both of the learning rules lead to the same steady-state behavior. We compare these learning dynamics based on the proposed framework. The crux of our comparative analysis is that the availability of more information in the case of LLL as compared to ML does not guarantee better performance when the performance criterion is to reach the potential maximizer quickly.

A summary of our main contributions in this work is as follows
Contributions

  • For problem motivation, we present our setup of sensor coverage game in which we formulate the sensor coverage problem with random sensor deployment as a potential game.

  • For the first order analysis, we derive and compare upper bounds on the expected hitting time to the set of NE for both LLL and ML.

  • We also obtain a sufficient condition to guarantee a smaller expected hitting time to the set of Nash equilibria under LLL than ML from any initial condition.

  • For higher order analysis, we identify cycle decomposition algorithm as a useful tool for the comparative analysis of stochastic learning dynamics. Moreover, we show through an example of a simple Markov chain that cycle decomposition algorithm is also suitable for describing system behavior at different levels of abstraction.

  • We compare the exit heights and mixing heights of cycles under LLL and ML. We show that if a subset of state space is a cycle under both LLL and ML, then the mixing and exit heights of that cycle will always be smaller for ML as compared to LLL.

Ii Background

Ii-a Preliminaries

We denote the cardinality of a set by

. For a vector

, denotes its entry and is its Euclidean norm. The Hamming distance between any two vectors and in is

(1)

denotes the dimensional probability simplex, i.e.,

where is a column vector in with all the entries equal to 1.

Ii-B Markov Chains

A discrete time Markov chain on a finite state space

is a random process that consists of a sequence of random variables

such that for all and

where and for all . Let be the transition matrix for Markov chain and be the transition probability from state to . A distribution is a stationary distribution with respect to if

If a Markov chain is ergodic and reversible, then it has a unique stationary distribution, i.e.,

for all . The following definitions are adapted from [2].

Definition II.1

Let be a transition matrix for a Markov chain over state space . Let be a family of perturbed Markov chains on for sufficiently small corresponding to . We say that is a regular perturbation of if

  1. is ergodic for sufficiently small ,

  2. , and

  3. for some implies that there exists some function such that

where is the cost of transition from to and is normally referred to as resistance.

The Markov process corresponding to is called a regularly perturbed Markov process.

Definition II.2

Let be a regular perturbation of with stationary distribution . A state is a stochastically stable state if

Thus, any state that is not stochastically stable will have a vanishingly small probability of occurrence in the steady state as .

Given any two states and in , is reachable from if for some . The neighborhood of is

A path between any two states and in is a sequence of distinct states such that , , , and for all . The length of the path is denoted as . The superscript will be ignored in path notation when the state space is clear from the context. Given a set and a path , we say that if for all . We define as the set of all paths between states and in state space . States and communicate with each other () if the sets and are not empty.

Definition II.3

A set is connected if for every , .

Definition II.4

The hitting time of is the first time it is visited, i.e.,

The hitting time of a set is the first time one of the states of is visited, i.e.,

Definition II.5

The exit time of a Markov chain from is , where

where . We will refer to as the boundary of . In the above definition it is assumed that .

Ii-C Game Theory

Let be a set of strategic players in which each player has a finite set of strategies . The utility of each player is represented by a utility function where is the set of joint action profiles. The combination of the action of the player and the actions of everyone else is represented by . The joint action profiles of all the players except are represented by the set

Player prefers action profile over , where and , if and only if . If , then it is indifferent to both the actions. An action profile is a Nash Equilibrium (NE) if

for all and . The best response set of player given an action profile is

The set of all possible best responses from an action profile is

The neighborhood of an action profile is

The agent specific neighborhood set of action profile is

Potential Game: A game is a potential game if there exists a real valued function such that

for all and for all , . The function is called a potential function.

Iii Stochastic Learning Dynamics

Stochastic learning dynamics is a class of learning dynamics in games in which the players typically play best/better reply to the actions of other players. However, the players sporadically play noisy actions for exploration because of which these dynamics have equilibrium selection property for a class of games like potential games.

Iii-a Log-Linear Learning

Let be the joint action profile representing the current state of the game. Then, the steps involved in LLL are as follows.

  1. Activate one of the players, say player , uniformly at random.

  2. All other players repeat their previous actions.

  3. Player selects an action with the following probability

    (2)

    Here is a normalizing constant, is a best response of player to , and

In (2), is the noise parameter, normally referred to as temperature. For , the players update their strategies uniformly at random. However, as , the probability of the actions yielding higher utilities increases.

Thus, LLL induces a Markov chain over the joint action profile with transition matrix . The transition probability between any two distinct action profiles and is

It was shown in [30] that for LLL is a regular perturbation of with . That is why we have used the notation instead of . Here, is the transition matrix of the Markov chain induced by sequential best response dynamics. It was also proved in [30] that in an -player potential game with a potential function , if all the agents update their actions based on LLL, then the only stochastically stable states are the potential maximizers. The stationary distribution for is the Gibbs distribution

(3)

where is the normalizing constant.

Iii-B Metropolis Learning

We introduce another learning dynamics that has the same stationary distribution as in (3). We refer to it as Metropolis Learning (ML) because the Markov chain induced by ML is a Metropolis chain, which is well studied in statistical mechanics and in simulated annealing [20]. The steps involved in ML are as follows.

  1. Activate one of the players, say player , uniformly at random.

  2. All other players repeat their previous actions.

  3. Player selects an action uniformly at random.

  4. Player switches its action form to with probability

Thus, the probability of transition from to is

(4)

where if and is equal to zero otherwise.

In ML, player switches to a randomly selected action with probability one as long as . Here, is the action that player was playing in the previous time slot. Thus, unlike LLL in which a player needs to compute the utilities for all the actions in its action set given , the update in ML only requires a player to make a pairwise comparison between a randomly selected action and its previous action. Furthermore, the probability of a noisy action is a function of loss in payoff as compared to the previous action.

Metropolis learning generates a Markov Chain over joint action profile with transition matrix . The transition probability between any two distinct action profiles and is

Next we show that is a regularly perturbed Markov process.

Lemma III.1

Transition matrix is a regular perturbation of , where is the transition matrix for asynchronous better reply dynamics. Moreover, the resistance of any feasible transition from to is

(5)

To prove that is a regular perturbation, we first describe the unperturbed process which is asynchronous better reply dynamics. The unperturbed process has the following dynamics.

  1. A player, say , is selected at random.

  2. All the other players repeat their previous actions.

  3. Player selects an action uniformly at random.

  4. Player switches its action from to if

    Otherwise, it repeats . Thus

Similar to LLL, the noise parameter . The Metropolis chain is ergodic for a given and it satisfies . For the final condition

where for any given pair and . Thus, is a regular perturbation of .

The important fact regarding ML in the context of this work is that the stationary distribution is also the Gibbs distribution, i.e.,

(6)

Thus, from the perspective of stochastic stability, both LLL and ML are precisely the same. To observe and understand the effects of different update rules on system behavior, we simulated a sensor coverage game with both LLL and ML. Next, we present the setup and the results of the simulation.

Iv Motivation: Sensor Coverage Problem

To study the difference in behaviors between LLL and ML, which is ignored under stochastic stability, we set up sensor coverage problem as a potential game. Through extensive simulations under various noise conditions, we exhibit the essential differences between the behavior of these learning dynamics in the short and medium runs. We want to mention here that this formulation of sensor coverage problem with random sensor deployment in a potential game theoretic framework is also a contribution and can be of independent interest in the context of local scheduling schemes for sensor coverage problem.

(a) Number of iterations to reach first NE
(b) Global payoff at the first NE
Fig. 1: System performance under LLL and ML for .
(a) ,
(b) ,
(c) ,
Fig. 2: System performance under LLL and ML for different noise conditions.

Iv-a Coverage Game Setup

Consider a scenario in which sensors are deployed randomly to monitor an environment for a long period of time. We approximate with a square region defined over intervals . To simplify the problem, the area is discretized as a 2-dimensional grid represented by the Cartesian product . The location of each sensor is where , i.e.,

is a random variable uniformly distributed over the region of interest

. The footprint of sensor is a circular disk of radius , i.e.,

We assume that each sensor can choose the radius of its footprint from a finite set, which determines its energy consumption. Let be the communication range of each sensor. We assume that , where is the maximum sensing radius.

We propose a game-theoretic solution to the sensor coverage problem in which we formulate the problem as a strategic game and implement some local learning rule so that each sensor can learn its schedule based on local information only. The players in this game are the sensors and each player has actions, i.e., and . Here, an action of a player is its sensing radius. For each sensor, , which is the off state of a sensor. The joint action profile is the joint state of all the sensors.

Let be a point on the grid where . The state of a grid point is whether it is covered or uncovered, i.e.,

Thus, the objective is to solve the following optimization problem.

where

is the total coverage achieved by the sensor network and

is the total cost incurred by the sensors that are on. We assume that , i.e., no cost is incurred by the sensors that are off.

The local utility of each player is computed through marginal contribution utility as explained in [31] with base action , i.e., the base action of each sensor is to be in the off state in which there is no energy consumption. If is the joint state of all the other sensors, then the utility of player for action is

The above equation implies that . For any

Thus, the marginal contribution utility of sensor with action is the number of grid points that are covered by the sensor exclusively with footprint of radius minus the cost .

To make the payoff and the cost terms in compatible, we express the cost of turning a sensor on as a function of the minimum number of grid points that a sensor should cover exclusively. Let be the maximum number of grid points that a sensor can cover if its footprint has radius . We define the cost as

Thus, the net utility of a sensor is negative if the number of points it covers exclusively is less than given .

Iv-B Simulation Results

We simulated the sensor coverage game with , , , and for all . For this setup, the maximum global utility was

which was computed numerically based on extensive simulations. To achieve the maximum payoff, we implemented LLL and ML with different values for the noise parameter and the number of iterations . The results of the simulation are presented in Figs. 1 and 2.

Initially, all the sensors were in the off state. To compare the short-term behavior of the network with small noise, we set and ran the simulation for twenty times for both LLL and ML with . Since players were randomly selected to update their actions at each decision time, each simulation led to a different system configuration in one hundred iterations even with the same initial condition. The results of twenty simulations are presented in Fig. 1. In Fig. 1(a), we show the number of iterations to reach a NE for the first time under LLL and ML. Based on the results in Fig. 1(a), the average number of iterations to reach a NE for the first time under LLL and ML were 43.15 and 63.75 respectively. Thus, on average, the system reached a NE faster under LLL than ML.

For a system with multiple Nash equilibria, reaching a NE faster is not the only objective. The quality of the NE is also a significant factor. In Fig. 1(b)

, we present the global payoff at the Nash equilibria reached under LLL and ML in our twenty simulations. The global payoffs at the Nash equilibria under LLL and ML had a mean value of 229.6 and 230.1, and a standard deviation of 8.39 and 12.49 respectively. Although the average global payoffs were almost equal, the higher standard deviation under ML implies that ML explored the state space more as compared to LLL. As of result of this higher exploration tendency, the system achieved the global maximum of 247 three times under ML and only one time under LLL.

Thus, based on the comparisons from Fig. 1, LLL seems to be better than ML because it can lead to a NE faster on average. However, ML seems to have a slight edge over LLL if we consider the quality of the Nash equilibria. This observation provides a strong rationale for comprehensive comparative analysis because we cannot simply declare one learning rule better than the other.

For higher order analysis, the objective was to observe and compare system behavior over an extended period. For comparison, we were interested in the following crucial aspects.

  • Time to reach a payoff maximizing NE under each learning dynamics.

  • The paths adopted to reach the payoff-maximizing NE and their characteristics.

  • System behavior after reaching a payoff maximizing NE.

Therefore, we simulated the system for iterations with and , and for iterations with . The results are presented in Figs 2(a)-2(c) respectively.

For , optimal configuration could not be achieved under both LLL and ML even in iterations. For LLL, the network remained stuck at some NE for iterations. Under ML, there was a single switch in network configuration after approximately from one NE to another. As we increased the noise to , payoff maximizing configurations were reached under both LLL and ML. However, the number of iterations to reach these optimal configurations were huge, particularly in LLL. Finally, for , the optimal configurations were reached rapidly.

The ability of ML to stay at an optimal configuration after reaching it is affected more by noise as compared to LLL. In Fig. 2(a) with , the network configuration switched from one NE to another under ML, but there was no switch under LLL. In Fig. 2(b) with , the network configuration switched to an optimal NE quickly under ML then under LLL. Finally, the increase of noise led to an interesting behavior that can be observed in Fig. 2(c). Under ML, the network configuration kept on leaving the payoff-maximizing configurations periodically for a significant duration of times. However, under LLL, after reaching an optimal configuration, the network never left the configuration for long durations of time. Every time it left the optimal configuration because of noise, it immediately switched back. We can summarize the observations from the simulation setup as follows

  1. In short run, LLL can drive network configuration to a NE quickly as compared to ML.

  2. In short, medium, and long run, starting from the same initial condition, LLL and ML can drive network configurations along entirely different paths that lead to the payoff-maximizing configurations in the long run.

  3. The effect of noise on LLL and ML is significantly different.

From the above observations, we can conclude that the concept of stochastic stability alone is not sufficient to describe the behavior of stochastic learning dynamics. However, these observations are based on the simulation of a particular system under certain conditions, which prohibits us from drawing any general conclusions regarding the behavior of these learning rules. Therefore, we present a general framework to analyze and compare the behavior of different learning rules that have the same stochastically stable states. We establish that the setup of Cycle Decomposition is useful for the comparative analysis of learning dynamics in games. In particular, we identify and compare the parameters that enable us to explain the system behavior that we observed in the motivating setup of sensor coverage games.

V Cycle Decomposition

Consider a Markov chain on a finite state space with transition matrix . We assume that the transition matrix satisfies the following property.

(7)

where for and

(8)

Here is defined as follows

For any pair, can be considered as a transition cost from to . It is assumed that the function is irreducible, which implies that for any state pair , there exists a path of length such that

Definition V.1

A function is induced by a potential function if, for all and in , the following weak reversibility condition is satisfied.

(9)

The following result is from [32] (Prop. 4.1).

Proposition V.1

Let be a family of Markov chains over state space such that the transition matrices satisfy (7) and (8). If the function is induced by a potential as defined in Def. V.1, then the stationary distribution is such that

Thus, in the limit as , only the states maximizing the potential will have a non-zero probability. Based on Prop. V.1, there is an entire class of Markov chains that lead to potential maximizers. We want to mention here that the results in [32] were for minimizing a potential function. Since we are dealing with maximizing a payoff, all the definitions and results are adapted accordingly.

V-a Cycle Decomposition Algorithm

Cycle Decomposition Algorithm (CDA) was presented in [26], based on the ideas originally presented in [17]. It was presented to study the transient behavior of Markov chains that satisfy (7), (8), and (9) and lead to the stationary distribution defined in Prop. V.1. In this algorithm, the state space is decomposed into unique cycles in an iterative procedure. The formal definition of cycle as presented in [32] and [21] is as follows

Definition V.2

A set is a cycle if it is a singleton or it satisfies either of the two conditions.

  1. For any , in ,

  2. For any , in ,

    where is the number of round trips including and performed by the chain before leaving .

The first condition simply means that a subset is a cycle if starting from some , the probability of leaving before visiting every state is exponentially small. Thus,

The second statement states that the expected number of times each is visited by starting from any is exponentially large.

For higher order comparative analysis, we first decompose the state space into cycles via CDA. Then, we compare the properties of the cycles under each learning dynamics. For the completeness of presentation, we reproduce CDA in Alg. 1. The outcome of CDA as presented in Alg. 1 is the set defined in (12). To explain system behavior using CDA, we need the following definitions and results, which are mostly adopted from [26].

The minimum cost of leaving a state is

We will refer to as the exit height of state . For any set of states and such that , we define

i.e., is the excess cost above the minimum transition cost form . For a path

The exterior boundary of set is

The interior boundary of set is

We say that a cycle is non-trivial if it has a non-zero exit height. Thus, a singleton is non-trivial cycle if it is a local maxima. The order of the decomposition of the state space is

An increasing family of cycles is defined for each as follows. Define . For each

(10)

Given a set such that , the maximal proper partition is

where .

For a cycle ,

  • order is

  • exit height is

  • mixing height is

  • potential is

  • communication altitude between any two states and is

    where is the element in the path .

  • the communication altitude of a cycle is

1:Define level zero as
with communication costs
2:The level has been constructed.
3:while  do
4:     Form a graph such that each cycle is a vertex in and
5:     Compute the minimum exit cost for every .
6:     For every and , compute .
7:     Form a graph such that for each vertex , . The graph is a subgraph of .
8:     Compute the strongly connected components in . is a strongly connected component of if for every and in , there exists a path such that .
9:     Let be the set of strongly connected components in . Define a minimum set as follows
10:     Construct the set
11:     For each , define
12:     Compute the cost between the sets in as
(11)
13:     .
14:end while
(12)
Algorithm 1 Cycle Decomposition

The exit and the mixing heights of a cycle provides an estimate of how long the Markov chain will remain in the cycle. The potential of a cycle is the maximum potential of a state within the cycle. The communication altitude was introduced in

[26], and it was shown that relates , and as follows.