1 Introduction
Recently, multiagent formulations have been utilized to tackle distributed optimization problems, since communication and computational complexity might be an issue under centralized schemes. In such formulations, decisions are usually taken in a repeated fashion, where agents select their next actions based on their own prior experience. Naturally, such multiagent interactions can be designed as strategicform games, where agents are repeatedly involved in a strategic interaction with a fixed payoff or utilitymatrix. Such framework finds numerous applications, including, for example, the problem of distributed overlay routing [2], distributed topology control [3] and distributed resource allocation [4].
Given the repeated fashion of the involved strategic interactions in such formulations, several questions naturally emerge: a) Can agents “learn” to asymptotically select optimal decisions/actions?, b) What information should agents share with each other?, and c) What is the computational complexity of the learning process? Under the scope of engineering applications, it is usually desirable that each agent shares minimum amount of information with other agents, while the computational complexity of the learning process is small. A class of learning dynamics that achieves small communication and computational complexity is the socalled payoffbased learning. Under such class of learning dynamics, each agent only receives measurements of its own utility function, without the need to know the actions selected by other agents, or the details of its own utility function (i.e., its dependencies on other agents’ actions).
In such repeatedlyplayed strategicform games, a popular objective for payoffbased learning is to guarantee convergence (in some sense) to Nash equilibria. Convergence to Nash equilibria may be desirable, especially when the set of optimal centralized solutions coincides with the set of Nash equilibria.
Reinforcementbased learning has been utilized in strategicform games in order for agents to gradually learn to play Nash equilibria. It may appear under alternative forms, including discretetime replicator dynamics [5], learning automata [6, 7] or approximate policy iteration or learning [8]. In all these classes of learning dynamics, deriving conditions under which convergence to Nash equilibria is achieved may not be a trivial task especially in the case of large number of agents (as it will be discussed in detail in the forthcoming Section 2).
In the present paper, we consider a class of reinforcementbased learning introduced in [9] that is closely related to both discretetime replicator dynamics and learning automata. We will refer to this class of dynamics as perturbed learning automata. The main difference with prior reinforcement learning schemes lies in a) the stepsize sequence, and b) the perturbation (or mutations) term. The stepsize sequence is assumed constant, thus introducing a fadingmemory effect of past experiences in each agent’s strategy. On the other hand, the perturbation term introduces errors in the selection process of each agent. Both these two features can be used for designing a desirable asymptotic behavior.
We provide an analytical framework for deriving conclusions over the asymptotic behavior of the dynamics that is based upon an explicit characterization of the invariant probability measure of the induced Markov chain. In particular, we show that in all strategicform games satisfying the PositiveUtility Property, the support of the invariant probability measure coincides with the set of pure strategy profiles. This extends prior work in coordination games, where convergence to mixed strategy profiles may only be excluded under strong conditions in the payoff matrix (e.g., existence of a potential function). Furthermore, we provide a methodology for computing the set of stochastically stable states in all positiveutility games. We illustrate this methodology in the context of coordination games and provide a simulation study in distributed network formation.
In the remainder of the paper, Section 2 presents the investigated class of learning dynamics, related work and the main contributions. Section 3 provides a simplification in the characterization of stochastic stability, while Section 4 presents its technical derivation. This result is utilized for computing the stochastically stable states in positiveutility games in Section 5. In Section 6, we present an illustration of the proposed methodology in the context of coordination games, together with a simulation study in distributed network formation. Finally, Section 7 presents concluding remarks.
Notation:

For a Euclidean topological space , let denote the neighborhood of , i.e.,
where denotes the Euclidean distance.

denotes the probability simplex of dimension , i.e.,

For some set in a topological space , let denote the index function, i.e.,

For a finite set , denotes its cardinality.

For a finite set
and any probability distribution
, the random selection of an element of will be denoted by . If , the random selection will be denoted by . 
denotes the Dirac measure at .

denotes the natural logarithm.
2 Perturbed Learning Automata
2.1 Terminology
We consider the standard setup of finite strategicform games. Consider a finite set of agents (or players) , and let each agent have a finite set of actions . Let denote any such action of agent . The set of action profiles is the Cartesian product and let be a representative element of this set. We will denote to be the complementary set and often decompose an action profile as follows . The payoff/utility function of agent is a mapping . A strategicform game is defined by the triple .
For the remainder of the paper, we will be concerned with strategicform games that satisfy the PositiveUtility Property.
Property 2.1 (PositiveUtility Property)
For any agent and any action profile , .
This property is rather generic and applies to a large family of games. For example, games at which some form of alignment of interests exists between agents (e.g., coordination games [10] or weaklyacyclic games [11]), can be designed to satisfy this property, since agents’ utilities/preferences are rather close to each other at any given action profile. However, in the forthcoming analysis, we do not impose any structural constraint other than the Property 2.1.
2.2 Perturbed Learning Automata
We consider a form of reinforcementbased learning that belongs to the general class of learning automata [7]. In learning automata, each agent updates a finite probability distribution representing its beliefs with respect to the most profitable action. The precise manner in which changes at time , depending on the performed action and the response of the environment, completely defines the learning model.
The proposed learning model is described in Table 1. At the first step, each agent updates its action given its current strategy vector . Its selection is slightly perturbed by a perturbation (or mutations) factor , such that, with a small probability agent follows a uniform strategy (or, it trembles). At the second step, agent evaluates its new selection by collecting a utility measurement, while in the last step, agent updates its strategy vector given its new experience.
Here, we identify actions with vertices of the simplex, . For example, if agent selects its th action at time , then . To better see how the strategies evolve, let us consider the following toy example. Let the current strategy of player be , i.e., player has two actions, each assigned probability . Let also , i.e., player selects the first action according to rule (1). Then, the new strategy vector for agent , updated according to rule (2), is:
In other words, when player selects its first action, the strategy of this action is going to increase proportionally to the reward received from this action. We may say that such type of dynamics reinforce repeated selection, however the size of reinforcement depends on the reward received.
By playing a strategicform game repeatedly over time, players do not always experience the same reward when selecting the same action, since other players may also change their actions. This dynamic element of the size of reinforcement is the factor that complicates the convergence analysis, as it will become clear in the forthcoming related work.
Note that by letting the stepsize to be sufficiently small and since the utility function is uniformly bounded in , for all .
In case , the above update recursion will be referred to as the unperturbed learning automata.
2.3 Related work
Discretetime replicator dynamics
A type of learning dynamics which is quite closely related to the dynamics of Table 1 is the discretetime version of replicator dynamics (cf., [12]). It has been used in different forms, depending primarily on the stepsize sequence in Table 1. For example, Arthur [5] considered a similar rule, with and step size of each agent defined as , for some positive constant and for (in the place of the constant step size of (2)). A comparative model is also used by Hopkins and Posch in [13], with , where is the accumulated benefits of agent up to time which gives rise to the urn process of ErevRoth [14]. Some similarities are also shared with the Cross’ learning model of [15], where and , and its modification presented by Leslie in [16], where , instead, is assumed decreasing with time.
The main difference of the proposed dynamics of Table 1 lies in the perturbation parameter which was first introduced and analyzed in [9]. A statedependent perturbation term has also been investigated in [17]. The perturbation parameter may serve as an equilibrium selection mechanism, since it excludes convergence to nonNash action profiles [9]. It resolved one of the main issues of discretetime replicator dynamics, that is the positive probability of convergence to action profiles that are not Nash equilibria (briefly, nonNash action profiles).
Although excluding convergence to nonNash action profiles can be guaranteed by using sufficiently small , establishing convergence to action profiles that are Nash equilibria (pure Nash equilibria) may still be an issue. This is desirable in the context of coordination games [18], where Paretoefficient outcomes are usually pure Nash equilibria (see, e.g., the definition of a coordination game in [10]). As presented in [17], convergence to pure Nash equilibria can be guaranteed only under strong conditions in the payoff matrix. For example, as shown in [17, Proposition 8], and under the ODEmethod for stochastic approximations, it requires a) the existence of a potential function, and b) conditions over the Jacobian matrix of the potential function. Even if a potential function does exist, verifying condition (b) is practically infeasible for games of more than 2 players [17].
On the other hand, an important sidebenefit of using this class of dynamics is the indirect “filtering” on the utilityfunction measurements (through the formulation of the strategy vectors in (2)). This is demonstrated, for example, in [13] for the ErevRoth model [14], where the robustness of convergence/nonconvergence asymptotic results is presented under the presence of noise in the utility measurements.
Learning automata
Learning automata, as first introduced by [6], have attracted attention with respect to the control of complex and distributed systems due to their simple structure and low computational complexity (cf., [7, Chapter 1]). Variablestructure stochastic automata may incorporate a form of reinforcement of favorable actions. Therefore, such stochastic automata bear a lot of similarities to the discretetime analogs of replicator dynamics discussed above. An example of such stochastic learning automata is the linear rewardinaction scheme described in [7, Chapter 4]. Comparing it with the reinforcement rule of (2), the linear rewardinaction scheme accepts a utility function of the form , where corresponds to an unfavorable response and corresponds to a favorable one. More general forms can also be used when the utility function may accept discrete or continuous values in the unit interval .
Analysis of learning automata in games has been restricted to zerosum and identicalinterest games [7, 19]. In identical interest games, convergence analysis has been derived for small number of players and actions, due to the difficulty in deriving conditions for absolute monotonicity, which corresponds to the property that the expected utility received by each player increases monotonically in time (cf., [7, Definition 8.1]). Similar are the results presented in [19].
The property of absolute monotonicity is closely related to the existence of a potential function, as in the case of potential games [20]. Similarly to the discretetime replicator dynamics, convergence to nonNash action profiles cannot be excluded when the stepsize sequence is constant, even if the utility function satisfies as in the learning automata. (The behavior under decreasing stepsize is different as [17, Proposition 2] has shown.) Furthermore, deriving conditions for excluding convergence to mixed strategy profiles in coordination games continues to be an issue for the case of learning automata, as in the case of discretetime replicator dynamics.
Recognizing these issues, reference [21] introduced a class of linear rewardinaction schemes in combination with a coordinated exploration phase so that convergence to the efficient (pure) Nash equilibrium is achieved. However, coordination of the exploration phase requires communication between the players, an approach that does not fit to the distributed nature of dynamics pursued here.
learning
Similar questions of convergence to Nash equilibria also appear in alternative reinforcementbased learning formulations, such as approximate dynamic programming and learning. Usually, under learning, players keep track of the discounted running average reward received by each action, based on which optimal decisions are made (see, e.g., [22]). Convergence to Nash equilibria can be accomplished under a stronger set of assumptions, which increases the computational complexity of the dynamics. For example, in the NashQ learning algorithm of [8], it is indirectly assumed that agents need to have full access to the joint action space and the rewards received by other agents.
More recently, reference [23] introduced a learning scheme in combination with either adaptive play or betterreply dynamics in order to attain convergence to Nash equilibria in potential games [20] or weaklyacyclic games. However, this form of dynamics requires that each player observes the actions selected by the other players, since a value needs to be assigned for each joint action.
When the evaluation of the values is totally independent, as in the individual learning in [22], then convergence to Nash equilibria has been shown only for 2player zerosum games and 2player partnership games with countably many Nash equilibria. Currently, there exist no convergence results in multiplayer games. This is a main drawback for learning dynamics in strategicform games as also pointed out in [24]. To overcome this drawback, in the context of stochastic dynamic games, reference [24] employs an additional feature (motivated by [11]), namely exploration phases. In any such exploration phase, all agents use constant policies, something that allows the accurate computation of the optimal factors. We may argue that the introduction of common exploration phases for all agents partially destroys the distributed nature of the dynamics, since it requires synchronization between agents.
Aspirationbased learning
Recently, there have been several attempts to establish convergence to Nash equilibria through alternative payoffbased learning dynamics, (see, e.g., the benchmarkbased dynamics of [11] for convergence to Nash equilibria in weaklyacyclic games, the trialanderror learning [25] for convergence to Nash equilibria in generic games, the moodbased dynamics of [26] for maximizing welfare in generic games or the aspiration learning in [10] for convergence to efficient outcomes in coordination games). We will refer to such approaches as aspirationbased learning. For these types of dynamics, convergence to Nash equilibria or efficient outcomes can be established without requiring any strong monotonicity properties (as in the multiplayer weaklyacyclic games in [11]).
The case of noisy utility measurements, which are present in many engineering applications, has not currently been addressed through aspirationbased learning. The only exception is reference [11], under benchmarkbased dynamics, where (synchronized) exploration phases are introduced through which each agent plays a fixed action for the duration of the exploration phase. If such exploration phases are large in duration (as required by the results in [11]), this may reduce the robustness of the dynamics to dynamic changes in the environment (e.g., changes in the utility function). One reason that such robustness analysis is currently not possible in this class of dynamics is the fact that decisions are taken directly based on the measured performances (e.g., by comparing the currently measured performance with the benchmark performance in [11]).
2.4 Contributions
The aforementioned literature in payoffbased learning dynamics in strategicform games can be grouped into two main categories, namely reinforcementbased learning (including discretetime replicator dynamics, learning automata and learning) and aspirationbased learning. Summarizing their main advantages/disadvantages, we may argue the following highlevel observations.

Strong asymptotic convergence guarantees for large number of players, even for generic games, are currently possible under aspirationbased learning. Similar results in reinforcementbased learning are currently restricted to games of small number of players and under strong structural assumptions (e.g., the existence of a potential function). See, for example, the discussion on discretetime replicator dynamics or learning automata in [17], or the discussion on learning in [24].

Noisy observations can be “handled” through reinforcementbased learning due to the indirect filtering of the observation signals (e.g., through the strategyvector formulation in the model of Table 1 or in the formulation of the factors in learning). This is demonstrated, for example, in the convergence/nonconvergence asymptotic results presented in [13] for a variation of the proposed learning dynamics of Table 1 (with and decreasing ) and under the presence of noise. Similar effects in aspirationbased learning can currently be achieved only through the introduction of synchronized exploration phases, as discussed in Section 2.3.
Motivated by these two observations (O1)–(O2), and the obvious inability of reinforcementbased learning to provide strong convergence guarantees in large games, this paper advances asymptotic convergence guarantees for a class of reinforcementbased learning described in Table 1 (closely related to both discretetime replicator dynamics and learning automata, as discussed in Section 2.3). Our goal is to go beyond common restrictions of small number of players and strong assumptions in the game structure (such as the existence of a potential function).
The proposed dynamics (also perturbed learning automata) were first introduced in [9] to resolve stability issues in the boundary of the domain appearing in prior schemes [5, 13]. This was achieved through the introduction of the perturbation factor of Table 1. However, strong convergence guarantees (e.g., w.p.1 convergence to Nash equilibria or efficient outcomes) is currently limited to small number of players and under strict structural assumptions, e.g., the existence of a potential function and additional conditions on its Jacobian matrix [17].
In this paper, we drop the assumption of a decreasing stepsize sequence, and instead we consider the case of a constant step size . Such selection increases the adaptivity of the dynamics to varying conditions (e.g., the number of agents or the utility function). Furthermore, we provide a stochasticstability analysis that provides a detailed characterization of the invariant probability measure of the induced Markov chain with no restrictions on the number of players. In particular, our contributions are the following:

We provide an equivalent finitedimensional characterization of the infinitedimensional induced Markov chain of the dynamics, that simplifies significantly the characterization of its invariant probability measure. This simplification is based upon a weakconvergence result and it applies to any strategicform game with the PositiveUtility Property 2.1 (Theorem 3.1).

We capitalize on this simplification and provide a methodology for computing stochastically stable states in positiveutility strategicform games (Theorem 5.1).

We illustrate the utility of this methodology in establishing stochastic stability in a class of coordination games with no restriction on the number of players or actions (Theorem 6.1).
These contributions significantly extend the utility of reinforcementbased learning for the reasons explained in observation (O1). We have to note that the illustration result in coordination games (contribution (C3) above) is of independent interest. To the best of our knowledge, it is the first convergence result in the context of reinforcementbased learning in repeatedlyplayed strategicform games with the following features: a) a completely distributed setup (i.e., with no information exchange), b) more than two players, and c) a set of weaklyacyclic games that do not require the strong condition of the existence of a potential function.
This paper is an extention over an earlier version appeared in [1], which only focused on contribution (C1) above.
3 Stochastic Stability
In this section, we provide a characterization of the invariant probability measure of the induced Markov chain of the dynamics of Table 1. The importance lies in an equivalence relation (established through a weakconvergence argument) of with an invariant distribution of a finitestate Markov chain. Characterization of the stochastic stability of the dynamics will follow directly due to the Birkhoff’s individual ergodic theorem.
This simplification in the characterization of will be the first important step for providing specialized results for stochastic stability in strategicform games.
3.1 Terminology and notation
Let , where , i.e., pairs of joint actions and strategy profiles . We will denote the elements of the state space by .
The set is endowed with the discrete topology, with its usual Euclidean topology, and with the corresponding product topology. We also let denote the Borel field of , and the set of probability measures (p.m.) on endowed with the Prohorov topology, i.e., the topology of weak convergence. The learning algorithm of Table 1 defines an valued Markov chain. Let denote its transition probability function (t.p.f.), parameterized by . We refer to the process with as the perturbed process. Let also denote the t.p.f. of the unperturbed process, i.e., when .
We let denote the Banach space of realvalued continuous functions on under the supnorm (denoted by ) topology. For , define
and
The process governed by the unperturbed process will be denoted by . Let denote the canonical path space, i.e., an element is a sequence , with . We use the same notation for the elements of the space and for the coordinates of the process . Let also denote the unique p.m. induced by the unperturbed process on the product field of , initialized at , and the corresponding expectation operator. Let also , , denote the field of generated by .
3.2 Stochastic stability
First, we note that both and () satisfy the weak Feller property (cf., [27, Definition 4.4.2]).
Proposition 3.1
Both the unperturbed process () and the perturbed process () have the weak Feller property.
Proof.
See Appendix 8.
The measure is called an invariant probability measure (i.p.m.) for if
Since defines a locally compact separable metric space and , have the weak Feller property, they both admit an i.p.m., denoted and , respectively [27, Theorem 7.2.3].
We would like to characterize the stochastically stable states of , that is any state for which any collection of i.p.m. satisfies . As the forthcoming analysis will show, the stochastically stable states will be a subset of the set of pure strategy states (p.s.s.) defined as follows:
Definition 3.1 (Pure Strategy State)
A pure strategy state is a state such that for all , , i.e., coincides with the vertex of the probability simplex which assigns probability 1 to action .
We will denote the set of pure strategy states by .
Theorem 3.1 (Stochastic Stability)
There exists a unique probability vector such that for any collection of i.p.m.’s , the following hold:

where convergence is in the weak sense.

The probability vector is an invariant distribution of the (finitestate) Markov process , such that, for any ,
(5) for any sufficiently small, where is the t.p.f. corresponding to only one player trembling
(i.e., following the uniform distribution of (
1)).
The proof of Theorem 3.1 requires a series of propositions and will be presented in detail in Section 4.
Theorem 3.1 implicitly provides a stochastically stability argument. In fact, the expected asymptotic behavior of the dynamics can be characterized by and, therefore, . In particular, by Birkhoff’s individual ergodic theorem [27, Theorem 2.3.4], the weak convergence of to , and the fact that is ergodic, we have that the expected percentage of time that the process spends in any such that is given by as the experimentation probability approaches zero and time increases, i.e.,
3.3 Discussion
Theorem 3.1 establishes “equivalence” (in a weak convergence sense) of the original (perturbed) learning process with a simplified process, where only one player trembles at the first iteration and then no player trembles thereafter. This simplification in the analysis has originally been capitalized to analyze aspiration learning dynamics in [28, 10], and it is based upon the observation that under the unperturbed process, agents’ strategies will converge to a pure strategy state, as it will be shown in the forthcoming Section 4.
Furthermore, the limiting behavior of the original (perturbed) dynamics can be characterized by the (unique) invariant distribution of a finitestate Markov chain , whose states correspond to the purestrategy states (Definition 3.1). In other words, we should expect that as the perturbation parameter approaches zero, the algorithm spends the majority of the time on pure strategy states. The importance of this result lies on the fact that no constraints have been imposed in the payoff matrix of the game other than the PositiveUtility Property 2.1.
In the forthcoming Section 5, we will use this result to provide a methodology for computing the set of stochastically stable states. This methodology will further be illustrated in the context of coordination games.
4 Technical Derivation
In this section, we provide the main steps for the proof of Theorem 3.1. We begin by investigating the asymptotic behavior of the unperturbed process , and then we characterize the i.p.m. of the perturbed process with respect to the p.s.s.’s .
4.1 Unperturbed Process
For define the sets
Note that is a nonincreasing sequence, i.e., , while is nondecreasing, i.e., . Let
In other words, corresponds to the event that agents eventually play the same action profile, while corresponds to the event that agents never change their actions.
Proposition 4.1 (Convergence to p.s.s.)
Let us assume that the step size is sufficiently small such that for all and . Then, the following hold:

,

.
Proof.
See Appendix 9.
Statement (a) of Proposition 4.1 states that the probability that agents never change their actions is bounded away from zero, while statement (b) states that the probability that eventually agents play the same action profile is one. This also indicates that any invariant measure of the unperturbed process can be characterized with respect to the pure strategy states , which is established by the following proposition.
Proposition 4.2 (Limiting t.p.f. of unperturbed process)
Let denote an i.p.m. of . Then, there exists a t.p.f. on with the following properties:

for a.e. , is an i.p.m. for ;

for all , ;

is an i.p.m. for ;

the support^{1}^{1}1The support of a measure on is the unique closed set such that and for every open set such that . of is on for all .
Proof. The state space is a locally compact separable metric space and the t.p.f. of the unperturbed process admits an i.p.m. due to Proposition 3.1. Thus, statements (a), (b) and (c) follow directly from [27, Theorem 5.2.2 (a), (b), (e)].
(d) Let us assume that the support of includes points in other than the pure strategy states in . Then, there exists an open set such that and for some . According to (b), converges weakly to . Thus, from Portmanteau theorem (cf., [27, Theorem 1.4.16]), we have that This is a contradiction of Proposition 4.1(b), which concludes the proof.
Proposition 4.2 states that the limiting unperturbed t.p.f. converges weakly to a t.p.f. which accepts the same i.p.m. as . Furthermore, the support of is the set of pure strategy states . This is a rather important observation, since the limiting perturbed process can also be “related” (in a weakconvergence sense) to the t.p.f. , as it will be shown in the following section.
4.2 Invariant probability measure (i.p.m.) of perturbed process
According to the definition of perturbed learning automata of Table 1, when a player updates its action, there is a small probability that it “trembles,” i.e., it selects a new action according to a uniform distribution (instead of using its current strategy). Thus, we can decompose the t.p.f. induced by the onestep update as follows:
where is the probability that at least one agent trembles (since is the probability that no agent trembles), and corresponds to the t.p.f. when at least one agent trembles. Note that as .
Define also as the t.p.f. where only one player trembles, and as the t.p.f. where at least two players tremble. Then, we may write:
(6) 
where corresponds to the probability that at least two players tremble given that at least one player trembles. It also satisfies as , which establishes an approximation of by as the perturbation factor approaches zero.
Let us also define the infinitestep t.p.f. when trembling only at the first step (briefly, lifted t.p.f.) as follows:
(7) 
where i.e., corresponds to the resolvent t.p.f.
In the following proposition, we establish weakconvergence of the lifted t.p.f. with as , which will further allow for an explicit characterization of the weak limit points of the i.p.m. of .
Proposition 4.3 (i.p.m. of perturbed process)
The following hold:

For ,

For , .

Any invariant distribution of is also an invariant distribution of .

Any weak limit point in of , as , is an i.p.m. of .
Proof. (a) For any , we have
where we have used the property . Note that
From Proposition 4.2(b), we have that for any , there exists such that the r.h.s. is uniformly bounded by for all . Thus, the sequence
is Cauchy and therefore convergent (under the supnorm). In other words, there exists such that For every , we have
Note that
If we take , then the r.h.s. converges to zero. Thus,
which concludes the proof.
(b) For any , we have
The first term of the r.h.s. approaches 0 as according to (a). The second term of the r.h.s. also approaches 0 as since as .
(c) By definition of the perturbed t.p.f. , we have
Note that and where corresponds to the identity operator. Thus,
For any i.p.m. of , , we have
which equivalently implies that since . We conclude that is also an i.p.m. of .
(d) Let denote a weak limit point of as . To see that such a limit exists, take to be an i.p.m. of . Then,
Note that the weak convergence of to , it necessarily implies that . Note further that
The first and the third term of the r.h.s. approaches 0 as due to the fact that . The same holds for the second term of the r.h.s. due to part (b). Thus, we conclude that any weak limit point of as is an i.p.m. of .
Proposition 4.3 establishes convergence (in a weak sense) of the i.p.m. of the perturbed process to an i.p.m. of . In the following section, this convergence result will allow for a more explicit characterization of as .
4.3 Equivalent finitestate Markov process
Define the finitestate Markov process as in (5).
Proposition 4.4 (Unique i.p.m. of )
There exists a unique i.p.m. of . It satisfies
(8) 
for some constants , . Moreover, is an invariant distribution of , i.e., .
Proof. From Proposition 4.2(d), we know that the support of is the set of pure strategy states . Thus, the support of is also on . From Proposition 4.3, we know that admits an i.p.m., say , whose support is also . Thus, admits the form of (8), for some constants , .
For any two distinct , note that , , is a continuity set of , i.e., . Thus, from Portmanteau theorem, given that ,
If we also define , then
which shows that is an invariant distribution of , i.e., .
It remains to establish uniqueness of the invariant distribution of . Note that the set of pure strategy states is isomorphic with the set of action profiles. If agent trembles (as t.p.f. dictates), then all actions in have positive probability of being selected, i.e., for all and . It follows by Proposition 4.1 that for all and . Finite induction then shows that for all . It follows that if we restrict the domain of to
, it defines an irreducible stochastic matrix. Therefore,
has a unique i.p.m.4.4 Proof of Theorem 3.1
5 Stochastically Stable States
In this section, we capitalize on Theorem 3.1 and we further simplify the computation of the stochastically stable states in games satisfying Property 2.1.
5.1 Background on finite Markov chains
In order to compute the invariant distribution of a finitestate, irreducible and aperiodic Markov chain, we are going to consider a characterization introduced by [29]. In particular, for finite Markov chains an invariant measure can be expressed as the ratio of sums of products consisting of transition probabilities. These products can be described conveniently by means of graphs on the set of states of the chain. In particular, let be a finite set of states, whose elements will be denoted by , , etc., and let a subset of .
Definition 5.1
(graph) A graph consisting of arrows () is called a graph if it satisfies the following conditions:

every point is the initial point of exactly one arrow;

there are no closed cycles in the graph; or, equivalently, for any point there exists a sequence of arrows leading from it to some point .
Figure 1 provides examples of graphs for some state when contains four states. We will denote by the set of graphs and we shall use the letter to denote graphs. If are nonnegative numbers, where , define also the transition probability along path as
The following Lemma holds:
Lemma 5.1 (Lemma 6.3.1 in [29])
Let us consider a Markov chain with a finite set of states and transition probabilities and assume that every state can be reached from any other state in a finite number of steps. Then, the stationary distribution of the chain is , where
(9) 
where .
In other words, in order to compute the weight that the stationary distribution assigns to a state , it suffices to compute the ratio of the transition probabilities of all graphs over the transition probabilities of all graphs.
5.2 Approximation of onestep transition probability
We wish to provide an approximation in the computation of the transition probabilities between states in since this will allow for explicitly computing the stationary distribution of Theorem 3.1. Based on the definition of the t.p.f. , and as , a transition from to influences the stationary distribution only if differs from in the action of a single player. This observation will be capitalized by the forthcoming Lemmas 5.2–5.3, to approximate the transition probability from to .
Let denote the first hitting time of the unperturbed process to the set . Denote the minimum hitting time of a set as when the process starts from state . Let us also define the set
where . The set defines the unreachable set in the strategy space of agent when starting from under and plays action for consecutive times.
Lemma 5.2 (Onestep transition probability)
Consider any two action profiles which differ in the action of a single player . Let define the corresponding pure strategy states associated with and , respectively. Let also , where , which corresponds to the state after agent perturbed once starting from and played . Define also which corresponds to the probability that the process transits from the perturbed state to a neighborhood of in finite time. For sufficiently small such that , the following hold:

The transition probability from to under can be approximated as follows:
(10) where corresponds to the probability that agent trembled and selected action , given that only one player trembles (under t.p.f. ).

Along any sample path that reaches the set , action profile is played at least times.

corresponds to the probability of the shortest path, i.e.,

There exists positive constant , such that for any transition step (with the above properties) and as ,
(11)
Proof.
See Appendix 10.
Note that for sufficiently small , the larger the destination utility , the larger the transition probability to . In a way, the inverse of the destination utility at represents a measure of “resistance” of the process to transit to . Lemma 5.2 provides a tool for simplifying the computation of stochastically stable pure strategy states as it will become apparent in the following section.
5.3 Approximation of stationary distribution
In this section, using Lemma 5.2 that approximates onestep transition probabilities, we provide an approximation of the invariant stationary distribution of the t.p.f.. By definition of , this approximation is based upon the observation that for the computation of the quantities of Lemma 5.1, it suffices to consider only those paths in which involve onestep transitions as defined in the previous section.
Define to be the set of graphs consisting solely of onestep transitions, i.e., for any and any arrow , the associated action profiles, say , respectively, differ in a single action of a single player. It is straightforward to check that for any .
Lemma 5.3 (Approximation of stationary distribution)
The stationary distribution of the finite Markov chain , , satisfies
(12) 
where and
(13) 
for some constant .
Proof. According to Lemma 5.1, for any , we have . Given the definition of the t.p.f. , where only one player trembles, we should only consider onestep transition probabilities (as defined in Lemma 5.2). Thus,
According to Lemma 5.2 and Equation (10), we have
where denotes the single player whose action changes from to , and
Comments
There are no comments yet.