Stochastic Stability of Reinforcement Learning in Positive-Utility Games

This paper considers a class of discrete-time reinforcement-learning dynamics and provides a stochastic-stability analysis in repeatedly played positive-utility (strategic-form) games. For this class of dynamics, convergence to pure Nash equilibria has been demonstrated only for the fine class of potential games. Prior work primarily provides convergence properties through stochastic approximations, where the asymptotic behavior can be associated with the limit points of an ordinary-differential equation (ODE). However, analyzing global convergence through an ODE-approximation requires the existence of a Lyapunov or a potential function, which naturally restricts the analysis to a fine class of games. To overcome these limitations, this paper introduces an alternative framework for analyzing convergence under reinforcement learning that is based upon an explicit characterization of the invariant probability measure of the induced Markov chain. We further provide a methodology for computing the invariant probability measure in positive-utility games, together with an illustration in the context of coordination games.

Authors

• 5 publications
• Stochastic Stability of Perturbed Learning Automata in Positive-Utility Games

This paper considers a class of reinforcement-based learning (namely, pe...
09/18/2017 ∙ by Georgios C. Chasparis, et al. ∙ 0

• Aspiration-based Perturbed Learning Automata

This paper introduces a novel payoff-based learning scheme for distribut...
03/07/2018 ∙ by Georgios C. Chasparis, et al. ∙ 0

• On Passivity, Reinforcement Learning and Higher-Order Learning in Multi-Agent Finite Games

In this paper, we propose a passivity-based methodology for analysis and...
08/13/2018 ∙ by Bolin Gao, et al. ∙ 0

• Best-Response Dynamics and Fictitious Play in Identical Interest Stochastic Games

This paper combines ideas from Q-learning and fictitious play to define ...
11/08/2021 ∙ by Lucas Baudin, et al. ∙ 0

• Path to Stochastic Stability: Comparative Analysis of Stochastic Learning Dynamics in Games

Stochastic stability is a popular solution concept for stochastic learni...
04/08/2018 ∙ by Hassan Jaleel, et al. ∙ 0

• Gains in evolutionary dynamics: a unifying approach to stability for contractive games and ESS

In this paper, we investigate gains from strategy revisions in determini...
05/13/2018 ∙ by Dai Zusai, et al. ∙ 0

• On The Convergence of a Nash Seeking Algorithm with Stochastic State Dependent Payoff

Distributed strategic learning has been getting attention in recent year...
09/30/2012 ∙ by A. F. Hanif, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, multi-agent formulations have been utilized to tackle distributed optimization problems, since communication and computational complexity might be an issue under centralized schemes. In such formulations, decisions are usually taken in a repeated fashion, where agents select their next actions based on their own prior experience. Naturally, such multi-agent interactions can be designed as strategic-form games, where agents are repeatedly involved in a strategic interaction with a fixed payoff- or utility-matrix. Such framework finds numerous applications, including, for example, the problem of distributed overlay routing [2], distributed topology control [3] and distributed resource allocation [4].

Given the repeated fashion of the involved strategic interactions in such formulations, several questions naturally emerge: a) Can agents “learn” to asymptotically select optimal decisions/actions?, b) What information should agents share with each other?, and c) What is the computational complexity of the learning process? Under the scope of engineering applications, it is usually desirable that each agent shares minimum amount of information with other agents, while the computational complexity of the learning process is small. A class of learning dynamics that achieves small communication and computational complexity is the so-called payoff-based learning. Under such class of learning dynamics, each agent only receives measurements of its own utility function, without the need to know the actions selected by other agents, or the details of its own utility function (i.e., its dependencies on other agents’ actions).

In such repeatedly-played strategic-form games, a popular objective for payoff-based learning is to guarantee convergence (in some sense) to Nash equilibria. Convergence to Nash equilibria may be desirable, especially when the set of optimal centralized solutions coincides with the set of Nash equilibria.

Reinforcement-based learning has been utilized in strategic-form games in order for agents to gradually learn to play Nash equilibria. It may appear under alternative forms, including discrete-time replicator dynamics [5], learning automata [6, 7] or approximate policy iteration or -learning [8]. In all these classes of learning dynamics, deriving conditions under which convergence to Nash equilibria is achieved may not be a trivial task especially in the case of large number of agents (as it will be discussed in detail in the forthcoming Section 2).

In the present paper, we consider a class of reinforcement-based learning introduced in [9] that is closely related to both discrete-time replicator dynamics and learning automata. We will refer to this class of dynamics as perturbed learning automata. The main difference with prior reinforcement learning schemes lies in a) the step-size sequence, and b) the perturbation (or mutations) term. The step-size sequence is assumed constant, thus introducing a fading-memory effect of past experiences in each agent’s strategy. On the other hand, the perturbation term introduces errors in the selection process of each agent. Both these two features can be used for designing a desirable asymptotic behavior.

We provide an analytical framework for deriving conclusions over the asymptotic behavior of the dynamics that is based upon an explicit characterization of the invariant probability measure of the induced Markov chain. In particular, we show that in all strategic-form games satisfying the Positive-Utility Property, the support of the invariant probability measure coincides with the set of pure strategy profiles. This extends prior work in coordination games, where convergence to mixed strategy profiles may only be excluded under strong conditions in the payoff matrix (e.g., existence of a potential function). Furthermore, we provide a methodology for computing the set of stochastically stable states in all positive-utility games. We illustrate this methodology in the context of coordination games and provide a simulation study in distributed network formation.

In the remainder of the paper, Section 2 presents the investigated class of learning dynamics, related work and the main contributions. Section 3 provides a simplification in the characterization of stochastic stability, while Section 4 presents its technical derivation. This result is utilized for computing the stochastically stable states in positive-utility games in Section 5. In Section 6, we present an illustration of the proposed methodology in the context of coordination games, together with a simulation study in distributed network formation. Finally, Section 7 presents concluding remarks.

Notation:

• For a Euclidean topological space , let denote the -neighborhood of , i.e.,

 Nδ(x)≐{y∈Z:|x−y|<δ},

where denotes the Euclidean distance.

• denotes the

unit vector

in where its th entry is equal to 1 and all other entries is equal to 0.

• denotes the probability simplex of dimension , i.e.,

 Δ(n)≐{x∈Rn:x≥0,1Tx=1}.
• For some set in a topological space , let denote the index function, i.e.,

 IA(x)≐{1 if x∈A,0 else.
• For a finite set , denotes its cardinality.

• For a finite set

and any probability distribution

, the random selection of an element of will be denoted by . If , the random selection will be denoted by .

• denotes the Dirac measure at .

• denotes the natural logarithm.

2 Perturbed Learning Automata

2.1 Terminology

We consider the standard setup of finite strategic-form games. Consider a finite set of agents (or players) , and let each agent have a finite set of actions . Let denote any such action of agent . The set of action profiles is the Cartesian product and let be a representative element of this set. We will denote to be the complementary set and often decompose an action profile as follows . The payoff/utility function of agent is a mapping . A strategic-form game is defined by the triple .

For the remainder of the paper, we will be concerned with strategic-form games that satisfy the Positive-Utility Property.

Property 2.1 (Positive-Utility Property)

For any agent and any action profile , .

This property is rather generic and applies to a large family of games. For example, games at which some form of alignment of interests exists between agents (e.g., coordination games [10] or weakly-acyclic games [11]), can be designed to satisfy this property, since agents’ utilities/preferences are rather close to each other at any given action profile. However, in the forthcoming analysis, we do not impose any structural constraint other than the Property 2.1.

2.2 Perturbed Learning Automata

We consider a form of reinforcement-based learning that belongs to the general class of learning automata [7]. In learning automata, each agent updates a finite probability distribution representing its beliefs with respect to the most profitable action. The precise manner in which changes at time , depending on the performed action and the response of the environment, completely defines the learning model.

The proposed learning model is described in Table 1. At the first step, each agent updates its action given its current strategy vector . Its selection is slightly perturbed by a perturbation (or mutations) factor , such that, with a small probability agent follows a uniform strategy (or, it trembles). At the second step, agent evaluates its new selection by collecting a utility measurement, while in the last step, agent updates its strategy vector given its new experience.

Here, we identify actions with vertices of the simplex, . For example, if agent selects its th action at time , then . To better see how the strategies evolve, let us consider the following toy example. Let the current strategy of player be , i.e., player has two actions, each assigned probability . Let also , i.e., player selects the first action according to rule (1). Then, the new strategy vector for agent , updated according to rule (2), is:

 xi(t+1)=\nicefrac12(1+ϵui(α(t+1))1−ϵui(α(t+1))).

In other words, when player selects its first action, the strategy of this action is going to increase proportionally to the reward received from this action. We may say that such type of dynamics reinforce repeated selection, however the size of reinforcement depends on the reward received.

By playing a strategic-form game repeatedly over time, players do not always experience the same reward when selecting the same action, since other players may also change their actions. This dynamic element of the size of reinforcement is the factor that complicates the convergence analysis, as it will become clear in the forthcoming related work.

Note that by letting the step-size to be sufficiently small and since the utility function is uniformly bounded in , for all .

In case , the above update recursion will be referred to as the unperturbed learning automata.

2.3 Related work

Discrete-time replicator dynamics

A type of learning dynamics which is quite closely related to the dynamics of Table 1 is the discrete-time version of replicator dynamics (cf., [12]). It has been used in different forms, depending primarily on the step-size sequence in Table 1. For example, Arthur [5] considered a similar rule, with and step size of each agent defined as , for some positive constant and for (in the place of the constant step size of (2)). A comparative model is also used by Hopkins and Posch in [13], with , where is the accumulated benefits of agent up to time which gives rise to the urn process of Erev-Roth [14]. Some similarities are also shared with the Cross’ learning model of [15], where and , and its modification presented by Leslie in [16], where , instead, is assumed decreasing with time.

The main difference of the proposed dynamics of Table 1 lies in the perturbation parameter which was first introduced and analyzed in [9]. A state-dependent perturbation term has also been investigated in [17]. The perturbation parameter may serve as an equilibrium selection mechanism, since it excludes convergence to non-Nash action profiles [9]. It resolved one of the main issues of discrete-time replicator dynamics, that is the positive probability of convergence to action profiles that are not Nash equilibria (briefly, non-Nash action profiles).

Although excluding convergence to non-Nash action profiles can be guaranteed by using sufficiently small , establishing convergence to action profiles that are Nash equilibria (pure Nash equilibria) may still be an issue. This is desirable in the context of coordination games [18], where Pareto-efficient outcomes are usually pure Nash equilibria (see, e.g., the definition of a coordination game in [10]). As presented in [17], convergence to pure Nash equilibria can be guaranteed only under strong conditions in the payoff matrix. For example, as shown in [17, Proposition 8], and under the ODE-method for stochastic approximations, it requires a) the existence of a potential function, and b) conditions over the Jacobian matrix of the potential function. Even if a potential function does exist, verifying condition (b) is practically infeasible for games of more than 2 players [17].

On the other hand, an important side-benefit of using this class of dynamics is the indirect “filtering” on the utility-function measurements (through the formulation of the strategy vectors in (2)). This is demonstrated, for example, in [13] for the Erev-Roth model [14], where the robustness of convergence/non-convergence asymptotic results is presented under the presence of noise in the utility measurements.

Learning automata

Learning automata, as first introduced by [6], have attracted attention with respect to the control of complex and distributed systems due to their simple structure and low computational complexity (cf., [7, Chapter 1]). Variable-structure stochastic automata may incorporate a form of reinforcement of favorable actions. Therefore, such stochastic automata bear a lot of similarities to the discrete-time analogs of replicator dynamics discussed above. An example of such stochastic learning automata is the linear reward-inaction scheme described in [7, Chapter 4]. Comparing it with the reinforcement rule of (2), the linear reward-inaction scheme accepts a utility function of the form , where corresponds to an unfavorable response and corresponds to a favorable one. More general forms can also be used when the utility function may accept discrete or continuous values in the unit interval .

Analysis of learning automata in games has been restricted to zero-sum and identical-interest games [7, 19]. In identical interest games, convergence analysis has been derived for small number of players and actions, due to the difficulty in deriving conditions for absolute monotonicity, which corresponds to the property that the expected utility received by each player increases monotonically in time (cf., [7, Definition 8.1]). Similar are the results presented in [19].

The property of absolute monotonicity is closely related to the existence of a potential function, as in the case of potential games [20]. Similarly to the discrete-time replicator dynamics, convergence to non-Nash action profiles cannot be excluded when the step-size sequence is constant, even if the utility function satisfies as in the learning automata. (The behavior under decreasing step-size is different as [17, Proposition 2] has shown.) Furthermore, deriving conditions for excluding convergence to mixed strategy profiles in coordination games continues to be an issue for the case of learning automata, as in the case of discrete-time replicator dynamics.

Recognizing these issues, reference [21] introduced a class of linear reward-inaction schemes in combination with a coordinated exploration phase so that convergence to the efficient (pure) Nash equilibrium is achieved. However, coordination of the exploration phase requires communication between the players, an approach that does not fit to the distributed nature of dynamics pursued here.

Q-learning

Similar questions of convergence to Nash equilibria also appear in alternative reinforcement-based learning formulations, such as approximate dynamic programming and -learning. Usually, under -learning, players keep track of the discounted running average reward received by each action, based on which optimal decisions are made (see, e.g., [22]). Convergence to Nash equilibria can be accomplished under a stronger set of assumptions, which increases the computational complexity of the dynamics. For example, in the Nash-Q learning algorithm of [8], it is indirectly assumed that agents need to have full access to the joint action space and the rewards received by other agents.

More recently, reference [23] introduced a -learning scheme in combination with either adaptive play or better-reply dynamics in order to attain convergence to Nash equilibria in potential games [20] or weakly-acyclic games. However, this form of dynamics requires that each player observes the actions selected by the other players, since a -value needs to be assigned for each joint action.

When the evaluation of the -values is totally independent, as in the individual -learning in [22], then convergence to Nash equilibria has been shown only for 2-player zero-sum games and 2-player partnership games with countably many Nash equilibria. Currently, there exist no convergence results in multi-player games. This is a main drawback for -learning dynamics in strategic-form games as also pointed out in [24]. To overcome this drawback, in the context of stochastic dynamic games, reference [24] employs an additional feature (motivated by [11]), namely exploration phases. In any such exploration phase, all agents use constant policies, something that allows the accurate computation of the optimal -factors. We may argue that the introduction of common exploration phases for all agents partially destroys the distributed nature of the dynamics, since it requires synchronization between agents.

Aspiration-based learning

Recently, there have been several attempts to establish convergence to Nash equilibria through alternative payoff-based learning dynamics, (see, e.g., the benchmark-based dynamics of [11] for convergence to Nash equilibria in weakly-acyclic games, the trial-and-error learning [25] for convergence to Nash equilibria in generic games, the mood-based dynamics of [26] for maximizing welfare in generic games or the aspiration learning in [10] for convergence to efficient outcomes in coordination games). We will refer to such approaches as aspiration-based learning. For these types of dynamics, convergence to Nash equilibria or efficient outcomes can be established without requiring any strong monotonicity properties (as in the multi-player weakly-acyclic games in [11]).

The case of noisy utility measurements, which are present in many engineering applications, has not currently been addressed through aspiration-based learning. The only exception is reference [11], under benchmark-based dynamics, where (synchronized) exploration phases are introduced through which each agent plays a fixed action for the duration of the exploration phase. If such exploration phases are large in duration (as required by the results in [11]), this may reduce the robustness of the dynamics to dynamic changes in the environment (e.g., changes in the utility function). One reason that such robustness analysis is currently not possible in this class of dynamics is the fact that decisions are taken directly based on the measured performances (e.g., by comparing the currently measured performance with the benchmark performance in [11]).

2.4 Contributions

The aforementioned literature in payoff-based learning dynamics in strategic-form games can be grouped into two main categories, namely reinforcement-based learning (including discrete-time replicator dynamics, learning automata and -learning) and aspiration-based learning. Summarizing their main advantages/disadvantages, we may argue the following high-level observations.

• Strong asymptotic convergence guarantees for large number of players, even for generic games, are currently possible under aspiration-based learning. Similar results in reinforcement-based learning are currently restricted to games of small number of players and under strong structural assumptions (e.g., the existence of a potential function). See, for example, the discussion on discrete-time replicator dynamics or learning automata in [17], or the discussion on -learning in [24].

• Noisy observations can be “handled” through reinforcement-based learning due to the indirect filtering of the observation signals (e.g., through the strategy-vector formulation in the model of Table 1 or in the formulation of the factors in -learning). This is demonstrated, for example, in the convergence/non-convergence asymptotic results presented in [13] for a variation of the proposed learning dynamics of Table 1 (with and decreasing ) and under the presence of noise. Similar effects in aspiration-based learning can currently be achieved only through the introduction of synchronized exploration phases, as discussed in Section 2.3.

Motivated by these two observations (O1)–(O2), and the obvious inability of reinforcement-based learning to provide strong convergence guarantees in large games, this paper advances asymptotic convergence guarantees for a class of reinforcement-based learning described in Table 1 (closely related to both discrete-time replicator dynamics and learning automata, as discussed in Section 2.3). Our goal is to go beyond common restrictions of small number of players and strong assumptions in the game structure (such as the existence of a potential function).

The proposed dynamics (also perturbed learning automata) were first introduced in [9] to resolve stability issues in the boundary of the domain appearing in prior schemes [5, 13]. This was achieved through the introduction of the perturbation factor of Table 1. However, strong convergence guarantees (e.g., w.p.1 convergence to Nash equilibria or efficient outcomes) is currently limited to small number of players and under strict structural assumptions, e.g., the existence of a potential function and additional conditions on its Jacobian matrix [17].

In this paper, we drop the assumption of a decreasing step-size sequence, and instead we consider the case of a constant step size . Such selection increases the adaptivity of the dynamics to varying conditions (e.g., the number of agents or the utility function). Furthermore, we provide a stochastic-stability analysis that provides a detailed characterization of the invariant probability measure of the induced Markov chain with no restrictions on the number of players. In particular, our contributions are the following:

1. We provide an equivalent finite-dimensional characterization of the infinite-dimensional induced Markov chain of the dynamics, that simplifies significantly the characterization of its invariant probability measure. This simplification is based upon a weak-convergence result and it applies to any strategic-form game with the Positive-Utility Property 2.1 (Theorem 3.1).

2. We capitalize on this simplification and provide a methodology for computing stochastically stable states in positive-utility strategic-form games (Theorem 5.1).

3. We illustrate the utility of this methodology in establishing stochastic stability in a class of coordination games with no restriction on the number of players or actions (Theorem 6.1).

These contributions significantly extend the utility of reinforcement-based learning for the reasons explained in observation (O1). We have to note that the illustration result in coordination games (contribution (C3) above) is of independent interest. To the best of our knowledge, it is the first convergence result in the context of reinforcement-based learning in repeatedly-played strategic-form games with the following features: a) a completely distributed setup (i.e., with no information exchange), b) more than two players, and c) a set of weakly-acyclic games that do not require the strong condition of the existence of a potential function.

This paper is an extention over an earlier version appeared in [1], which only focused on contribution (C1) above.

3 Stochastic Stability

In this section, we provide a characterization of the invariant probability measure of the induced Markov chain of the dynamics of Table 1. The importance lies in an equivalence relation (established through a weak-convergence argument) of with an invariant distribution of a finite-state Markov chain. Characterization of the stochastic stability of the dynamics will follow directly due to the Birkhoff’s individual ergodic theorem.

This simplification in the characterization of will be the first important step for providing specialized results for stochastic stability in strategic-form games.

3.1 Terminology and notation

Let , where , i.e., pairs of joint actions and strategy profiles . We will denote the elements of the state space by .

The set is endowed with the discrete topology, with its usual Euclidean topology, and with the corresponding product topology. We also let denote the Borel -field of , and the set of probability measures (p.m.) on endowed with the Prohorov topology, i.e., the topology of weak convergence. The learning algorithm of Table 1 defines an -valued Markov chain. Let denote its transition probability function (t.p.f.), parameterized by . We refer to the process with as the perturbed process. Let also denote the t.p.f. of the unperturbed process, i.e., when .

We let denote the Banach space of real-valued continuous functions on under the sup-norm (denoted by ) topology. For , define

 Pλf(z)≐∫ZPλ(z,dy)f(y),

and

 μ[f]≐∫Zμ(dx)f(z), for μ∈P(Z).

The process governed by the unperturbed process will be denoted by . Let denote the canonical path space, i.e., an element is a sequence , with . We use the same notation for the elements of the space and for the coordinates of the process . Let also denote the unique p.m. induced by the unperturbed process on the product -field of , initialized at , and the corresponding expectation operator. Let also , , denote the -field of generated by .

3.2 Stochastic stability

First, we note that both and () satisfy the weak Feller property (cf., [27, Definition 4.4.2]).

Proposition 3.1

Both the unperturbed process () and the perturbed process () have the weak Feller property.

Proof. See Appendix 8.

The measure is called an invariant probability measure (i.p.m.) for if

 (μλPλ)(A)≐∫Zμλ(dx)Pλ(z,A)=μλ(A),A∈B(Z).

Since defines a locally compact separable metric space and , have the weak Feller property, they both admit an i.p.m., denoted and , respectively [27, Theorem 7.2.3].

We would like to characterize the stochastically stable states of , that is any state for which any collection of i.p.m. satisfies . As the forthcoming analysis will show, the stochastically stable states will be a subset of the set of pure strategy states (p.s.s.) defined as follows:

Definition 3.1 (Pure Strategy State)

A pure strategy state is a state such that for all , , i.e., coincides with the vertex of the probability simplex which assigns probability 1 to action .

We will denote the set of pure strategy states by .

Theorem 3.1 (Stochastic Stability)

There exists a unique probability vector such that for any collection of i.p.m.’s , the following hold:

• where convergence is in the weak sense.

• The probability vector is an invariant distribution of the (finite-state) Markov process , such that, for any ,

 ^Pss′≐limt→∞QPt(s,Nδ(s′)), (5)

for any sufficiently small, where is the t.p.f. corresponding to only one player trembling

(i.e., following the uniform distribution of (

1)).

The proof of Theorem 3.1 requires a series of propositions and will be presented in detail in Section 4.

Theorem 3.1 implicitly provides a stochastically stability argument. In fact, the expected asymptotic behavior of the dynamics can be characterized by and, therefore, . In particular, by Birkhoff’s individual ergodic theorem [27, Theorem 2.3.4], the weak convergence of to , and the fact that is ergodic, we have that the expected percentage of time that the process spends in any such that is given by as the experimentation probability approaches zero and time increases, i.e.,

 limλ↓0(limt→∞1tt−1∑k=0Pkλ(x,O))=^μ(O).

3.3 Discussion

Theorem 3.1 establishes “equivalence” (in a weak convergence sense) of the original (perturbed) learning process with a simplified process, where only one player trembles at the first iteration and then no player trembles thereafter. This simplification in the analysis has originally been capitalized to analyze aspiration learning dynamics in [28, 10], and it is based upon the observation that under the unperturbed process, agents’ strategies will converge to a pure strategy state, as it will be shown in the forthcoming Section 4.

Furthermore, the limiting behavior of the original (perturbed) dynamics can be characterized by the (unique) invariant distribution of a finite-state Markov chain , whose states correspond to the pure-strategy states (Definition 3.1). In other words, we should expect that as the perturbation parameter approaches zero, the algorithm spends the majority of the time on pure strategy states. The importance of this result lies on the fact that no constraints have been imposed in the payoff matrix of the game other than the Positive-Utility Property 2.1.

In the forthcoming Section 5, we will use this result to provide a methodology for computing the set of stochastically stable states. This methodology will further be illustrated in the context of coordination games.

4 Technical Derivation

In this section, we provide the main steps for the proof of Theorem 3.1. We begin by investigating the asymptotic behavior of the unperturbed process , and then we characterize the i.p.m. of the perturbed process with respect to the p.s.s.’s .

4.1 Unperturbed Process

For define the sets

 At ≐{ω∈Ω:α(τ)=α(t),~{}% for all~{}τ≥t}, Bt ≐{ω∈Ω:α(τ)=α(0),~{}for all% ~{}0≤τ≤t}.

Note that is a non-increasing sequence, i.e., , while is non-decreasing, i.e., . Let

 A∞≐∞⋃t=0At and B∞≐∞⋂t=1Bt.

In other words, corresponds to the event that agents eventually play the same action profile, while corresponds to the event that agents never change their actions.

Proposition 4.1 (Convergence to p.s.s.)

Let us assume that the step size is sufficiently small such that for all and . Then, the following hold:

• ,

• .

Proof. See Appendix 9.

Statement (a) of Proposition 4.1 states that the probability that agents never change their actions is bounded away from zero, while statement (b) states that the probability that eventually agents play the same action profile is one. This also indicates that any invariant measure of the unperturbed process can be characterized with respect to the pure strategy states , which is established by the following proposition.

Proposition 4.2 (Limiting t.p.f. of unperturbed process)

Let denote an i.p.m. of . Then, there exists a t.p.f. on with the following properties:

• for -a.e. , is an i.p.m. for ;

• for all , ;

• is an i.p.m. for ;

• the support111The support of a measure on is the unique closed set such that and for every open set such that . of is on for all .

Proof. The state space is a locally compact separable metric space and the t.p.f. of the unperturbed process admits an i.p.m. due to Proposition 3.1. Thus, statements (a), (b) and (c) follow directly from [27, Theorem 5.2.2 (a), (b), (e)].

(d) Let us assume that the support of includes points in other than the pure strategy states in . Then, there exists an open set such that and for some . According to (b), converges weakly to . Thus, from Portmanteau theorem (cf., [27, Theorem 1.4.16]), we have that This is a contradiction of Proposition 4.1(b), which concludes the proof.

Proposition 4.2 states that the limiting unperturbed t.p.f. converges weakly to a t.p.f. which accepts the same i.p.m. as . Furthermore, the support of is the set of pure strategy states . This is a rather important observation, since the limiting perturbed process can also be “related” (in a weak-convergence sense) to the t.p.f. , as it will be shown in the following section.

4.2 Invariant probability measure (i.p.m.) of perturbed process

According to the definition of perturbed learning automata of Table 1, when a player updates its action, there is a small probability that it “trembles,” i.e., it selects a new action according to a uniform distribution (instead of using its current strategy). Thus, we can decompose the t.p.f. induced by the one-step update as follows:

 Pλ=(1−φ(λ))P+φ(λ)Qλ

where is the probability that at least one agent trembles (since is the probability that no agent trembles), and corresponds to the t.p.f. when at least one agent trembles. Note that as .

Define also as the t.p.f. where only one player trembles, and as the t.p.f. where at least two players tremble. Then, we may write:

 Qλ=(1−ψ(λ))Q+ψ(λ)Q∗, (6)

where corresponds to the probability that at least two players tremble given that at least one player trembles. It also satisfies as , which establishes an approximation of by as the perturbation factor approaches zero.

Let us also define the infinite-step t.p.f. when trembling only at the first step (briefly, lifted t.p.f.) as follows:

 PLλ≐φ(λ)∞∑t=0(1−φ(λ))tQλPt=QλRλ (7)

where i.e., corresponds to the resolvent t.p.f.

In the following proposition, we establish weak-convergence of the lifted t.p.f. with as , which will further allow for an explicit characterization of the weak limit points of the i.p.m. of .

Proposition 4.3 (i.p.m. of perturbed process)

The following hold:

• For ,

• For , .

• Any invariant distribution of is also an invariant distribution of .

• Any weak limit point in of , as , is an i.p.m. of .

Proof. (a) For any , we have

 ∥Rλf−Πf∥∞ = ∥φ(λ)∞∑t=0(1−φ(λ))tPtf−Πf∥∞ = ∥φ(λ)∞∑t=0(1−φ(λ))t(Ptf−Πf)∥∞

where we have used the property . Note that

 φ(λ)∞∑t=T(1−φ(λ))t∥Ptf−Πf∥∞ ≤ (1−φ(λ))Tsupt≥T∥Ptf−Πf∥∞.

From Proposition 4.2(b), we have that for any , there exists such that the r.h.s. is uniformly bounded by for all . Thus, the sequence

 AT≐φ(λ)T∑t=0(1−φ(λ))t(Ptf−Πf)

is Cauchy and therefore convergent (under the sup-norm). In other words, there exists such that For every , we have

 ∥Rλf−Πf∥∞≤∥AT∥∞+∥A−AT∥∞.

Note that

 ∥AT∥∞≤φ(λ)T∑t=0(1−φ(λ))t∥Ptf−Πf∥∞.

If we take , then the r.h.s. converges to zero. Thus,

 limλ↓0∥Rλf−Πf∥∞≤∥A−AT∥∞, for all T>0,

which concludes the proof.

(b) For any , we have

 ∥PLλf−QΠf∥∞ ≤ ∥Qλ(Rλf−Πf)∥∞+∥QλΠf−QΠf∥∞ ≤ ∥Rλf−Πf∥∞+∥QλΠf−QΠf∥∞.

The first term of the r.h.s. approaches 0 as according to (a). The second term of the r.h.s. also approaches 0 as since as .

(c) By definition of the perturbed t.p.f. , we have

 PλRλ=(1−φ(λ))PRλ+φ(λ)QλRλ.

Note that and where corresponds to the identity operator. Thus,

 PλRλ=Rλ−φ(λ)I+φ(λ)PLλ.

For any i.p.m. of , , we have

 μλPλRλ=μλRλ−φ(λ)μλ+φ(λ)μλPLλ,

which equivalently implies that since . We conclude that is also an i.p.m. of .

(d) Let denote a weak limit point of as . To see that such a limit exists, take to be an i.p.m. of . Then,

 ∥Pλf−Pf∥∞ ≥ ∥μλ(Pλf−Pf)∥∞ = ∥(μλ−^μ)(I−P)[f]∥∞.

Note that the weak convergence of to , it necessarily implies that . Note further that

 ^μ[f]−^μQΠf = (^μ[f]−μλ[f])+μλ[PLλf−QΠf]+ (μλ[QΠf]−^μ[QΠf]).

The first and the third term of the r.h.s. approaches 0 as due to the fact that . The same holds for the second term of the r.h.s. due to part (b). Thus, we conclude that any weak limit point of as is an i.p.m. of .

Proposition 4.3 establishes convergence (in a weak sense) of the i.p.m. of the perturbed process to an i.p.m. of . In the following section, this convergence result will allow for a more explicit characterization of as .

4.3 Equivalent finite-state Markov process

Define the finite-state Markov process as in (5).

Proposition 4.4 (Unique i.p.m. of QΠ)

There exists a unique i.p.m. of . It satisfies

 ^μ(⋅)=∑s∈Sπsδs(⋅) (8)

for some constants , . Moreover, is an invariant distribution of , i.e., .

Proof. From Proposition 4.2(d), we know that the support of is the set of pure strategy states . Thus, the support of is also on . From Proposition 4.3, we know that admits an i.p.m., say , whose support is also . Thus, admits the form of (8), for some constants , .

For any two distinct , note that , , is a continuity set of , i.e., . Thus, from Portmanteau theorem, given that ,

 QΠ(s,Nδ(s′))=limt→∞QPt(s,Nδ(s′))=^Pss′.

If we also define , then

which shows that is an invariant distribution of , i.e., .

It remains to establish uniqueness of the invariant distribution of . Note that the set of pure strategy states is isomorphic with the set of action profiles. If agent trembles (as t.p.f. dictates), then all actions in have positive probability of being selected, i.e., for all and . It follows by Proposition 4.1 that for all and . Finite induction then shows that for all . It follows that if we restrict the domain of to

, it defines an irreducible stochastic matrix. Therefore,

has a unique i.p.m.

4.4 Proof of Theorem 3.1

Theorem 3.1(a)–(b) is a direct implication of Propositions 4.34.4.

5 Stochastically Stable States

In this section, we capitalize on Theorem 3.1 and we further simplify the computation of the stochastically stable states in games satisfying Property 2.1.

5.1 Background on finite Markov chains

In order to compute the invariant distribution of a finite-state, irreducible and aperiodic Markov chain, we are going to consider a characterization introduced by [29]. In particular, for finite Markov chains an invariant measure can be expressed as the ratio of sums of products consisting of transition probabilities. These products can be described conveniently by means of graphs on the set of states of the chain. In particular, let be a finite set of states, whose elements will be denoted by , , etc., and let a subset of .

Definition 5.1

(-graph) A graph consisting of arrows () is called a -graph if it satisfies the following conditions:

1. every point is the initial point of exactly one arrow;

2. there are no closed cycles in the graph; or, equivalently, for any point there exists a sequence of arrows leading from it to some point .

Figure 1 provides examples of -graphs for some state when contains four states. We will denote by the set of -graphs and we shall use the letter to denote graphs. If are nonnegative numbers, where , define also the transition probability along path as

 ϖ(g)≐∏(sk→sℓ)∈g^Psksℓ.

The following Lemma holds:

Lemma 5.1 (Lemma 6.3.1 in [29])

Let us consider a Markov chain with a finite set of states and transition probabilities and assume that every state can be reached from any other state in a finite number of steps. Then, the stationary distribution of the chain is , where

 πs=Rs∑si∈SRsi,s∈S (9)

where .

In other words, in order to compute the weight that the stationary distribution assigns to a state , it suffices to compute the ratio of the transition probabilities of all -graphs over the transition probabilities of all graphs.

5.2 Approximation of one-step transition probability

We wish to provide an approximation in the computation of the transition probabilities between states in since this will allow for explicitly computing the stationary distribution of Theorem 3.1. Based on the definition of the t.p.f. , and as , a transition from to influences the stationary distribution only if differs from in the action of a single player. This observation will be capitalized by the forthcoming Lemmas 5.25.3, to approximate the transition probability from to .

Let denote the first hitting time of the unperturbed process to the set . Denote the minimum hitting time of a set as when the process starts from state . Let us also define the set

 Di,ℓ(α)≐{(α,x)∈Z:xiαi>1−Hi(α)ℓ},

where . The set defines the unreachable set in the strategy space of agent when starting from under and plays action for consecutive times.

Lemma 5.2 (One-step transition probability)

Consider any two action profiles which differ in the action of a single player . Let define the corresponding pure strategy states associated with and , respectively. Let also , where , which corresponds to the state after agent perturbed once starting from and played . Define also which corresponds to the probability that the process transits from the perturbed state to a -neighborhood of in finite time. For sufficiently small such that , the following hold:

• The transition probability from to under can be approximated as follows:

 ^Pss′=γj⋅limδ↓0˘Pss′(δ), (10)

where corresponds to the probability that agent trembled and selected action , given that only one player trembles (under t.p.f. ).

• Along any sample path that reaches the set , action profile is played at least times.

• corresponds to the probability of the shortest path, i.e.,

 ˘Pss′(δ)=Pz[α(t+1)=α′,t<\uptau∗s(Nδ(s′))].
• There exists positive constant , such that for any transition step (with the above properties) and as ,

 ˘Pss′(δ)≈exp(−C0(δ)ϵuj(α′)). (11)

Proof. See Appendix 10.

Note that for sufficiently small , the larger the destination utility , the larger the transition probability to . In a way, the inverse of the destination utility at represents a measure of “resistance” of the process to transit to . Lemma 5.2 provides a tool for simplifying the computation of stochastically stable pure strategy states as it will become apparent in the following section.

5.3 Approximation of stationary distribution

In this section, using Lemma 5.2 that approximates one-step transition probabilities, we provide an approximation of the invariant stationary distribution of the t.p.f.. By definition of , this approximation is based upon the observation that for the computation of the quantities of Lemma 5.1, it suffices to consider only those paths in which involve one-step transitions as defined in the previous section.

Define to be the set of -graphs consisting solely of one-step transitions, i.e., for any and any arrow , the associated action profiles, say , respectively, differ in a single action of a single player. It is straightforward to check that for any .

Lemma 5.3 (Approximation of stationary distribution)

The stationary distribution of the finite Markov chain , , satisfies

 πs=limδ↓0˘Rs(δ)∑si∈S˘Rsi(δ),s∈S, (12)

where and

 ˘ϖ(g;δ)≐¯γg∏(sk→sℓ)∈g˘Psksℓ(δ), (13)

for some constant .

Proof. According to Lemma 5.1, for any , we have . Given the definition of the t.p.f. , where only one player trembles, we should only consider one-step transition probabilities (as defined in Lemma 5.2). Thus,

 Rs=∑g∈G(1){s}ϖ(g)=∑g∈G(1){s}∏(sk→sℓ)∈g^Psksℓ.

According to Lemma 5.2 and Equation (10), we have

 Rs = limδ↓0∑g∈G(1){s}∏(sk→sℓ)∈gγj(sk,sℓ)˘Psksℓ(δ) = limδ↓0∑g∈G(1){s}¯γg∏(sk→sℓ)∈g˘Psksℓ(δ)

where denotes the single player whose action changes from to , and