Multi-agent formulations can be used to tackle distributed optimization problems due to their reduced communication and computational complexity. In such formulations, agents make their own decisions repeatedly over time trying to maximize their own utility/performance function. However, due to the interdependencies among agents’ utility functions, local (or distributed) optimization does not necessarily imply global (or centralized) optimization. The problem becomes even more challenging when the utility function of each agent is unknown, and only measurements of this function (possibly corrupted by noise) are available. For this reason, there have been several efforts towards the design of distributed payoff-based learning dynamics for convergence to globally optimal outcomes.
Naturally, several such distributed optimization problems can be formulated as strategic-form games. A rather common objective is then to derive conditions under which convergence to efficient Nash equilibria can be achieved, i.e., locally stable outcomes that also maximize a centralized objective. One large class of payoff-based learning dynamics that has been utilized for convergence to Nash equilibria is reinforcement-based learning. It may appear under alternative forms, including discrete-time replicator dynamics , learning automata [2, 3] or approximate policy iteration or -learning 
. It is highly attractive to several engineering applications, since agents do not need to know neither the actions of other agents, nor their own utility function. For example, it has been utilized for system identification and pattern recognition, distributed network formation and resource-allocation problems .
In reinforcement-based learning, deriving conditions for convergence to Nash equilibria may not be a trivial task especially in the case of large number of agents. Especially in the context of coordination games (e.g., ), two main difficulties are encountered: a) excluding convergence to pure strategies that are not Nash equilibria, and b) excluding convergence to mixed strategy profiles. Recent work by the author in perturbed learning automata 
, overcame these limitations by directly characterizing the stochastically stable states of the induced Markov chain (independently of the number of players or actions). This type of analysis allowed for acquiring convergence guarantees in multi-player coordination games (thus, extending previous results in reinforcement-based learning restricted only to potential games).
Although Nash equilibria are stochastically stable in coordination games under perturbed learning automata, not all Nash equilibria may be desirable. An example may be drawn from the classical Stag-Hunt coordination game of Table I.
In this game, the first player selects the row of the payoff matrix and the second player selects the column. The first element of the selected entry determines the reward of the row player, and the second element determines the reward of the column player. This game has two pure Nash equilibria, which correspond to the symmetric plays and . Ideally, we would prefer that agents eventually learn to play which corresponds to the payoff-dominant (or Pareto-efficient) equilibrium. However, existing results in perturbed learning automata [6, 9, 8] demonstrate that
may prevail asymptotically with positive probability. The reason lies in the cost that an agent experiences when the other agent deviates from a Nash equilibrium, which captures the notion ofrisk dominance (cf., ). In fact, is the risk-dominant equilibrium in the Stag-Hunt game of Table I.
In the present paper, we extend the perturbed learning automata dynamics presented in  to incorporate agents’ satisfaction levels, namely aspiration-based perturbed learning automata (APLA). In this extended version, an agent reinforces an action based on both repeated selection and its satisfaction level. We provide a stochastically stability analysis of the proposed dynamics in multi-player coordination games. Furthermore, we show that payoff-dominant Nash equilibria are the only stochastically-stable states, as opposed to standard learning automata.
This paper (also in combination with ) provides an analytical framework that significantly expands the utility of reinforcement-based learning in strategic-form games. Note though that several classes of aspiration-based learning also guarantee convergence to efficient outcomes in coordination games. For example, the baseline-based dynamics of , the mode-based dynamics of , the trial-and-error dynamics of , and the aspiration-learning dynamics of  also guarantee convergence to efficient Nash equilibria in certain classes of coordination and weakly-acyclic games. However, existing analysis does not take into account the possibility of noisy observations (with the exception of  and through the introduction of sufficiently large exploration phases). In comparison with these learning dynamics, learning automata can naturally incorporate noisy observations, as demonstrated in the robustness convergence analysis of , due to the indirect filtering of measurement noise in the formulation of the agents’ strategies.
In the remainder of the paper, Section II introduces coordination games and Section III presents the aspiration-based perturbed learning automata dynamics. Section IV presents the main weak-convergence result and Section V its technical derivation. Section VI provides a refinement of stochastically stable states together with a simulation study. Finally, Section VII presents concluding remarks.
For a Euclidean topological space, denotes the Euclidean distance.
unit vectorin where its th entry is equal to 1 and all other entries is equal to 0.
denotes the probability simplex of dimension , i.e.,
denotes the Dirac measure at .
For a finite set , denotes its cardinality.
Ii Coordination Games
We consider the standard setup of finite strategic-form games. There is a finite set of agents or players, , and each agent has a finite set of actions, denoted by . The set of action profiles is the Cartesian product ; denotes an action of agent ; and denotes the action profile or joint action of all agents. The payoff/utility function of player is a mapping . An action profile is a (pure) Nash equilibrium if, for each ,
for all , where denotes the complementary set . We denote the set of pure Nash equilibria by .
Before defining coordination games, we first need to define the notion of best response.
Definition II.1 (Best Response)
The best response of agent to an action profile is a set valued map such that
A coordination game is defined as follows:
Definition II.2 (Coordination game)
A game of two or more agents is a coordination game if the following conditions are satisfied:
for any , there exist and action such that
for any , there exist an agent and an action such that
The conditions of a coordination game establish a weak form of “coincidence of interests” among players. For example, condition (2) states that there always exists a best response of a player that can make no other player worse off. Due to this condition, a pure Nash equilibrium always exists. Furthermore, condition (3) states that at a pure Nash equilibrium, there exists an action profile that can make every player worse off. It is straightforward to show that coordination games are weakly acyclic games (cf., ). For example, the Stag-Hunt game of Table I satisfy the properties of Definition II.2. Alternative examples can be found, e.g., the network formation games and common-pool games presented in .
For the remainder of the paper, we will be concerned with coordination games that satisfy the Positive-Utility Property.
Property II.1 (Positive-Utility Property)
For any agent and any action profile , .
Iii Aspiration-based Perturbed Learning Automata (APLA)
In this section, we present a novel reinforcement-based learning algorithm, namely aspiration-based perturbed learning automata (APLA).
The proposed dynamics is presented in Table II and extends the recently developed perturbed learning automata [6, 8]. At the first step, each agent selects an action according to a finite probability distribution (i.e., strategy) (capturing its beliefs about the most rewarding action). Its selection is slightly perturbed by a perturbation (or mutations) factor , such that, with a small probability agent follows a uniform strategy (or, it trembles). At the second step, agent evaluates its new selection by collecting a utility measurement, while in the third step, agent updates its strategy vector given its new experience. Finally, each agent updates its discounted running average performance, namely aspiration level, , for some , .
Here we identify actions with vertices of the simplex, . For example, if agent selects its th action at time , then . To better clarify how the strategies evolve, consider the following toy example. Let the current strategy of player be , i.e., player has two actions, each assigned probability . Let also , i.e., player selects the first action according to rule (4). Then, the new strategy vector for agent , updated according to rule (6), is:
If , i.e., player receives a satisfactory performance, then , and the strategy of the selected action is going to increase proportionally to the reward received from this action. If, instead, , then , and the strategy of the selected action is going to increase proportionally to both the reward received and . By adjusting , we may control the increase in the strategy of a dissatisfactory action, since, if the current reward is far below the current aspiration level, then .
The introduction of the level of satisfaction captured by in the strategy update is the main contribution of this paper as compared to the original perturbed learning automata (PLA) introduced in [6, 8], where .
Note that by letting the step-size to be sufficiently small and since the utility function is uniformly bounded in , for all .
We also deliberately set the step size of the aspiration-level update to be different than the step size of the strategy vector update . In general, and for reasons that will become more clear in the forthcoming Section V, we would like to be sufficiently larger than in order for the aspiration level to evolve at a faster rate than the strategy vector. For the remainder of the paper, we will assume the following design property.
Given , we set the step size sufficiently small such that, for any ,
The ratio of the l.h.s. represents the minimum number of steps that the aspiration-level update of a player needs to reach a -neighborhood of the utility starting from any other action profile. The ratio of the r.h.s. represents the minimum number of steps that the strategy update of a player needs to reach a -neighborhood of the pure strategy vector corresponding to , when playing continuously. It is clear that for a finite number of actions and bounded utilities, and for sufficiently larger than , property (11) will be satisfied.
Iv Stochastic Stability
Iv-a Terminology and notation
Let , where and , i.e., tuples of joint actions , strategy profiles and aspiration-level profiles . We will denote the elements of the state space by .
The set is endowed with the discrete topology, and with the usual Euclidean topology, and with the corresponding product topology. We also let denote the Borel -field of , and the set of probability measures (p.m.) on endowed with the Prohorov topology, i.e., the topology of weak convergence. The dynamics of Table II defines an -valued Markov chain. Let denote its transition probability function (t.p.f.), parameterized by . We will refer to this process as the perturbed process.
Note that under the perturbed process one or more agents may tremble (i.e., select randomly an action according to the uniform distribution). Define also the process where at most one agent may tremble. We will refer to this process as the unperturbed process.
We let denote the Banach space of real-valued continuous functions on under the sup-norm (denoted by ) topology. For , define
The process governed by the unperturbed process will be denoted by . Let denote the canonical path space, i.e., an element is a sequence , with . We use the same notation for the elements of the space and for the coordinates of the process . Let also denote the unique p.m. induced by the unperturbed process on the product -field of , initialized at .
Iv-B Stochastic stability
First, note that both and () satisfy the weak-Feller property (cf., [16, Definition 4.4.2]).
Both the unperturbed process () and the perturbed process () satisfy the weak-Feller property.
Let us consider the perturbed process . The proof for the unperturbed process will be directly implied by employing . Let us also consider any sequence such that .
For any open set , the following holds:
where , and are the canonical projections defined by the product topology, and
Similarly, we have:
To investigate the limit of as , it suffices to investigate the behavior of the sequences
Let us first investigate the sequence . We distinguish the following (complementary) cases:
(a) and : In this case, there exists an open ball about the next strategy vector that does not share any common points with the canonical projection of . Due to the continuity of the function , we have that .
(b) : In this case, there exists an open ball about the next strategy vector that belongs to the canonical projection of , since . Due to the continuity of the function with respect to both the strategy and the aspiration level , we have that .
(c) and : In this case, . We conclude that , since .
In either one of the above (complementary) cases (a), (b) or (c), we have that . Following the exact same reasoning, and the continuity of the mapping with respect to the aspiration level , we also derive that (for all the corresponding (a), (b) and (c) cases).
Finally, due to the continuity of the perturbed strategy vector with respect to , we conclude that for any sequence ,
By [16, Proposition 7.2.1], we conclude that satisfies the weak-Feller property.
The above derivation can be generalized to any selection probability function in the place of , provided that it is a continuous function. Thus, the proof for the unperturbed process follows the exact same reasoning by simply setting .
The measure is called an invariant probability measure (i.p.m.) for if
Since defines a locally compact separable metric space and , satisfy the weak-Feller property, they both admit an i.p.m., denoted and , respectively [16, Theorem 7.2.3].
We would like to characterize the stochastically stable states of , that is any state for which any collection of i.p.m. satisfies . As the forthcoming analysis will show, the stochastically stable states will be a subset of the set of pure strategy states (p.s.s.) defined as follows:
Definition IV.1 (Pure Strategy State)
A pure strategy state is a state such that for all , and , i.e., coincides with the vertex of the probability simplex which assigns probability 1 to action , and coincides with the utility of agent under action profile .
We will denote the set of pure strategy states by . Pure strategy states that correspond to pure Nash equilibria, will be referred to as pure Nash equilibrium states and will be denoted by . For any pure strategy state , define the -neighborhood of as follows
Theorem IV.1 (Stochastic Stability)
In any coordination game (Definition II.2) and under the aspiration-based perturbed learning automata of Table II (APLA), there exists a unique probability vector such that, for any collection of i.p.m.’s ,
where convergence is in the weak sense.
The probability vector is an invariant distribution of the (finite-state) Markov process , such that, for any ,
for any sufficiently small, where is the t.p.f. corresponding to at least two players trembling (i.e., following the uniform distribution of (4)).
Theorem IV.1 establishes weak convergence of the i.p.m. of with the invariant distribution of a finite Markov chain , whose support is on the set of pure Nash equilibrium states. Thus, from the ergodicity of , we have that the expected percentage of time that the process spends in any such that is given by as and time increases, i.e.,
The methodology for assessing which Nash equilibria are stochastically stable will follow in the forthcoming Section VI.
V Technical Derivation
In this section, we provide the main steps for the proof of Theorem IV.1. We begin by investigating the asymptotic behavior of the unperturbed process , and then we characterize the i.p.m. of the perturbed process with respect to the pure Nash equilibrium states .
V-a Unperturbed Process
Recall that the unperturbed process with t.p.f. has been defined such that at most one agent may tremble. We first present two technical lemmas that will help us identify the behavior of the unperturbed process.
Let denote the first hitting time of the unperturbed process to a set .
For some action profile and , define the set:
The set , corresponds to any state at which the aspiration level is below for some given and . Define also the event
i.e., the event corresponds to the case that the process first reaches for some action profile before time .
The first lemma states that, for any initial state , at least one occurs for some .
For any , and any ,
Consider any initial state . Since , there exists an action profile such that for all . For some , define to be the maximum (with respect to ) number of iterations required for the aspiration level profile to drop from (i.e., its maximum value) to , when playing only action profile . Let also denote the complement of . We will first consider the case that the initial state . In this case, we have:
The first inequality results from the fact that one possible sample path that reaches corresponds to playing action continuously, and the probability of this path is smaller when we start from . The second inequality results from the fact that, under the unperturbed process , only one agent may tremble at any given time. Finally, note that by selecting action continuously for steps, for all . By selecting (according to Property III.1), is finite, and . Finally, note that if, instead, , then . Thus, we conclude that
For some , let us define the event:
In other words, corresponds to the event that some action profile has been selected (continuously) for at least times before time . Note that . Furthermore, . Thus, from the counterpart of the Borel-Cantelli Lemma (cf., [17, Lemma 1]), and continuity from below, we conclude that , i.e., the probability that at least one occurs, starting from any , is one. Given that for any , the conclusion follows.
The second lemma states that the unperturbed process reaches (infinitely often) a state at which the aspiration level is below the utility level of a Nash equilibrium and its strategy assigns positive probability to it. Define the event:
which corresponds to the case that the aspiration level is below the utility level of a Nash equilibrium.
For any ,
By Lemma V.1, there exists a subsequence such that for some action profile . We distinguish the following (complementary) cases.
(a) . In this case, corresponds to a pure Nash equilibrium. By definition of a coordination game, there exists agent and that makes every other agent worse off. By selecting sufficiently small, such drop in the performance brings the aspiration level of every agent strictly below . Formally, select , where
which is strictly positive since .
(b) . In this case, corresponds to an action profile that is not a Nash equilibrium. According to the definition of a coordination game, there exists a finite sequence of action profiles starting from , namely , such that: a) , b) and c) for each , there exists such that satisfies condition (2). Thus, there exists finite integer , such that
Along this sequence of best responses there is no agent that gets worse off. Thus, along this sample path, increases with an order of (and independent of ) for every agent . Thus, .
We can define a subsequence , that takes into account the size of , such that and
Thus, the conclusion follows from the counterpart of the Borel-Cantelli Lemma (cf., [17, Lemma 1]).
Next, we will use Lemma V.2 to show that the process will reach a -neighborhood of a pure Nash equilibrium infinitely often with probability one.
For any , define the event:
In other words, corresponds to the event that the unperturbed process has reached a -neighborhood of a pure Nash equilibrium state before time instance .
For any and any initial state ,
By Lemma V.2, there exists a subsequence such that for some pure Nash equilibrium . Let us consider one such , for some . Let us also consider a sample path of the unperturbed process, where action is played continuously until is reached, where is the pure strategy state corresponding to action profile . Let be the maximum (with respect to ) number of steps required for the process to reach when playing action profile continuously. Proposition 4.1 in  shows that such sample path occurs with strictly positive probability (of order of ), say . Then,
We can define a subsequence such that , such that and
Thus, from the counterpart of the Borel-Cantelli Lemma (cf., [17, Lemma 1]), we conclude that
Proposition V.1 states that the unperturbed process will reach a -neighborhood of a Nash equilibrium infinitely often with probability one. Note that this derivation is independent of the size of and .
Proposition V.2 (Limiting t.p.f. of unperturbed process)
Let denote an i.p.m. of . Then, there exists a t.p.f. on with the following properties:
for -a.e. , is an i.p.m. for ;
for all , ;
is an i.p.m. for ;
the support111The support of a measure on is the unique closed set such that and for every open set such that . of is on for all .
The state space is a locally compact separable metric space and the t.p.f. of the unperturbed process admits an i.p.m. due to the weak-Feller property. Thus, statements (a), (b) and (c) follow directly from [16, Theorem 5.2.2 (a), (b), (e)].
(d) Let us assume that the support of includes points in other than the pure Nash equilibrium states in . Then, there exists an open set such that and for some . According to (b), converges weakly to . Thus, from the Portmanteau theorem (cf., [16, Theorem 1.4.16]), we have that However, this contradicts Proposition V.1.
Proposition V.2 states that the limiting unperturbed t.p.f. converges weakly to a t.p.f. which accepts the same invariant probability measure as . Furthermore, the support of is the set of pure Nash equilibrium states in . This is a rather handy property, since the limiting perturbed process can also be “related” (in a weak-convergence sense) to the t.p.f. , as it will be shown in the following section.
V-B Invariant probability measure (i.p.m.) of perturbed process
Note that the t.p.f. of the perturbed process can be decomposed as follows:
where is the t.p.f. of the one-step process where at least two agents tremble simultaneously, i.e., they play an action uniformly at random. Note that
is the probability that at most one agent trembles. It is straightforward to check that as .
Define also the infinite-step t.p.f. when trembling only at the first step (briefly, lifted t.p.f.) as follows:
where i.e., corresponds to the resolvent t.p.f.
In the following proposition, we establish weak-convergence of the lifted t.p.f. with as , which will further allow for an explicit characterization of the weak limit points of the i.p.m. of .
Proposition V.3 (i.p.m. of perturbed process)
The following hold:
For , .
Any invariant distribution of is also an invariant distribution of .
Any weak limit point in of , as , is an i.p.m. of .
The proof follows the exact same reasoning with the proof of [8, Proposition 4.3].
Proposition V.3 establishes convergence (in a weak sense) of the i.p.m. of the perturbed process to an i.p.m. of . In the following section, this convergence result will allow for a more explicit characterization of as .
V-C Equivalent finite-state Markov process
Define the finite-state Markov process as in (12).
Proposition V.4 (Unique i.p.m. of )
There exists a unique i.p.m. of . It satisfies