I Introduction
There has been much recent activity in using techniques of learning in games to design distributed control systems. This research traverses from utility function design [1, 2, 3], through analysis of potential suboptimalities due to the use of distributed selfish controllers [4] to the design and analysis of gametheoretical learning algorithms with specific controlinspired objectives (reaching a global optimum, fast convergence, etc.) [5, 6].
In this context, considerable interest has arisen from the approach of [1, 2] in which the independent controls available to a system are distributed among a set of agents, henceforth called “players”. To complete the gametheoretical analogy, the controls available to a player are called “actions”, and each player is assigned a utility function which depends on the actions of all players (as does the global systemlevel utility). As such, a player’s utility in a particular play of the game could be set to be the global utility of the joint action selected by all players. However, a more learnable choice is the socalled Wonderful Life Utility (WLU) [1, 2], in which the utility of any particular player is given by how much better the system is doing as a result of that player’s action (compared to the situation where no other player changes their action but the focal player uses a baseline action instead). A fundamental result in this domain is that setting the players’ utilities using WLUs results in a potential game [7] (see Section II below). There are alternative methods for converting a systemlevel utility function into individual utilities, such as Shapley value utility [8]; however, most of these also boil down to a potential game (possibly in the extended sense of [3]) where the optimal system control is a Nash equilibrium of the game. Thus, by representing a control problem as a potential game, the controllers’ main objective amounts to reaching a Nash equilibrium of the resulting game.
On the other hand, like much of the economic literature on learning in games [9, 10], the vast majority of this corpus of research has focused almost exclusively on situations where each player’s controls comprise a finite set. This allows results from the theory of learning in games to be applied directly, resulting in learning algorithms that converge to the set of equilibria – and hence system optima. However, the assumption of discrete action sets is frequently anomalous in control, engineering and economics: after all, prices are not discrete, and neither are the controls in a large number of engineering systems. For instance, in massively parallel grid computing networks (such as the Berkeley Open Infrastructure for Network Computing – BOINC) [11], the decision granularity of “bagoftasks” application scheduling gives rise to a potential game with continuous action sets [7]. A similar situation is encountered in the case of energyefficient power control and power allocation in large wireless networks [12, 13]: mobile wireless users can transmit at different power levels (or split their power across different subcarriers [14]), and their throughput is a continuous function of their chosen transmit power profiles (which have to be optimized unilaterally and without recourse to user coordination or cooperation). Finally, decisionmaking in the emerging “smart grid” paradigm for power generation and management in electricity grids also revolves around continuous variables (such as the amount of power to generate, or when to power down during the day), leading again to gametheoretical model formulations with continuous action sets [15].
In this paper, we focus squarely on control problems (presented as potential games) with continuous action sets and we propose an actorcritic reinforcement learning algorithm that provably converges to equilibrium. To address this problem in an economic setting, very recent work by Perkins and Leslie [16] extended the theory of learning in games to zerosum games with continuous action sets (see also [17, 18]); however, from a controltheoretical point of view, zerosum games are of limited practical relevance because they only capture adversarial interactions between two players. Owing to this fundamental difference between zerosum and potential games, the twoplayer analysis of [16] no longer applies to our case, so a completely different approach is required to obtain convergence in the context of manyplayer potential games.
To accomplish this, our analysis relies on two theoretical contributions of independent interest. The first is the extension of stochastic approximation techniques for Banach spaces (otherwise known as “abstract stochastic approximation” [19, 20, 21, 22, 23, 24]) to the socalled “twotimescales” framework originally introduced in standard (finitedimensional space) stochastic approximation by [25]. This allows us to consider interdependent strategies and value functions evolving as a stochastic process in a Banach space (the space of signed measures over the players’ continuous action sets and the space of continuous functions from action space to
respectively, both endowed with appropriate norms). Our second contribution is the asymptotic analysis of the mean field dynamics of this process on the space of probability measures on the action space; our analysis reveals that the dynamics’ rest points in potential games are globally attracting, so, combined with our stochastic approximation results, we obtain the convergence of our actorcritic reinforcement learning algorithm to equilibrium.
In Section II we introduce the framework and notation, and introduce our actor–critic learning algorithm. Following that, in Section III we introduce twotimescales stochastic approximation in Banach spaces, and prove our general result. Section IV applies the stochastic approximation theory to the actor–critic algorithm to show that it can be studied via a mean field dynamical system. Section V then analyses the convergence of the mean field dynamical system in potential games, a result which allows us to prove the convergence of the actor–critic process in this context.
Ii Actor–critic learning with continuous action spaces
Throughout this paper, we will focus on control problems presented as potential games with finitely many players and continuous action spaces. Such a game comprises a finite set of players labelled . For each there exists an action set which is a compact interval;^{1}^{1}1We are only making this assumption for convenience; our analysis carries through to higherdimensional convex bodies with minimal hassle. when each player selects an action , this results in a joint action We will frequently use the notation to refer to the joint action in which Player uses action and all other players use the joint action . Each player is also associated with a bounded and continuous utility function . For the game to be a potential game, there must exist a potential function such that
for all , for all and for all , . Thus if any player changes their action while the others do not, the change in utility for the player that changes their action is equal to the change in value of the potential function of the game. Methods for constructing potential games from system utility functions [1, 2, 3] usually ensure that the potential corresponds to the system utility, so maximising the potential function corresponds to maximising the system utility.
Gametheoretical analyses usually focus on mixed strategies where a player selects an action to play randomly. A mixed strategy for Player is defined to be a probability distribution over the action space . This is a simple concept when is finite, but for the continuous action spaces considered in this paper more care is required. Specifically, let be the Borel sigmaalgebra on and let denote the set of all probability measures on . Throughout this article we endow with the weak topology, metrized by the bounded Lipschitz norm (see Section IV; also [26, 27, 16]). A mixed strategy is then an element ; for we have that is the probability that Player selects an action in the Borel set . Note that a mixed strategy under this definition need not admit a density with respect to Lebesgue measure, and in particular may contain an atom at a particular action .
Returning to our gametheoretical considerations, we extend the definition of utilities to the space of mixed strategy profiles. In particular, let be a mixed strategy profile, and define
As before we use the notation to refer to the mixed strategy profile in which Player uses and all other players use . In further abuse of notation, we write for the mixed strategy profile , where is the Dirac measure at (meaning that Player selects action with probability ). Hence is the utility to Player for selecting when all other players use strategy .
A central concept in game theory is the best response correspondence of Player
, i.e. the set of mixed strategies that maximise Player ’s utility given any particular opponent mixed strategy . A Nash equilibrium is a fixed point of this correspondence, in which all players are playing a best response to all other players. In a learning context however, the discontinuities that appear in best response correspondences can cause great difficulties [28]. We focus instead on a smoothing of the best response. For a fixed , the logit best response with noise level of Player to strategy is defined to be the mixed strategy such that(1) 
for each . In [18] it is shown that is absolutely continuous (with respect to Lebesgue measure), with density given by
(2) 
To ease notation in what follows, we let
The existence of fixed points of is shown in [18] and [16]; such a fixed point is a joint strategy such that for each , and so is a mixed strategy profile such that every player is playing a smooth best response to the strategies of the other players. Such profiles are called logit equilibria and the set of all such fixed points will be denoted by . A logit equilibrium is thus an approximation of a local maximizer of the potential function of the game in the sense that for small a logit equilibrium places most of the probability mass in areas where the joint action results in a high potential function value; in particular, logit equilibria approximate Nash equilibria when the noise level is sufficiently small.^{2}^{2}2We note here that the notion of a logit equilibrium is a special case of the more general concept of quantal response equilibrium introduced in [29].
Smooth best responses also play an important part in discrete action games, particularly when learning is considered. In this domain they were introduced in stochastic fictitious play by [30], and later studied by, among others, [31, 32, 33] to ensure the played mixed strategies in a fictitious play process converge to logit equilibrium. This is in contrast to classical fictitious play in which the beliefs of players converge, but the played strategies are (almost) always pure. The technique was also required by [34, 35, 36] to allow simple reinforcement learners to converge to logit equilibria: as discussed in [34], players whose strategies are a function of the expected value of their actions cannot converge to a Nash equilibrium because, at equilibrium, all actions in the support of the equilibrium mixed strategies will receive the same expected reward.
Recently [18] developed the dynamical systems tools necessary to consider whether the smooth best response dynamics converge to logit equilibria in the infinitedimensional setting. This was extended to learning systems in [16], where it was shown that stochastic fictitious play converges to logit equilibrium in twoplayer zerosum games with compact continuous action sets.
One of the main requirements for efficient learning in a control setting is that the full utility functions of the game need not be known in advance, and players may not be able to observe the actions of all other players. Using fictitious play (or, indeed, many of the other standard gametheoretical tools) does not satisfy this requirement because they assume full knowledge and observability of payoff functions and opponent actions. This is what motivates the simple reinforcement learning approaches discussed previously [34, 35, 36], and also the actorcritic reinforcement learning approach of [37], which we extend in this article to the continuous action space setting. The idea is to learn both a value function
that estimates the function
for the current value of , while also maintaining a separate mixed strategy . The critic, , informs the update of the actor, . In turn the observed utilities received by the actor, , inform the update of the critic .In the continuous action space setting of this paper, we implement the actorcritic algorithm as the following iterative process (for a pseudocode implementation, see Algorithm 1):

At the th stage of the process, each player selects an action by sampling from the distribution and uses to play the game.

Players update their critics using the update equation
(3a) 
Each player samples and updates their actor using the update equation
(3b)
The algorithm above is the main focus of our paper, so some remarks are in order:
Remark 1.
In (3a), it is assumed that a player can access , so they can calculate how much they would have received for each of their actions in response to the joint action that was selected by the other players. Even though this assumption restricts the applicability of our method somewhat, it is relatively harmless in many settings — for instance, in congestion games such estimates can be calculated simply by observing the utilization level of the system’s facilities. Note further that to implement this algorithm an individual need not actually observe the action profile , needing only the utility . This means that a player need know nothing at all about the players who don’t directly affect her utility function, which allows a degree of separation and modularisation in large systems, as demonstrated in [38].
Remark 2.
The logit response used to sample the used in (3b) is now parameterised by instead of . This is a trivial change in which we use in place of in (1), which represents the fact that now players select smooth best responses to their critic instead of directly to the estimated mixed strategy of the other players.
Remark 3.
Also in (3b), the players update towards a sampled instead of toward the full function . This is so that the critic can be represented as a collection of weighted atoms, instead of as a complicated and continuous probability measure. Representing as a collection of atoms means that sampling is particularly easy.
On the other hand, sampling could be extremely difficult for general . The gradual evolution of the however implies that a sequential Monte Carlo sampler [39] could be used to produce samples according to . The representation of is also potentially troublesome and we do not address it fully here. However one could assume that each can be represented as a finite linear combination of basis functions such as a spline, Fourier or wavelet basis. Another option would be to slowly increase the size of a Fourier or wavelet basis as gets large, resulting in vanishing bias terms which can be easily incorporated in the stochastic approximation framework.
Remark 4.
The remainder of this article works to prove the following theorem, while also providing several auxiliary results of independent interest along the way:
Theorem 1.
In a continuousactionset potential game with bounded Lipschitz rewards and isolated equilibrium components, the actor–critic algorithm (3) converges strongly to a component of the equilibrium set (a.s.).
Remark.
We recall here that the notion of strong convergence of probability measures is defined by asking that for every measurable
. As such, this notion of convergence is even stronger than the notion of “convergence in probability” (vague convergence) used in the central limit theorem and other weakconvergence results.
Iii Twotimescales stochastic approximation in Banach spaces
The analysis of systems such as Algorithm 1 is enabled by the use of twotimescales stochastic approximation techniques [25]. By allowing as , the system can be analysed as if the ‘fast’ update (3a), with higher learning parameter , has fully converged to the current value of the ‘slow’ system (3b), with lower learning parameter . Note that it is not the case that we have an outer and inner loop, in which (3a) is run to convergence for every update of (3b): both the actor and the critic are updated on every iteration. It is simply that the twotimescales technique allows us to analyse the system as if there were an inner loop.
That being said, the results of [25] are only cast in the framework of finitedimensional spaces. We have already observed that with continuous action spaces , the mixed strategies are probability measures in the space , and the critics are functions. Placing appropriate norms on these spaces results in Banach spaces, and in this section we combine the twotimescales results of [25] with the Banach space stochastic approximation framework of [16] to develop the tool necessary to analyse the recursion (3).
To that end, consider the general twotimescales stochastic approximation system
(4a)  
(4b) 
where

and are sequences in the Banach spaces and respectively.

and are the learning rate sequences of the process.

and comprise the mean field of the process.

and are stochastic processes in and
respectively. (For a detailed exposition of Banachvalued random variables, see
[40].) 
and are bias terms that converge almost surely to .
We will study this system using the asymptotic pseudotrajectory approach of [41], which is already cast in the language of metric spaces; since Banach spaces are metric, the framework of [41] still applies to our scenario. This modernises the approach of [22] while also introducing the twotimescales technique to ‘abstract stochastic approximation’.
To proceed, recall that a semiflow on a metric space, , is a continuous map , , such that, and for all . As in simple Euclidean spaces, wellposed differential equations on Banach spaces induce a semiflow [42]. A continuous function is an asymptotic pseudotrajectory for if for any ,
Properties of asymptotic pseudotrajectories are discussed in detail in [41].
We will prove that interpolations of the stochastic approximation process (
4) result in asymptotic pseudotrajectories to flows induced by dynamical systems on and governed by and respectively. To do so, and to allow us to state necessary assumptions on the processes, we define timescales on which we will interpolate the stochastic approximation process. In particular, let (with ), and for let . Similarly let (with ), and for let .With these timescales we define interpolations of the stochastic approximation processes (4). On the slow () timescale we define a continuoustime interpolation of by letting
(5) 
for . On the fast () timescale we consider , and define the continuous time interpolation of by letting
(6) 
for .
Our assumptions, which are simple extensions to those of [25] and [41], can now be stated as follows:

Noise control.

For all ,

and are bounded sequences such that and as .


Boundedness and continuity.

There exist compact sets and such that and for all .

and are bounded and uniformly continuous on .


Learning rates.

and with and as .

as .


Mean field behaviour.

For any fixed the differential equation
(7) has unique solution trajectories that remain in for any initial value . Furthermore the differential equation (7) has a unique globally attracting fixed point , and the function is Lipschitz continuous.

The differential equation
(8) has unique solution trajectories that remain in for any initial value .

Assumption A1 is the standard assumption for noise control in stochastic approximation. It has traditionally caused difficulty in abstract stochastic approximation, but recent solutions are discussed in the following paragraph. Assumption A2 is simply a boundedness and continuity assumption, but can cause difficulty with some norms in function spaces. Assumption A3 provides the twotimescales nature of the scheme, with both learning rate sequences converging to 0, but becoming much smaller than . Finally Assumption A4 provides both the existence of unique solutions of the relevant mean field differential equations, and the useful separation of timescales in continuous time which is directly analogous to Assumption (A1) of [25]. Note that we do not make the stronger assumption that there exists a unique globally asymptotically stable fixed point in the slow timescale dynamics (8) [25, Assumption A2]; this assumption is not necessary for the theory presented here, and would unnecessarily restrict the applicability of the results.
Note that the noise assumption A1(a) has traditionally caused difficulty for stochastic approximation on Banach spaces: [23] considers the simple case where the stochastic terms are independent and identically distributed, whilst [22] prove a very weak convergence result for a particular process which again uses independent noise. However [16] provide criteria analogous to the martingale noise assumptions in which guarantee that the noise condition 1(a) holds in useful Banach spaces. In particular, if is a sequence of martingale differences in Banach space , then
with probability 1 if is:

the space of functions for , is deterministic with , is a martingale difference sequence with respect to some filtration , and (cf. the remark following Proposition A.1 of [16]);

the space of functions on bounded spaces (see [43]); or

the space of finite signed measures on a compact interval of with the bounded Lipschitz norm (see [26, 27, 16] or Section IV below) is deterministic with , where there exists a filtration such that is measurable with respect to , is a bounded absolutely continuous probability measure which is measurable with respect to and has density , and is sampled from the probability distribution (Proposition 3.6 of [16]);
Clearly, if similar conditions also hold for then Assumption A1(a) holds.
Our first lemma demonstrates that we can analyse the system as if the fast system is fully calibrated to the slow system . By this we mean that, for sufficiently large , is close to the value it would converge to if were fixed and allowed to fully converge.
Lemma 2.
Under Assumptions A1–A4,
Proof:
Let , with the induced product norm from the topologies of and . Under this topology, is a Banach space, and is compact. The updates (4) can be expressed as
(9) 
where is such that , for , and
Assumptions A1–A4 imply the assumptions of Theorem 3.3 of [16]. Most are direct translations, but the noise must be carefully considered. For any , any , and any ,
Since , the second term converges to 0 as . Hence, using assumption A1 to control the first term,
Therefore , defined in (6), is an asymptotic pseudotrajectory of the flow defined by
(10) 
Assumption A4(a) implies that is globally attracting for (10). Hence Theorem 6.10 of [41] gives that . The result follows by the continuity of assumed in A4(a). ∎
We use this fact to consider the evolution of on the slow timescale.
Theorem 3.
Proof:
Rewrite (4a) as
(11) 
where . We will show that this is a wellbehaved stochastic approximation process. In particular, we need to show that can be absorbed into in such a way that the equivalent Assumption A1 of [16] can be applied to .
By Lemma 2 we have that . Hence we can define
with as . By the uniform continuity of , it follows that we can define a sequence such that for all , .
From this construction, for any and for any ,
As in the proof of Lemma 2, similar arguments can be used for under assumption (A1)(b). Hence for all ,
Once again it is straightforward to show that, under (A1)(A4), the slow timescale stochastic approximation (11) satisfies the assumptions of Theorem 3.3 of [16], and therefore is an asymptotic pseudotrajectory to the flow induced by the differential equation (8). ∎
While [41] provides several results that can be combined with Theorem 3, we summarise the result used in this paper with the following corollary:
Corollary 4.
Suppose that Assumptions A1–A4 hold. Then converges to an internally chain transitive set of the flow induced by the mean field differential equation (8).
Iv Stochastic approximation of the actor–critic algorithm
In this section we demonstrate that the actor–critic algorithm (3) can be analysed using the twotimescales stochastic approximation framework of Section III. Our first task is to define the Banach spaces in which the algorithm evolves.
Note that the set of probability distributions on is a subset of the space of finite signed measures on . To turn this space into a Banach space, the most convenient norm for our purposes is the bounded Lipschitz (BL) norm.^{3}^{3}3For a discussion regarding the appropriateness of this norm for gametheoretical considerations, see [26, 27, 18], and, for stochastic approximation, especially [16]. To define the BL norm, let
Then, for we define
with norm is a Banach space [27], and convergence of a sequence of probability measures under corresponds to weak convergence of the measures [26]. Under the BL norm, is a compact subset of (see Proposition 4.6 of [16]), allowing Assumption A2 to be easily verified.
We consider mixed strategy profiles as existing in the subset of the product space We use the max norm to induce the product topology, so that if we define
(12) 
Suppose also that utility functions are bounded and Lipschitz continuous. Since their domain is a bounded interval of we can assume that the estimates are in the Banach space of functions with a finite norm, under the
norm. Hence we consider the vectors
as elements of the Banach space withTheorem 5.
Consider the actor–critic algorithm (3). Suppose that for each the action space is a compact interval of , and the utility function is bounded and uniformly Lipschitz continuous. Suppose also that and are chosen to satisfy Assumption A3 as well as and Then, under the bounded Lipschitz norm, converges with probability 1 to an internally chain transitive set of the flow defined by the player logit best response dynamics
(13) 
Proof:
We take , and as above. This allows a direct mapping of the actor–critic algorithm (3) to the stochastic approximation framework (4) by taking
and
By Corollary 4 we therefore only need to verify Assumptions A1–A4.
 A1:

is of exactly the form studied by [16] and therefore Proposition 3.6 of that paper suffices to prove the condition on the tail behaviour of holds with probability 1. The are martingale difference sequences, since , and the are functions. Hence Proposition A.1 of [16] suffices to prove the condition on the tail behaviour of holds with probability 1 under the norm. Since and are identically zero, we have shown that A1 holds.
 A2:

is a compact subset of under the bounded Lipschitz norm, so taking suffices. Furthermore, with bounded continuous reward functions it follows that the are uniformly bounded and equicontinuous and therefore remain in a compact set . is clearly uniformly continuous on the compact set . The continuity of , and therefore , is shown in Lemma C.2 of [16].
 A3:

The learning rates are chosen to satisfy this assumption.
 A4:

For fixed , the differential equations
converge exponentially quickly to . Furthermore is Lipschitz continuous in , so part (a) is satisfied. Equation (8) then becomes
Since we rewrote to depend on the utility functions instead of directly on , we find that we have recovered the logit best response dynamics of [18] and [16], which those authors show to have unique solution trajectories.∎
V Convergence of the logit best response dynamics
We have shown in Theorem 5 that the actor–critic algorithm (3) results in joint strategies that converge to an internally chain transitive set of the flow defined by the logit best response dynamics (13) under the bounded Lipschitz norm. It is demonstrated in [16] that in twoplayer zerosum continuous action games the set of logit equilibria (the fixed points of the logit best response ) is a global attractor of the flow. Hence, by Corollary 5.4 of [41] we instantly obtain the result that any internally chain transitive set is contained in .
However twoplayer zerosum games are not particularly relevant for control systems: multiplayer potential games are much more important. The logit best responses in a potential game are identical to the logit best responses in the identical interest game in which the potential function is the global utility function. Hence evolution of strategies under the logit best response dynamics in a potential game is identical to that in the identical interest game in which the potential acts as the global utility. We therefore carry out our convergence analysis for the logit best response dynamics (13) in player identical interest games with continuous action spaces. See [44] for related issues.
For the remainder of this section we work to prove the following theorem:
Theorem 6.
In a potential game with continuous bounded rewards, in which the connected components of the set of logit equilibria of the game are isolated, any internally chain transitive set of the flow induced by the smooth best response dynamics (13) is contained in a connected component of .
Define
Appendix C of [16] shows that if the utility functions are bounded and Lipschitz continuous then, for any , there exists a such that for all , and that is forward invariant under the logit best response dynamics. For the remainder of this article, is taken to be sufficiently large for this to be the case.
Our method first demonstrates that the set is globally attracting for the flow, so any internally chain transitive set of the flow is contained in . The nice properties of then allow the use of a Lyapunov function argument to show that any internally chain transitive set in is a connected set of logit equilibria.
Lemma 7.
Let be an internally chaintransitive set. Then .
Proof:
Consider the trajectory of (13) starting at an arbitrary . We can write as
Defining
it is immediate both that and
(14) 
Thus approaches at an exponential rate, uniformly in . Hence is uniformly globally attracting.
We would like to invoke Corollary 5.4 of [41], but since may not be invariant it is not an attractor in the terminology of [41] either. We therefore prove directly that . Suppose not, so there exists a point and by the compactness of internally chain transitive sets there exists a such that . There exists a such that for the trajectory with ,
Comments
There are no comments yet.