1.1. Entropy games and matrix multiplication games
Entropy games have been introduced by Asarin et al. [ACD16]. They model the situation in which two players with conflicting interests, called “Despot” and “Tribune”, wish to minimize or to maximize a topological entropy representing the freedom of a half-player, “People”. Entropy games are special “matrix multiplication games”, in which two players alternatively choose matrices in certain prescribed sets; the first player wishes to minimize the growth rate of the infinite matrix product obtained in this way, whereas the second player wishes to maximize it. Whereas matrix multiplication games are hard in general (computing joint spectral radii is a special case), entropy games correspond to a tractable subclass of multiplication games, in which the matrix sets have the property of being invariant by row interchange, the so called independent row uncertainty (IRU) assumption, sometimes also called row-wise or rectangularity assumption. In particular, Asarin et al. showed in [ACD16] that the problem of comparing the value of an entropy game to a given rational number is in NP coNP, giving to entropy games a status somehow comparable to other important classes of games with an unsettled complexity, including mean payoff games, simple stochastic games, or stochastic mean payoff games, see [AM09] for background.
Another motivation to study entropy games arises from risk sensitive control [FHH97, FHH99, AB17]: as we shall see, essentially the same class of operators arise in the latter setting. A recent application of entropy games to the approximation of the joint spectral radius of nonnegative matrices (without making the IRU assumption) can be found in [GS18]. Other motivations originate from symbolic dynamics [Lot05, Chapter 1.8.4].
We first show that entropy games, which were introduced as a new class of games, are equivalent to a class of zero-sum mean payoff stochastic games with perfect information, in which some action spaces are simplices, and the instantaneous payments are given by a Kullback-Leibler entropy. Hence, entropy games fit in a classical class of games, with a “nice” payment function over infinite action spaces.
To do so, we introduce a slightly more expressive variant of the model of Asarin et al [ACD16], in which the initial state is prescribed (the initial state is chosen by a half-player, People, in the original model). This may look like a relatively minor extension, so we keep the name “entropy game” for it, but this extension is essential to develop an operator approach and derive consequences from it. We show that the main results known for stochastic mean payoff games with finite actions space and perfect information, namely the existence of the value and the existence of optimal positional strategies, are still valid for entropy games (Theorems 10 and 9). This is derived from a model theory approach of Bolte, Gaubert, and Vigeral [BGV14], together with the observation that the dynamic programming operators of entropy games are definable in the real exponential field. Then, a key ingredient is the proof of existence of Blackwell optimal policies, as a consequence of o-minimality, see Theorem 8. Another consequence of the operator approach is the existence of Collatz-Wielandt optimality certificates for entropy games, Theorem 13. When specialized to the one player case, this leads to a convex programming characterization of the value, Corollary 14, which can also be recovered from a characterization of Anantharam and Borkar [AB17].
Our main result, Theorem 16, shows that entropy games in which Despot has a fixed number of significant states (states with a nontrivial choice) can be solved strategically
in polynomial time, meaning that optimal (stationary) strategies can be found in polynomial time. Thus, entropy games are somehow similar to stochastic mean payoff games, for which an analogous fixed-parameter tractability result holds (by reducing the one player case to a linear program). This approach also reveals a fundamental asymmetry between the players Despot and Tribune: our approach does not lead to a polynomial bound if one fixes the number of states of Tribune. In our proof, o-minimality arguments allow a reduction from the two-player to the one-player case (Theorem 9). Then, the one-player case is dealt with using several ingredients: ellipsoid method, separation bounds between algebraic numbers, and results from Perron-Frobenius theory.
The operator approach also allows one to obtain practically efficient algorithms to solve entropy games. In this way, the classical policy iteration of Hoffman-Karp [HK66] can be adapted to entropy games. We report experiments showing that when specialized to one player problems, policy iteration yields a speedup by one order of magnitude by comparison with the “spectral simplex” method recently introduced by Protasov [Pro15].
Let us finally complete the discussion of related works. The formulation of entropy games in terms of “classical” mean payoff games in which the payments are given by a Kullback-Leibler entropy builds on known principles in risk sensitive control [FHH99, AB17]. It can be thought as a version for two player problems of the Donsker-Varadhan characterization of the Perron-eigenvalue [DV75]. The latter is closely related to the log-convexity property of the spectral radius established by Kingman [Kin61]. A Donsker-Varadhan type formula for risk sensitive problems, which can be applied in particular to Despot-free player entropy games, has been recently obtained by Anantharam and Borkar, in a wider setting allowing an infinite state space [AB17]. In a nutshell, for Despot-free problems, the Donsker-Varadhan formula appears to be the (convex-analytic) dual of the Collatz-Wielandt formula. Chen and Han [CH14]
developed a related convex programming approach to solve the entropy maximization problem for Markov chains with uncertain parameters. We also note that the present Collatz-Wielandt approach, building on[AGN11], yields an alternative to the approach of [ACD16] using the “hourglass alternative” of [Koz15] to produce concise certificates allowing one to bound the value of entropy games. By comparison with [ACD16], a essential difference is the use of o-minimality arguments: these are needed because we study the more precise version of the game, in which the initial state is fixed. Indeed, a counter example of Vigeral shows that the mean payoff may not exist in such cases without an o-minimality assumption [Vig13], whereas the existence of the mean payoff holds universally (without restrictions of an algebraic nature on the Shapley operator) if one allows one player to choose the initial state, see e.g. Proposition 2.12 of [AGG12]. Finally, the identification of tractable subclasses of matrix multiplication games can be traced back at least to the work of Blondel and Nesterov [BN09].
2. Entropy games
An entropy game is a perfect information game played on a finite directed weighted graph . There are players, “Despot”, “Tribune”, and a half-player with a nondeterministic behavior, “People”. The set of nodes of the graph is written as the disjoint union , where and represent sets of states in which Despot, Tribune, and People play. We assume that the set of arcs is included in , meaning that Despot, Tribune, and People alternate their actions. A weight , which is a positive real number, is attached to every arc . All the other arcs in have weight . An initial state, , is known to the players. A token, initially in node , is moved in the graph according to the following rule. If the token is currently in a node belonging to , then, Despot chooses an arc and moves the token to a node . Similarly, if the token is currently in a node , Tribune chooses an arc and moves the token to node . Finally, if the token is in a node , People chooses an arc and moves the token to a node . We will assume that every player has at least one possible action in each state in which it is his or her turn to play. In other words, for all , the set of actions must be nonempty, and similar conditions apply to and .
A history of the game consists of a finite path in the directed graph , starting from the initial node . The number of turns of this history is defined to be the length of this path, each arc counting for a length of one third. The weight of a history is defined to be the product of the weights of the arcs arising on this path. For instance, a history where , and , makes and turn, and its weight is .
A strategy of Player Despot is a map which assigns to every history ending in some node in an arc of the form . Similarly, a strategy of Player Tribune is a map which assigns an arc to every history ending with a node in . The strategy is said to be positional if it only depends on the last node which has been visited and eventually of the number of turns. Similarly, the strategy is said to be positional if it only depends on and eventually of the number of turns. These strategies are in addition stationary, if they do not depend on the number of turns.
For every integer , we define as follows the game in horizon with initial state , . We assume that Despot and Tribune play according to the strategies . Then, People plays in a nondeterministic way. Therefore, the pair of strategies allows for different histories. The payment received by Tribune, in turns, is denoted by . It is defined as the sum of the weights of all the paths of the directed graph of length with initial node determined by the strategies and : each of these paths corresponds to different successive choices of People, leading to different histories allowed by the strategies . The payment received by Despot is defined to be the opposite of , so that the game in horizon is zero-sum. In that way, the payment measures the “freedom” of People, Despot wishes to minimize it whereas Tribune wishes to maximize it.
We say that the game in horizon with initial state has the value if for all , there is a strategy of Despot such that for all strategies of Tribune,
and similarly, there is a strategy of Tribune such that for all strategies of Despot,
The strategies and are said to be -optimal. In other words, Despot can make sure his loss will not exceed by playing , and Tribune can make sure to win at least by playing . The strategies and are optimal if they are -optimal, i.e., if we have the saddle point property:
for all strategies of Despot and Tribune.
If the value exists for all choices
of the initial state , we define the value vector
value vectorof the family of games in horizon , to be .
We now define the infinite horizon game , in which the payment received by Tribune is given by
and the payment received by Despot is the opposite of the latter payment. (The choice of limsup is somehow arbitrary, we could choose liminf instead without affecting the results which follow.) The value of the infinite horizon game , and the optimal strategies in this game, are still defined by a saddle point condition, as in (1), (2), (3), the payment being now replaced by .
We denote by the value vector of the infinite horizon games .
We associate to the latter games the dynamic programming operator , such that, for all , and ,
To relate this operator with the value of the above finite or infinite horizon games, we shall interpret these games as zero-sum stochastic games with expected multiplicative criteria. The one-player case was studied in particular by Howard and Matheson under the name of risk-sensitive Markov decision processes[HM72] and by Rothblum under the name of multiplicative Markov decision processes, see for instance [Rot84].
For any node , we denote by the set of actions available to People in state , and we denote by
the probability measure onobtained by normalizing the restriction of the weight function to : with . Then, can be rewritten as
A pair of strategies and of both players, determine the stochastic process with values in , such that for all and all histories having turns and ending in , and such that the transitions from to and to are deterministicaly determined by the strategies and respectively as in the above description of the entropy games . Then, the payoff of the entropy game with horizon starting in , , is equal to the following expected multiplicative/ risk-sensitive criterion:
The value of the entropy game in horizon with initial state , , does exists. The value vector of this game is determined by the relations , , , where is the unit vector of . Moreover, there exist optimal strategies for Despot and Tribune that are positional.
This result follows from a classical dynamic programming argument. Indeed, in the one player case, that is when there is only one choice of or one choice of , that is when the operator contains only a “min” or a “max”, the game is in the class of Markov Decision Problems with multiplicative criterion and the Dynamic Programming Principle has already been proved in this setting in [HM72, Rot84], see also [Whi82, Th. 1.1, Chap 11]. This shows that the game has a value which satifies and , and that an optimal strategy is obtained using these equations. For instance for a “max” (when Despot has only one choice), Tribune chooses any action attaining the maximum in
The resulting strategy is positional and it is optimal among all strategies . A similar result holds for a “min”, leading to a positional strategy for Despot.
Let us now consider the general two-player case. Define the sequence of vectors by
with , for all . We construct candidate strategies and , depending on the current position and number of turns, as follows. In state , if there remains turns to be played, Despot selects an action achieving the minimum in (5). We denote by the value of such that is selected. In state , if there remains turns to be played, Tribune chooses any action attaining the maximum in
Now, if Player Despot plays according to , we obtain a reduced one player game. It follows from the same dynamic programming principle as above (applied here to time dependent transition probabilities and factors ) that the value vector of this reduced game in horizon does exist and satisfies the recursion
with , for all . Since is the value, we have for all strategies of Tribune. Noting that by definition of , we deduce that Despot, by playing , can guarantee that his loss in the horizon game starting from state will not exceed . A dual argument shows that by playing , Tribune can guarantee that his win will be at least . ∎
Consider the entropy game whose graph and dynamic programming operator are given by:
For readability, the states of Despot are shown twice on the picture. Here, , , , , , , , , , , , , , , , , , , , , and all the weights are equal to , i.e., for all and such that .
One can check that , where and is the Fibonacci sequence. As an application of Theorem 10 below, it can be checked that the value vector of this entropy game is where is the golden mean.
3. Stochastic mean payoff games with Kullback-Leibler payments
We next show that entropy games are equivalent to a class of stochastic mean payoff games in which some action spaces are simplices, and payments are given by a Kullback-Leibler divergence.
To the entropy game , we associate a stochastic zero-sum game with Kullback-Leibler payments, denoted and defined as follows, referred to as “Kullback-Leibler game” for brevity. This new game is played by the same players, Despot, and Tribune, on the same weighted directed graph (so with same sets and same weight function ). The nondeterministic half-player, People, will be replaced by a standard probabilistic half-player, Nature.
For any node , recalling that is the set of actions available to People in state , we denote by the set of probability measures on . Therefore, an element of can be identified to a vector with nonnegative entries and sum . The admissible actions of Despot and Tribune in the states and are the same in the game and in the entropy game . However, the two games have different rules when the state belongs to the set of People’s states. Then, Tribune is allowed to play again, by selecting a probability measure ; in other words, Tribune plays twice in a row, selecting first an arc , and then a measure . Then, Nature chooses the next state according to probability , and Tribune receives the payment , where is the relative entropy or Kullback-Leibler divergence between and the measure obtained by restricting the weight function to :
Therefore, using the notations of Section 2, we get that
is minimal when the chosen probability distributionon is equal to the probability distribution of the transitions from state in the stochastic game defined in Section 2. Recall that relative entropy is related to information theory and statistics [Kul97]. An interesting special case arises when , as in [ACD16], thus
is the uniform distribution on. Then, is nothing but the Shannon entropy of .
A history in the game now consists of a finite sequence , , , which encodes both the states and actions which have been chosen. A strategy of Despot is still a function which associates to a history ending in a state in an arc in . A strategy of Tribune has now two components , is a map which assigns to a history ending in a state in an arc , as before, whereas assigns to the same history and to the next state chosen according to a probability measure on .
To each history corresponds a path in , obtained by ignoring the occurrences of probability measures. For instance, the path corresponding to the history is . Again, the number of turns of a history is defined as the length of this path, each arc counting for . So the number of turns of is and . Choosing strategies and of both players and fixing the initial state determines a probability measure on the space of histories . We denote by
the expectation of the payment received by Tribune, in turns, with respect to this measure, where is as in (6) and is the weight function of the graph of the game. We denote by the value of the game in horizon , with initial state , and we denote by the value vector. As in the case of entropy games, we shall use subscripts and superscripts to indicate special versions of the game, e.g., refers to the game in horizon with initial state . Note also our convention to use lowercase letters (as in ) to refer to the game with Kullback-Leibler payments, whereas we used uppercase letters (as in ) to refer to the entropy game.
It will be convenient to consider more special games in which the actions of one of the players are restricted. We will call policy of Despot a stationary positional strategy of this player, i.e., a map which assigns to every node a node such that . Similarly, we will call policy of Tribune a map which assigns to every node a node such that . Observe, in this definition of policy, the symmetry between Despot and Tribune, while the game is asymetric: the policy is not enough to determine a positional strategy of Tribune, because the probability distribution at every state is not specified by the policy . The set of policies of Despot and Tribune are denoted by and , respectively.
If one fixes a policy of Despot, we end up with a reduced game in which only Tribune has actions. We denote by the value vector of this game in horizon . Similarly, if one fixes a policy of Tribune, we obtain a reduced game denoted by , in which Despot plays when the state is in , Tribune selects an action according to the policy when the state is in , and Tribune plays when the state is in . The value vector of this reduced game is denoted by . We also denote by the value of the reduced game in which both policies of Despot and of Tribune are fixed, which means that only Tribune plays when the state is in . The systematic character of notation used here should be self explanatory: the symbol refers to the actions which are not fixed by the policy.
We also consider the infinite horizon or mean payoff game , in which the payment of Tribune is now
For , we also consider the discounted game with a discount factor , in which the payment of Tribune is
The value of the mean payoff game is denoted by , whereas the value of the discounted game is denoted by . As above, we denote by and the games restricted by the choice of policies , and use an analogous notation for the corresponding value vectors. For instance, refers to the value vector of the game with a discount factor . We define the notion of value, as well as the notion of optimal strategies, by saddle point conditions, as in Section 2.
The following dynamic programming principle entails that the value of the stochastic game with Kullback-Leibler payments in horizon is the log of the value of the entropy game.
The value vector of the Kullback-Leibler game in horizon does exist. It is determined by the relations , , , where
and we have .
In order to prove Proposition 2, we recall the following classical result in convex analysis showing that the “log-exp”function is the Legendre-Fenchel transform of Shannon entropy.
The function is convex and it satisfies
This result is mentioned in [RW98], Example 11.12. This convexity property is a special instance of the general fact that the log of the Laplace transform of a positive measure is convex (which follows from the Cauchy-Schwarz inequality), whereas the explicit expression as a maximum follows from a straightforward computation (apply Lagrange multipliers rule).
Proof of Proposition 2.
For a zero-sum game with finite horizon and additive criterion, the existence of the value is a standard fact, proved in a way similar to Proposition 1. The value vector satisfies the following dynamic programming equation
where for , and . By Lemma 3,
and so, (8) can be rewritten as where is given by (7). Observe that the operator is the conjugate of the operator of the original entropy game: . It follows that , where for a vector the notation denotes the vector , and . ∎
The map arising in (7) is obviously order preserving and it commutes with the addition of a constant, meaning that where is the unit vector of , and . Any map with these two properties is nonexpansive in the sup-norm, meaning that , see [CT80]. Hence, the map has a unique fixed point. For discounted games, the existence of the value and of optimal positional strategies is a known fact:
The discounted game with discount factor has a value and it admits optimal strategies that are positional and stationary. The value vector is the unique solution of .
The existence and the characterization of the value are standard results, see e.g. the discussion in [Ney03]. It is also known that the optimal strategies are obtained by selecting actions of the player attaining the minimum and maximum when evaluating every coordinate of , in a way similar to the proof of Proposition 1, there being replaced by . Since does not depend on the number of turns, the optimal strategies are also stationary. ∎
Nonexpansive maps can be considered more generally with respect to an arbitrary norm. In this setting, the issue of the existence of the limit of as , and of the limit of , as , where is the unique fixed point of , has received much attention. The former limit is sometimes called escape rate vector. Nonexpansiveness implies that the set of accumulation points of the sequence is independent of the choice of , but it does not suffice to establish the existence of the limit; some additional “tameness” condition on the map is needed. Indeed, a result of Neyman [Ney03], using a technique of Bewley and Kohlberg [BK76], shows that the two limits and do exist and coincide if is semi-algebraic. More generally, Bolte, Gaubert and Vigeral [BGV14] showed that the same limits still exist and coincide if the nonexpansive mapping is definable in an o-minimal structure. A counter example of Vigeral shows that the latter limit may not exist, even if the action spaces are compact and the payment and transition probability functions are continuous, so the o-minimality assumption is essential in what follows [Vig13].
In order to apply this result, let us recall the needed definitions, referring to [vdD98, vdD99] for background. An o-minimal structure consists, for each integer , of a family of subsets of . A subset of is said to be definable with respect to this structure if it belongs to this family. It is required that definable sets are closed under the Boolean operations, under every projection map (elimination of one variable) from to , and under the lift, meaning if is definable, then and are also definable. It is finally required that when , definable subsets are precisely finite unions of intervals. A function from to is said to be definable if its graph is definable.
An important example of o-minimal structure is the real exponential field . The definable sets in this structure are the subexponential sets [vdD99], i.e., the images under the projection maps of the exponential sets of , the latter being sets of the form where is a real polynomial. A theorem of Wilkie [Wil96] implies that is o-minimal, see [vdD99]. Observe in particular that the set is definable in this structure, being the projection of . Using the o-minimal character of this structure, this implies that definable maps are stable by the operations of pointwise maximum and minimum. We deduce the following key fact.
The dynamic programming operator of the Kullback-Leibler game, defined by (7), is definable in the real exponential field. ∎
Theorem 6 ([Bgv14]).
Let be nonexpansive in any norm, and suppose that is definable in an o-minimal structure. Then,
does exists, and it coincides with
Let be the value vector in horizon of the stochastic game with Kullback-Leibler payments, , and for , let denote the value vector of the discounted game with discount factor . Then does exist and it coincides with .
Corollary 7 will allow us to establish the existence of the value of the mean payoff game, and to obtain optimal strategies, by considering the discounted game, for which, as noted in Proposition 4, the existence of the value and of optimal policies are already known.
Let us recall that a strategy in a discounted game is said to be Blackwell optimal if it is optimal for all discount factors sufficiently close to one. The existence of Blackwell optimal positional strategies is a basic feature of perfect information zero-sum stochastic games with finite action spaces (see [Put05, Chap. 10] for the one-player case, the two-player case builds on similar ideas, e.g. [GG98, Lemma 26]). We next show that this result has an analogue for entropy games. To get a Blackwell type optimality result, we need to restrict to a setting with finitely many positional strategies. Recall that (resp. ) denotes the set of policies of Despot (resp. Tribune). We also recall our notation for the value of the mean payoff game in which Despot plays according to the policy .
We define the projection of a pair of strategies in the game to be the strategy in the game . In the present setting, it is appropriate to say that a pair of policies is Blackwell optimal if there is a real number such that, for all , is the projection of a pair of optimal strategies in the discounted game .
The family of discounted Kullback-Leibler games has positional Blackwell optimal strategies.
For all , the discounted game has positional optimal strategies . This follows from the standard dynamic programming argument mentioned in the proofs of Proposition 1 and 4, noting that is obtained by choosing any such that attains the minimum in the expression
Similarly, is chosen to be any such that attains the maximum in
and is chosen to be the unique action attaining the maximum in
(observe that the function to be maximized is strictly concave and continuous on , and that is compact and convex, so the maximum is achieved at a unique point).
By definition of the value and of optimal strategies, we have, for all strategies and of Despot and Tribune respectively,
which is equivalent to
Specializing the first inequality in (9) to , and bounding above the last term, we deduce that, for all for all strategies and of Despot and Tribune respectively, we have
where is the value of the reduced discounted 1-player discounted game starting at , in which the (not necessarily positional) strategies of Despot and of Tribune are fixed. The inequalities (11) can be specialized in particular to policies and . Then, by Proposition 4, is the unique fixed point of the self-map of , where is the dynamic programming operator given by
It follows that the map is definable in the real exponential field . (To see this, observe that, by creftypecap 5, the set is definable in this structure; then, taking the intersection of this set with the definable sets , for , and projecting the intersection keeping only the and variables, we obtain a definable set which is precisely the graph of the map ).
For all , let denote the set of such that
holds for all . Since the saddle point property (11) holds for all ( and depend on , of course), we have
Observe that the set is a subset of definable in the real exponential field, which is o-minimal. It follows that is a finite union of intervals. Hence, (14) provides a covering of by finitely many intervals, and so, one of the sets must include an interval of the form .
To show that the policies obtained in this way are Blackwell optimal, it remains to show that if satisfies (13) for some , then it is the projection of a pair of optimal strategies in the discounted game . For this, we shall apply the existence of optimal strategies that are positional and the resulting equations (10) and (11) to the reduced games and , respectively.
The first game leads to the existence of positional stationary strategies of Tribune such that, for all ,
Then, using (13), we get that , hence the equality .
The second one leads to the existence of positional stationary strategies of Despot and Tribune respectively such that, for all ,
Then, using (13), we deduce that