Linear Quadratic Games with Costly Measurements

09/20/2017 ∙ by Dipankar Maity, et al. ∙ University of Maryland University of Michigan 0

In this work we consider a stochastic linear quadratic two-player game. The state measurements are observed through a switched noiseless communication link. Each player incurs a finite cost every time the link is established to get measurements. Along with the usual control action, each player is equipped with a switching action to control the communication link. The measurements help to improve the estimate and hence reduce the quadratic cost but at the same time the cost is increased due to switching. We study the subgame perfect equilibrium control and switching strategies for the players. We show that the problem can be solved in a two-step process by solving two dynamic programming problems. The first step corresponds to solving a dynamic programming for the control strategy and the second step solves another dynamic programming for the switching strategy

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Linear quadratic (LQ) stochastic games have attracted a great deal of attention in the control and related community due to its wide applicability in stochastic control, minimax control, multi-agent systems and economics [2], [3], [4], [5], [6], [7], [8]. There is a well established notion of (Nash) equilibrium (NE) strategies for static games, and in dynamic games there are refinements of NE known as subgame perfect equilibria (SPE). Closed form solutions for these NE (or SPE) may generally not exist or hard to compute if one such exists. Among the various classes of dynamic games, LQ games exhibit a closed form expression for SPE, and it is characterized by some Riccati equations. Necessary and sufficient conditions for the NE strategies of LQ games have been studied in [4], [9],[6]. Contrary to the prior belief, [10] shows existence of nonlinear control strategies for LQ games.

Amongst the vast majority of the prior works, the underlying assumption is the availability of free observations. Dynamic games are studied with either open-loop strategy (i.e. only measurement is the initial state) or feedback strategies where the observation is available freely at any time. Challenges emerge when the measurements are on demand, but costly. This adds an extra layer of decision making, for the players, because now they have to both, control the system and ask for measurements.

In this work, we consider a class of two-player linear quadratic stochastic games of finite horizon. The game dynamics are partially observable. Contrary to the existing literature, the observations are not freely available. Each observation requires a finite cost for establishing a link for communication. The link through which the observations are communicated to the players (their controllers) is noiseless but operated by two switches (Figure 1), one for each player. The link is established only when both the players are willing for it, and they both get the actual state measurement at that time. Consequently, there is an apparent trade off between cost of obtaining state measurement and the estimation quality.

Figure 1: Schematic of the system. Each player has to select their controller strategy and switching strategy . All the links are noiseless and delay-free.

In this game, the players can make a precise estimate of the state if they establish the link at every time instance. However, since the link establishment is costly, they can compromise the estimation accuracy in exchange of the cost for accessing the measurement. Therefore, the problem is to optimally decide when to establish the link and how to use the acquired measurement in order to minimize their individual cost. Since, in general, the players will have different preferences over the time instances when they want to acquire the measurement, they have to come to an agreement when to actually establish the link.

The closest work on the similar game framework has been studied in [11] where the authors studied zero-sum stochastic differential LQ games. However, the selection for switching times were performed in an collaborative way rather than being the outcome of a strategic interaction. The major digression of this work from [11] is that we consider an explicit game for the switching strategy as well. We express the switch as a Boolean control action and seek for SPE for both control and switching strategies.

Our contributions are as follows:
(a) We study the SPE of this dynamic game and show that they can be found through a two-step process. Specifically, in the first step we fix the switching strategy and study the SPE for control strategies. The study shows that the control strategy is linear in estimated state, where the gain is characterized with two backward Riccati equations which can be computed offline. Moreover, the Riccati equations do not depend on the switching strategy.
(b) Regarding the equilibrium switching strategy, we provide a backward recursive algorithm to find all SPE where value functions need only be computed over a finite and quadratically-sized (in the duration of the game) set.
(c) Regarding the equilibrium switching strategy, we show that there are many equilibria among which there is one that is strictly preferable by both users and has a Markovian structure. It is found in our study that a strictly preferable switching strategy for a player not only depends on their own cost-to-go, but also depends on the cost-to-go for the opponent.

The remaining of the paper is organized as follows: The problem formulation is provided in Section 2, Section 3 contains the results on the SPE of the control strategy, SPE for the switching strategy and its offline computation are analyzed in Section 4. Finally we conclude our work in Section 6.

2 Problem Formulation

In the discrete time Gauss-Markov setting, we consider the following linear dynamics of the state :

(1)

where , and denotes the action of player . is a Gaussian noise with and ( is the Kronecker delta.), and .

There are two additional actions (switching actions) and . These switching actions control a switch (switch closes if both are equal to 1) and the observation available to both users is with

(2)

where “

” denotes an erasure. The evolution of random variables in period

is assumed to be

The information available at time to player before she takes the switching action is

(3)

and the information available at time to player before she takes the control action is

(4)

As a result, the actions have the functional form

(5a)
(5b)

where by , , we denote the control and switching strategies of player . , let us denote an measurable random variable .

The individual cost that each player needs to minimize is quadratic in state and action, and it also depends on the switching actions and . We consider a game for a finite duration () and the per-stage costs are explicitly written as:

(6)

for all and

(7)

The quantity is the cost paid by player when both the players attempt to close the switch and they observe the state information . Therefore the average cost over the time horizon is represented as,

(8)

where denotes the strategy of the player that corresponds to control strategy and switching strategy .

The objective of player is:

(9)

3 Subgame Perfect Control Strategy

For dynamic games with complete information the appropriate equilibrium concept is a refinement of Nash equilibrium (NE) called the subgame perfect equilibrium (SPE). A strategy profile is a SPE if the restriction of to any proper subgame of the original game constitutes a NE [12, pp. 94].

We seek to characterize the SPE for this switched LQG game. Moreover, we will show that among the multiple SPE, there exists one that simultaneously minimizes the cost for both users among all SPE and thus it will be the preferable SPE solution of this game. In this section, we study the SPE control strategy for both the players.

Theorem 3.1

For any switching profile of the players, the SPE control strategy has the following structure:

(10)

where

(11)

Furthermore, the cost-to-go incurred by player under the SPE control strategy at any time step is given by,

(12)

where . The matrices and depend only on the game parameters and (detailed expressions are in the proof of the theorem) and thus, can be calculated offline without the knowledge of the switching strategy profile.

proof The proof of this theorem is provided in Appendix 7.1.

To maintain brevity will be denoted as . From this point onward we will set and write (12) in compact form as

(13)

It should be noted that in Theorem 3.1, the depends on the given switching strategy through the .

There are several remarks to be made at this point.

The stochastic control version of the same problem (i.e. single player single objective) is a modified Kalman filtering problem where the observations are available on demand after paying certain cost

per observation. Therefore, the decision of switching will solely depend on the influence of switching on the error covariance matrix. This is a side result of our work and details will appear elsewhere.

From Theorem 3.1, . Therefore, the total cost incurred by player with control strategy profile () is . Hence, the total cost incurred with the switching is:

(14)

Another remark that is apparent from our result is that the SPE control strategy is completely characterized by the pair of matrices which is uniquely determined by backward dynamic equations.

4 Subgame Perfect Switching Strategy

In this section we complete the procedure for finding the SPE of this game by focusing on the switching strategies. We will do that by considering the backward induction process for finding SPE and reduce the cost-to-go functions into a simpler and more tractable form (compared to the one in (13)).

In this problem the switching action is taken first at time based on the knowledge and then the augmented knowledge (, , , ) is used to select the control strategies . In order to visualize it, one might break the time period into two halves where in the first half, switching action is performed and in the second half, control action is performed. In Theorem 3.1, is the optimal cost-to-go after the switching decision has been taken at time .

The actual (before switching action is taken) cost-to-go at stage is:

(15)

and the optimization (game) variables are control and switching for all .

Due to the fact for all , we can write

(16)

where .

Since each player is interested in minimizing their cost, they are interested in at every stage (finally they want to minimize ).

We can write,

(17)

We substitute the expression of from Theorem 3.1 into (17), but before that, let us define,

(18)

where (since ).

We also define . Note that is measurable whereas is measurable. and are related as follows:

(19)

Now let us consider the -th stage cost .

(20)

Let us define Therefore,

(21)

Using (47), we get

(22)

The selection of switching strategy has no effect of and hence it does not play any role in the game at stage .

Let us define an instantaneous cost:

(23)

With slight abuse of notation, after neglecting the term, we obtain,

(24)

Therefore,

(25)

Let us denote:

(26)

Let us perform the similar backward induction to find the SPE for the switching strategies. Note at time , there is no action to optimize and

(27)

Let us define

(28)

Similarly, at ,

(29)
(30)

Using and ,

(31)

If is a SPE strategy at time then

(32)

and for both ; and .

Using the above definition of SPE, for is an equilibrium strategy since unilateral change from to does not change the cost for any player. However, there might be other equilibria (in this case only ) which produces lower cost for the above cost function.

It is straightforward to show that the equilibrium strategy at is

(33)

From (33) we notice that and can also be an equilibrium strategy. However those equilibria are equivalent to in the sense that they produce the same cost-to-go for both . Therefore, we will restrict our attention on two equilibria and

As a remark, it is pointed out that adding an infinitesimal switching cost for every time player requests for a switching (irrespective of whether the switch was closed or not) will ensure that and is never an SPE.

Let us note when , then or , both produces the same cost-to-go value. Under such situations, all possible switching actions are equivalent. In order to obliterate such instances we make the following assumption:

Assumption 4.1

If , for all possible history . Then, (33) is modified as follows:

(34)

Irrespective of whether SPE is or , the optimal cost-to-go depends only on and also the best SPE strategy (that produces the least cost among all SPE) depends only on (or ).

Therefore, we hypothesize the following:

Claim 4.2

For any , there exists a that depends only on and produces the least cost-to-go among all SPE. Hence (i.e. only depends on ).

Proof: The hypothesis is true for . Let us assume it is true for some , i.e. . Therefore,

Then using a dynamic programming argument,

(35)

From (4), the best equilibrium strategy if

(similar to assumption 4.1, we only consider the strict inequality), otherwise .

Therefore requires only the knowledge of and hence from (4),

For this class of games, there always exists a Markovian SPE switching strategy and a Markovian SPE control strategy which produce the least cost-to-go among all SPE. Though, there might be other non-Markovian SPE strategies which produce the same cost, however, due to the claim 4.2, it is sufficient to consider only the Markovian strategies to find the best SPE corresponding to the least cost-to-go.

4.1 Offline Calculation of

In the following we define how the players can take the decision online by using some stored offline functions (value functions).

Let us define in the following manner:

(36)

and

(37)

where , and

(38)

By construction, if denotes the minimum cost-to-go (for the subgame starting at ) among the SPE, defined in (37) provides the minimum cost-to-go at stage for player . Therefore, by backward inductions, denotes the cost-to-go function along an SPE that simultaneously minimizes the cost-to-go for both players.

Claim 4.3

For any and history , the best switching strategy (SPE) is given by for if and only if,

(39)

Otherwise for .

proof is trivially true.

: First, notice that we have established is an SPE strategy for all . Now let us assume that at some , (39) holds, then if player selects a strategy such that , then the cost-to-go for player with any strategy profile from time onward is

(40)

Therefore, unilateral deviation is harmful (strictly non-profitable) for the player , and that allows us to conclude for is an equilibrium for . Therefore and both are equilibria. However, the cost-to-go by selecting is strictly lesser than selecting , and this is, therefore, preferable by the players.

Note that, (37) can be calculated and stored offline and (39) can be evaluated online using the stored values.

Equation (39) is equivalent to:

(41)

which shows a threshold policy for SPE switching.

We note that , therefore at time we only need the value not the function in the entire space of symmetric positive semidefinite matrices. In order to decide we need to know only four values for

. Therefore, given the variance of

, we need to store only finite number of values to characterize all the value functions for a finite duration game.

Claim 4.4

The maximum number of values (value function evaluations) needed to be stored to calculate the switching strategies for entire game of duration is .

proof Let at stage , (or ) takes number of possible distinct values based on all possible previous history . Therefore to determine the switching at time , we need to make comparison tests (39) and for each test the term is common. Therefore we need to evaluate the value function only at number of points at time .

For the switching pair , (or ) and for any other possible switching profile at stage , . Therefore at stage , will be at most ( and possible values of .) Therefore,

(42)

with , we get .

Total value function evaluations to be stored =

The factor in above equation is due to the fact that we have to evaluate the value functions for both the players.

Remark 4.5

A switching is performed only when it strictly reduces the cost-to-go for both users. Therefore, each switching minimizes the welfare cost-to-go. However, the converse is not necessarily true i.e. a switching with a potential to reduce the welfare cost-to-go may not always be performed.

4.2 Centralized Optimization vs. Game Setup

The problem we consider here is a game theoretic setup between two players with their own optimization criterion with two actions (control and switch). While they can select their controllers independently, however, their individual switching action does not affect the system (and cost) unless they switch synchronously. A valid question to ask is how a centralized agent would select its action strategies in order to optimize the welfare cost (i.e. the sum of two individual players’ cost).

We have shown in Theorem 3.1 that the control strategy is totally characterized by Riccati equations for two-player setup. Similar analysis would show that same characteristics for the control strategy are true for the centralized agent. However, it will have a single Riccati equation as opposed to two equations that we have. Similarly, the gain of the controller might change. Considering the symmetric case i.e. , , we can show that the control strategy for the centralized agent will be equivalent to the strategies of the two agents (i.e.