Surrogate Optimal Control for Strategic Multi-Agent Systems

by   Pedro Hespanhol, et al.
berkeley college

This paper studies how to design a platform to optimally control constrained multi-agent systems with a single coordinator and multiple strategic agents. In our setting, the agents cannot apply control inputs and only the coordinator applies control inputs; however, the coordinator does not know the objective functions of the agents, and so must choose control actions based on information provided by the agents. One major challenge is that if the platform is not correctly designed then the agents may provide false information to the coordinator in order to achieve improved outcomes for themselves at the expense of the overall system efficiency. Here, we design an interaction mechanism between the agents and the coordinator such that the mechanism: ensures agents truthfully report their information, has low communication requirements, and leads to a control action that achieves efficiency by achieving a Nash equilibrium. In particular, we design a mechanism in which each agent does not need to posses full knowledge of the system dynamics nor the objective functions of other agents. We illustrate our proposed mechanism in a model predictive control (MPC) application involving heating, ventilation, air-conditioning (HVAC) control by a building manager of an apartment building. Our results showcase how such a mechanism can be potentially used in the context of distributed MPC.



There are no comments yet.


page 1

page 2

page 3

page 4


Cooperative Path Integral Control for Stochastic Multi-Agent Systems

A distributed stochastic optimal control solution is presented for coope...

D3C: Reducing the Price of Anarchy in Multi-Agent Learning

Even in simple multi-agent systems, fixed incentives can lead to outcome...

Coalitional Control for Self-Organizing Agents

Coalitional control is concerned with the management of multi-agent syst...

Reliable Intersection Control in Non-cooperative Environments

We propose a reliable intersection control mechanism for strategic auton...

Using Multi-Agent Reinforcement Learning in Auction Simulations

Game theory has been developed by scientists as a theory of strategic in...

Double-Auction Mechanisms for Resource Trading Market

We consider a double-auction mechanism, which was recently proposed in t...

The Importance of System-Level Information in Multiagent Systems Design: Cardinality and Covering Problems

A fundamental challenge in multiagent systems is to design local control...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many systems have dynamics influenced by agents, including power systems [1], communication networks [2], water systems [3], and heating, ventilation, and air-conditioning (HVAC) automation [4]. These systems are characterized by information flows and the order of computations. For cooperative agents, various distributed model predictive control (MPC) schemes have been designed. A system-level control policy was obtained by aggregating locally-computed inputs [5, 6], and central platforms that compute a control based upon information sent by agents have also been designed [7].

Distributed control with strategic agents is less well-studied. The competitive nature of agents and asymmetries of information reward tactical behavior, ultimately leading to instability or poor performance [8, 9, 10, 11]. We focus our attention on the case where equilibrium behavior can be described as a Nash equilibrium of some non-cooperative game [12] that may be inefficient [13]. A common way to overcome such inefficiencies is to force agents to coordinate their goals with the system-wide goal [14], [15]. However, this approach requires strong assumptions that the agents’ utility functions are common knowledge and/or agents are honest when transmitting information [16]. Another line of work [17, 18] provides pricing schemes to induce or manage agents’ behavior in the equilibrium.

I-a Contributions

In this paper, we study the case of strategic agents under weaker assumptions than past work like [19, 16]. In particular, the agents exchange information only with a central platform that is responsible for the control decision. Our goal is to design the interaction mechanism to ensure not only efficiency of the resulting control policy but also honest reporting from the agents. Originally, the study of such mechanisms [20] was concerned with the design of incentives to ensure efficient allocation of commodities amongst market participants, whilst ensuring truthfulness. The classical VCG mechanisms [21, 22, 23] are an example of such. Our first contribution lies in providing a mechanism that enjoy those properties when applied to an optimal control setting.

A major hurdle in implementing such mechanisms is their steep communication needs [24, 11, 25]. But, minimal strategy spaces that elicit efficient Nash equilibrium in convex environments have been developed [26]

. A second contribution of our work is to provide communication protocols that are of low complexity order: We avoid communicating the entire utility function by the agents and instead resort to vector-valued messages inspired by surrogate optimization

[24], [27]. Hence, our goal in this paper is to provide a platform where (i) agents provide low-dimensional information, (ii) agents are honest, and (iii) an efficient control policy is implemented in the Nash equilibrium.

Lastly, we demonstrate the practical usefulness of our designed platform by conducting a simulation analysis of HVAC automation [28]. The situation we consider involves an apartment building where each apartment has its own preferences on desired room temperature versus the amount of energy consumption. The thermal dynamics of each apartment are coupled, and more efficient control is possible through coordination. Our simulations quantify the performance improvement possible through the use of our central platform in coordinating agents. In fact, this HVAC setup is similar to the setup in [18]. However, a major difference is that in [18] the central platform knows each agents’ utility function and can set prices on the control inputs to induce agents’ behavior. In contrast, we allow the agents to be strategic with respect to how they communicate information about their utility function to the central platform.

I-B Outline

Sect. II defines the system model, and Sect. III defines the mechanism and how agents interact with it. In Sect. IV, we provide a Nash equilibrium characterization of the agents’ equilibrium behavior. We conclude with Sect. V, where we provide a case study in the context of HVAC automation.

Ii System Model

Consider a system which obeys linear dynamics


where is the state vector, and is the input signal. Suppose this system is composed of interconnected and non-overlapping subsystems that are each associated to an agent. Let denote the set . We let denote the state vector of subsystem at period . Then we have and . In addition, we can also partition the inputs where . Note it follows that and .

Ii-a Agent Model

The diagonal block of gives the subsystem dynamics for the -th agent. Influence by other agents is described by off-diagonal blocks of when subsystem impacts . We assume agent ’s input only affects states in their subsystem; hence, the input matrix is block-diagonal. Let be the set of neighboring subsystems of subsystem . Then the dynamics for the -th subsystem is


We assume each agent only knows their own local dynamics and . Since agents do not know the part of their dynamics, a central platform is needed through which each agent can receive this information.

Each subsystem has state and input constraints that are polytopes containing the origin. Here, are matrices and vectors with appropriate dimensions, respectively. Lastly, each agent has their own cost function


where and are stage and terminal costs, and is control horizon. We assume each agent’s stage and terminal costs are strictly convex, differentiable, and take their minimum at the origin. The cost function is the agents’ private information, and their goal is to minimize it.

Ii-B Principal Model

The central platform is operated by a coordinator that we call the principal. We assume the principal has complete knowledge about the dynamics of the system (i.e., matrices and ) and constraints (i.e., the sets ), and importantly the principal is who gets to apply a control input to the entire system (restated, the agents do not directly provide control inputs). In this framework, if the principal knew the objective function of each agent, then they could compute a control sequence by solving the following convex optimal control problem (OCP-T):


Throughout the paper we let denote the optimal solution of (OCP-T), which we call the efficient trajectory. However, solving this problem is not possible for the principal, since it does not know the objective functions of each agent. It then needs to elicit information from each agent. The need of information gives birth to two major issues, which are the central focus of this work: (1) The agents may not be able/not desire to transmit their entire cost functions to the principal, as each cost function is infinite-dimensional and their private information; (2) The agents are strategic and may be not tell the truth. Therefore in order for the principal to solve (OCP-T) it also needs to design a mechanism that provides incentives to each agent to tell the truth. Hence, the principal is faced with both an optimal control problem and a mechanism design problem.

Iii Mechanism Specification

As described in the previous section, the principal’s goal is to solve (OCP-T). Towards that goal, the principal resorts to approximate the objective function based on a finite number of parameters that the agents can report, and then the principal will minimize this approximated function.

Iii-a Definition of a Mechanism

Let a mechanism be a tuple , where is the set of allowable messages agent can send to the principal, and is the outcome function that determines the outcome for any message profile . Here, the outcome function maps a message profile to a state/input trajectory :


where refers to the state/input trajectory associated with agent’s subsystem. Next, we define to be a non-negative vector of “fees” for each agent.

The mechanism together with the cost functions of each agents induce a game among the agents. We define the Nash equilibrium (NE) of this game as a message profile such that


for all and , where the compact notation denotes the vector of messages from all agents except . The fee increases costs for agent , which is undesirable since agents are minimizing. The goal of the principal is to design the mechanism such that the efficient trajectory can be implemented as the Nash equilibrium of the game N. Implementation means that the trajectory corresponding to the Nash equilibrium of the game induced by the mechanism is equal to the efficient trajectory.

Iii-B Low-Communication Mechanism

We specify our low-communication mechanism as follows: Each agent reports messages of the form


where are weights for every state of subsystem for each stage; are weights for every control input of subsystem for each stage; are weights representing the “sensitivity” of agent dynamics in cost function for each stage; is vector of bounds for states/inputs; and is a reference trajectory for the states of subsystem . Restated, each agent provides some open-loop trajectory coupled with state and input bounds, as well as scalars measuring the “impact” of states, inputs, and dynamics in its cost function.

In addition, the principal announces a single real-valued function to all agents to be used as a surrogate function for their cost functions. Namely, for each agent the principal forms the surrogate function


as an approximation of the agent’s stage cost . The notation indicates the second argument is a parameter of the function and not a variable. We further only consider functions that are strictly convex for all possible parameters . Lastly, for simplicity we let the principal announce the same function for both states and inputs. But one could consider different functions for states and inputs – the key property being that it is the same function for all agents. Then the principal forms the surrogate function


as an approximation of agent’s terminal cost . Based on a message profile , the principal formulates the following surrogate optimal control problem (OCP-S):


where we explicitly consider the desired operational bounds reported by each agent for every stage .

Since is strictly convex, this optimization problem has an unique solution that we call . Note that this notation means the are the optimal inputs for OCP-S. Then, given a message profile , we have that . That is, the outcome function of the mechanism outputs exactly the state/input trajectory of the optimal solution of OCP-S. We also define to be the optimal lagrange multipliers associated with Eq. 2 for every agent .

Now, suppose the game is repeatedly played with the same initial condition . At first, this mechanism is run for one round, meaning the principal collected some message , and solved OCP-S once. Then, before the next round the principal sends the following reference trajectory to agent :


for , where is part of the message as per (7). Observe that the reference trajectory sent to agent does not depend upon solving OCP-S. Moreover the principal will assign the following fees to each agent:


where is a state reference trajectory computed by the principal for agent given that the other agents behave according to . For example, the principal can solve another round of OCP-S but now excluding agent ’s contribution to the objective function in order to obtain (in a way akin to VCG mechanisms [23]). The key observation here is that the reference trajectory does not depend on the message sent by agent . The first term of the fee penalizes deviations of the computed optimal state trajectory from . The second term penalizes mismatches between the reported and the optimal state trajectory . The third term penalizes deviations from the reported sensitivity vector and the optimal lagrange multipliers of OCP-S associated with the dynamics of agent . Lastly the vectors are computed by the principal as follows:


where we, once again, note that this vector does not depend on the message sent by agent .

Iv equilibrium Characterization

With the mechanism defined, we can now characterize the equilibrium behavior of agents interacting via this mechanism. The goal of this section is to characterize the Nash equilibrium (NE) of the reulting game, which is a message profile . We start by first analyzing the properties of such equilibrium, and then we show it actually exists. Our analysis begins by showing that in a NE , each agent reports a specific type of state reference trajectories to the principal.

Lemma 1

Let be a NE of the game induced by the mechanism. Then every agent reports and . In addition, the principal sends the following references to the agents:


Suppose all agents adhere to the message profile , except agent which reports some message . Since is a NE, this deviation should give a higher cost for agent , that is:


Now, observe that the outcome function only depends on the components of the message . Then substituting into (III-B) gives that


for all possible sensitivities and state trajectory reports . Hence is the solution of the following minimization problem:


which achieves the minimum when . Then by definition of it directly follows that


Next, observe that each agent can only “measure” the impact of other subsystems in its dynamics via the reference signal that is sent by the principal. We say an state/input sequence is feasible for agent if it is feasible for the agent’s subsystem given the reference . We proceed to show that given the reference , any feasible state/input sequence can be achieved by agent . That is, agent can send a message that makes the principal compute the input and as it solves the problem OCP-S, given that OCP-S is feasible.

Lemma 2

For any agent , given a feasible state/input sequence there exists a message such that for all possible messages of the other agents , given that the resulting optimization problem (OCP-S) is feasible for .

Fix some agent and a feasible state/input sequence . We prove this lemma by constructing the message . Specifically, suppose agent chooses and . This choice constrains OCP-S to require that and . Then for any message , OCP-S is either infeasible or returns the desired solution for agent , regardless of the message of other agents.

What this lemma implies is that given the Nash equilibrium message profile , agent can unilaterally deviate in such a way that the principal will compute as part of the optimal solution, as long as OCP-S is feasible. We proceed in writing the agent’s optimal control problem (OCP-A) in equilibrium:


where the equilibrium fee, according to Lemmas 1 and 2 is given by:


where .

Thus in order for a message profile to be a NE, we must have that the optimal solution for OCP-S must also be an optimal solution for each agents’ OCP-A. Since both OCP-S and OCP-A are convex problems, it is enough to require that satisfy the KKT conditions for OCP-A for every agent . Next we present our main theorem, which shows that the efficient trajectory can be implemented as a Nash equilibrium of the game induced by the mechanism:

Theorem 1

(Implementability): The unique efficient trajectory can be supported as a Nash equilibrium of the game induced by the mechanism, that is . In addition the equilibrium messages satisfy


where denotes the derivative of .

First, note OCP-T is an “aggregation” of each agent’s problem: Instead of optimizing each agent separately with references for the neighbors, we optimize all agents at once. The KKT stationarity conditions for multipliers of OCP-T, associated with the dynamics, state and input constraints respectively, are


for all , and . On the above we use the notation to denote the column of matrix . In addition we let denote the column h of . Similarly, represents the column of the stage k constraints matrix . Now, if the messages follow (21), then it is easy to see that satisfy the KKT conditions of OCP-S:


for all , , and for , and . But in the equilibrium, Lemma 1 says that the reference trajectory sent to each agent is exactly the one that would be obtained if each agent applied the input sequent . Hence solves, not only OCP-S, but also each agent’s problem when is sent to the agents (OCP-A). This can be seen directly by using the multipliers , and for every agent’s subproblem and verifying that solves the KKT conditions of OCP-A. As a result, no agent has incentive to deviate from . Hence will be a Nash equilibrium of the game induced by the mechanism.

We finish this section with some remarks on Theorem 1:

At equilibrium, each agent reports the largest possible bounds so that OCP-S is always feasible at equilibrium. One may argue why do we include such reports in the message vector? Their presence is key to establishing Lemma 2, as they provide a “credible threat” to the mechanism (and thus to other agents). This forces that the solution of OCP-S must solve each agent’s subproblem at equilibrium. A similar argument with a numerical example is given in [25] in the context of routing. Also, each agent reports weights such that the derivative of the surrogate function matches exactly their marginal cost with respect to states and inputs. Secondly, observe that the fees payed in equilibrium are not zero, as they depend on the reference trajectory sent by the mechanism. The intuition behind this is that the fee charged to agent can capture the “externality” cost it imposes to the system by having his cost function considered by the mechanism. Lastly, OCP-S may be infeasible outside of equilibrium, since an agent could report an infeasible operational range. This issue can be overcome by assuming that the principal may apply some feasible control input if OCP-S ends up being infeasible. More importantly, in order for the agents to behave according to the equilibrium strategies, they need to know the optimal solution for OCP-T. This means that the agents need to “learn” the equilibrium by replaying the game and refining their messages. In the next section, we will provide one such simple learning process and, instead of theoretically proving its convergence to the Nash equilibrium defined in Theorem 1, we will present a test case on HVAC control in an MPC setting, where the game is replayed consecutively, but at each time, the initial condition is different. This showcases the potential use of our mechanism when a learning protocol is used within the MPC framework.

V HVAC Control Case Study

Fig. 1: Closed-Loop State Trajectories for (blue x-marked dashed line) P-MPC: Perfect Information Case; (red circle-marked dotted line) M-MPC: Surrogate-Mechanism Case; and (black dot-marked solid line) A-MPC: Consensus-Average Case
Fig. 2: Room Configuration with Heat Exchange Vectors highlighted

Consider a building manager who controls the HVAC system for four rooms. Each room occupant is an agent. Let the state be the room temperatures. The building manager can heat/cool each individual room: Let be the inputs in each room. Fig. 2 shows the layout of the rooms with respect to each other. Using standard HVAC models [29], the dynamics are


where ; ; ; ; and are the heat transmission coefficients between rooms; and is the heat coefficient with the outside. In addition, is the heat coefficient between the HVAC and each room. Note we treat the outside temperature as an exogenous disturbance vector.

Now suppose each agent has the private cost function


where the tuple is the agent’s private information, namely: their desired room temperate and two scalars regulating the trade-off between comfort and energy usage. Following the setup of our mechanism, the building manager does not know the agent’s private information nor the shape of their objective functions. The manager broadcasts the function , where is a reference temperature for the building manager.

We consider an MPC setting, where the principal’s receding horizon OCP-S at stage is given by


where are given in (24) and is a prediction of made at time . Also, we use to denote the open-loop control input computed at stage . Let be the optimal solution of (V). The manager uses the current open-loop trajectories sent by agents to compute references


where the neighborhoods match the room configurations. The principal also uses in order to compute the reference trajectory in the fee . After receiving such references, each agent solves their own OCP-A with the computed fees in the objective, obtaining a private solution vector and setting to be the lagrange multipliers associated with the dynamics. Then each agent updates the remaining according to (21), which in our case reduces to


Lastly, and . When , all weights are initialized to unit values. All optimization problems were solved using the optimization solver MOSEK [30]. We consider a optimal control length and an MPC horizon of . We compare our mechanism-based MPC (M-MPC) with the perfect-information case (P-MPC), where the principal knows the exact form of each . We also consider a “consensus”-type case, where no weights are updated and is set to the average of the desired temperatures (A-MPC). Fig. 1 shows the closed-loop state trajectory of the three approaches. It shows that our M-MPC closely tracks the P-MPC trajectory. Note that disturbance from the outside temperature causes the room temperatures to fluctuate around the desired values.

Fig. 3 shows M-MPC recovers the P-MPC cost after a few time steps. Since we used true costs to compute P-MPC, this shows our mechanism recovers the efficient trajectory. In contrast, the case without information exchange behaves poorly. This example shows our mechanism can be used with MPC: at each stage an optimal control problem is solved, the first-stage control is applied, and agents update their messages based on knowledge received from the principal.

Fig. 3: MPC Aggregated Stage Cost with Agents’ True Utility Functions

Vi Conclusion

We studied a dynamical system with several non-cooperative strategic agents. We proposed a mechanism where the agents interact via a platform and characterized the equilibrium strategies. We provided an HVAC control test case to highlight the need of designing mechanisms that have low-communication requirements in an MPC setting.


  • [1] A. Venkat, I. Hiskens, J. Rawlings, and S. Wright, “Distributed MPC strategies with application to power system automatic generation control,” IEEE T-CST, vol. 16, no. 6, pp. 1192–1206, 2008.
  • [2] F. Kelly, A. Maulloo, and D. Tan, “Rate control for communication networks: shadow prices, proportional fairness and stability,” J. Oper. Res. Soc., vol. 49, no. 3, pp. 237–252, 1998.
  • [3] R. R. Negenborn, P.-J. van Overloop, T. Keviczky, and B. De Schutter, “Distributed model predictive control of irrigation canals.” NHM, vol. 4, no. 2, pp. 359–380, 2009.
  • [4] A. Aswani, N. Master, J. Taneja, A. Krioukov, D. Culler, and C. Tomlin, “Energy-efficient building HVAC control using hybrid system LBMPC,” IFAC Proceedings, vol. 45, no. 17, pp. 496–501, 2012.
  • [5] M. Farina and R. Scattolini, “Distributed non-cooperative MPC with neighbor-to-neighbor communication,” IFAC Proceedings, vol. 44, no. 1, pp. 404–409, 2011.
  • [6] A. Alessio, D. Barcelli, and A. Bemporad, “Decentralized model predictive control of dynamically coupled linear systems,” Journal of Process Control, vol. 21, no. 5, pp. 705–714, 2011.
  • [7] G. Ferrari-Trecate, L. Galbusera, M. P. E. Marciandi, and R. Scattolini, “Model predictive control schemes for consensus in multi-agent systems with single-and double-integrator dynamics,” IEEE Transactions on Automatic Control, vol. 54, no. 11, pp. 2560–2572, 2009.
  • [8] A. Venkat, J. Rawlings, and S. Wright, “Stability and optimality of distributed model predictive control,” in Conference on Decision and Control, 2005, pp. 6680–6685.
  • [9] J. Rawlings and D. Mayne, Model Predictive Control: Theory and Design.   Nob Hill Pub., 2009.
  • [10] Y. Mintz, J. A. Cabrera, J. R. Pedrasa, and A. Aswani, “Control synthesis for bilevel linear model predictive control,” in 2018 Annual American Control Conference (ACC).   IEEE, 2018, pp. 2338–2343.
  • [11] R. Johari and J. Tsitsiklis, “Efficiency loss in a network resource allocation game,” Math. Oper. Res., vol. 29, no. 3, pp. 407–435, 2004.
  • [12] R. Neck and E. Dockner, “Conflict and cooperation in a model of stabilization policies: A differential game approach,” Journal of Economic Dynamics and Control, vol. 11, no. 2, pp. 153–158, 1987.
  • [13] J. E. Cohen, “Cooperation and self-interest: Pareto-inefficiency of nash equilibria in finite random games,” Proceedings of the National Academy of Sciences, vol. 95, no. 17, pp. 9724–9731, 1998.
  • [14] J. Marden and A. Wierman, “Overcoming limitations of game-theoretic distributed control,” in CDC, 2009, pp. 6466–6471.
  • [15] N. Li and J. R. Marden, “Designing games to handle coupled constraints,” in Conference on Decision and Control, 2010, pp. 250–255.
  • [16] B. T. Stewart, A. N. Venkat, J. B. Rawlings, S. J. Wright, and G. Pannocchia, “Cooperative distributed model predictive control,” Systems & Control Letters, vol. 59, no. 8, pp. 460–469, 2010.
  • [17] L. Ratliff, S. Coogan, D. Calderone, and S. S. Sastry, “Pricing in linear-quadratic dynamic games,” in Allerton, 2012, pp. 1798–1805.
  • [18] S. Coogan, L. Ratliff, D. Calderone, C. Tomlin, and S. S. Sastry, “Energy management via pricing in LQ dynamic games,” in American Control Conference, 2013, pp. 443–448.
  • [19] J. Shamma, Cooperative control of distributed multi-agent systems.   John Wiley & Sons, 2008.
  • [20] F. Kelly, “Charging and rate control for elastic traffic,” T. Emerg. Telecommun. T., vol. 8, no. 1, pp. 33–37, 1997.
  • [21] W. Vickrey, “Counterspeculation, auctions, and competitive sealed tenders,” The Journal of Finance, vol. 16, no. 1, pp. 8–37, 1961.
  • [22] E. H. Clarke, “Multipart pricing of public goods,” Public Choice, vol. 11, no. 1, pp. 17–33, 1971.
  • [23] T. Groves, “Incentives in teams,” Econometrica, pp. 617–631, 1973.
  • [24] S. Yang and B. Hajek, “VCG-Kelly mechanisms for allocation of divisible goods: Adapting VCG mechanisms to one-dimensional signals,” IEEE J. Sel. Areas Commun., vol. 25, no. 6, 2007.
  • [25] F. Farhadi, J. Golestani, and D. Teneketzis, “A surrogate optimization-based mechanism for resource allocation and routing in networks with strategic agents,” IEEE Trans. Autom. Control, 2018.
  • [26] S. Reichelstein and S. Reiter, “Game forms with minimal message spaces,” Econometrica, pp. 661–692, 1988.
  • [27] R. Johari and J. N. Tsitsiklis, “Efficiency of scalar-parameterized mechanisms,” Operations Research, vol. 57, no. 4, pp. 823–839, 2009.
  • [28] Y. Ma, G. Anderson, and F. Borrelli, “A distributed predictive control approach to building temperature regulation,” in American Control Conference, 2011, pp. 2089–2094.
  • [29] A. Aswani, N. Master, J. Taneja, V. Smith, A. Krioukov, D. Culler, and C. Tomlin, “Identifying models of HVAC systems using semiparametric regression,” in ACC, 2012, pp. 3675–3680.
  • [30] Mosek, ApS, “The MOSEK optimization toolbox for MATLAB manual,” 2015.