A Variational Perturbative Approach to Planning in Graph-based Markov Decision Processes

Coordinating multiple interacting agents to achieve a common goal is a difficult task with huge applicability. This problem remains hard to solve, even when limiting interactions to be mediated via a static interaction-graph. We present a novel approximate solution method for multi-agent Markov decision problems on graphs, based on variational perturbation theory. We adopt the strategy of planning via inference, which has been explored in various prior works. We employ a non-trivial extension of a novel high-order variational method that allows for approximate inference in large networks and has been shown to surpass the accuracy of existing variational methods. To compare our method to two state-of-the-art methods for multi-agent planning on graphs, we apply the method different standard GMDP problems. We show that in cases, where the goal is encoded as a non-local cost function, our method performs well, while state-of-the-art methods approach the performance of random guess. In a final experiment, we demonstrate that our method brings significant improvement for synchronization tasks.


page 1

page 2

page 3

page 4


Fast Value Iteration for Goal-Directed Markov Decision Processes

Planning problems where effects of actions are non-deterministic can be ...

Effective Multi-Robot Spatial Task Allocation using Model Approximations

Real-world multi-agent planning problems cannot be solved using decision...

Model-based Multi-Agent Reinforcement Learning with Cooperative Prioritized Sweeping

We present a new model-based reinforcement learning algorithm, Cooperati...

Polynomial-Time Algorithms for Multi-Agent Minimal-Capacity Planning

We study the problem of minimizing the resource capacity of autonomous a...

Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes

Information-theoretic principles for learning and acting have been propo...

Robust temporal difference learning for critical domains

We present a new Q-function operator for temporal difference (TD) learni...

Near Optimal Task Graph Scheduling with Priced Timed Automata and Priced Timed Markov Decision Processes

Task graph scheduling is a relevant problem in computer science with app...

1 Introduction

Understanding and designing the behavior of multiple agents interacting through large networks in order to achieve a common goal is a task studied across many fields, such as artificial intelligence

[Sigaud and Buffet2013], electrical engineering [Tousi, Hosseinian, and Menhaj2010], but also economics and biological sciences [Castellano, Fortunato, and Loreto2009] and epidemics [Venkatramanan et al.2018]. Finding optimal policies, e.g., for the distribution of information across a social or communication network, for effective intervention in molecular networks or for vaccinations in order to prevent spreading of diseases are actively discussed problems. In many of these applications, there exists no unique natural time-scale. In such cases, it is appropriate to reason in continuous-time. The setting of multiple agents on a graph in continuous-time has been previously explored [Kan and Shelton2008].

For a Markov decision process (MDP), an optimal policy can be computed in time scaling polynomially in the size of the state and action space using dynamic programming [Puterman2005]. However, in many realistic scenarios, these spaces are high dimensional, e.g., in multi-agent settings [Boutilier, Dean, and Hanks1996], where the size of the state and action space of the underlying global MDP in general scales exponentially in the number of agents. Solving such problems exactly is infeasible for large-scale systems. For this reason, various simplifying assumptions on the structure of MDPs have been proposed. Assuming a factorized state space, a local representation of the transition model and the reward function, decomposing according to a graph-structure, so-called factored MDPs (FMDPs) [Guestrin, Koller, and Parr2001, Boutilier1996] have been defined. For this model, various approximate solution schemes have been developed [Guestrin, Koller, and Parr2001, Guestrin et al.2003].

Graph-based MDPs (GMDPs), as proposed in [Sabbadin, Peyrard, and Forsell2012], present a subclass of FMDPs, where additionally, agent-wise policies are assumed. We note, that this renders GMDPs equivalent to mMDPs [Boutilier, Dean, and Hanks1996]

, interacting and communicating over a graph-structure. GMDPs can be solved approximately using approximate linear programming

[Sabbadin, Peyrard, and Forsell2012], approximate policy iteration [Sabbadin, Peyrard, and Forsell2012] or approximate value iteration using mean field or cluster variational methods [Cheng and Chen2013]. Additional simplifying assumptions, such as transition-independence of agents (TI-Dec-MDP) can be made [Sigaud and Buffet2013], however reducing the descriptive power of the model. We will thus not compare to such models.

In this work, we propose a novel method for approximate inference and planning for GMDPs inspired by advances in statistical physics. We emphasize that in planning problems [Fleming and Soner2006]

, system dynamics are known, given a policy. Thus, we do not encounter problems as in reinforcement learning, e.g., as the

exploration-exploitation dilemma [Puterman2005]. We employ a scheme based on variational perturbation theory [Tanaka1999, Paquet, Winther, and Opper2009, Opper, Paquet, and Winther2013, Linzner and Koeppl2018], which was originally introduced in [Plefka1982].

The manuscript is organized as follows: In Section 2, we briefly summarize the connection between variational inference and planning. Here, the main result is that maximization of expected reward can be coined as maximization of a variational lower bound [Toussaint and Storkey2006, Furmston and Barber2010, Kappen, Gómez, and Opper2012]

. In Section 3 and 4, we develop an expectation-maximization algorithm to iteratively improve the policy for each agent individually. Lastly, we perform simulated experiments on several standard planning task and show realistic cases, where current state-of-the-art methods perform similar to random guess, while our method performs well (Section 5). An implementation of our method is available via Git


2 Background

Continuous-time MDPs on Graphs.

Figure 1: a) A minimal example of a GMDP. The state of agent is modulated by its parent . b) The same GMDP unrolled in time as directed graphical model. Agent  affects agent ’s state (blue) by influencing agent ’s choice over actions (green) defined by ’s policy. The rewards (red) of agent are determined by ’s and ’s state. It is also possible to incorporate direct modulation of the transition models by the states of adjacent agents (not displayed for readability).

A MDP models an agent picking actions according to a policy, depending on its current state. Its objective is to minimize its reward, while being subject to some, possibly hostile, environment. Herein, we define a homogeneous continuous-time MDP by a tupel . It defines a two-component Markov process through a transition intensity matrix over a countable state space and a countable action space together with a policy . Each state-action pair is mapped to a reward via the reward function . In this work we only consider negative rewards, which poses no restriction as any bounded reward function can be trivially shifted into the negative half-space. For the sake of conciseness, we will often adopt shorthand notations of the type , with

. Given a sequence of actions, the evolution of the MDP can be understood as a usual continuous-time Markov chain (CTMC) with the (infinitesimal) transition probability

for some time-step with , and the indicator function. We note, that any intensity matrix fullfils . A multi-agent MDP (mMDP) can be understood as an -component MDP over state- and action-spaces , , with denoting the Cartesian product, evolving jointly as an MDP. We state explicitly that single component states and actions are entries of the states and actions of the global MDP, i.e. for with and for with for all . In this multi-agent setting, each component, referred to as an individual agent, has no direct access to the global state of the system, but can only observe the states of a subset of agents, which we will call its parent-set. In the following analysis, we want to restrict ourselves to mMDPs on graphs (GMDPs).

For GMDPs, the parent configuration can be summarized via a directed graph structure encoding the relationship among the agents , in this context also referred to as nodes. These are connected via an edge set . The parent-set is then defined as . Conversely, we define the child-set . The ’th agents process then depend only on its current state , its action and of all his parents taking values in . We display a sketch of a GMDP in Fig. 1. We note, that cycles in a graphical model as in Fig. 1(a) are unproblematic, as the corresponding temporally unrolled model, as displayed in Fig. 1(b), would be acyclic. For a GMDP, the global marginal transition matrix then factorizes over agents

into local conditional transition probabilities. We define local transition rates and policies for each parent configuration . In the following we write compactly and . Subsequently, we can express the local conditional transition probabilities as


We consider the problem of planning in continuous time over a countable state space.

Definition 1.

Consider a MDP with initial state and a policy . Then, we can define the (discounted) infinite horizon value function in continuous time as

with being the expectation with respect to the MDPs path measure .

We can now cast the planning problem as: for a given initial state , find a policy , such that


A common solution strategy for these kinds of problems is to solve the Bellman equation [Puterman2005]. Instead of trying to optimize a Bellman equation, we want to take advantage of the close relationship of planning and inference [Dayan and Hinton1997, Furmston and Barber2010, Toussaint and Storkey2006, Kappen, Gómez, and Opper2012, Levine and Koltun2013]. In the following, we restrict ourselves to finite horizon MDPs, for which the process evolution terminates at time , and later extend to the infinite horizon problem.

Finite Horizon Planning via Inference. In order to establish the connection between inference and planning we can, following [Dayan and Hinton1997, Toussaint and Storkey2006, Levine and Koltun2013] or similarly [Kappen, Gómez, and Opper2012, Furmston and Barber2010], define a boolean auxiliary process taking values in , with emission probability . We define the finite horizon trajectories and , we can express the reward-optimal posterior process for a given policy  according to Definition 1 as , with meaning that for . We consider the Kullback–Leibler (KL) divergence between the posterior and a variational measure induced by a time-inhomogeneous MDP with the same policy as (in supplementary B, we show that the KL-divergences between two continuous-time MDPs with different policies diverges). We arrive at a lower bound for the marginal log-likelihood in the finite horizon case


with the variational lower bound . The full derivation and structure of (3) can be found in supplementary A and B. When performing exact inference, meaning that , lower bound and log-likelihood coincide and the maximization of the value function as in Definition 1 corresponds to a maximization of the log-likelihood w.r.t the policy

establishing the connection between planning and inference. When performing approximate inference, we can iteratively maximize the lower bound with respect to and thereby approximate the log-likelihood, following a maximization with respect to . This is the expectation-maximization algorithm, which has been previously applied to policy optimization [Toussaint and Storkey2006, Levine and Koltun2013].

Infinite Horizon Planning via Inference. The same framework as above can be used in order to solve (discounted) infinite horizon problems. Following [Toussaint and Storkey2006], this can be achieved by introducing a prior over horizons . As a derivation in continuous-time is missing in literature, we provide it in supplementary C. By choosing , one recovers exponential discounting with discount factor .

3 Variational Perturbation Theory for GMDPs

Calculating a variational lower bound exactly is in general intractable for interacting systems. This is often circumvented by assuming a factorized proposal distribution , which corresponds to the naive mean-field approximation. Variational perturbation theory (VPT) offers a different approach. Here, the similarity measure (the KL-divergence) itself is approximated via a series expansion [Tanaka1999]. A prominent example of this approach is Plefka’s expansion [Plefka1982, Bachschmid et al.2016]. The central assumption is that variables are only weakly coupled, i.e. the interaction of variables is scaled in some small perturbation parameter . In this case, the objective is to find an expansion of the KL-divergence in orders of the interaction parameter : This approximate variational lower bound is then maximized with respect to . We note that , like in the case of cluster variational methods [Yedidia, Freeman, and Weiss2000] (CVMs), no longer provides a lower bound but only an approximation. However, in contrast to CVMs (which can be used construct similar approximate KL–divergences [Vázquez, Ferraro, and Ricci-Tersenghi2017]), variational perturbation theory yields a controlled approximation in the perturbation parameter .

Weak Coupling Expansion. In the following, we want to briefly recapitulate and extend the weak coupling expansion for the lower bound in (3), as derived in [Linzner and Koeppl2018] in the context of factorized CTMCs, to (discounted) infinite horizon GMDPs. For this we notice, that the lower bound decomposes over time

where we introduced the shorthands for the marginals and the infinitesimal transition matrix of the variational process , for notational convenience.

For a weak coupling expansion, we decompose the node-wise transition probability into an uncoupled part, given by averaging over parents and a deviation around it, defined as . Following standard mean-field procedure, we extract a scale parameter , with having the same magnitude as the uncoupled part. This allows to rewrite the transition matrix


We emphasize, that this procedure is generic and can be performed for any transition probability. This motivates the weak-coupling expansion on which the results in this manuscript are build upon, for which we define the shorthand .

Theorem 1 (Weak coupling expansion for GMDPs).

The time point wise lower bound admits an expansion in , as given in (4), into node-wise terms

The proof of this theorem is along the same lines as in [Linzner and Koeppl2018].

Weak Coupling Expansion for GMDPs in Continuous Time. In order to derive the approximate variational lower bound in continuous time for a GMDP, we define variational marginal rates

and but will from now on use the redefinition ,, for these objects, in order to avoid clutter. We further make use a mean-field assumption , with the shorthand , assuming factorization of the marginals. We emphasize, that in contrast to naive mean-field [Opper and Sanguinetti2008, Cohn et al.2010], we only have to assume a factorization of these marginals, but keep the dependency on the parents in the rates . Together with the normalization constraint, this defines an expansion of the proposal transition probability in time-steps of : . The proposal transition probability defines an inhomogeneous master equation


In order for

to describe a probability distribution, this constraint has to be enforced at all times.

Proposition 1.

The variational lower bound of a GMDP has an expansion into agent-wise terms in the perturbation parameter


with the discounting function .


We proof our proposition by inserting the marginals into the expansion of Theorem 1. We insert the expression of the conditional transition matrix (1). Subsequently, we perform . We arrive at the approximate lower bound of a GMDP. The discounting function follows from Fubini’s theorem. For a detailed derivation, see supplementary D. ∎

By minimizing this functional, while fulfilling continuity, we can derive approximate dynamic equations corresponding to the stationary solutions of the Lagrangian


with being the constrain enforcing (5) (see supplementary E) and Lagrange multipliers .

4 Approximate Inference

1:  Input: Initial trajectories obeying normalization, boundary conditions and , reward function .
2:  repeat
3:     for all   do
4:        Update by backward propagation (9).
5:        Update by forward propagation using (8) given .
6:     end for
7:  until Convergence (1)
8:  Output: Set of and .
Algorithm 1 Stationary points of Euler–Lagrange equation

We finally derive approximate dynamics of the GMDP as stationary points of the Lagrangian, satisfying the Euler–Lagrange equation. These are the key equations that enable us to perform scalable approximate inference for large GMDPs.

Proposition 2.

We define the agent-wise expectation . The stationary points of the Lagrangian (7

) are given by the set of ordinary differential equations for every component



with as given in the supplementary and . We note, that for exponential discounting .


Differentiating with respect to , its time-derivative , and the Lagrange multiplier yield a closed set of coupled ODEs for the posterior process of the marginal distributions and transformed Lagrange multipliers , eliminating . For more details, we refer the reader to supplementary E. ∎

Although, the restriction on the reward function to decompose into local terms is not necessary, we will assume it for readability. The coupled set of ODEs can be solved iteratively as a fixed-point procedure in the same manner as in previous works [Opper and Sanguinetti2008] in a forward-backward procedure (see Algorithm 1). Because we only need to solve ODEs to approximate the dynamics of an -agent system, we recover a linear complexity in the number of agents, rendering our method scalable.

We require boundary conditions for the evolution interval in order to determine a unique solution to the set of equations in Proposition 2. We thus set to the desired initial state and for free evolution of the system. We note that while we do not consider time-dependent reward in general, our method is capable of doing so. We use this in the following control setting: in control scenarios, a deterministic goal state of the system is often desired [Kappen, Gómez, and Opper2012]. In this case, we can put infinite reward on the goal state at the boundary . We then recover the terminal condition . By setting the reward-dependent terms in Proposition 2 to zero, we can evaluate the prior dynamics of the system given a policy. We will use this to evaluate Definition 1 approximately.

Expectation-Maximization for GMDPs. By examining the approximate lower bound of the value function, one notices that it decomposes over local agent-wise value functions, conditioned on its parents.


The marginal log-likelihood of a GMDP has an approximate agent-wise decomposition


where the are given by Proposition 1.

Because of this, the global marginal log-likelihood can be maximized by locally maximizing local lower bounds of the individual agents with respect to local policies . Given the dynamic equations from Proposition 2, we now devise a strategy for scalable planning for GMDPs. For this we notice, that the solutions of these equations maximize the lower bound, thereby providing an approximation to the marginal log-likelihood. Because of (10), we can maximize this object as well with respect to the policies for each agent individually. Thus the complexity of our optimization scales linearly in the number of components. Given this maximizer, we again evaluate the dynamic equations. We do this repeatedly until convergence, thereby implementing an expectation-maximization (EM) algorithm. This strategy is summarized in Algorithm 2. We note that the resulting policy is probabilistic, but a MAP-deterministic policy can be constructed.

1:  Input: Initial trajectories obeying normalization, boundary conditions and , reward function , initial policy .
2:  Set
3:  repeat
4:     Solve Euler-Lagrange equations given using Algorithm 1.
5:     for all   do
6:        Maximize (10) with respect to ’s.
7:        Set maximizer .
8:     end for
10:  until Convergence of (10)
11:  Output: Optimal policy .
Algorithm 2 Expectation-Maximization for Planning

5 Experiments

We evaluate the performance of our method on real-world problem settings against two existing state-of-the-art methods for GMDPs on different network topologies. One method is based on policy iteration in mean-field approximation (API) [Sabbadin, Peyrard, and Forsell2012], the other on approximate linear programming (ALP) [Guestrin, Koller, and Parr2001]. Both algorithms have been developed and implemented in the GMDPtoolbox [Cros et al.2017]. For small problems, we compare the performance of all algorithms to the exact solution. To ensure a correct evaluation, we first construct the GMDP problem and then transform it to the corresponding MDP problem by a built-in function in the GMDPtoolbox, in order to recover the exact solution. For small problems, we finally perform exact policy evaluation using this MDP.

As competing methods are implemented in discrete-time, we have to pass them an equivalent discrete-time version of the continuous-time problem via uniformization [Kan and Shelton2008]. For this we generate transformed rewards and transition matrices

for some .

GMDPs have previously been applied to a variety of problems as in agriculture, forest management [Peyrard et al.2007, Sabbadin, Peyrard, and Forsell2012], socio-physics [Castellano, Fortunato, and Loreto2009, Yang et al.2018] and caching networks [Rezaei, Manoochehri, and Khalaj2018], to name a few. In the following we want to benchmark our method on those problem sets. We want compare to the exact solution, thus the network considered is a small regular grid, with nearest-neighbour bi-directional couplings, unless specified otherwise. In the end, we demonstrate scalability on a larger grid in a synchronization task experiment. We denote the policies returned by the different methods with for ALP, for API, for a random policy and for our method VPT. For all experiments, we set the discount factor to and the atomic reward . As a metric for performance, we calculate the relative deviation (with being the exact optimal policy) in percent for the crop and forest planning problem, and the interval of total deviation for the opinion dynamics model.

Table 1: Results of disease control problem. We give the relative deviation of the values returned by different methods from the exact optimal values.

Disease Control. First, we apply our method to the task of disease control, originally posed for crop fields [Sabbadin, Peyrard, and Forsell2012]. Each crop is in either of two states – susceptible or infected (). The rate , with which a susceptible crop is infected, is proportional to the number of its infected neighbours, which we denote by . The recovery rate is assumed to be constant . The planner has to decide between two local actions for each crop – either to harvest or to leave it fallow and treat it (). Below, we summarize the transition model:

The reward model is:

In Table 1, we display the results of this experiment for different parameters. We find that API and VPT perform equally well in this problem.

Table 2: Results of forest management problem. We give the relative deviation of the values returned by different methods from the exact optimal values.

Forest Management. We consider the forest management problem as in [Sabbadin, Peyrard, and Forsell2012]. Here, each node has multiple states dependent of each trees age and whether it is damaged by wind or not. A tree can either age or become damaged over time. In a simplified scenario, we are going to assume, that a tree can either be grown – or not – or damaged (). As trees can shield one-another against wind-damage, this rate depends on the number of grown trees . The planner has, again, two actions – either to harvest and cut down the tree or to leave it (). The transition model is summarized below:

As yield depends on having neighbours for various reasons, the reward function in [Sabbadin, Peyrard, and Forsell2012] has a non-local form. We consider reward functions as:

The results of this experiment are displayed in Table 2, where we give the relative deviation in percent between the optimal and the policies returned from the different methods. We find that for all parameters, our method performs significantly better than other methods.

Opinion Dynamics. In this experiment we test the performance of our method on the seminal Ising model, which has, among others, applications in socio-physics [Castellano, Fortunato, and Loreto2009] to model opinion dynamics, swarming [Šošic et al.2017], or as a benchmark for multi-agent reinforcement learning [Yang et al.2018]. In the Ising model, each node is in either of two states and the reward function takes the form


In the following, we want to consider random reward functions, where couplings are drawn from gaussians and . Further, we model the transition rates according to opinion dynamics (voter model) [Castellano, Fortunato, and Loreto2009] and , with , being the sum of the sequence , see below:

The results for an ensemble 20 random reward functions displayed in Table 3. Again, we find that our method performs best in all tested parameter regimes, while in some cases RND achieves a higher value than API and ALP.

Table 3: Results of voter model for an ensemble of 20 random reward functions. We give the interval of the deviation of the achieved values returned by different methods from the exact optimal value.

Synchronization of Agents. In a final experiment, we want to compare the performance of methods in a synchronization task. We consider a regular grid of agents. We encode a synchronization goal by reward function as in (11) with and . The reward function takes the from of an order parameter , which measures anti-parallel alignment between neighbouring agents. Each agents transition model is local:

We display over time for different methods in Figure 2 (LP returned the same policy as MF). For evaluation, we simulated each trained model using Gillespie sampling.

Figure 2: Results of the synchronization task. We track the mean order parameter over time under the VPT (red) and MF (blue-dashed) policy. Areas denote

percent of variance.

6 Conclusion

We proposed a new method to conduct planning on large scale GMDPs based on variational perturbation theory. We compare our method to state-of-the-art methods for planning in GMDPs and show, that for non-local reward functions state-of-the-art methods approach the performance of random guess, while our method performs well. In the future, we want to use this planning method as the basis for a new reinforcement algorithm for multiple agents on a graph.


We thank the anonymous reviewers for helpful comments on the previous version of this manuscript. Dominik Linzner acknowledges funding by the European Union’s Horizon 2020 research and innovation programme (iPC–Pediatric Cure, No. 826121) and (PrECISE, No. 668858). Heinz Koeppl acknowledges support by the European Research Council (ERC) within the CONSYN project, No. 773196, and by the Hessian research priority programme LOEWE within the project CompuGene.


  • [Bachschmid et al.2016] Bachschmid, L.; Battistin, C.; Opper, M.; and Roudi, Y. 2016. Variational perturbation and extended Plefka approaches to dynamics on random networks: The case of the kinetic Ising model. Journal of Physics A: Mathematical and Theoretical 49(43):434003–33.
  • [Boutilier, Dean, and Hanks1996] Boutilier, C.; Dean, T.; and Hanks, S. 1996. Planning under uncertainty: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research 11:1–94.
  • [Boutilier1996] Boutilier, C. 1996. Planning, learning and coordination in multiagent decision processes. Proceedings of the 6th conference on Theoretical aspects of rationality and knowledge 195–210.
  • [Castellano, Fortunato, and Loreto2009] Castellano, C.; Fortunato, S.; and Loreto, V. 2009. Statistical physics of social dynamics. Reviews of Modern Physics 81(2):1–58.
  • [Cheng and Chen2013] Cheng, Q., and Chen, F. 2013. Variational Planning for Graph-based MDPs. Advances in Neural Information Processing Systems.
  • [Cohn et al.2010] Cohn, I.; El-Hay, T.; Friedman, N.; and Kupferman, R. 2010.

    Mean field variational approximation for continuous-time Bayesian networks.

    Journal Of Machine Learning Research

  • [Cros et al.2017] Cros, M. J.; Aubertot, J. N.; Peyrard, N.; and Sabbadin, R. 2017. GMDPtoolbox: A Matlab library for designing spatial management policies. Application to the long-term collective management of an airborne disease. PLoS ONE 12(10):e0186014.
  • [Dayan and Hinton1997] Dayan, P., and Hinton, G. E. 1997. Using EM for reinforcement learning. Neural Computation 278(1):271–278.
  • [Fleming and Soner2006] Fleming, W. H., and Soner, H. M. 2006. Controlled Markov processes and viscosity solutions. Springer.
  • [Furmston and Barber2010] Furmston, T., and Barber, D. 2010. Variational Methods for Reinforcement Learning. International conference on artificial intelligence and statistics 241–248.
  • [Guestrin et al.2003] Guestrin, C.; Koller, D.; Parr, R.; and Venkataraman, S. 2003. Efficient solution algorithms for factored MDPs. J. Artificial Intelligence Res. 19:399–468.
  • [Guestrin, Koller, and Parr2001] Guestrin, C.; Koller, D.; and Parr, R. 2001. Multiagent Planning with Factored MDPs. Advances in Neural Information Processing Systems 1523–1530.
  • [Kan and Shelton2008] Kan, K. F., and Shelton, C. R. 2008. Solving Structured Continuous-Time Markov Decision Processes. AAAI.
  • [Kappen, Gómez, and Opper2012] Kappen, H. J.; Gómez, V.; and Opper, M. 2012. Optimal control as a graphical model inference problem. Machine Learning 87(2):159–182.
  • [Levine and Koltun2013] Levine, S., and Koltun, V. 2013. Variational policy search via trajectory optimization. Advances in Neural Information Processing Systems.
  • [Linzner and Koeppl2018] Linzner, D., and Koeppl, H. 2018. Cluster Variational Approximations for Structure Learning of Continuous-Time Bayesian Networks from Incomplete Data. Advances in Neural Information Processing Systems 7891–7901.
  • [Opper and Sanguinetti2008] Opper, M., and Sanguinetti, G. 2008. Variational inference for Markov jump processes. Advances in Neural Information Processing Systems 20 1105–1112.
  • [Opper, Paquet, and Winther2013] Opper, M.; Paquet, U.; and Winther, O. 2013. Perturbative corrections for approximate inference in Gaussian latent variable models. Journal of Machine Learning Research 14:2857–2898.
  • [Paquet, Winther, and Opper2009] Paquet, U.; Winther, O.; and Opper, M. 2009. Perturbation Corrections in Approximate Inference: Mixture Modelling Applications. Journal of Machine Learning Research 10:1263–1304.
  • [Peyrard et al.2007] Peyrard, N.; Sabbadin, R.; Lo-Pelzer, E.; and Aubertot, J. N. 2007. A Graph-based Markov Decision Process framework for Optimising Collective Management of Diseases in Agriculture: Application to Blackleg on Canola. Modsim 2007: International Congress on Modelling and Simulation 2175–2181.
  • [Plefka1982] Plefka, T. 1982. Convergence condition of the TAP equation for the infinite-range Ising spin glass model. Journal of Physics A 15:1971–1978.
  • [Puterman2005] Puterman, M. L. 2005. Markov Decision Processes: Discrete stochastic dynamic programming. Wiley-Interscience.
  • [Rezaei, Manoochehri, and Khalaj2018] Rezaei, E.; Manoochehri, H. E.; and Khalaj, B. H. 2018. Multi-agent Learning for Cooperative Large-scale Caching Networks. arXiv.
  • [Sabbadin, Peyrard, and Forsell2012] Sabbadin, R.; Peyrard, N.; and Forsell, N. 2012. A framework and a mean-field algorithm for the local control of spatial processes. International Journal of Approximate Reasoning 53(1):66–86.
  • [Sigaud and Buffet2013] Sigaud, O., and Buffet, O. 2013. Markov Decision Processes in Artificial Intelligence. Wiley.
  • [Šošic et al.2017] Šošic, A.; KhudaBukhsh, W. R.; Zoubir, A. M.; and Koeppl, H. 2017. Inverse reinforcement learning in swarm systems. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, volume 3, 1413–1420.
  • [Tanaka1999] Tanaka, T. 1999. A Theory of Mean Field Approximation. Advances in Neural Information Processing Systems.
  • [Tousi, Hosseinian, and Menhaj2010] Tousi, M. R.; Hosseinian, S. H.; and Menhaj, M. B. 2010. A Multi-agent-based voltage control in power systems using distributed reinforcement learning. Simulation 87(7):581–599.
  • [Toussaint and Storkey2006] Toussaint, M., and Storkey, A. 2006. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In International conference on Machine learning.
  • [Vázquez, Ferraro, and Ricci-Tersenghi2017] Vázquez, E. D.; Ferraro, G. D.; and Ricci-Tersenghi, F. 2017. A simple analytical description of the non-stationary dynamics in Ising spin systems. Journal of Statistical Mechanics: Theory and Experiment 2017(3):033303.
  • [Venkatramanan et al.2018] Venkatramanan, S.; Lewis, B.; Chen, J.; Higdon, D.; Vullikanti, A.; and Marathe, M. 2018. Using data-driven agent-based models for forecasting emerging infectious diseases. Epidemics 22:43–49.
  • [Yang et al.2018] Yang, Y.; Luo, R.; Li, M.; Zhou, M.; Zhang, W.; and Wang, J. 2018. Mean field multi-agent reinforcement learning. In 35th International Conference on Machine Learning, ICML 2018, volume 12, 8869–8886.
  • [Yedidia, Freeman, and Weiss2000] Yedidia, J. S.; Freeman, W. T.; and Weiss, Y. 2000. Bethe free energy, Kikuchi approximations, and belief propagation algorithms. Advances in Neural Information Processing Systems 13.