# Mean field for Markov Decision Processes: from Discrete to Continuous Optimization

We study the convergence of Markov Decision Processes made of a large number of objects to optimization problems on ordinary differential equations (ODE). We show that the optimal reward of such a Markov Decision Process, satisfying a Bellman equation, converges to the solution of a continuous Hamilton-Jacobi-Bellman (HJB) equation based on the mean field approximation of the Markov Decision Process. We give bounds on the difference of the rewards, and a constructive algorithm for deriving an approximating solution to the Markov Decision Process from a solution of the HJB equations. We illustrate the method on three examples pertaining respectively to investment strategies, population dynamics control and scheduling in queues are developed. They are used to illustrate and justify the construction of the controlled ODE and to show the gain obtained by solving a continuous HJB equation rather than a large discrete Bellman equation.

## Authors

• 12 publications
• 4 publications
• 14 publications
11/15/2017

### Quantile Markov Decision Process

In this paper, we consider the problem of optimizing the quantiles of th...
07/02/2019

### Learning the Arrow of Time

We humans seem to have an innate understanding of the asymmetric progres...
06/05/2012

### A Mixed Observability Markov Decision Process Model for Musical Pitch

Partially observable Markov decision processes have been widely used to ...
09/30/2021

### Learning the Markov Decision Process in the Sparse Gaussian Elimination

We propose a learning-based approach for the sparse Gaussian Elimination...
11/22/2017

### Budget Allocation in Binary Opinion Dynamics

In this article we study the allocation of a budget to promote an opinio...
04/25/2017

05/03/2020

### Multialternative Neural Decision Processes

We introduce an algorithmic decision process for multialternative choice...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we study dynamic optimization problems on Markov decision processes composed of a large number of interacting objects.

Consider a system of

objects evolving in a common environment. At each time step, objects change their state randomly according to some probability kernel

. This kernel depends on the number of objects in each state, as well as on the decisions of a centralized controller. Our goal is to study the behavior of the controlled system when becomes large.

Several papers investigate the asymptotic behavior of such systems, but without controllers. For example, in [2, 19], the authors show that under mild conditions, as grows, the system converges to a deterministic limit. The limiting system can be of two types, depending on the intensity (the intensity is the probability than an object changes its state between two time steps). If , the system converges to a dynamical system in discrete time [19]. If goes to as grows, the limiting system is a continuous time dynamical system and can be described by ordinary differential equations (ODEs).

### Contributions

Here, we consider a Markov decision process where at each time step, a central controller chooses an action from a predefined set that will modify the dynamics of the system the controller receives a reward depending on the current state of the system and on the action. The goal of the controller is to maximize the expected reward over a finite time horizon. We show that when becomes large this problem converges to an optimization problem on an ordinary differential equation.

More precisely, we focus on the case where the Markov decision process is such that its empirical occupancy measure is also Markov; this occurs when the system consists of many interacting objects, the objects can be observed only through their state and the system evolution depends only on the collection of all states. We show that the optimal reward converges to the optimal reward of the mean field approximation of the system, which is given by the solution of an HJB equation. Furthermore, the optimal policy of the mean field approximation is also asymptotically optimal in , for the original discrete system. Our method relies on bounding techniques used in stochastic approximation and learning [4, 1]. We also introduce an original coupling method, where, to each sample path of the Markov decision process, we associate a random trajectory that is obtained as a solution of the ODE, i.e. the mean field limit, controlled by random actions.

This convergence result has an algorithmic by-product. Roughly speaking, when confronted with a large Markov decision problem, we can first solve the HJB equation for the associated mean field limit and then build a decision policy for the initial system that is asymptotically optimal in .

Our results have two main implications. The first is to justify the construction of controlled ODEs as good approximations of large discrete controlled systems. This construction is given done without rigorous proofs. In Section 4.3.2 we illustrate this point with an example of malware infection in computer systems.

The second implication concerns the effective computation of an optimal control policy. In the discrete case, this is usually done by using dynamic programming for the finite horizon case or by computing a fixed point of the Bellman equation in the discounted case. Both approaches suffer from the curse of dimensionality, which makes them impractical when the state space is too large. In our context, the size of the state space is exponential in

, making the problem even more acute. In practice, modern supercomputers only allow us to tackle such optimal control problems when is no larger than a few tens [20].

The mean field approach offers an alternative to brute force computations. By letting go to infinity, the discrete problem is replaced by a limit Hamilton-Jacobi-Bellman equation that is deterministic where the dimensionality of the original system has been hidden in the occupancy measure. Solving the HJB equation numerically is sometimes rather easy, as in the examples in Sections 4.3.1 and 4.3.2. It provides a deterministic optimal policy whose reward with a finite (but large) number of objects is remarkably close to the optimal reward.

### Related Work

Several papers in the literature are concerned with the problem of mixing the limiting behavior of a large number of objects with optimization.

In [6], the value function of the Markov decision process is approximated by a linearly parametrized class of functions and a fluid approximation of the MDP is used. It is shown that a solution of the HJB equation is a value function for a modification of the original MDP problem. In [25, 8]

, the curse of dimensionality of dynamic programming is circumvented by approximating the value function by linear regression. Here, we use instead a mean field limit approximation and prove asymptotic optimality in

of limit policy.

In [9], the authors also consider Markov decision processes with a growing number of objects, but when the intensity is . In their case, the optimization problem of the system of size converges to a deterministic optimization problem in discrete time. In this paper however, we focus on the case, which is substantially different from the discrete time case because the limiting system does not evolve in discrete time anymore.

Actually, most of the papers dealing with mean field limits of optimization problems over large systems are set in a game theory framework, leading to the concept of

mean field games introduced in [18]. The objects composing the system are seen as players of a game with distributed information, cost and control; their actions lead to a Nash equilibrium. To the best of our knowledge, the classic case with global information and centralized control has not yet been considered. Our work focuses precisely on classic Markov decision problems, where a central controller (our objects are passive), aims at minimizing a global cost function.

For example, a series of papers by M. Huang, P.E. Caines and P. Malhamé such as [11, 12, 13, 14] investigate the behavior of systems made of a large number of objects under distributed control. They mostly investigate Linear-Quadratic- Gaussian (LQG) dynamics and use the fact that, here, the solution can be given in closed form as a Riccati equation to show that the limit satisfies a Nash fixed point equation. Their more general approach uses the Nash Equivalence Certainty principle introduced in [11]. The limit equilibrium could or could not be a global optimal. Here, we consider the general case where the dynamics and the cost may be arbitrary (we do not assume LQG Dynamics) so that the optimal policy is not given in closed form. The main difference with their approach comes from the fact that we focus instead on centralized control to achieve a global optimum. The techniques to prove convergence are rather different. Our proofs are more in line with classic mean field arguments and use stochastic approximation techniques.

Another example is the work of Tembiné and others [23, 24], on the limits of games with many players. The authors provide conditions under which the limit when the number of players grows to infinity commutes with the fixed point equation satisfied by a Nash equilibrium. Again, our investigation solves a different problem and focuses on the centralized case. In addition, our approach is more algorithmic; we construct two intermediate systems: one with a finite number of objects controlled by a limit policy and one with a limit system controlled by a stochastic policy induced by the finite system.

### Structure of the paper

The rest of the paper is structured as follows. In Section 2 we give definitions, some notation and hypotheses. In Section 3 we describe our main theoretical. In Section 4 we describe our resulting algorithm and illustrate the application of our method with a few examples. The details of all proofs are in Section 5 and Section 6 concludes the paper.

## 2 Notations and Definitions

### 2.1 System with N Objects

We consider a system composed of objects. Each object has a state from the finite set . Time is discrete and the state of the object at step is denoted . The state of the system at time is . For all , we denote by the empirical measure of the objects at time :

 MN(k)def=1NN∑n=1δXNn(k), (1)

where denotes the Dirac measure in . is a probability measure on and its th component denotes the proportions of objects in state at time (also called the occupancy measure): .

The system is a Markov process once the sequence of the actions taken by the controller is fixed. Let be the transition kernel, namely is a mapping , where is the set of possible actions, such that for every and ,

is a probability distribution on

and further, if the controller takes the action at time and the system is in state , then:

 P(XN(k+1)=y1…yN|XN(k)=x1…xN,AN(k)=a)=ΓN(x1…xN,y1…yN,a) (2)

We assume that

() Objects are observable only through their states

in particular, the controller can observe the collection of all states , but not the identities . This assumption is required for mean field convergence to occur. In practice, it means that we need to put into the object state any information that is relevant to the description of the system.

Assumption () translates into the requirement that the kernel be invariant by object re-labeling. Formally, let be the set of permutations of . By a slight abuse of notation, for and we also denote with the collection of object states after the permutation, i.e. . The requirement is that

 ΓN(σ(x),σ(y),a)=ΓN(x,y,a) (3)

for all , and . A direct consequence, shown in Section 5, is:

###### Theorem 1.

For any given sequence of actions, the process

is a Markov chain

### 2.2 Action, Reward and Policy

At every time , a centralized controller chooses an action where is called the action set. is a compact metric space for some distance . The purpose of Markov decision control is to compute optimal policies. A policy is a sequence of decision rules that specify the action at every time instant. The policy might depend on the sequence of past and present states of the process , however, it it known that when the state space is finite, the action set compact and the kernel and the reward are continuous, there exists a deterministic Markovian policy which is optimal (see Theorem 4.4.3 in [21]). This implies that we can limit ourselves to policies that depend only on the current state .

Further, we assume that the controller can only observe object states. Therefore she cannot make a difference between states that result from object relabeling, i.e. the policy depends on in a way that is invariant by permutation. By Lemma 2 in Section 5.2, it depends on only. Thus, we may assume that, for every , is a function . Let denotes the occupancy measure of the system at time when the controller applies policy .

If the system has occupancy measure at time and if the controller chooses the action , she gets an instantaneous reward . The expected value over a finite-time horizon starting from when applying the policy is defined by

 VNπ(m)def=E⎛⎝⌊HN⌋∑k=0rN(MNπ(k),π(MNπ(k)))∣∣ ∣∣MNπ(0)=m⎞⎠ (4)

The goal of the controller is to find an optimal policy that maximizes the expected value. We denote by the optimal value when starting from :

 VN∗(m)=supπVNπ(m) (5)

### 2.3 Scaling Assumptions

If at some time , the system has occupancy measure and the controller chooses action , the system goes into state with probabilities given by the kernel . The expectation of the difference between and is called the drift and is denoted by :

 FN(m,a)def=E[MN(k+1)−MN(k)|MN(k)=m,AN(k)=a]. (6)

In order to study the limit with , we assume that goes to at speed when goes to infinity and that converges to a Lipschitz continuous function . More precisely, we assume that there exists a sequence , , called the intensity of the model with and a sequence , , also with such that for all and : . In a sense, represents the order of magnitude of the number of objects that change their state within one unit of time.

The change of during a time step is of order . This suggests a rescaling of time by to obtain an asymptotic result. We define the continuous time process

as the affine interpolation of

, rescaled by the intensity function, i.e. is affine on the intervals , and

 ^MN(kI(N))=MN(k).

Similarly, denotes the affine interpolation of the occupancy measure under policy . Thus, can also be interpreted as the duration of the time slot for the system with objects.

We assume that the time horizon and the reward per time slot scale accordingly, i.e. we impose

 HN = ⌊TI(N)⌋ rN(m,a) = I(N)r(m,a)

for every and (where denotes the largest integer ).

### 2.4 Limiting System (Mean Field Limit)

We will see in Section 3 that as grows, the stochastic system converges to a deterministic limit , the mean field limit. For more clarity, all the stochastic variables (i.e., when is finite) are in uppercase and their limiting deterministic values are in lowercase.

An action function is a piecewise Lipschitz continuous function that associates to each time an action . Note that action functions and policies are different in the sense that action functions do not take into account the state to determine the next action. For an action function and an initial condition , we consider the following ordinary integral equation for , :

 m(t)−m(0)=∫t0f(m(s),α(s))ds. (7)

(This equation is equivalent to an ODE, but is easier to manipulate in integral form. In the rest of the paper, we make a slight abuse of language and refer to it as an ODE). Under the foregoing assumptions on and , this equation satisfies the Cauchy Lipschitz condition and therefore has a unique solution once the initial condition is fixed. We call , , the corresponding semi-flow, i.e.

 m(t)=ϕt(m0,α) (8)

is the unique solution of Eq.(7).

As for the system with objects, we define as the value of the limiting system over a finite horizon when applying the action function and starting from :

 vα(m0)def=∫T0r(ϕs(m0,α),α(s))ds. (9)

This equation looks similar to the stochastic case (4) although there are two main differences. The first is that the system is deterministic. The second is that it is defined for action functions and not for policies. We also define the optimal value of the deterministic limit :

 v∗(m0)=supαvα(m0), (10)

where the supremum is taken over all possible action functions from .

### 2.5 Table of Notations

We recall here a list of the main notations used throughout the paper.

Empirical measure of the system with objects, under , at time , (Section 2.2)
Drift of the system with objects when the state is and the action is , Eq.(6)
Drift of the limiting system (limit of rescaled as ), Eq.(11)
State of the limiting system: , Eq.(8)
Policy for the system with objects: associates an action to each
Action function for the limiting system: associates an action to each :
Optimal policy for the system with objects
Optimal action function for the limiting system (if it exists)
Expected reward for the system with objects starting from under policy , Eq.(4)
Optimal expected value for the system : , Eq.(5)
Expected value for the system when applying the action function , Eq.(12)
Value of the limiting system starting from under action function , Eq.(9)
Optimal value of the limiting system: , Eq.(10)

### 2.6 Summary of Assumptions

In Section 3 we establish theorems for the convergence of the discrete stochastic optimization problem to a continuous deterministic one. These theorems are based on several technical assumptions, which are given next. Since is finite, the set is the simplex in and for we define as the -norm of and as the usual inner product.

##### (A1) (Transition probabilities)

Objects can be observed only through their state, i.e., the transition probability matrix (or transition kernel) , defined by Eq.(2), is invariant under permutations of .

There exist some non-random functions and such that and such that for all and any policy , the number of objects that perform a transition between time slot and per time slot satisfies

 E(ΔNπ(k)∣∣MNπ(k)=m) ≤ NI1(N) E(ΔNπ(k)2∣∣MNπ(k)=m) ≤ N2I(N)I2(N)

where is the intensity function of the model, defined in the following assumption A2.

##### (A2) (Convergence of the Drift)

There exist some non-random functions and and a function such that and

 ∥∥∥1I(N)FN(m,a)−f(m,a)∥∥∥≤I0(N) (11)

is defined on and there exists such that .

##### (A3) (Lipschitz Continuity)

There exist constants , and such that for all , :

 ∥∥FN(m,a)−FN(m′,a)∥∥ ≤ L1∥∥m−m′∥∥I(N) ∥∥f(m,a)−f(m′,a′)∥∥ ≤ K(∥∥m−m′∥∥+d(a,a′)) ∣∣r(m,a)−r(m′,a)∣∣ ≤ Kr∥∥m−m′∥∥

We also assume that the reward is bounded: .

To make things more concrete, here is a simple but useful case where all assumptions are true.

• There are constants and such that the expectation of the number of objects that perform a transition in one time slot is

and its standard deviation is

,

• and can be written under the form where is a continuous function on for some neighborhood of and some , continuously differentiable with respect to .

In this case we can choose , (where is an upper bound to the norm of the differential ), and .

## 3 Mean Field Convergence

In Section 3.1 we establish the main results, then, in Section 3.2, we provide the details of the method used to derive them.

### 3.1 Main Results

The first result establishes convergence of the optimization problem for the system with objects to the optimization problem of the mean field limit:

###### Theorem 2 (Optimal System Convergence).

Assume (A0) to (A3). If almost surely [resp. in probability] then:

 limN→∞VN∗(MN(0))=v∗(m0)

almost surely [resp. in probability], where and are the optimal values for the system with objects and the mean field limit, defined in Section 2.

The proof is given in Section 5.6.

The second result states that an optimal action function for the mean field limit provides an asymptotically optimal strategy for the system with objects. We need, at this point, to introduce a first auxiliary system, which is a system with objects controlled by an action function borrowed from the mean field limit. More precisely, let be an action function that specifies the action to be taken at time . Although has been defined for the limiting system, it can also be used in the system with objects. In this case, the action function can be seen as a policy that does not depend on the state of the system. At step , the controller applies action . By abuse of notation, we denote by , the state of the system when applying the action function (it will be clear from the notation whether the subscript is an action function or a policy). The value for this system is defined by

 VNα(m0)def=E⎛⎝HN∑k=0r(MNα(k),α(kI(N)))∣∣ ∣∣MNα(0)=m0⎞⎠ (12)

Our next result is the convergence of convergence of and of the value:

###### Theorem 3.

Assume (A0) to (A3); is a piecewise Lipschitz continuous action function on , of constant , and with at most discontinuity points. Let be the linear interpolation of the discrete time process . Then for all :

 P{sup0≤t≤T∥∥^MNα(t)−ϕt(m0,α)∥∥>[∥∥MN(0)−m0∥∥+I′0(N,α)T+ϵ]eL1T}≤J(N,T)ϵ2 (13)

and

 ∣∣VNα(MN(0))−vα(m0)∣∣≤B′(N,∥∥MN(0)−m0∥∥) (14)

where and are defined in Section 5.1 and satisfy and .

In particular, if almost surely [resp. in probability] then almost surely [resp. in probability].

The proof is given in Section 5.5.

As the reward function is bounded and the time-horizon is finite, the set of values when starting from the initial condition , , is bounded. This set is not necessarily compact because the set of action functions may not be closed (a limit of Lipschitz continuous functions is not necessarily Lipschitz continuous). However, as it is bounded, for all , there exists an action function such that . Theorem 2 shows that is optimal up to for large enough. This shows the following corollary:

###### Corollary 4 (Asymptotically Optimal Policy).

Let be an optimal action function for the limiting system. Then

 limN→∞∣∣VNα∗−VN∗∣∣=0

In other words, an optimal action function for the limiting system is asymptotically optimal for the system with objects.

In particular, this shows that as grows, policies that do not take into account the state of the system (i.e., action functions) are asymptotically as good as adaptive policies. In practice however, adaptive policies might perform better, especially for very small values of . However, it is in general impossible to prove convergence for adaptive policies.

### 3.2 Derivation of Main Results

#### 3.2.1 Second Auxiliary System

The method of proof uses a second auxiliary system, the process defined below. It is a limiting system controlled by an action function derived from the policy of the original system with objects.

Consider the system with objects under policy . The process is defined on some probability space . To each corresponds a trajectory , and for each , we define an action function . This random function is piecewise constant on each interval () and is such that is the action taken by the controller of the system with objects at time slot , under policy .

Recall that for any and any action function , is the solution of the ODE (7). For every , is the solution of the limiting system with action function , i.e.

 ϕt(m0,ANπ(ω))−m0=∫t0f(ϕs(m0,ANπ(ω)),ANπ(ω)(s))ds.

When is fixed, is a continuous time deterministic process corresponding to one trajectory . When considering all possible realizations of , is a random, continuous time function “coupled” to . Its randomness comes only from the action term , in the ODE. In the following, we omit to write the dependence in . and will always designate the processes corresponding to the same .

#### 3.2.2 Convergence of Controlled System

The following result is the main technical result; it shows the convergence of the controlled system in probability, with explicit bounds. Notice that it does not require any regularity assumption on the policy .

###### Theorem 5.

Under Assumptions (A0) to (A3), for any , and any policy :

 P{sup0≤t≤T∥∥^MNπ(t)−ϕt(m0,ANπ)∥∥>[∥∥MN(0)−m0∥∥+I0(N)T+ϵ]eL1T}≤J(N,T)ϵ2 (15)

where is the linear interpolation of the discrete time system with objects) and is defined in Section 5.1.

Recall that and for a fixed go to as . The proof is given in Section 5.3.

#### 3.2.3 Convergence of Value

Let be a policy and the sequence of actions corresponding to a trajectory as we just defined. Eq.(9

) defines the value for the deterministic limit when applying a sequence of actions. This defines a random variable

that corresponds to the value over the limit system when using as action function. The random part comes from . designates the expectation of this value over all possible . A first consequence of Theorem 5 is the convergence of to with an error that can be uniformly bounded.

###### Theorem 6 (Uniform convergence of the value).

Let be the random action function associated with , as defined earlier. Under Assumptions (A0) to (A3),

 ∣∣VNπ(MN(0))−E[vANπ(m0)]∣∣≤B(N,∥∥MN(0)−m0∥∥)

where is defined in Section 5.1.

Note that ; in particular, if almost surely [resp. in probability] then almost surely [resp. in probability].

The proof is given in Section 5.4.

#### 3.2.4 Putting Things Together

The proof of the main result uses the two auxiliary systems. The first auxiliary system provides a strategy for the system with objects derived from an action function of the mean field limit; it cannot do better than the optimal value for the system with objects, and is close to the optimal value of the mean field limit. Therefore, the optimal value for the system with objects is lower bounded by the optimal value for the mean field limit. The second auxiliary system is used in the opposite direction, which shows that, roughly speaking, for large the two optimal values are the same. We give the details of the derivation in Section 5.6.

## 4 Applications

### 4.1 Hamilton-Jacobi-Bellman Equation and Dynamic Programming

Let us now consider the finite time optimization problem for the stochastic system and its limit from a constructive point of view. As the state space is finite, we can compute the optimal value by using a dynamic programming algorithm. If denotes the optimal value for the stochastic system starting from at time , then . The optimal value can be computed by a discrete dynamic programming algorithm [21] by setting and

 UN(m,t)=supa∈AE(rN(m,a)+UN(MN(t+I(N)),t+I(N))∣∣¯MN(t)=m,AN(t)=a). (16)

Then, the optimal cost over horizon is .

Similarly, if we denote by the optimal cost over horizon for the limiting system, satisfies the classical Hamilton-Jacobi-Bellman equation:

 ˙u(m,t)+maxa{∇u(m,t).f(m,a)+r(m,a)}=0. (17)

This provides a way to compute the optimal value, as well as the optimal policy, by solving the partial differential equation above.

### 4.2 Algorithms

Theorem 2 above can be used to design an effective construction of an asymptotically optimal policy for the system with objects over the horizon by using the procedure described in Algorithm 1.

Theorem 2 says that under policy , the total value is asymptotically optimal:

 limN→∞VNπ(MN(0))=liminfN→∞VN∗(MN(0)).

The policy constructed by Algorithm 1 is static in the sense that it does not depend on the state but only on the initial state

, and the deterministic estimation of

provided by the differential equation. One can construct a more adaptive policy by updating the starting point of the differential equation at each step. This new procedure, constructing an adaptive policy from 0 to the final horizon is given in Algorithm 2.

In practice, the total value of the adaptive policy is larger than the value of the static policy because it uses on-line corrections at each step, before taking a new action. However Theorem 2 does not provide a proof of its asymptotic optimality.

### 4.3 Examples

In this section, we develop three examples. The first one can be seen as a simple illustration of optimal mean field. The limiting ODE is quite simple and can be optimized in closed analytical form.

The second example considers a classic virus problem. Although virus propagations concern discrete objects (individuals or devices), most work in the literature study a continuous approximation of the problem under the form of an ODE. The justification of passing from a discrete to a continuous model is barely mentioned in most papers (they mainly focus on the study of the ODE). Here we present a discrete dynamical system based on a simple stochastic mobility model for the individuals whose behavior converges to a classic continuous model. We show on a numerical example that the limiting problem provides a policy that is close to optimal, even for a system with a relatively small numbers of nodes.

Finally, the last example comes from routing optimization in a queueing network model of volunteer computing platforms. The purpose of this last example is to show that a discrete optimal control problem suffering from the curse of dimensionality can be replaced by a continuous optimization problem where an HJB equation must be solved over a much smaller state space.

#### 4.3.1 Utility Provider Pricing

This is a simplified discrete Merton’s problem. This example shows a case where the optimization problem in the infinite system can be solved in closed form. This can be seen as an ideal case for the mean field approach: although the original system is difficult to solve even numerically when is large, taking the limit when goes to infinity makes it simple to solve, in an analytical form.

We consider a system made of a utility and users; users can be either in state (subscribed) or (unsubscribed). The utility fixes their price . At every time step, one randomly chosen customer revises her status: if she is in state [resp. ], with probability [resp. ] she moves to the other state; is the probability of a new subscription, and is the probability of attrition. We assume decreases with and increases. If the price is large, the instant gain is large, but the utility loses customers, which eventually reduces the gain.

Within our framework, this problem can be seen as a Markovian system made of objects (users) and one controller (the provider). The intensity of the model is . Moreover, if the immediate profit is divided by (this does not alter the optimal pricing policy) and if is the fraction of objects in state at time and is the action taken by the provider at time , the mean field limit of the system is:

 ∂x∂t=−x(t)a(α(t))+(1−x(t))s(α(t))=s(α(t))−x(s(α(t))+a(α(t)) (18)

and the rescaled profit over a time horizon is . Call the optimal benefit over the interval if there is a proportion of subscribers at time . The Hamilton-Jaccobi-Bellman equation is

 ∂∂tu∗(t,x)+H(x,∂∂xu∗(t,x)) = 0 with H(x,p) = maxα∈[0,1][p(s(α)−x(s(α)+a(α))+αx]

can be computed under reasonable assumptions on the rates of subscription and attrition and , which can then be used to show that there exists an optimal policy that is threshold based. To continue the rest of this illustration, we consider the radically simplified case where can take only the values and and under the conditions and , in which case the ODE becomes

 ∂x∂t=−x(t)α(t)+(1−x(t))(1−α(t))=1−x(t)−α(t), (19)

and . The solution of the HJB equation can be given in closed form. The optimal policy is to chose action if or , and otherwise. Figure 1 shows the evolution of the proportion of subscribers when the optimal policy is used. The coloured area corresponds to all the points where the optimal policy is (fix a high price) and the white area is where the optimal policy is to choose (low price).

To show that this policy is indeed optimal, one has to compute the corresponding value of the benefit and show that it satisfies the HJB equation. This can be done using a case analysis, by computing explicitly the value of in the zones and displayed in Figure 1, and check that satisfies Eq.(4.3.1) in each case.

#### 4.3.2 Infection Strategy of a Viral Worm

This second example has two purposes. The first one is to provide a rigorous justification of the use of a continuous optimization approach for this classic problem in population dynamics and to show that the continuous limit provides insights on the structure of the optimal behavior for the discrete system. Here, the optimal action function can be shown to be of the bang-bang type for the limit problem, by using tools from continuous optimization such as the Pontryagin maximum principle. Theorem 2 shows that a bang-bang policy should also be asymptotically optimal in the discrete case.

The second purpose is to compare numerically the performance of the optimal policy of the deterministic limit and the performance of other policies for the stochastic system for small values of . We show that is close to optimal even for

and that it outperforms another classic heuristic.

This example is taken from [15] and considers the propagation of infection by a viral worm. Actually, similar epidemic models have been validated through experiments, as well as simulations as a realistic representation of the spread of a virus in mobile wireless networks (see [7, 22]). A susceptible node is a mobile wireless device, not contaminated by the worm but prone to infection. A node is infective if it is contaminated by the worm. An infective node spreads the worm to a susceptible node whenever they meet, with probability . The worm can also choose to kill an infective node, i.e., render it completely dysfunctional - such nodes are denoted dead. A functional node that is immune to the worm is referred to as recovered. Although the network operator uses security patches to immunize susceptibles (they become recovered) and heals infectives to the recovered state, the goal of the worm is to maximize the damages done to the network. Let the total number of nodes in the network be . Let the proportion of susceptible, infective, recovered and dead nodes at time be denoted by , , and , respectively. Under a uniform mobility model, the probability that a susceptible node becomes infected is . The immunization of susceptibles (resp. infectives) happens at a fixed rate (resp. ). This means that a susceptible (resp. infective) node is immunized with probability (resp. ) at every time step.

At this point, authors of [15] invoke the classic results of Kurtz [17] to show that the dynamics of this population process converges to the solution of the following differential equations.

 ∂S∂t=−βIS−qS∂I∂t=βIS−bI−v(t)I∂D∂t=v(t)I∂R∂t=bI+qS. (20)

This system actually satisfies assumptions (), which allows us not only to obtain the mean field limit, but also to say more about the optimization problem. The objective of the worm is to find such that the damage function is maximized under the constraint (where is convex). In [15], this problem is shown to have a solution and the Pontryagin maximum principle is used to show that the optimal solution is of bang-bang type:

 ∃t1∈[0…T) s. t. v∗(t)=0 for 0

Theorem 2 makes the formal link between the optimization of the model on an individual level and the previous resolution of the optimization problem on the differential equations, done in [15]. It allows us to formally claim that the policy of the worm is indeed asymptotically optimal when the number of objects goes to infinity.

We investigated numerically the performance of against various infection policies for small values of the number of nodes in the system . These results are reported on Figure 2, where we compare four values:

• – the optimal value of the limiting system.

• – the optimal expected damage for the system with objects (MDP problem);

• – the expected value of the system with objects when applying the action function that is optimal for the limiting system; Performance of algorithm 1

• the performance of a heuristic where, instead of choosing a threshold as suggested by the limiting system (21), the killing probability if fixed for the whole time. The curve on the figure is drawn for the optimal (recomputed for each parameter ).

We implemented a simulator that follows strictly the model of infection described earlier in this part. We chose parameters similar to those used in [15]: the parameter for the evolution of the system are , , , and the damage function to be optimized is with . However, it should be noted that the choice of thess parameters does not influence qualitatively the results. Thanks to the relatively small size of the system, these four quantities can be computed numerically using a backward induction. The optimal policies for the deterministic limit consists in not killing machines until and in killing machines at a maximum rate after that time: .

Theorem 2 shows that is asymptotically optimal (), but Figure 2(a) shows that, already for low values of , these three quantities are very close. A classic heuristic for this maximal infection problem is to kill a node with a constant probability , regardless of the time horizon. Our numerical study shows that outperforms this heuristic by more than . The performance of this heuristic does not increase with the size of the system .

In order to illustrate the convergence of the values and to , Figure 2(b) is a detailed view of Figure 2(a) where we show the two quantities , and their common limit . This figure shows that the convergence is indeed very fast. Other numerical experiments indicate that this is true for a large panel of parameters. Although this figures seems to indicate that , this is not true in general, for example adding to the damage function leads to ( is always less than by definition of ).

#### 4.3.3 Brokering Problem

Finally, let us consider a model of a volunteer computing system such as BOINC http://boinc.berkeley.edu/. Volunteer computing means that people make their personal computer available for a computing system. When they do not use their computer, it is available for the computing system. However, as soon as they start using their computer, it becomes unavailable for the computing system. These systems are becoming more and more popular and provide large computing power at a very low cost [16].

The Markovian model with objects is defined as follows. The objects represent the users that can submit jobs to the system and the resources that can run the jobs. The resources are grouped into a small number of clusters and all resources in the same cluster share the same characteristics in terms of speed and availability. Users send jobs to a central broker whose role is to balance the load among the clusters.

The model is a discrete time model of a queuing system. Actually, a more natural continuous-time Markov model could also be handled similarly, by using uniformization.

There are users. Each user has a state . At each time step, an active user sends one job with probability and becomes inactive with probability . An inactive user sends no jobs to the system and becomes on with probability .

There are clusters in the system. Each cluster contains computing resources. Each resource has a buffer of bounded size . A resource can either be valid or broken. If it is valid and if it has one or more job in its queue, it completes one job with probability at this time slot. A resource gets broken with probability . In that case, it discards all the packets of its buffer. A broken resource becomes valid with probability .

At each time step, the broker takes an action and sends the packets it received to the clusters according to the distribution . A packet sent to cluster joins the queue of one of the resources, sqy ; according to a local rule (for example chos