    # Optimizing over a Restricted Policy Class in Markov Decision Processes

We address the problem of finding an optimal policy in a Markov decision process under a restricted policy class defined by the convex hull of a set of base policies. This problem is of great interest in applications in which a number of reasonably good (or safe) policies are already known and we are only interested in optimizing in their convex hull. We show that this problem is NP-hard to solve exactly as well as to approximate to arbitrary accuracy. However, under a condition that is akin to the occupancy measures of the base policies having large overlap, we show that there exists an efficient algorithm that finds a policy that is almost as good as the best convex combination of the base policies. The running time of the proposed algorithm is linear in the number of states and polynomial in the number of base policies. In practice, we demonstrate an efficient implementation for large state problems. Compared to traditional policy gradient methods, the proposed approach has the advantage that, apart from the computation of occupancy measures of some base policies, the iterative method need not interact with the environment during the optimization process. This is especially important in complex systems where estimating the value of a policy can be a time consuming process.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many control problems, a number of reasonable base policies are known. For example these policies might be provided by an expert. A natural question is whether a combination of these base policies can provide an improvement. This problem is especially important when the number of states is large and the exact computation of the optimal policy is not feasible. We want to design reinforcement learning algorithms that can take this extra prior information into account. One way to formulate the problem is to define a policy space that includes all mixtures of the base policies. A member policy of this class is determined by its mixture distribution: It samples from this mixture at each state, as opposed to sampling from this mixture at the initial state and running the sampled policy until the end.

A popular method to optimize a parameterized policy space is policy gradient that typically employs a variant of the gradient descent method (Williams92SS, Sutton-McAllester-Singh-Mansour-2000, Konda00AA, Baxter01IP, Peters05NA, Bhatnagar09NA). Although in some applications the quality of the solution is high, the policy gradient methods often converge to some local minima as the problem is highly non-convex. Further, computing a gradient estimate can be an expensive operation. For example, the finite difference method requires running a number of policies in each iteration, and estimating the value of a policy in a complicated system might require a long running time.

In this paper, we show a number of results on the problem of policy optimization in a restricted class of mixture policies. First, we show that the optimization problem is NP-hard to solve exactly as well as to approximate. The hardness result is obtained by a reduction from the problem for graphs and an application of the Motzkin-Straus theorem for optimizing quadratic forms over the simplex (Motzkin65). The hardness result is somewhat surprising, since the same problem is known to be easy (in the complexity class P) if the space of base policies includes all MDP policies (an exponentially large space!) (Papadimitriou87)

. The critical difference is that in the unconstrained case an optimal MDP policy is known to be deterministic, in which case linear programming or policy iteration are known to run in polynomial time

(Ye05), whereas in the restricted case an optimal policy may need to randomize.

Although this hardness result is somewhat disappointing, we show that an approximately optimal solution can be found in a reasonable time when the occupancy measures of the base policies have large overlap. We obtain this result by formulating the problem in the dual space. More specifically, instead of searching in the space of mixture policies, we construct a new search space that consists of linear combinations of the occupancy measures of the base policies. Each such linear combination is not an occupancy measure itself, but it defines a policy through a standard normalization. The objective function in the dual space is still highly non-convex, but we can exploit the convex relaxation proposed by ABM2014 to have an efficient algorithm with performance guarantees. ABM2014 study the linear programming approach to dynamic programming in the dual space (space of occupancy measures), and propose a penalty method that minimizes the sum of the linear objective, and a multiple of the constraint violations.

To demonstrate the idea, consider the problem of controlling the service rate of a queue where jobs arrive at a certain rate and the cost is the sum of the queue length and service rate chosen. Consider two policies, one that selects low and one that selects high service rates. The space of the mixtures of these two policies is rich and is likely to contain a policy with low total cost. We can generate a wide range of service rates as a convex combination of these two base policies. Now let us consider the dual space. The occupancy measure of the first policy is concentrated at large queue lengths, while the occupancy measure of the second one is concentrated at short queue. It can be shown that the linear combination of occupancy measures can generate a limited set of policies that are either similar to the first policy or the second one. In short, although the space of mixture policies is a rich space (and hence the optimization is NP-hard in that space), the space of dual policies can be more limited. More crucially, if the base policies have some similarities so that their occupancy measures overlap, then we can generate non-trivial policies in the dual space that can be competitive with the mixture policies in the primal space.In the experiments section, we show that this is indeed the case.

Given that we use occupancy measures as basis of the linear subspace, our approach has several advantages compared to ABM2014. First, the importance weighting distributions are no longer needed and this simplifies the design of the algorithm. Second, our bounds are tighter as constraint violation terms no longer appear in the error bounds. Finally, unlike the algorithm of ABM2014, our algorithm does not require a backward simulator. Thus it can be used when only a forward simulator is given. This also provides a mechanism to run the algorithm using only a single trajectory of state-action observations. Our algorithm requires the occupancy measures of the base policies as input. We employ recent results of LL-2017

to efficiently compute these vectors in time linear in the size of the state space. When the size of the state space is very large, these occupancy measures can be estimated by roll-outs.

Let us compare our algorithm with the traditional policy gradient in the space of mixtures of policies. Although the space of the mixtures of policies is rich and is likely to contain a policy with lower total cost than any policy produced from a mixture of occupancy measures, our approach has several advantages. First, the policy gradient method is more computationally demanding. The gradient descent method needs to perform several roll-outs in each round to estimate the gradient direction. In a complicated system, the mixing times can be large and thus we might need to run a policy for a very long time before we can reliably estimate its value. In contrast and as we will show, apart from the initial roll-outs to estimate occupancy measures of the base policies, the proposed method doesn’t need to interact with environment when optimizing in the dual space. Further, our method enjoys stronger theoretical guarantees than the policy gradient method.

### 1.1 Notation

Let and denote th row and th column of matrix respectively. We use

to denote an identity matrix. We use

and to denote -dimensional vectors with all elements equal to one and zero, respectively. We use to denote the all-zeros matrix, and to denote the unit -vector (1 at position and 0 elsewhere). We also use to denote the indicator function. We use and to denote the minimum and the maximum, respectively. We use the notation with denotes the maximum. Similarly, we use the notation with denotes the minimum in the paper. For vectors and , means element-wise inequality, i.e. for all . We use

to denote the space of probability distributions defined on the set

. For positive integer , we use to denote the set .

## 2 Preliminaries

In this paper, we study Reinforcement Learning (RL) problems in which the interaction between the agent and environment has been modeled as a discrete111For simplicity, we consider finite-state problems in this paper. The results can be extended to continuous-state problems. discounted Markov decision process (MDP). A discrete MDP is a tuple , where and are the sets of states and actions, respectively; is the cost function; is the transition probability distribution that maps each state-action pair to a distribution over states ; is the initial state distribution; and is the discount factor. We are primarily interested in the case where the number of states is large. Note that since we consider discrete MDPs, all the MDP-related quantities can be written in vector and matrix form.

We also need to specify the rule according to which the agent selects actions at each possible state. We assume that this rule does not depend explicitly on time. A stationary policy is a probability distribution over actions, conditioned on the current state. The MDP controlled by a policy

induces a Markov chain with the transition probability

and cost function . We denote by the value function of policy , i.e., the expected sum of discounted costs of following policy , and by and the state and state-action occupancy measures under policy and w.r.t. the starting distribution , respectively, i.e.,

 νπ(x)=(1−γ)∑x′∈Xα(x′)∞∑t=0γtP(Xt=x|X0=x′),

and . Note that when , we may write , and thus, , where is the identity matrix. Given a policy class , the goal is to find a policy that minimizes . It is easy to show that .

In this work we are interested in the policy optimization problem when the policy class is defined by a mixture of base policies :

We call the policy space the primal space. Executing a policy amounts to sampling one of the base policies from the distribution at each time step, and act according to this policy. Finding the best policy in requires solving the following optimization problem

 minw∈Δ[m]J(πw)=minw∈Δ[m]α⊤Jπw. (1)

We denote by the solution to (1) and use . For a positive constant , we define the dual space of as the space of linear combinations of the state-action occupancies of the base policies, i.e.,

 Ξ={ξθ:ξθ=m∑i=1θiμπi,θ∈Θ}, (2)

where . Here, is the radius of the parameter space and restricts the size of the policy class. Note that given the definition of , each is not necessarily a state-action occupancy measure. However, each corresponds to a policy , defined as

 πθ(a|x)=[ξθ(x,a)]+∑a′∈A[ξθ(x,a′)]+. (3)

If for all , we let

be the uniform distribution. We denote by

the state-action occupancy of this policy. The policy optimization problem in the dual space is defined as

 minθ∈ΘJ(πθ)=minθ∈Θα⊤Jπθ=minθ∈Θμ⊤πθc, (4)

where is computed from using Eq. 3.

## 3 Hardness Result

In this section we show that the policy optimization problem in the primal space (Eq. 1) is NP-hard. At a high level, the proof involves designing a special MDP and base policies such that solving Eq. 1 in polynomial time would imply PNP.

###### Theorem 3.1.

Given a discounted MDP, a set of policies, and a target cost , it is NP-hard to decide if there exists a mixture of these policies that has expected cost at most .

###### Proof.

We reduce from the problem. This problem asks, for a given (undirected and with no self-loops) graph with vertex set , and a positive integer , whether contains an independent set having . This problem is NP-complete (Garey79).

Let be the (symmetric, 0-1) adjacency matrix of an input graph , and let , where is the identity matrix. The reduction constructs a deterministic MDP with states and actions, and deterministic policies that induce corresponding chains , for , where each matrix reads

 Pπi=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣0e⊤i000m0mmA:,i1m−A:,i00⊤m0100⊤m01⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (5)

A mixture of the base policies, with weights , induces the chain

 Pπw=m∑i=1wiPπi=⎡⎢ ⎢ ⎢ ⎢⎣0w⊤000m0mmAw1m−Aw00⊤m0100⊤m01⎤⎥ ⎥ ⎥ ⎥⎦. (6)

It is easy to see that, for each , the th power of is equal to (all zeros except for the last column that is all ones), while its square reveals the quadratic form in position . Hence, for initial state distribution (all mass on the first state), state-only-dependent cost vector (all zeros except for 1 in the ’th state), and discount factor , the expected discounted cost of is

 J(πw) =(1−γ)α⊤(I−γPπw)−1c =(1−γ)α⊤(I+γPπw+γ2P2πw+…)c =(1−γ)γ2w⊤Aw. (7)

For any graph with vertices and adjacency matrix , the following identity holds (Motzkin65):

 1ω(G)=miny∈Δ[m]y⊤(I+G)y, (8)

where is the size of the maximum independent set of . Let the target cost be , where is the target integer of the instance. Then the decision question is equivalent to , where we used (7). Hence, it follows from (8) that the existence of a vector that satisfies would imply and therefore for some independent set . This establishes that, deciding the MDP policy optimization problem in polynomial time would also decide the problem in polynomial time, implying PNP. ∎

#### Remarks:

1. The same technique can be used to show NP-hardness of the problem under an average cost criterion. This only requires changing the last two rows of the matrices , by having the 1’s in the first column instead of the last column. In that case, if is the stationary distribution of , where is an -vector and scalars, we can algebraically solve the eigensystem (by elementary manipulations), to get and . Hence, for a cost vector , the average cost of the MDP under is , and by choosing target cost the Motzkin-Straus argument applies as above.

2. The reduction via automatically establishes hardness of -approximability of the problem for arbitrary (Haastad99, Zuckerman06).

3. Optimizing over a restricted policy class essentially converts the MDP to a POMDP, for which related complexity results are known (Papadimitriou87, Mundhenk00, Vlassis12).

## 4 Reduction to Convex Optimization

In this section, we first propose an algorithm to solve the policy optimization problem in the dual space (Eq. 4) and then prove a bound on the performance of the policy returned by our algorithm compared to the solution of the policy optimization problem in the primal space (Eq. 1).

Figure 1 contains the pseudocode of our proposed algorithm. The algorithm takes the base policies and the dual parameter space as input. It first computes the state-action occupancy measures of the base policies, i.e., . This is done using the recent results by LL-2017 that show it is possible to compute the stationary distribution (occupancy measure) of a Markov chain in time linear to the size of the state space.222These results apply to the discounted case as well. This guarantees that we can compute the state-action occupancy measures of the base policies efficiently, even when the size of the state space is large. Alternatively, when the size of the state space is very large, these occupancy measures can be estimated by roll-outs.

ABM2014 propose minimizing a convex surrogate function which, when is a linear combination of occupancy measures, reads

 L(θ)=c⊤ξθ+H∑(x,a)∈X×A|[ξθ(x,a)]−|.

Here, is a parameter that penalizes negative values in . At each time step , our algorithm first computes an estimate of the sub-gradient and then feeds it to the projected sub-gradient method to update the policy parameters . In order to compute an estimate of the sub-gradient , we first sample a state-action pair from , and then compute the function

 gt(θt)=c⊤M−HmM(xt,at)∑mi=1μπi(xt,at)1{ξθt(xt,at)<0}, (9)

where is a matrix, whose ’th column is , and is the row of corresponding to the state-action pair . To sample from , we first select a number uniformly at random and then sample a state-action pair from the state-action occupancy measure of the ’th base policy, i.e., . We now show that in (9

) is an unbiased estimate of

:

 E[gt(θ)] =c⊤M−HE[M(xt,at)(1/m)∑mi=1μπi(xt,at)1{ξθ(xt,at)<0}] =c⊤M−H∑(x,a)∈X×AM(x,a)P(xt=x,at=a)(1/m)∑mi=1μπi(x,a)1{ξθ(x,a)<0} =c⊤M−H∑(x,a)∈X×AM(x,a)1{ξθ(x,a)<0} =∇L(θ).

After rounds, we average the computed policy parameters and obtain the final solution . The parameter vector defines an element of the dual space (Eq. 2), which in turn defines a policy using Eq. 3. We denote by the state-action occupancy measure of this policy. Our main result in this section is that is near-optimal in the dual space . This result is a consequence of Theorem 1 of ABM2014 applied to the dual space .

###### Theorem 4.1.

Let be the probability of error. Let be the number of rounds and be the constraints multiplier in the subgradient estimate (9). Let be the output of the stochastic subgradient method after rounds and let the learning rate be , where . Then with probability at least ,

 J(ˆθ) =α⊤Jπˆθ≤minθ∈Θ(α⊤Jπθ+(1η+61−γ)∑(x,a)|[ξθ(x,a)]−|+6√mCSη1−γ+O(η)),

where the constants hidden in the big-O notation are polynomials in , , and .

Let be the constraint violation. If we make the optimal choice of , we get that

 J(ˆθ) =α⊤Jπˆθ≤minθ∈Θ(α⊤Jπθ+O(max(U(θ)1−γ,√U(θ)1−γ))).

This result highlights a number of interesting features of our approach. First, unlike the algorithm of ABM2014, our algorithm does not require a backward simulator. Second, error terms involving stationarity constraints do not appear in the performance bound.

Our next result shows that is near-optimal as long as the occupancy measures of the base policies have large overlap.

###### Theorem 4.2.

Let . We have the following bound on the performance of the policy returned by the algorithm in Figure 1:

 J(πˆθ)α⊤Jπˆθ≤minw∈Δ[m]J(πw)α⊤Jπw+minθ∈Θ(m∑i=1θi(μπi−μπ∗)⊤c+O(U(θ)1−γ∨√U(θ)1−γ)).

Further, the computational complexity of the algorithm is .

Before proving Theorem 4.2, we need to prove the following lemma.

###### Lemma 4.3.

Let . Then, for any and any policy in the primal space , we have

 ∥∥νπi−νπw∥∥1≤ϵ(1+γ)1−γ.
###### Proof.

For any , there exists a vector with such that

 νπi−νπj=vi,j. (10)

We may rewrite Eq. 10 as

 α⊤(I−γPπi)−1=α⊤(I−γPπj)−1+v⊤i,j,

and further as

 α⊤(I−γPπi)−1(I−γPπj)=α⊤+v⊤i,j(I−γPπj). (11)

Now for a policy , corresponding to the weight vector , we may write

 α⊤(I−γPπi)−1(I−γm∑k=1wkPπk) =m∑k=1wkα⊤(I−γPπi)−1(I−γPπk) (a)=α⊤m∑k=1wk+m∑k=1wkv⊤i,k(I−γPπk) (b)=α⊤+m∑k=1wkv⊤i,k(I−γPπk), (12)

where (a) is from Eq. 11 and (b) is from the fact that , and thus, . From Eq. 12, we have

 α⊤ (I−γPπi)−1=α⊤(I−γm∑k=1wkPπk)−1+m∑k=1wkv⊤i,k(I−γPπk)(I−γm∑l=1wlPπl)−1. (13)

Since , we may rewrite Eq. 13 as

 νπi−νπw =α⊤(I−γPπi)−1−α⊤(I−γPπw)−1 =m∑k=1wkv⊤i,k(I−γPπk)(I−γm∑l=1wlPπl)−1.

Let

 ϵ′=maxi∈[m]∥∥ ∥∥m∑k=1wkv⊤i,k(I−γPπk)(I−γm∑l=1wlPπl)−1∥∥ ∥∥1,

, , and . We may write

 ∥∥νπi−νπw∥∥1 ≤ϵ′ ≤∥∥ ∥∥M⊤m∑k=1wkzk∥∥ ∥∥1 ≤∥∥M⊤∥∥1∥∥ ∥∥m∑k=1wkzk∥∥ ∥∥1 ≤∥∥I+γQ⊤+γ2Q2⊤+…∥∥1≤(1+γ)m∑k=1wk≤1/(1−γ)∥∥(I−γPπk)⊤∥∥1≤ϵ∥vi,k∥1 ≤ϵ(1+γ)1−γ.

This concludes the proof. ∎

Now we are ready to prove Theorem 4.2.

###### Proof of Theorem 4.2.

From Lemma 2 of ABM2014, color=red!20!white,color=red!20!white,todo: color=red!20!white,Yasin: refer to the journal version for the discounted result for any , we have

 ∥∥ξθ−μπθ∥∥1≤3U(θ)1−γ.

From Lemma 4.3, we have , for any and . Thus, for any , we may write

 α⊤Jπθ =α⊤Jπ∗+μ⊤πθc−μ⊤π∗c =α⊤Jπ∗+ξ⊤θc−μ⊤π∗c+μ⊤πθc−ξ⊤θc ≤α⊤Jπ∗+(ξθ−μπ∗)⊤c+3U(θ)1−γ =α⊤Jπ∗+∑iθi(μπi−μπ∗)⊤c+3U(θ)1−γ.

Let . We get that

 α⊤Jπθ≤α⊤Jπ∗+θ⊤b+3U(θ)1−γ.

Thus, by Theorem 4.1,

 α⊤Jπˆθ ≤α⊤Jπ∗+minθ∈Θ(θ⊤b+O(max(U(θ)1−γ,√U(θ)1−γ))).

Let be such that for any and any ,

 λ1+λμπj(x,a)≤μπi(x,a)≤1+λλμπj(x,a). (14)

A large value of indicates large overlap among occupancy measures. Let . Parameter provides an error and provides an error bound. We can obtain an easy but non-trivial bound under the condition (14) as follows. Let be the policy with smallest value and be the policy with largest value. Choose , , and all other elements are zero. Then

 U(θ) =∑(x,a)|[ξθ(x,a)]−| =∑(x,a)∣∣[(1+λ)μπ1(x,a)−λμπm(x,a)]−∣∣.

For each term to be negative, we must have

 μπ1(x,a)≤λ1+λμπm(x,a),

which contradicts our assumption. Thus . We also get that

 θ⊤b =(1+λ)(μπ1−μπ∗)⊤c−λ(μπm−μπ∗)⊤c =(μπ1−μπ∗)⊤c−λ(α⊤Jπm−α⊤Jπ1).

Thus,

 α⊤Jπˆθ≤α⊤Jπ∗+ε−λα⊤(Jπm−Jπ1). (15)

Let’s compare the above result to a simple bound:

 α⊤Jπ1≤α⊤Jπ∗+ε. (16)

The term in (15) shows the amount of improvement compared to the bound in (16). This term is positive given that has a larger value than .

## 5 Experiments

In this section we compare the results of combining policies in the policy space () and occupancy measure space (), where the base policies have overlapping occupancy measures. We consider two different queuing problems. The results show that solving the problem in space results in a better optimal policy.

### 5.1 Queuing Problem: Single Queue

Consider a queue of length L. We denote the state of the system at time by , which shows the number of jobs in the queue. This problem has been studied before in de2003linear. Jobs arrive at the queue with rate . Action at each time, , is chosen from the finite set , which shows the service rate or departure probability. The transition function of the system is then defined as:

 xt+1=⎧⎪⎨⎪⎩xt−1with % probability:atxt+1with probability:pxtotherwise (17)

The system goes from state to with probability and stays in with probability . Also, transition from state to has probability and the system stays in with probability . The cost incurred by being in state and taking action is given by: .

Consider a queue with and two base policies. The two policies, independent of the states, have the following distributions over the set of actions: and . Consider two scenarios. 1) Mixing the original policies in the space of . 2) Building a policy based on combining the corresponding occupancy measures (optimization in the space of ). We intentionally choose two similar policies with high overlap in occupancy measure to show the effectiveness of finding a good mixture in the space of in this regime, while the gain in combining the original policies is not promising.

Figure 5 shows the results of optimization in the two spaces. The costs associated with and are and . Suppose is our mixture policy. Then for , is a monotonically increasing function of . In other words, there is no gain in mixing the policies when is between and . For this experiment we let to take values outside this interval to examine the lowest possible cost. The best mixing weight is and the associated cost is . However, in the space of occupancy measures we can build a better policy. Suppose . Then the cost associated with the policy that is built based on decreases monotonically as we decrease and it saturates at almost , 11todo: 1Nikos: the x-label is not shown in the theta plot which is much lower than the cost of optimal policy mixing in the space of .

### 5.2 Queuing Problem: Four Queues

Consider a system with four queues, each has capacity , shown in figure 6. This problem has been studied in several works chen1999value, kumar1990dynamic, de2003linear. There are two servers in the system. Server serves queue and queue with rates and , respectively, and server serves queue and queue with rates and , respectively. Each server serves only one of its associated queues at each time. Jobs arrive at queue and queue with rate . A job leaves the systems after being served at either queue and queue or queue and queue . State of the system is denoted by a four dimensional vector , where represents the number of jobs in queue at each time. A controller can choose a four-dimensional action from such that and . The cost at each time is equal to the total number of jobs in the system: .

Assume , , and . We choose our base policies from a family of policies for which server serves its longer queue with probability and its shorter queue with probability , for . Five base policies and their associated costs are shown in Table 1.

Solving this problem in the space of policies using policy gradient will result in the optimal weight vector , i.e. giving all the weight to the policy with the lowest cost. Therefore the cost of the mixture policy will be .

Interestingly, if we solve the problem in the dual space (space of stationary distributions) using policy gradient and Eq. 3, after 200 iterations, the optimal resulting policy will have cost with (note that based on the definition of , this vector can have negative values). However, this method is not computationally efficient. Using our stochastic subgradient method, we can approximate the cost very fast and efficient. In fact, the resulting policy will have cost with . Figure 9 shows the difference between the costs of policy gradient and the stochastic subgradient method at each iteration. Figure 7: cost per iteration (step) for the primal and the dual space. The policy gradient (b) for the dual space is the method described in algorithm 1. Horizontal dashed lines are the costs of the base policies. Figure 8: 8-queue Problem

### 5.3 Queuing Problem: Eight Queues

We consider a queuing problem with eight queues similar to the problem in de2003linear with slightly different parameters. Consider a system with three servers and eight queues. Similar to 4-queue problem, each server is responsible of serving multiple queues. There are two pipelines. Jobs in the first pipeline enter the system from the first queue by rate and exit the system from the third queue. Jobs in the second pipeline enter the system from the forth queue by rate and exit the system from the eighth queue. Fig. 8 demonstrates the system. The service rate is denoted by , . There is no limit on the length of queues. State of the system is determined by an 8-dimensional vector that shows the number of jobs in each queue. Actions are also 8-dimensional binary vectors, where each dimension shows if the associated queue is served or not. The cost at each time step is equal to the total number of jobs in the system. Each server can serve only one queue at each time. We choose three base policies from a family of policies that is described below.

Family of base policies: The base policies are parameterized by three components: , , and , which are associated to each server and satisfy the following properties:

• If all of the queues associated to server are empty the server is idle.

• If only one of the queues for server is non-empty, that queue is served with probability .

• If server has more than one non-empty queue, it serves the longest queue with probability and the rest of the non-empty queues with probability , where is the number of non-empty queues.

The three base policies and their costs are shown in Table 2. The cost is computed by running each policy for 10,000,000 iterations, starting from empty system, and averaging over number of jobs at each iteration.

Below we discuss about combining the policies in the primal and dual space.

Solving the problem in the policy space: Compared to the previous problems, the number of states in this domain is not limited. The cost of each policy should be computed by running the policy for a long time, e.g. 10,0000,000 iterations. We solve the problem in the primal space using policy gradient, and specifically a finite differences method. We start from a random initial wight. At each iteration, we slightly perturb the weight in different directions, compute the cost for each perturbation, and approximate the gradient surface. The weight is then updated in the direction of the gradient. Since computing the cost is computationally expensive, finding the optimum mixing weight in this space is very tedious. After 50 iterations the weight converges to , which corresponds to .

Solving the problem in the space of occupancy measures: To solve the problem in this space we only need to run the base policies once and obtain the stationary distributions. From there, we do not need to run any other policy and the optimal is obtained using the described algorithm. The optimal value of for this problem is: and the cost induced is . In fact, by mixing the policies in this space we obtain a lower cost compared to the base policies. Although, this cost is slightly worse than the cost of the policy obtained the primal space, the gain in computation is significant. Figure 9: Mean and standard deviation of the cost per iteration (step) in dual space in 8 queue problem.