## I Introduction

The goal of a policy evaluation algorithm is to estimate the performance that an agent will achieve when it follows a particular policy to interact with an environment, usually modeled as a Markov Decision Process (MDP). Policy evaluation algorithms are important because they are often key parts of more elaborate solution methods where the ultimate goal is to find an optimal policy for a particular task (one such example is the class of actor-critic algorithms – see

[1] for a survey). This work studies the problem of policy evaluation in a fully decentralized setting. We consider two distinct scenarios.In the first case, independent agents interact with independent instances of the same environment following potentially different behavior policies to collect data the objective is for the agents to cooperate. In this scenario each agent only has knowledge of its own states and rewards, which are independent of the states and the rewards of the other agents. Various practical situations give rise to this scenario, for example, consider a task that takes place in a large geographic area. The area can be divided into smaller sections, each of which can be explored by a separate agent. This framework is also useful for collective robot learning (see, [2, 3, 4]).

The second scenario we consider is that of multi-agent reinforcement learning (MARL). In this case a group of agents interact simultaneously with a unique MDP and with each other to attain a common goal. In this setting there is a unique global state known to all agents and each agent receives distinct local rewards, which are unknown to the other agents. Some examples that fit into this framework are teams of robots working on a common task such as moving a bulky object, trying to catch a prey, or putting out a fire.

### I-a Related Work

Our contributions belong to the class of works that deal with policy evaluation, distributed reinforcement learning, and multi-agent reinforcement learning.

There exist a plethora of algorithms for policy evaluation such as GTD [5], TDC [6], GTD2 [6], GTD-MP/GTD2-MP [7], GTD() [8], and True Online GTD() [9]. The main feature of these algorithms is that they have guaranteed convergence (for small enough step-sizes) while combining off-policy learning and linear function approximation; and are applicable to scenarios with streaming data. They are also applicable to cases with a finite amount of data. However, in this latter situation, they have the drawback that they converge at a sub-linear rate because a decaying step-size is necessary to guarantee convergence to the minimizer. In most current applications, policy evaluation is actually carried out after collecting a finite amount of data (one example is the recent success in the game of GO [10]). Therefore, deriving algorithms with better convergence properties for the finite sample case becomes necessary. By leveraging recent developments in variance-reduced algorithms, such as SVRG [11] and SAGA [12], the work [13] presented SVRG and SAGA-type algorithms for policy evaluation. These algorithms combine GTD2 with SVRG and SAGA and they have the advantage over GTD2 in that linear convergence is guaranteed for fixed data sets. Our work is related to [13] in that we too use a variance-reduced strategy, however we use the AVRG strategy [14] which is more convenient for distributed implementations because of an important balanced gradient calculation feature.

Another interesting line of work in the context of distributed policy evaluation is [15], [16] and [17]. In [15] and [16] the authors introduce Diffusion GTD2 and ALG2; which are extensions of GTD2 and TDC to the fully decentralized case, respectively. While [17] is a shorter version of this work. These algorithms consider the situation where independent agents interact with independent instances of the same MDP. These strategies allow individual agents to converge through collaboration even in situations where convergence is infeasible without collaboration. The algorithm we introduce in this paper can be applied to this setting as well and has two main advantages over [15] and [16]. First, the proposed algorithm has guaranteed linear convergence, while the previous algorithms converge at a sub-linear rate. Second, while in some instances, the solutions in [15] and [16] may be biased due to the use of the Mean Square Projected Bellman Error (MSPBE) as a surrogate cost (this point is further clarified in Section II), the proposed method allows better control of the bias term due to a modification in the cost function. We extend our previous work [17] in four main ways which we discuss in the Contribution sub-section.

There is also a good body of work on multi-agent reinforcement learning (MARL). However, most works in this area focus on the policy optimization problem instead of the policy evaluation problem. The work that is closer to the current contribution is [18], which was pursued simultaneously and independently of our current work. The goal of the formulation in [18] is to derive a linearly-convergent distributed policy evaluation procedure for MARL. The work [18] does not consider the case where independent agents interact with independent MDPs. In the context of MARL, our proposed technique has three advantages in comparison to the approach from [18]. First, the memory requirement of the algorithm in [18] scales linearly with the amount of data (i.e., ), while the memory requirement for the proposed method in this manuscript is ), i.e., it is independent of the amount of data. Second, the algorithm of [18] does not include the use of eligibility traces; a feature that is often necessary to reach state of the art performance (see, for example, [19, 20]). Finally, the algorithm from [18] requires all agents in the network to sample their data points in a synchronized manner, while the algorithm we propose in this work does not require this type of synchronization. Another paper that is related to the current work is [21], which considers the same distributed MARL as we do; although their contribution is different from ours. The main contribution in [21] is to extend the policy gradient theorem to the MARL case and derive two fully distributed actor-critic algorithms with linear function approximation for policy optimization. The connection between [21] and our work is that their actor-critic algorithms require a distributed policy evaluation algorithm. The algorithm they use is similar to [15] and [16] (they combine diffusion learning [22] with standard TD instead of GTD2 and TDC as was the case in [15] and [16]). The algorithm we present in this paper is compatible with their actor-critic schemes (i.e., it could be used as the critic), and hence could potentially be used to augment their performance and convergence rate.

Our work is also related to the literature on distributed optimization. Some notable works in this area include [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]. Consensus [23] and Diffusion [22] constitute some of the earliest work in this area. These methods can converge to a neighborhood around, but not exactly to, the global minimizer when constant step-sizes are employed [24, 32]. Another family of methods is based on distributed alternating direction method of multipliers (ADMM) [25]. While these methods can converge linearly fast to the exact global minimizer, they are computationally more expensive than previous methods since they need to optimize a sub-problem at each iteration. An exact first-order algorithm (EXTRA) was proposed in [26] for undirected networks to correct the bias suffered by consensus, (this work was later extended for the case of directed networks [29]). EXTRA and DEXTRA [29] can also converge linearly to the global minimizer while maintaining the same computational efficiency as consensus and diffusion. Several other works employ instead a gradient tracking strategy [33, 27, 28]. These works guarantee linear convergence to the global minimizer even when they operate over time-varying networks. More recently, the Exact Diffusion algorithm [30, 31] has been introduced for static undirected graphs. This algorithm has a wider stability range than EXTRA (and hence exhibits faster convergence [31]

), and for the case of static graphs is more communication efficient than gradient tracking methods since the gradient vectors are not shared among agents. Our current work closely related to Exact Diffusion since our MARL model is based on static undirected graphs and our distributed strategy is derived in a similar manner to

Exact Diffusion. We remark that there is a fundamental difference between the algorithm we present and the works in [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32], namely, our algorithm finds the global saddle-point in a primal dual formulation while the cited works solve convex minimization problems.### I-B Contribution

The contribution of this paper is twofold. In the first place, we introduce Fast Diffusion for Policy Evaluation (FDPE), a fully decentralized policy evaluation algorithm under which all agents have a guaranteed linear convergence rate to the minimizer of the global cost function. The algorithm is designed for the finite data set case and combines off-policy learning, eligibility traces, and linear function approximation. The eligibility traces are derived from the use of a more general cost function and they allow the control of the bias-variance trade-off we mentioned previously. In our distributed model, a fusion center is not required and communication is only allowed between immediate neighbors. The algorithm is applicable both to distributed situations with independent MDPs (i.e., independent states and rewards) and to MARL scenarios (i.e., global state and independent rewards). To the best of our knowledge, this is the first algorithm that combines all these characteristics. Our second contribution is a novel proof of convergence for the algorithm. This proof is challenging due to the combination of three factors: the distributed nature of the algorithm, the primal-dual structure of the cost function, and the use of stochastic biased gradients as opposed to exact gradients.

This work expands our short work [17] in four ways. In the first place, in that work we used the MSPBE as a cost function, while in this work we employ a more general cost function. Second, we include the proof of convergence. Third, we show that our approach applies to MARL scenarios, while in our previous short paper we only discussed the distributed policy evaluation scenario with independent MDPs. Finally in this paper we provide more extensive simulations.

### I-C Notation and Paper Outline

Matrices are denoted by upper case letters, while vectors are denoted with lower case. Random variables and sets are denoted with bold font and calligraphic font, respectively.

indicates the spectral radius of matrix A.is the identity matrix of size

. is the expected value with respect to distribution . refers to the weighted matrix norm, where is a diagonal positive definite matrix. We use to denote entry-wise inequality. is a column vector with elements through (where is at the bottom). Finally and represent the sets of real and natural numbers, respectively.The outline of the paper is as follows. In the next section we introduce the framework under consideration. In Section III we derive our algorithm and provide a theorem that guarantees linear convergence rate. In Section IV we discuss the MARL setting. Finally we show simulation results in Section V.

## Ii Problem Setting

### Ii-a Markov Decision Processes and the Value Function

We consider the problem of policy evaluation within the traditional reinforcement learning framework. We recall that the objective of a policy evaluation algorithm is to estimate the performance of a known target policy using data generated by either the same policy (this case is referred as on-policy), or a different policy that is also known (this case is referred as off-policy). We model our setting as a finite Markov Decision Process (MDP), with an MDP defined by the tuple (,,,,), where is a set of states of size , is a set of actions of size ,

specifies the probability of transitioning to state

from state having taken action , is the reward function when the agent transitions to state from state having taken action ), and is the discount factor.Even though in this paper we analyze the distributed scenario, in this section we motivate the cost function for the single agent case for clarity of exposition and in the next section we generalize it to the distributed setting. We thus consider an agent that wishes to learn the value function, , for a target policy of interest , while following a potentially different behavior policy . Here, the notation specifies the probability of selecting action at state . We recall that the value function for a target policy , starting from some initial state at time , is defined as follows:

(1) |

where and are the state and action at time , respectively. Note that since we are dealing with a constant target policy , the transition probabilities between states, which are given by

, are fixed and hence the MDP reduces to a Markov Rewards Process. In this case, the state evolution of the agent can be modeled with a Markov Chain with transition matrix

whose entries are given by .###### Assumption 1.

We assume that the Markov Chain induced by the behavior policy is aperiodic and irreducible. In view of the Perron-Frobenius Theorem [32], this condition guarantees that the Markov Chain under will have a steady-state distribution in which every state has a strictly positive probability of visitation [32].

Using the matrix and defining:

(2) |

we can rewrite (1) in matrix form as:

(3) |

Note that the inverse always exists; this is because and the matrix is right stochastic with spectral radius equal to one. We further note that also satisfies the following stage Bellman equation for any :

(4) |

### Ii-B Definition of cost function

We are interested in applications where the state space is too large (or even infinite) and hence some form of function approximation is necessary to reduce the dimensionality of the parameters to be learned. As we anticipated in the introduction, in this work we use linear approximations^{1}^{1}1We choose linear function approximation, not just because it is mathematically convenient (since with this approximation our cost function is strongly convex) but because there are theoretical justifications for this choice. In the first place, in some domains (for example Linear Quadratic Regulator problems) the value function is a linear function of known features. Secondly, when policy evaluation is used to estimate the gradient of a policy in a policy gradient algorithm, the policy gradient theorem [34] assures that the exact gradient can be obtained even when a linear function is used to estimate .. More formally, for every state , we approximate where is a feature vector corresponding to state and is a parameter vector such that . Defining , we can write a vector approximation for as . We assume that is a full rank matrix; this is not a restrictive assumption since the feature matrix is a design choice. It is important to note though that the true need not be in the range space of . If is in the range space of , an equality of the form holds exactly and the value of is unique (because is full rank) and given by . For the more general case where is not in the range space of , then one sensible choice for is:

(5) |

where is some positive definite weighting matrix to be defined later. Although (5) is a reasonable cost to define , it is not useful to derive a learning algorithm since is not known beforehand. As a result, for the purposes of deriving a learning algorithm, another cost (one whose gradients can be sampled) needs to be used as a surrogate for (5). One popular choice for the surrogate cost is the MSPBE (see, e.g., [6, 7, 15, 16]); this cost has the inconvenience that its minimizer is different from (5) and some bias is incurred [6]. In order to control the magnitude of the bias, we shall derive a generalization of the MSPBE which we refer to as truncated -weighted Mean Square Projected Bellman Error (H-MSPBE). To introduce this cost, we start by writing a convex combination of equation (4) with different ’s ranging from 1 to (we choose to be a finite amount instead of because in this paper we deal with finite data instead of streaming data) as follows:

(6) | ||||

(7) |

where we introduced:

(8) | |||

(9) | |||

(10) |

and is a parameter that controls the bias.

###### Remark 1.

Note that .

###### Remark 2.

is a right stochastic matrix because it is defined as a convex combination of powers of

(which are right stochastic matrices).Note that from now on for the purpose of simplifying the notation, we refer to , and as , and , respectively. Replacing in (7) by its linear approximation we get:

(11) |

Projecting the right hand side onto the range space of so that an equality holds, we arrive at:

(12) |

where is the weighted projection matrix onto the space spanned by , (i.e., ). We can now use (12) to define our surrogate cost function:

(13) |

where the first term on the right hand side is the H-MSPBE, is a regularization parameter, is a symmetric positive-definite weighting matrix, and reflects prior knowledge about . Two sensible choices for are and , which reflect previous knowledge about or the value function , respectively. The regularization term can be particularly useful when the policy evaluation algorithm is used as part of a policy gradient loop (since subsequent policies are expected to have similar value functions and the value of learned in one iteration can be used as in the next iteration) like, for example, in [35]. One main advantage of using the proposed cost (13) instead of the more traditional MSPBE cost is that the magnitude of the bias between its minimizer (denoted as ) and the desired solution can be controlled through and . To see this, we first rewrite in the following equivalent form:

(14) |

Next, we introduce the quantities:

(15) |

###### Remark 3.

is an invertible matrix.

###### Proof.

Due to remarks 1 and 2 we have that the spectral radius of is strictly smaller than one, and hence is invertible. The result follows by recalling that and are full rank matrices.

The minimizer of (14) is given by:

(16) |

where exists and hence is well defined. This is because is positive-definite and is invertible. Also note that when , and , reduces to (5) and hence the bias is removed. We do not fix because while the bias diminishes as , the estimate of the value function approaches a Monte Carlo estimate and hence the variance of the estimate increases. Note from (7) and (14) that in the particular case where the value function lies in the range space of (and there is no regularization, i.e., ) there is no bias (i.e., ) independently of the values of and . This observation shows that when there is bias between and , the bias arises from the fact that the value function being estimated does not lie in the range space of . In practice, offers a valuable bias-variance trade-off, and its optimal value depends on each particular problem. Note that since we are dealing with finite data samples, in practice, will always be finite. Therefore, eliminating the bias completely is not possible (even when ). The exact expression for the bias is obtained by subtracting (16) from (5). However, this expression does not easily indicate how the bias behaves as a function of , and . Lemma 1 provides a simplified expression.

###### Lemma 1.

The bias is approximated by:

(17) |

where

(18) |

for some constants , and .

###### Proof.

See Appendix A.

In the statement of the lemma, the notation is the indicator function. Note that expression (17) agrees with our previous discussion and with several intuitive facts. First, due to the indicator function, if lies in the range space of there is no bias independently of the values of , and (as long as ). Second, if , the bias is independent of (because when all terms that depend on are zeroed). Third, if then the bias is independent of the value of (because when all terms that depend on are zeroed). Furthermore, the expression is monotone decreasing in (for the case where ) which agrees with the intuition that the bias diminishes as increases. Finally, we note that the bias is minimized for and in this case there is still a bias, which if , is on the order of . This explicitly shows the effect on the bias of having a finite . The following lemma describes the behavior of the variance.

###### Lemma 2.

The variance of the estimate is approximated by:

(19) |

for some constants and .

###### Proof.

See Appendix B.

Note that (19) is monotone increasing as a function of (for ) and as a function of (for ). Adding expressions (17) and (19) shows explicitly the bias-variance trade-off handled by the parameter and the finite horizon . We remark that the idea of an eligibility trace parameter as a bias-variance trade-off is not novel to this paper and has been previously used in algorithms such as [36], with replacing traces [37], GTD() [8] and True Online GTD() [9]. Note however, that these works derive algorithms for the on-line case (as opposed to the batch setting) using different cost functions. Therefore, the expressions we present in this paper are different from previous works, which is why we derive them in detail. Moreover, the expressions corresponding to Lemmas 1 and 2 that quantify such bias-variance trade-off for are new and specific for our batch model.

At this point, all that is left to fully define the surrogate cost function is to choose the positive definite matrix . The algorithm that we derive in this paper is of the stochastic gradient type. With this in mind, we shall choose such that the quantities , and turn out to be expectations that can be sampled from data realizations. Thus, we start by setting to be a diagonal matrix with positive entries; we collect these entries into a vector and write instead of , i.e., . We shall select to correspond to the steady-state distribution of the Markov chain induced by the behavior policy, . This choice for not only is convenient in terms of algorithm derivation, it is also physically meaningful; since with this choice for , states that are visited more often are weighted more heavily while states which are rarely visited receive lower weights. As a consequence of Assumption 1 and the Perron-Frobenius Theorem [32], the vector is guaranteed to exist and all its entries will be strictly positive and add up to one. Moreover, this vector satisfies where is the transition probability matrix defined in a manner similar to .

###### Lemma 3.

Setting , the matrices , and can be written as expectations as follows:

(20a) | |||

(20b) |

where, with a little abuse of notation, we defined and , where is the state visited at time .

###### Proof.

See Appendix C.

### Ii-C Optimization problem

Since the signal distributions are not known beforehand and we are working with a finite amount of data, say, of size , we need to rely on empirical approximations to estimate the expectations in . We thus let , , and denote estimates for , , and from data and replace them in (14) to define the following empirical optimization problem:

(21) |

Note that whether an empirical estimate for is required depends on the choice for . For instance, if then obviously no estimate is needed. However, if then an empirical estimate is needed, (i.e., ).

To fully characterize the empirical optimization problem, expressions for the empirical estimates still need to be provided. The following lemma provides the necessary estimates.

###### Lemma 4.

For the general off-policy case, the following expressions provide unbiased estimates

for , and :(22a) | |||

(22b) | |||

(22c) |

where

(23) | |||

(24) |

###### Proof.

See Appendix D.

Note that is the importance sample weight corresponding to the trajectory that started at some state and took steps before arriving at some other state . Note that even if we have transitions, we can only use training samples because every estimate of and looks steps into the future.

## Iii Distributed Policy Evaluation

In this section we present the distributed framework and use (21) to derive Fast Diffusion for Policy Evaluation (FDPE). The purpose of this algorithm is to deal with situations where data is dispersed among a number of nodes and the goal is to solve the policy evaluation problem in a fully decentralized manner.

### Iii-a Distributed Setting

We consider a situation in which there are agents that wish to evaluate a target policy for a common MDP. Each agent has samples, which are collected following its own behavior policy (with steady state distribution matrix ). Note that the behavior policies can be potentially different from each other. The goal for all agents is to estimate the value function of the target policy leveraging all the data from all other agents in a fully decentralized manner.

To do this, they form a network in which each agent can only communicate with other agents in its immediate neighborhood. The network is represented by a graph in which the nodes and edges represent the agents and communication links, respectively. The topology of the graph is defined by a combination matrix whose -th entry (i.e., ) is a scalar with which agent scales information arriving from agent . If agent is not in the neighborhood of agent , then .

###### Assumption 2.

We assume that the network is strongly connected. This implies that there is at least one path from any node to any other node and that at least one node has a self-loop (i.e. that at least one agent uses its own information). We further assume that the combination matrix is symmetric and doubly-stochastic.

###### Remark 4.

In view of the Perron-Frobenius Theorem, assumption 2 implies that the matrix L can be diagonalized as , where one element of

is equal to 1 and its corresponding eigenvector is given by

(whereis the all ones vector). The remaining eigenvalues of

lie strictly inside the unit circle.A combination matrix satisfying assumption 2 can be generated using the Laplacian rule, the maximum-degree rule, or the Metropolis rule (see Table 14.1 in [32]).

### Iii-B Algorithm Derivation

Mathematically, the goal for all agents is to minimize the following aggregate cost:

(25) |

where the purpose of the nonnegative coefficients is to scale the costs of the different agents; this is useful since the costs of agents whose behavior policy is closer to the target policy might be assigned higher weights. For (25), we define the matrices and to be:

(26) |

so that equation (25) becomes:

(27) |

Note that (27) has the same form as (14); the only difference is that in (27) the matrices and are defined by linear combinations of the individual matrices and , respectively. Matrices are therefore not required to be positive definite, only is required to be a positive definite diagonal matrix. Since the matrices are given by the steady-state probabilities of the behavior policies, this implies that each agent does not need to explore the entire state-space by itself, but rather all agents collectively need to explore the state-space. This is one of the advantages of our multi-agent setting. In practice, this could be useful since the agents can divide the entire state-space into sections, each of which can be explored by a different agent in parallel.

###### Assumption 3.

We assume that the behavior policies are such that the aggregated steady state probabilities (i.e., ) are strictly positive for every state.

The empirical problem for the multi-agent case is then given by:

(28) |

(29a) | |||

(29b) |

###### Assumption 4.

We assume that and are positive definite and invertible, respectively.

It is easy to show that Assumption 4 is equivalent to assuming that each state has been visited at least once while collecting data. Intuitively, this assumption is necessary for any policy evaluation algorithm since one cannot expect to estimate the value function of states that have never been visited. Since we are interested in deriving a distributed algorithm we define local copies and rewrite (28) equivalently in the form:

(30) |

The above formulation although correct is not useful because the gradient with respect to any individual depends on all the data from all agents and we want to derive an algorithm that only relies on local data. To circumvent this inconvenience, we reformulate (28) into an equivalent problem. To this end, we note that every quadratic function can be expressed in terms of its conjugate function as:

(31) |

Therefore, expression (28) can equivalently be rewritten as:

(32) |

###### Remark 5.

The saddle-point of (32) is given by

(33) |

###### Proof.

and are obtained by equating the gradient of (32) to zero and solving for and .

Defining local copies for the primal and dual variables we can write:

(34) |

Now to derive a learning algorithm we rewrite (III-B) in an equivalent more convenient manner (the following steps can be seen as an extension to the primal-dual case of similar steps used in [30]). We start by defining the following network-wide magnitudes:

(35) |

We remind the reader that and were defined in Remark 4. We further clarify that is the entrywise square root of the positive definite diagonal matrix . The notation refers to stacking vectors from to into one larger vector. Moreover, is a block diagonal matrix with matrices as its diagonal elements.

###### Remark 6.

Due to Remark 4, it follows that the bases of the null-spaces of and are given by and , respectively. Therefore, we get:

(36a) | |||

(36b) |

Using (36) we transform (III-B) into the following equivalent formulation:

(37) |

We next introduce the constraints into the cost by using Lagrangian and extended Lagrangian terms as follows:

(38) |

where and are the dual variables of and , respectively. Now we perform incremental gradient ascent on and gradient descent on to obtain the following updates:

(39a) | ||||

(39b) | ||||

(39c) | ||||

(39d) |

where in (39b) we used . Combining (39b) and (39c) we get:

(40a) | ||||

(40b) | ||||

(40c) |

Using (40b) to calculate we get:

(41) |

Substituting (40c) into (41) we get:

(42) |

Which we rewrite as:

(43a) | ||||

(43b) | ||||

(43c) |

Notice that steps (39)-(43) allow us to get rid of . Performing incremental gradient descent on and gradient ascent on and following equivalently (39)-(43) we get:

(44a) | ||||

(44b) | ||||

(44c) |

Combining (43) and (44) and defining (and similarly for ) we arrive at Algorithm 1, which is a fully distributed algorithm.

###### Theorem 1.

Note that the above condition can always be satisfied by making sufficiently large.

Algorithm 1: Processing steps at node

Initialize: and arbitrarily and let .

For :

(46a) | ||||

(46b) | ||||

(46c) |

###### Proof.

We start by setting and introducing the following definitions:

(47a) | |||

(47b) | |||

(47c) | |||

(47d) |

With these definitions, we write the update equations of Algorithm 1 in the form of equations (40) (for both the primal and dual) in the following first order network-wide recursion:

(48a) | ||||

(48b) |

for which . Note that the variable is a network-wide variable that includes both and .

###### Lemma 5.

###### Proof.

See appendix E.

Subtracting and from (48) and defining the error quantities and we get:

(50) |

Multiplying by the inverse of the leftmost matrix we get:

(51) |

###### Lemma 6.

Through a coordinate transformation applied to (51) we obtain the following error recursion:

(52) |

where are some constant matrices, , and is a diagonal matrix with . Furthermore and satisfy:

(53) |

###### Proof.

See Appendix F.

Comments

There are no comments yet.