The goal of a policy evaluation algorithm is to estimate the performance that an agent will achieve when it follows a particular policy to interact with an environment, usually modeled as a Markov Decision Process (MDP). Policy evaluation algorithms are important because they are often key parts of more elaborate solution methods where the ultimate goal is to find an optimal policy for a particular task (one such example is the class of actor-critic algorithms – see for a survey). This work studies the problem of policy evaluation in a fully decentralized setting. We consider two distinct scenarios.
In the first case, independent agents interact with independent instances of the same environment following potentially different behavior policies to collect data the objective is for the agents to cooperate. In this scenario each agent only has knowledge of its own states and rewards, which are independent of the states and the rewards of the other agents. Various practical situations give rise to this scenario, for example, consider a task that takes place in a large geographic area. The area can be divided into smaller sections, each of which can be explored by a separate agent. This framework is also useful for collective robot learning (see, for example, ,  and ).
The second scenario we consider is that of multi-agent reinforcement learning (MARL). In this case a group of agents interact simultaneously with a unique MDP and with each other to attain a common goal. In this setting there is a unique global state known to all agents and each agent receives distinct local rewards, which are unknown to the other agents. Some examples that fit into this framework are teams of robots working on a common task such as moving a bulky object, try to catch a prey, or putting out a fire.
I-a Related Work
Our contributions belong to the class of works that deal with policy evaluation, distributed reinforcement learning, and multi-agent reinforcement learning.
There exist a plethora of algorithms for policy evaluation such as GTD , TDC , GTD2 , GTD-MP/GTD2-MP , GTD() , and True Online GTD() . The main feature of these algorithms is that they have guaranteed convergence (for small enough step-sizes) while combining off-policy learning and linear function approximation; and are applicable to scenarios with streaming data. They are also applicable to cases with a finite amount of data. However, in this latter situation, they have the drawback that they converge at a sub-linear rate because a decaying step-size is necessary to guarantee convergence to the minimizer. In most current applications, policy evaluation is actually carried out after collecting a finite amount of data (one example is the recent success in the game of GO ). Therefore, deriving algorithms with better convergence properties for the finite sample case becomes necessary. By leveraging recent developments in variance-reduced algorithms, such as SVRG  and SAGA , the work  presented SVRG and SAGA-type algorithms for policy evaluation. These algorithms combine GTD2 with SVRG and SAGA and they have the advantage over GTD2 in that linear convergence is guaranteed for fixed data sets. Our work is related to  in that we too use a variance-reduced strategy, however we use the AVRG strategy  which is more convenient for distributed implementations because of an important balanced gradient calculation feature.
Another interesting line of work in the context of distributed policy evaluation is ,  and . In  and  the authors introduce Diffusion GTD2 and ALG2; which are extensions of GTD2 and TDC to the fully decentralized case, respectively. While  is a shorter version of this work. These algorithms consider the situation where independent agents interact with independent instances of the same MDP. These strategies allow individual agents to converge through collaboration even in situations where convergence is infeasible without collaboration. The algorithm we introduce in this paper can be applied to this setting as well and has two main advantages over  and . First, the proposed algorithm has guaranteed linear convergence, while the previous algorithms converge at a sub-linear rate. Second, while in some instances, the solutions in  and  may be biased, the proposed method allows better control of the bias term due to a modification in the cost function. We extend our previous work  in four main ways which we discuss in the Contribution sub-section.
There is also a good body of work on multi-agent reinforcement learning (MARL). However, most works in this area focus on the policy optimization problem instead of the policy evaluation problem. The works that are most related to the current contribution are  and . In , the authors extend the policy gradient theorem to the multi-agent case and derive two actor-critic algorithms for policy optimization that employ a distributed policy evaluation strategy. Their distributed policy evaluation algorithm is similar to  and  (they combine diffusion learning  with standard TD instead of GTD2 and TDC as was the case in  and ). As a result, the algorithm converges sublinearly. Furthermore, the algorithm only works for the on-policy case and is not suitable to off-policy learning. The algorithm we present in this paper is compatible with these actor-critic strategies, and hence can be used to augment their performance. Another related work is , which was pursued simultaneously and independently of our current work. This paper also derives an algorithm for distributed policy evaluation, which is proved to converge linearly. Our proposed technique has at least five advantages in comparison to the approach from . First, the work  only considers the scenario in which the state is global and assumed known by all agents. In other words, the algorithm cannot be used in situations where agents have different states that are only known locally. Second, the memory requirement of the algorithm in  scales linearly with the amount of data (i.e., ), while the memory requirement for the proposed method is ), i.e., it is independent of the amount of data. This is because the algorithm in  relies on IAG  and SAG  while the proposed method relies on AVRG . Third, the proposed method works both on-policy and off-policy, while the algorithm of  is designed solely for on-policy operation. Fourth, we employ a more general cost function that provides a valuable bias-variance trade-off and. Finally, the algorithm from  requires all agents in the network to sample their data points in a synchronized manner (i.e., all agents need to sample the data points corresponding to the same time), while the algorithm we propose in this work does not require this type of synchronization.
The contribution of this paper is twofold. In the first place, we introduce Fast Diffusion for Policy Evaluation (FDPE), a fully decentralized policy evaluation algorithm under which all agents have a guaranteed linear convergence rate to the minimizer of the global cost function. The algorithm is designed for the finite data set case and combines off-policy learning, eligibility traces, and linear function approximation. The eligibility traces are derived from the use of a more general cost function and they allow the control of the bias-variance trade-off we mentioned previously. In our distributed model, a fusion center is not required and communication is only allowed between immediate neighbors. The algorithm is applicable both to distributed situations with independent MDPs (i.e., independent states and rewards) and to MARL scenarios (i.e., global state and independent rewards). To the best of our knowledge, this is the first algorithm that combines all these characteristics. Our second contribution is a novel proof of convergence for the algorithm.
This work expands our short work  in four ways. In the first place, in that work we used the Mean Square Projected Bellman Error (MSPBE) as a cost function, while in this work we employ a more general cost function. Second, we include the proof of convergence. Third, we show that our approach applies to MARL scenarios, while in our previous short paper we only discussed the distributed policy evaluation scenario with independent MDPs. Finally in this paper we provide more extensive simulations.
I-C Notation and Paper Outline
The outline of the paper is as follows. In the next section we introduce the framework under consideration. In Section III we derive our algorithm and provide a theorem that guarantees linear convergence rate. In Section IV we discuss the MARL setting. We show simulation results in Section V, and finally provide concluding remarks in Section VI.
Ii Problem Setting
Ii-a Markov Decision Processes and the Value Function
We consider the problem of policy evaluation within the traditional reinforcement learning framework. We model this setting as a finite Markov Decision Process (MDP), with an MDP defined by the tuple (,,,,), where is a set of states of size , is a set of actions of size ,
specifies the probability of transitioning to statefrom state having taken action , is the reward function when the agent transitions to state from state having taken action ), and is the discount factor.
Even though in this paper we analyze the distributed scenario, in this section we motivate the cost function for the single agent case for clarity of exposition and in the next section we generalize it to the distributed setting. We thus consider an agent that wishes to learn the value function, , for a target policy of interest while following a potentially different behavior policy, . Here, the notation specifies the probability of selecting action at state . We recall that the value function for a target policy , starting from some initial state at time , is defined as follows:
where and are the state and action at time , respectively. Note that since we are dealing with a constant target policy , the transition probabilities between states, which are given by
, are fixed and hence the MDP reduces to a Markov Rewards Process (MRP). In this case, the state evolution of the agent can be modeled with a Markov Chain with transition matrixwhose entries are given by .
We assume that the Markov Chain induced by the behavior policy is aperiodic and irreducible. In view of the Perron-Frobenius Theorem , this condition guarantees that the Markov Chain under will have a steady-state distribution in which every state has a strictly positive probability of visitation .
Using the matrix and defining:
we can rewrite (1) in matrix form as:
Note that the inverse always exists; this is because and the matrix is right stochastic with spectral radius equal to one. We further note that the value function vector also satisfies the following stage Bellman equation:
for all .
Ii-B Definition of cost function
We are interested in applications where the state space is too large (or even infinite) and hence some form of function approximation is necessary to reduce the dimensionality of the parameters to be learned. As we anticipated in the introduction, in this work we use linear approximations111We choose linear function approximation, not just because it is mathematically convenient (since with this approximation our cost function turns out to be strongly convex) but because there are theoretical justifications for this choice. In the first place, in some domains (for example Linear Quadratic Regulator problems) the value function is a linear function of known features. Secondly, when policy evaluation is used to estimate the gradient of a policy in a policy gradient algorithm, the policy gradient theorem  assures that the exact gradient can be obtained even when a linear function is used to estimate the value function.. More formally, for every state , we approximate where is a feature vector corresponding to state and is a parameter vector such that . Defining , we can write a vector approximation for as . We assume that is a full rank matrix; this is not a restrictive assumption since the feature matrix is a design choice. It is important to note though that the true need not be in the rangespace of . If is in the rangespace of , an equality of the form holds exactly and the value of is unique (because is full rank) and given by . For the more general case where is not in the rangespace of , then one sensible choice for is:
where is some positive definite weighting matrix to be defined later. Note that always exists and is positive definite because is positive definite (due to the fact that is full rank and is positive definite). Although (7) is a reasonable cost to define , it is not useful to derive a learning algorithm since is not known beforehand. As a result, for the purposes of deriving a learning algorithm, another cost (one whose gradients can be sampled) needs to be used as a surrogate for (7). One popular choice for the surrogate cost is the Mean Square Projected Bellman Error (MSPBE) (see, e.g., , ,  and ); this cost has the inconvenience that its minimizer is different from (7) and some bias given by is incurred. In order to control the magnitude of the bias, we shall derive a generalization of the MSPBE which we refer to as truncated -weighted Mean Square Projected Bellman Error (H-MSPBE). To introduce this cost, we start by writing a convex combination of equation (6) with different ’s ranging from 1 to (we choose to be a finite amount instead of because in this paper we deal with finite data instead of streaming data) as follows:
where we introduced:
and is a parameter that controls the bias.
Note that .
is a right stochastic matrix because it is defined as a convex combination of powers of
is a right stochastic matrix because it is defined as a convex combination of powers of(which are right stochastic matrices).
Replacing in (8) by its linear approximation we get:
Projecting the right hand side onto the range space of so that an equality holds, we arrive at the truncated -weighted Projected Bellman Equation:
where is the weighted projection matrix onto the space spanned by , i.e.,
We can now use (13) to define our surrogate cost function:
where the first term on the right hand side is the H-MSPBE, is a regularization parameter, is a symmetric positive-definite weighting matrix, and reflects prior knowledge about . Two sensible choices for are and , which reflect previous knowledge about or the value function , respectively. The regularization term can be particularly useful when the policy evaluation algorithm is used as part of a policy gradient loop (since subsequent policies are expected to have similar value functions and the value of learned in one iteration can be used as in the next iteration) like, for example, in . One main advantage of using the proposed cost (II-B) instead of the more traditional MSPBE cost is that the size of the bias can be controlled through and . To see this, we first rewrite in the following equivalent form:
Next, we introduce the quantities:
is an invertible matrix.
is an invertible matrix.
Due to remarks 1 and 2 we have that the spectral radius of is strictly smaller than one, and hence is invertible. The result now follows by recalling that and are full rank matrices.
The minimizer of (16) is given by:
where the inverse exists and hence is well defined. This is because is positive-definite and is invertible. Also note that when , and , reduces to (7) and hence the bias is removed. We do not fix because while the bias diminishes as , the estimate of the value function approaches a Monte Carlo estimate and hence the variance of the estimate increases. The parameter then offers a valuable bias-variance trade-off in practice, and its optimal value naturally depends on each particular problem.
At this point, all that is left to fully define the surrogate cost function is to choose the positive definite matrix . The algorithm that we derive in this paper is of the stochastic gradient type. With this in mind, we shall choose such that the quantities , and turn out to be expectations that can be sampled from data realizations. Thus, we start by setting to be a diagonal matrix with positive entries; we collect these entries into a vector and write instead of , i.e., . We shall select to correspond to the steady-state distribution of the Markov chain induced by the behavior policy, . This choice for not only is convenient in terms of algorithm derivation, it is also physically meaningful; since with this choice for , states that are visited more often are weighted more heavily while states which are rarely visited receive lower weights. As a consequence of Assumption 1 and the Perron-Frobenius Theorem , the vector is guaranteed to exist and all its entries will be strictly positive and add up to one. Moreover, this vector satisfies where is the transition probability matrix defined in a manner similar to .
Setting , the matrices , and can be written as expectations as follows:
where, with a little abuse of notation, we defined and , where is the state visited at time .
See Appendix A.
Ii-C Optimization problem
Since the signal distributions are not known beforehand and we are working with a finite amount of data, say, of size , we need to rely on empirical approximations to estimate the expectations in . We thus let , , and denote estimates for , , and from data and replace them in (16) to define the following empirical optimization problem:
Note that whether an empirical estimate for is required depends on the choice for . For instance, if then obviously no estimate is needed. However, if then an empirical estimate is needed (for this particular choice, ).
To fully characterize the empirical optimization problem, expressions for the empirical estimates still need to be provided. The following lemma provides the necessary estimates.
For the general off-policy case, the following expressions provide unbiased estimates for the desired quantities:
For the general off-policy case, the following expressions provide unbiased estimates for the desired quantities:
See Appendix B.
Note that is the importance sample weight corresponding to the trajectory that started at some state and took steps before arriving at some other state . Note that even if we have transitions, we can only use training samples because every estimate of and looks steps into the future.
Iii Distributed Policy Evaluation
In this section we present the distributed framework and use (24) to derive Fast Diffusion for Policy Evaluation (FDPE). The purpose of this algorithm is to deal with situations where data is dispersed among a number of nodes and the goal is to solve the policy evaluation problem in a fully decentralized manner. The algorithm combines two important tools for inference from data: diffusion strategies [20, 23, 26, 27] and amortized variance-reduced techniques .
Iii-a Distributed Setting
We consider a situation in which there are agents that wish to evaluate a target policy for a common MDP. Each agent has samples, which are collected following its own behavior policy (with steady state distribution matrix ). Note that the behavior policies can be potentially different from each other. The goal for all agents is to estimate the value function of the target policy leveraging all the data from all other agents in a fully decentralized manner.
To do this, they form a network in which each agent can only communicate with other agents in its immediate neighborhood. The network is represented by a graph in which the nodes and edges represent the agents and communication links, respectively. The topology of the graph is defined by a combination matrix whose -th entry (i.e., ) is a scalar with which agent scales information arriving from agent . If agent is not in the neighborhood of agent , then . A sample network is shown in Figure 1.
We assume that the network is strongly connected. This implies that there is at least one path from any node to any other node and that at least one node has a self-loop (i.e., that at least one agent uses its own information). We further assume that the combination matrix is symmetric and doubly-stochastic.
A combination matrix satisfying assumption 2 can be generated using the Laplacian rule, the maximum-degree rule, or the Metropolis rule (see Table 14.1 in ).
Iii-B Algorithm Derivation
Mathematically, the goal for all agents is to minimize the following aggregate cost:
where the purpose of the nonnegative coefficients is to scale the costs of the different agents; this is useful since the costs of agents whose behavior policy is closer to the target policy might be assigned higher weights. For (III-B), we define the matrices and to be:
so that equation (III-B) becomes:
Note that (III-B) has the same form as (16); the only difference is that in (III-B) the matrices and are defined by linear combinations of the individual matrices and , respectively. Matrices are therefore not required to be positive definite, only is required to be a positive definite diagonal matrix. Since the matrices are given by the steady-state probabilities of the behavior policies, this implies that each agent does not need to explore the entire state-space by itself, but rather all agents collectively need to explore the state-space. This is one of the advantages of our multi-agent setting. In practice, this could be useful since the agents can divide the entire state-space into sections, each of which can be explored by a different agent in parallel.
The empirical problem for the multi-agent case is then given by:
Since we are interested in deriving a distributed algorithm we define local copies and rewrite (31) equivalently in the form:
The above formulation although correct is not useful because the gradient with respect to any individual depends on all the data from all agents and we want to derive an algorithm that only relies on local data. To circumvent this inconvenience, we reformulate (31) into an equivalent problem. To this end, we note that every quadratic function can be expressed in terms of its conjugate function  as:
Therefore, expression (31) can equivalently be rewritten as:
Defining local copies for the primal and dual variables we can write:
where we defined . Note that in the above formulation, the gradients with respect and depend only on local quantities, and therefore this optimization problem can be used to derive a fully distributed learning algorithm. To solve problem (38) we shall judiciously combine the Exact Diffusion  and Amortized Variance Reduced Gradient (AVRG)  techniques. We do so in a manner similar to what was done for the Diffusion AVRG approach from .
Exact Diffusion is a fully distributed algorithm that guarantees convergence to the global minimizer in strongly convex problems of the following form:
where the summation runs across the agents and is the individual risk function of agent . On the other hand, AVRG is a reduced-variance stochastic gradient algorithm that relies on random reshuffling. This algorithm tackles single agent problems of the following form:
where the summation runs across samples and
is some loss function evaluated atand the -th data point. Note that (38) has a form that is a combination of (39) and (40), with the added difference that (38) is a saddle-point problem while (39) and (40) are minimization problems. Hence, to be able to apply Exact Diffusion to (38) we modify the original formulation to make it suitable for our saddle-point problem. The modification we make is to change the gradient vector in the original formulation for another vector (which we refer to as ) in which the gradient with respect to the minimization variable is stacked with the negative gradient with respect to the maximization variable. We further apply the AVRG variance reduction scheme to the gradients of the primal and dual variables. We refer to this algorithm as Fast Diffusion for Policy Evaluation or FDPE — see Algorithm 1. Note that the proofs of convergence of ,  and  are not applicable to our algorithm because those papers derive algorithms that minimize strongly convex optimization problems, while FDPE relies on a combination of these techniques to solve a saddle-point problem. Hence, a new convergence technique and proof are required.
Update with exact diffusion:
In the listing of Fast Diffusion for Policy Evaluation, we introduce , , and , where indicates a random permutation of the mini-batches of the
-th agent, which is generated at the beginning of epoch; is the -th mini-batch and is defined as follows:
Note that the choice of the mini-batch size provides a communication-computation trade-off. As the number of mini-batches diminishes so do the communication requirements per epoch. However, more gradients need to be calculated per update and hence more gradient calculations might be required to achieve a desired error. Obviously the optimal amount of mini-batches to minimize the overall time of the optimization process depends on the particular hardware availability for each implementation.
The saddle-point of (38) is given by
and are obtained by equating the gradient of (38) to zero and solving for and .
If there is enough data such that and are positive definite matrices, and the step-sizes and are small enough, then the iterates and generated by Fast Diffusion for Policy Evaluation (FDPE) converge linearly to (49).
The proof is demanding and lengthy and involves several steps. Here we present a sketch of the proof based on high level lemmas without justification. The full proofs for all these results and lemmas are included in the supplementary material and can also be found in the arXiv version .
The first step we take towards proving convergence is to rewrite the equations of Algorithm 1 in the form of a single network recursion.
The update equations of Algorithm 1 can compactly be expressed by the following second order recursion for :
where , , and:
Here, the notation refers to the -th agent’s estimate of using all samples from the -th mini-batch (if no mini-batches are being used, is just a sample estimate, otherwise it is the empirical average using the samples corresponding to the -th mini-batch).
Recursion (50) is only valid for due to the fact that during the initial epoch () and in the first iteration of the second epoch (), not all gradients in Algorithm 1 have variance reduction. During the initial iterations up until , equations different than (50) describe the evolution of the network. Note, however, that such equations are not necessary to establish convergence because they only affect a finite number of iterations at the beginning of the optimization process.
If is initialized to , then recursion (50) is equivalent to the following first-order recursion for :
The next step in proving convergence is to derive an error recursion driven by gradient noise. We do this in the following lemma.
Defining the following error quantities and gradient noise:
we can derive the following error recursion, which is valid for :
where is a constant matrix.
We follow by doing a coordinate transformation so that the properties of the driving matrix in (56) are easier to analyze.
Through a coordinate transformation applied to (56) we obtain the following recursion:
where are some constant matrices, , , and is a diagonal matrix with . Furthermore and satisfy: