## 1 Introduction

Reinforcement learning (RL) is a machine learning paradigm with potential for impact in wide-ranging applications. At a high level, RL studies autonomous agents interacting with uncertain environments – by taking actions, observing the effects of those actions, and incurring costs – in hopes of achieving some goal. Mathematically, this is often cast in the following (finite, discrete-time) Markov decision process (MDP) model. Let

and be finite sets of states and actions, respectively; for simplicity, we assume for somethroughout the paper. The uncertain environment is modeled by a controlled Markov chain with transition matrix

, i.e.(1) |

where and are the random sequences of states and actions, respectively. State-action pair incurs instantaneous cost . Mappings are called (stationary, deterministic, Markov) policies and dictate the action taken at each state, i.e. . If the initial state is and the agent follows policy , it incurs discounted cost

(2) |

where means inside the expectation, is a discount factor, and and

are the transition matrix and cost vector induced by

.To find good policies – roughly, for which

is small – one often needs to estimate

for a fixed policy . For example, the empirical policy iteration algorithm of [Haskell et al.(2016)Haskell, Jain, and Kalathil] iteratively estimates and greedily updates . Moving forward, we focus on the former step, called empirical policy evaluation (EPE). The policy will thus be fixed for the remainder of the paper, so we dispense with this subscript in (2) and (with slight abuse of notation) define our problem as follows. Let be a discount factor, a cost vector, and anrow stochastic matrix. We seek an algorithm to estimate the

value function(3) |

As is typical in the RL literature, we assume is unknown but the agent can sample random states distributed as via interaction with the environment. Since this interaction can be costly in applications, we aim to estimate (3) with as few samples as possible. In contrast to some works, we also assume is a known input to the algorithm. Thus, our work is suitable for goal-oriented applications where one knows instantaneous costs a priori – for instance, which states correspond to winning or losing if the MDP models a game – and aims to estimate long-term discounted costs – for instance, how good or bad non-terminal configurations of the game are.

To contexualize our contributions, we contrast two approaches to EPE. The first approach is one of forward exploration, where is estimated by sampling trajectories beginning at each state. We focus on a typical approach employed in e.g. [Haskell et al.(2016)Haskell, Jain, and Kalathil], which we refer to as the forward approach for the remainder of the paper, and which proceeds as follows. First, let be a Markov chain with transition matrix , fix and , and rewrite (3) as

(4) |

Here the bias can be made small if is chosen large, and the first term can be estimated by simulating length- trajectories. More specifically, let be a trajectory obtained as follows: set and, for , sample from . Letting and repeating this for

, we obtain an unbiased estimate of the first term in (

4):(5) |

This forward approach is analytically quite tractable; indeed, rigorous guarantees follow easily from standard Chernoff bounds (see Appendix I). However, since trajectories must be sampled starting at each state, samples are fundamentally required, which may be prohibitive in practice.

The second approach we consider is one of backward exploration. This approach relies on the idea that if there are only a few high-cost states with only a few trajectories leading to them, it is more efficient to work backward along just these trajectories (or along a small set containing them) to identify high-value states (those for which is large). Put differently, if and are sparse, intuition suggests that backward exploration from high-cost states is more sample-efficient than forward exploration from all states. While intuitively reasonable, there are two issues that prevent backward exploration from reducing the linear sample complexity of the forward approach. First, the agent must identify high-cost states in order to explore backward from them, without visiting all states. Second, the agent must explore a small set of trajectories likely to lead to high-cost states, without starting at each state and filtering out trajectories that do not reach the high-cost set. Several approaches have been proposed to combat these issues. For instance, [Goyal et al.(2018)Goyal, Brakel, Fedus, Singhal, Lillicrap, Levine, Larochelle, and Bengio] uses observed state-action-cost sequences to train a model that generates samples of state-action pairs likely to lead to a given state. This allows the agent to construct simulated trajectories that are guaranteed to lead to high-cost states, addressing the second issue; the observed sequences are also used to identify high-cost states, addressing the first issue. [Edwards et al.(2018)Edwards, Downs, and Davidson] similarly trains a model that predicts which trajectories lead to high-cost states while assuming costs are known a priori. In a different vein, [Florensa et al.(2017)Florensa, Held, Wulfmeier, Zhang, and Abbeel] considers physical tasks like a robot navigating a maze which have clear goal states, addressing the first issue. The state-action space is assumed to have a certain continuity – namely, “small” actions (e.g. a robot moving a small distance) lead to “nearby” states (e.g. physically close locations) – addressing the second issue.

Our approach is as follows. First, as mentioned above, we assume the cost vector is known a priori (like [Edwards et al.(2018)Edwards, Downs, and Davidson] and similar to [Florensa et al.(2017)Florensa, Held, Wulfmeier, Zhang, and Abbeel]). Second, we assume the agent is provided certain side information: satisfying the “absolute continuity” condition

(6) |

Note we can view as the adjacency matrix for a graph whose edges are a superset of those in the graph induced by ; thus, we refer to this side information as the supergraph. The utility of the supergraph is that it allows the agent to determine which states may be “close” to high-cost states in the induced graph, which may allow for construction of trajectories leading to such states. In this work, we assume the supergraph is provided and do not address the important practical consideration of how to actually obtain it. However, we do note it can likely be obtained from domain knowledge. For instance, in a robot navigation task like [Florensa et al.(2017)Florensa, Held, Wulfmeier, Zhang, and Abbeel], one-step transitions between physically distant states and may be impossible, which would allow us to conclude a priori and set . Unlike [Florensa et al.(2017)Florensa, Held, Wulfmeier, Zhang, and Abbeel], however, our supergraph assumption does not depend on state-action continuity and thus should hold more generally; for example, if the MDP models a game, the game’s rules may prevent transitions from to , so that . We also emphasize that the reverse of the implication in (6) need not hold. Thus, one can always set to ensure that (6) holds. Of course, there is a trade off; as will be seen, our algorithms are most efficient when is sparse in a certain sense.

In the remainder of the paper, we devise two backward exploration-based EPE algorithms that exploit the supergraph. Unlike [Goyal et al.(2018)Goyal, Brakel, Fedus, Singhal, Lillicrap, Levine, Larochelle, and Bengio, Edwards et al.(2018)Edwards, Downs, and Davidson, Florensa et al.(2017)Florensa, Held, Wulfmeier, Zhang, and Abbeel], which only present empirical results, our algorithms are amenable to rigorous accuracy and sample complexity guarantees. Thus, our main contribution is to offer theoretical evidence for the empirical success of backward exploration. More precisely, our contributions are as follows. First, we devise an algorithm called Backward-EPE in Section 2 that uses the supergraph to discover high-value states while working backward from high-cost ones. We establish accuracy and worst-case sample complexity , equivalent to the average-case complexity of the forward approach. More notably, we show the average-case complexity of Backward-EPE is , where is the average degree in the supergraph. Note this bound precisely captures the intuition that backward exploration depends on how many high-cost states are present ( term) and how many trajectories lead to them ( term). In the extreme case, , in which case Backward-EPE reduces complexity from to . Next, we combine Backward-EPE with the forward approach for our second algorithm Bidirectional-EPE in Section 3. We establish a (pseudo)-relative error guarantee, which we argue is useful in e.g. empirical policy iteration. Analytically, we show Bidirectional-EPE reduces the sample complexity of a plug-in method with the same accuracy guarantee; empirically, we show it is more efficient than using the backward or forward approach alone. Both of our algorithms are inspired by methods that estimate PageRank, a node centrality measure used in the network science literature [Page et al.(1999)Page, Brin, Motwani, and Winograd]. While seemingly unrelated to EPE, PageRank has mathematical form similar to that of the value function (3); however, the PageRank estimation literature assumes is known, so the extension to EPE is non-trivial. Thus, another contribution of this work to show how PageRank estimators can be adapted to EPE. We believe our algorithms and analysis are only examples of a more general approach; Section 4 discusses other problems where we believe our approach will be useful.

Commonly-used notation: For a matrix and any , , , and denote the -th entry, -th row, and -th column of , respectively. We write and for the matrices of zeroes and ones, resp. Matrix transpose is denoted by . We use for the indicator function, i.e. if statement is true and otherwise. For , is the -length vector with in the -th entry and elsewhere, i.e. . Also for , and are the incoming neighbors and in-degree of in the supergraph. Average degree is denoted by . For , we write , , , and , resp., if , , and , and

, resp. All random variables are defined on a common probability space

, with denoting expectation and meaning -almost surely.## 2 Backward empirical policy evaluation

Our first algorithm is called Backward-EPE and is based on the Approx-Contributions PageRank estimator from [Andersen et al.(2008)Andersen, Borgs, Chayes, Hopcroft, Mirrokni, and Teng]. The latter algorithm restricts to the case for some and assumes is known; our algorithm is a fairly natural generalization to the case and unknown . For brevity, we restrict attention to Backward-EPE in this section. For transparency, Appendix A discusses Approx-Contributions and clarifies which aspects of our analysis are borrowed from [Andersen et al.(2008)Andersen, Borgs, Chayes, Hopcroft, Mirrokni, and Teng] and other existing work.

Backward-EPE is defined in Algorithm 2. The algorithm takes as input cost vector , discount factor , and desired accuracy , and initializes four variables: a value function estimate , a residual error vector , a set we call the encountered set, and a transition matrix estimate . Conceptually, the algorithm then works backward from high-cost states, iteratively pushing mass from residual vector to estimate vector so as to improve the estimate of . More precisely, the first iteration proceeds as follows. First, a high-cost state is chosen ( such that is maximal) and its incoming supergraph neighbors are added to the encountered set (first line in while loop). For – i.e. for which may be nonzero by (6) – an estimate of is computed using samples (first for loop). The estimate is then incremented with the component of , and is used to estimate the component of and to increment the corresponding residual (second for loop).

In subsequent iterations , the iterative update proceeds analogously, choosing to maximize , adding to the encountered set, incrementing by , and using an estimate of to increment . The only distinction is that at iteration , is estimated only for states . Put differently, the first time we encounter state – i.e. the first for which – we estimate ; we then retain that estimate for the remainder of the algorithm. Thus, the encountered set tracks the rows of we have estimated up to and including iteration . Alternatively, one could estimate with independent samples at each iteration for which ; we discuss the merits of this approach in Section 4.

; cost vector ; discount factor ; supergraph in-neighbors ; termination parameter ; per-state sample count

Sampler for transition matrix, , , ,

, uniformly,

,

, , Estimate of

The manner in which we estimate and update the estimate and residual vectors may appear mysterious, but it allows us to prove the following analogue of a key result from [Andersen et al.(2008)Andersen, Borgs, Chayes, Hopcroft, Mirrokni, and Teng]. To explain this result, first let denote the iteration at which Backward-EPE terminates, and let denote the -th row of , so that is the value function for defined on the final estimate of . Then the result (roughly) says that the fixed point equation is preserved across iterations . Conceptually, this means that if we run the algorithm until it terminates to obtain , then look back at the sequence generated by the algorithm, the fixed point equation will have held at each . This non-causality is somewhat unintuitive, yet is crucial to the ensuring analysis.

More precisely, Lemma 2 says that such a fixed point equation holds for certain row stochastic matrices which differ from only in unestimated rows of , i.e. rows indexed by . The set of such matrices for which the result holds is discussed in Appendix B; for brevity, here we state the result only for the two elements of this set we require in later analyses: , which fills unestimated rows with offline estimates, and , which fills unestimated rows with actual rows of .

Let , independent across and , and independent of the random variables in Algorithm 2. From , define an offline estimate of row-wise by . Furthermore, define

(7) | |||

(8) |

where is the iteration at which Algorithm 2 terminates. Then

(9) |

Appendix B.

Owing to the fact that (9) holds across iterations, we will refer to the identities in (9) as the -invariant and the -invariant, respectively. These invariants will be pivotal in the theorems to come; interestingly, though, only one invariant is useful for each theorem, while the other fails. This is due to technical issues discussed in Remarks C, D, and F.2 in the appendix. We also emphasize the offline estimate is an analytical tool and does not affect our algorithm’s sample complexity.

We turn to the first of the aforementioned theorems, an accuracy guarantee for Backward- EPE. Toward this end, note that is a distribution over and recall that by definition; thus, the -invariant ensures that the ultimate estimate of satisfies

(10) |

For the remaining summand , recall and are the value functions defined on and an estimate of , respectively. Thus, if the estimate of is sufficiently acccurate, this remaining summand will be small. This is made precise by the following theorem. We note that showing with high probability is not immediate, because is a biased estimate of in general; instead, the proof bounds by a random variable more conducive to standard Chernoff bounds. We also note this guarantee matches the forward approach’s guarantee from [Haskell et al.(2016)Haskell, Jain, and Kalathil]. Fix and define

(11) |

Theorem 10 says that if we take samples per state encountered, the estimate produced by Backward-EPE will be -accurate. Since Backward-EPE encounters states by definition, the total number of samples needed to ensure -accuracy is . Hence, our next goal is to bound , in order to bound this overall complexity. By the backward exploration intuition discussed in Section 1, we should expect a nontrivial bound if the cost vector and supergraph are sufficiently sparse. However, even when both objects are maximally sparse, one can construct adversarial examples for which . For instance, suppose we restrict to having a single high-cost state and the supergraph to having the minimal number of edges . Then taking and will satisfy this restriction, but will yield (assuming ). Note the key issue in this example (and, we suspect, in most adversarial examples) is the interaction between the cost vector and the supergraph; in particular, if high-cost states have high in-degrees, will be large (even if there are few high-cost states and few edges overall).

In light of this, our best hope for a nontrivial bound on is an average-case analysis; in particular, bounding while randomizing over the inputs of Backward-EPE. As it turns out, we only need to randomize over the cost vector (not the transition matrix). Roughly speaking, we will consider a random cost vector for which , i.e. the expected cost of any given state does not dominate the average expected cost. For such cost vectors, the interaction between cost and in-degree discussed in the previous paragraph will “average out”, and consequently the adversarial examples will not dominate in expectation.

This intuition is formalized in the following theorem. Similar to Theorem 10, the proof exploits the -invariant. Here the key observations are that and that increases by at least at each for which , which prevents certain states from being chosen as and thus (potentially) prevents their incoming supergraph neighbors from being encountered.

Let be an -valued random vector s.t. for some absolute constant . Then if Algorithm 2 is initialized with cost vector ,

(12) |

where the expectation is with respect to and the randomness in Algorithm 2. See Appendix D.

We now return to interpret our results and derive Backward-EPE’s overall sample complexity, which (we recall) is . In the worst case, , and thus the worst-case sample complexity for fixed is . Neglecting factors and constants, ignoring terms for quantities that have polynomial scaling (e.g. writing as simply ), and assuming is either constant or grows to , Theorem 10 implies

(13) |

For comparison, the complexity of the forward approach is

(14) |

(see Appendix I). Thus, in the worst case Backward-EPE has similar complexity to that of the forward approach, with a slightly improved dependence on the discount factor .

In the average case, however, the sample complexity of Backward-EPE can be dramatically better than the forward approach. In particular, Theorem 2 implies average-case sample complexity

(15) |

(This argument is not precise, since is random in Theorem 2; we return to address this shortly.) Thus, if , , and are constants, Backward-EPE has average case complexity

(16) |

Interestingly, (16) exactly captures the intuition that backward exploration is efficient when the costs and supergraph are sufficiently sparse, since and quantify cost and supergraph sparsity, respectively. We also note that when , , and are constants, the forward approach’s complexity (14) becomes simply . In the extreme case, and Backward-EPE offers a dramatic reduction in sample complexity; namely, by a factor of .

Though this average-case argument is not precise, we can make it rigorous with further assumptions on . For example, the following corollary considers random binary cost vectors with nonzero entries. Such cost vectors could arise, for example, in MDP models of games, where states corresponding to losing configurations of the game have unit cost and other states have zero cost. Fix and define to be the set of binary vectors with nonzero entries. Assume the cost vector is chosen uniformly at random from and are constants. Then to guarantee , Backward-EPE requires samples in expectation. See Appendix E.

To conclude this section and illustrate our analysis, we present empirical results in Figure 1. Here we generate random problem instances in a manner that yields three different cases of the complexity factor identified above; roughly, , , and (left). In all cases, the sample complexity of Backward-EPE decays relative to that of the forward approach, suggesting sublinear complexity (middle). Moreover, the different scalings of reflect in different rates of decay in relative complexity, suggesting indeed determines sample complexity. We also note algorithmic parameters are chosen to ensure both algorithms yield similar

error (right). Error bars show standard deviation across problem instances. Further details regarding the experimental setup can be found in Appendix

H.## 3 Bidirectional empirical policy evaluation

Our second algorithm is called Bidirectional-EPE and is inspired by the Bidirectional- PPR PageRank estimator from [Lofgren et al.(2016)Lofgren, Banerjee, and Goel] (see Appendix A for further discussion of this PageRank estimator). As will be seen, this algorithm is conducive to a stronger accuracy guarantee; namely, a (pseudo)-relative error guarantee. The utility of such a guarantee is that the resulting estimates tend to better preserve the ordering of the actual value function when compared to an guarantee. Preserving this ordering is important in the problem of finding good policies; e.g. in the greedy update of policy iteration mentioned in Section 1.

As its name suggests, Bidirectional-EPE proceeds in two stages: it first conducts backward exploration using Backward-EPE, then improves the resulting estimate via forward exploration. The analysis of this bidirectional approach relies on the -invariant (9). Similar to Theorem 10, we can make small by taking large in Backward-EPE; when this holds, we have

(17) |

Since

is a probability distribution over

, the residual term in (17) satisfies(18) |

where in the approximate equality are distributed as and is large. Hence, by (17),

(19) |

Intuitively, the right side of (19) is a more accurate estimate of than alone; the only remaining question is how to generate . This can indeed be done in our model; namely, by generating -length trajectories on . More specifically, given , we first generate a random variable and set ; we then sample from for each ; and finally we set . Then conditioned on , is distributed as . To see why, let denote probability conditioned on and observe

(20) |

Thus, sampling from amounts sampling from . To do so, we either sample from (if ) or from (if ); the former is exactly what was done in Backward-EPE, and the latter can be done after running Backward-EPE. Put differently, to generate we sample from unless we have already sampled from during Backward-EPE, in which case we sample from the empirical estimate obtained during Backward-EPE.

The Bidirectional-EPE algorithm is formally defined in Algorithm 3. As above, write for the per-state forward trajectory count; we also write for the per-state sample count in the Backward-EPE subrountine. We denote the ultimate estimate of by .

; cost vector ; discount factor ; supergraph in-neighbors ; termination parameter ; per-state backward, forward sample counts ,

Sampler for transition matrixRun Backward-EPE (Algorithm 2) with inputs sampler, , , , ,

Let , , , be estimate vector, residual vector, encountered states, and estimate at termination of Backward-EPE, and define in (8)

Generate samples from , set

Estimate of

As alluded to above, Bidirectional-EPE is conducive to a pseudo-relative error guarantee. In particular, given relative error tolerance and absolute tolerance , Theorem 3 shows that with high probability, the estimate satisfies

(21) |

Thus, Bidirectional-EPE permits a relative-plus-absolute accuracy guarantee. (Note that since can be arbitrarily small in general, we should not expect a relative error guarantee for all states.) This guarantee is formalized in the next theorem. As suggested by (17)-(19), the proof first shows for large; conditioned on , we then show for large, using separate Chernoff bounds for two cases of .

We next discuss Theorem 3. To simplify notation, we restrict to the setting of Corollary 2; however, the key insights extend to the more general setting of Theorem 2. Also, we assume the relative error tolerance , the discount factor , and inaccuracy probability are constants independent of . Finally, we note Theorem 3 holds for random ; see Remark F.2 in Appendix F.

We begin by deriving expressions for the asymptotic sample complexity of Bidirectional- EPE in the setting of Corollary 2. For the backward exploration stage (i.e. the Backward-EPE subroutine), we require per-state sample complexity ; note this is deterministic since pointwise in Corollary 2. Thus, the average-case sample complexity is (by Corollary 2),

(25) |

For the forward exploration stage, we require trajectories of expected length for each of states. We are assuming is a constant, and thus the expected forward complexity is simply . Combined with (25), and writing for the overall expected sample of Bidirectional-EPE in the setting of Corollary 2,

(26) |

Here the termination parameter for the Backward-EPE subroutine is a free parameter that can be chosen to minimize the overall sample complexity. For example,

(27) |

where for simplicity we wrote as simply (note this choice of minimizes (26) if we also ignore the term in that expression). To interpret (27), we consider a specific choice of . To motivate this, we first observe that in the setting of Corollary 2,

(28) |

i.e. the “typical” value is . It is thus sensible to choose , so that we obtain a relative guarantee for above-typical values and settle for the absolute guarantee for below-typical values. Substituting into (27), we conclude that Bidirectional-EPE requires

Comments

There are no comments yet.