1 Introduction
One of the unique characteristics of human problem solving is the ability to represent the world in different granularities. When we plan a trip, we first choose the destinations we want to visit and only then decide what to do at each destination. Hierarchical reasoning enables us to break a complex task into simpler ones that are computationally tractable to reason about. Nevertheless, the most successful Reinforcement Learning (RL) algorithms perform planning in a single abstraction level.
RL provides a general framework for optimizing decisions in dynamic environments. However, scaling it to realworld problems suffers from the curses of dimensionality; that is, coping with large state spaces, action spaces, and long horizons. The most common approach to deal with large state spaces is to approximate the value function or policy, making it possible to generalize across different states
(Tesauro, 1995; Mnih et al., 2015). For long horizons, combining Monte Carlo simulation with value and policy networks was shown to search among game outcomes efficiently, leading to a superhuman performance in playing Go, Chess, and Poker (Silver et al., 2016; Moravčík et al., 2017). Another longstanding approach for dealing with long horizons is to introduce hierarchy into the problem (see (Barto and Mahadevan, 2003) for a survey). In particular, Sutton et. al. (Sutton et al., 1999) extended the RL formulation to include options – local policies for taking actions over a period of time. An option is formally defined as a threetuple , where is a set of option initiation states, is the option policy, and is a set of option termination states. The options framework presents a twolevel hierarchy, where options can be learned separately to achieve sub goals and a policy over options selects among options to accomplish the final goal of a task. Hierarchical solutions based on this formulation simplify the problem and demonstrate superior performance in challenging environments (Tessler et al., 2017; Vezhnevets et al., 2017; Bacon et al., 2017).In this work, we focus on a specific type of hierarchy  reward function decomposition  that dates back to the works of (Humphrys, 1996; Karlsson, 1997)
and was recently combined with deep learning
(van Seijen et al., 2017). In this formulation, the goal of each option is to maximize a local reward function , while the final goal is to maximize the sum of rewards . Each option is trained separately, and provides a policy and its value function. The goal is to learn a policy over options that uses the value functions of the options to select among them. That way, each option is responsible for solving a simple task, and the options are learnt in parallel across different machines.Formally, a set of options defined over a Markov decision process (MDP) constitutes a semiMDP (SMDP), and the theory of SMDPs provides the foundation for the theory of options. In particular, the policy over options can be learnt using an SMDP algorithm
(Sutton et al., 1999), which converges to the optimal policy over the options under sufficient conditions. A different approach is to use predefined rules to select among options. Such rules allow us to derive policies for the MDP (to maximize the sum of rewards) by learning options (without learning directly in ), such that learning is fully decentralized. For example, choosing the option with the largest value function (Humphrys, 1996; Barreto et al., 2017), or choosing the action that maximizes the sum of the values given to it by the options (Karlsson, 1997). The goal of this work is to provide theoretical guarantees for using such rules.We focus on a specific case where the dynamics is deterministic and the individual rewards correspond to collectible items, which is common in navigation benchmarks (Tessler et al., 2017; Beattie et al., 2016). We denote such an MDP by . The challenge with collectible rewards is that the state changes each time we collect a reward (the subset of available rewards is part of the state). Since all the combinations of remaining items have to be considered, the state space grows exponentially with the number of rewards. We therefore focus on solving an SMDP that is a composition of with a set of (optimal) options for collecting each one of the rewards. Each option can be initiated in any state, its policy is to collect a single reward, and it terminates once it collected it.
We show that finding the optimal policy in (the optimal policy over options) in this setup is equivalent to solving a Reward Discounted Traveling Salesman Problem (RDTSP). Similar to the classical TSP, the goal in the RDTSP is to find an order to collect all the rewards; but instead of finding the shortest tour, we maximize the discounted cumulative sum of the rewards (Definition 2). Not surprisingly, computing an optimal solution to the RDTSP is NPhard (Blum et al., 2007). A brute force approach for solving the RDTSP requires evaluating all the possible tours connecting the rewards; an adaptation of the Bellman–Held–Karp algorithm^{2}^{2}2Dynamic programming solution to the TSP (Bellman, 1962; Held and Karp, 1962). to the RDTSP (Algorithm 4 in the supplementary material) is identical to tabular Qlearning on the SMDP , and requires exponential time. This makes the task of computing the optimal policy for our SMDP infeasible.^{3}^{3}3The Hardness results for RDTSP do not rule out efficient solutions for special MDPs. In the supplementary, we provide exact polynomialtime solutions for the case in which the MDP is a line and when it is a star. Blum et. al. Blum et al. (2007) proposed a polynomial time algorithm for RDTSP that computes a policy which collects at least fraction of the optimal discounted return, which was later improved to (Farbstein and Levin, 2016). These algorithms need to know the entire SMDP to compute their approximately optimal policies.
In contrast, we focus on deriving and analyzing approximate solutions that use only local information, i.e., they only observe the value of each option from the current state. Such local policies
are straightforward to implement and are computationally efficient. The reinforcement learning community is already using simple local approximation algorithms for hierarchical RL. We hope that our research provides theoretical support for comparing local heuristics, and in addition introduces new reasonable local heuristics. Specifically, we prove worstcase guarantees on the reward collected by these algorithms relative to the optimal solution (given optimal options). We also prove bounds on the maximum reward that such local policies can collect. In our experiments, we compare the performance of these local policies in the planning setup (where all of our assumptions hold), and also during learning (with suboptimal options), and with stochastic dynamics.
Our results: We establish impossibility results for local policies, showing that no deterministic local policy can guarantee a reward larger than for any MDP, and no stochastic policy can guarantee a reward larger than (where OPT denotes the value of the optimal solution). These impossibility results imply that the Nearest Neighbor (NN) algorithm that iteratively collects the closest reward (and thereby a total of at least reward) is optimal up to a constant factor amongst all deterministic local policies.
On the positive side, we propose three simple stochastic policies that outperform NN. The best of them combines NN with a Random Depth First Search (RDFS) and guarantees performance of at least when OPT achieves , and at least in the general case. Combining NN with jumping to a random reward and sorting the rewards by their distance from it, has a slightly worse guarantee. A simple modification of the NN to first jump to a random reward and continues NN from there, already improves the guarantee to .
2 Problem formulation
We consider the standard RL formulation (Sutton and Barto, 2018), and focus on MDPs with deterministic dynamics and reward that is decomposed from a sum of collectible rewards (Definition 2). Models that satisfy these properties appear in numerous domains including many maze navigation problems, the Arcade Learning Environment, and games like Chess and Go (Bellemare et al., 2013; Barreto et al., 2017; Tessler et al., 2017; van Seijen et al., 2017). [Collectible Reward Decomposition MDP] An MDP is called a Collectible Reward Decomposition MDP if it satisfies the following properties: (1) Reward Decomposition, the reward in represents the sum of the local rewards: ; (2) Collectible Rewards, each reward signal represents a single collectible reward, i.e., iff for some particular state and otherwise. In addition, each reward can only be collected once. (3) Deterministic Dynamics. We consider the setup where we are given a set of optimal local policies (options) one per reward. Option tells us how to get to reward via a shortest path. To take decisions, when we are at state , then we can observe the value of at state for each . We have that where is the length of the shortest path from to . Notice that at any state, an optimal policy in always follows the shortest path to one of the rewards. ^{4}^{4}4To see this, assume there exists a policy that does not follow the shortest path from some state to the next rewardstate . Then, we can improve by taking the shortest path from to , contradicting the optimality of . This implies that an optimal policy on is a composition of the local options . Given that the dynamics is deterministic, an optimal policy in makes decisions only at states which contain rewards. In other words, once the policy arrived at a rewardstate and decided to go to a rewardstate , then it will follow the optimal policies until it reaches .^{5}^{5}5This is not true if is stochastic. To see this, recall that stochastic shortest path is only shortest in expectation. Thus, stochasticity may lead to states in which its better to change the target reward.
For a collectible reward decomposition MDP (Definition 2), an optimal policy can be derived in an SMDP (Sutton et al., 1999) denoted by . The state space of contains only the initial state and the reward states . The action space is replaced by the set of optimal options for collecting each one of the rewards. These options can start in any state of and terminate once they collect the reward.
In general, optimal policies for SMDPs are not guaranteed to be optimal in the original MDP, i.e., they do not achieve the same reward as the optimal policy in that uses only primitive actions. One trivial exception is the case that the set of options in includes the set of primitive actions (and perhaps also options) (Sutton et al., 1999). Another example is landmark options (Mann et al., 2015), 0that plan to reach a specific state in a deterministic MDP. Given a set of landmark options (that does not include primitive actions), Mann et al. showed that the optimal policy in is also optimal in (under a variation of Definition 2). In addition, they analysed suboptimal options and stochastic dynamics (Mann et al., 2015), which may help to extend our results in future work. We conclude this Section with Proposition 1, suggesting that an optimal policy on can be derived by solving an RDTSP (Definition 2). Here, a node corresponds to a state in (initial and reward states in ), and the length of an edge corresponds to time it takes to follow an option from state until it collects the reward in state .
[RDTSP] Given an undirected graph with nodes and edges of lengths , find a path (a set of indices ) that maximizes the discounted cumulative return:
Proposition 1 (MDP to RDTSP)
Given an MDP that satisfies Definition 2 with rewards and a set of options for collecting them, define a graph, , with nodes corresponding to the initial state and the rewardstates of . Define the length of an edge in to be , i.e., the value of following option from state . Then, an optimal policy in can be derived by solving an RDTSP in .
3 Local heuristics
We consider the SMDP described by the complete graph derived from in Proposition 1.
Recall that we have a reward in each vertex of .
We start by defining a local policy.
All the policies which we analyze are local policies.
A local policy is a mapping who inputs are:
(1) the current state ,
(2) the history containing the previous steps taken by the policy; in particular includes the rewards that we have already collected, and
(3) the discounted return for each option from the current state, i.e., ,
and whose output is a distribution over the options.
Notice that a local policy does not have full information on the MDP (but only on local distances)^{6}^{6}6Notice that the optimal global policy, (computed given the entire SMDP as input), has the same history dependence as local policies. However, while the global policy is deterministic, our results show that for local policies, stochastic policies are better than deterministic ones. It follows that the benefit of stochastic selection is due to the local information and not due to the dependence on the history. This implies that locality is related to partial observability, for which it is known that stochastic policies can be better (Aumann et al., 1996)..
A local policy is a mapping:
where is the set of distributions over a finite set .
3.1 The performance of NN
We start with an analysis of the natural Nearest Neighbor (NN) heuristics for the TSP. In the context of our problem, NN is the policy that selects the option with the highest estimated value, exactly like GPI
(Barreto et al., 2017). We shall abuse the notation slightly and use the same name (e.g., NN) for the algorithm itself and its value; no confusion will arise. For TSP (without discount) in general graphs, we know that (Rosenkrantz et al., 1977):
However, for RDTSP the NN algorithm only guarantees a value of as Theorem 3.1 states. In the next subsection, we prove a lower bound for deterministic local policies (such as NN) of This implies that NN is optimal for deterministic policies. [NN Performance] For any MDP satisfying Definition 2 with rewards, and ,
Denote by the nearest reward to the origin , and by the distance from the origin to . The distance from to the first reward collected by OPT is at least . Thus, if are the rewards ordered in the order by which OPT collects them we get that
OPT  
On the other hand, the NN heuristic chooses in the first round, thus, its cumulative reward is at least and we get that
RandomNN: Next, we propose a simple, easy to implement, stochastic adjustment to the NN algorithm with a better upper bound which we call RandomNN (RNN). The algorithm starts by collecting one of the rewards, say , picked at random, and continues by executing NN (Algorithm 1).
[RNN Performance] For an MDP satisfying Definition 2 with rewards, and :
We now analyze the performance guarantees of the RNN method. The analysis is conducted in two steps. In the first step, we assume that OPT achieved a value of by collecting rewards and consider the case that . The second step considers the more general case and analyzes the performance of NNRandom for the worst value of We emphasize that unlike the previous two algorithms, we do not assume this time that OPT collects its rewards at a segment of length ^{7}^{7}7Therefore, we do not perform a third step like we did in the analysis of the previous methods..
Step 1. Assume OPT collects rewards. Define and (here we can replace the by any fractional power of , this will not affect the asymptotics of the result) and denote by the CCs that are obtained by pruning edges longer than . We define a CC to be large if it contains more than rewards. Observe that since there are at most CCs (Lemma 2), at least one large CC exists.
Assume that is in a large component . Let be the path covered by NN starting from until it reaches in a large component. Let be the length of and let be the number of rewards collected by NN in (including the last reward in which is back in a large component, but not including ). Note that . Then . Let be the prefix of that ends at the th reward on () and let be the length of . Let be the distance from the th reward on to the th reward on . Since when NN is at the th reward on , the neighbor of in is at distance at most from this reward we have that . Thus, (with the initial condition ). The solution to this recurrence is .
For , we have that after visits of RNN in large CCs, for any in a large CC there exists an unvisited reward at distance shorter than from .
Let be a reward in a large component . We have collected at most rewards from . Therefore, there exists a reward which we have not collected at distance at most from .
Assume that , and let be the path of NN from its th reward in a large CC to its st reward in a large connected component. Let denote the length of and be the number of rewards on (excluding the first and including the last). Then .
The following lemma concludes the analysis of this step.
Let be the prefix of RNN of length . Let be the number of segments on of RNN that connect rewards in large CCs and contain internally rewards in small CCs. For , let be the number of rewards RNN collects in the th segment. Then . (We assume that splits exactly into segments, but in fact the last segment may be incomplete, this requires a minor adjustment in the proof.) Since then if the lemma follows. So assume that . By Corollary 3.1 we have that
(1) 
where . Since , Equation (1) implies that and since we get that . Taking logs the lemma follows.
Lemma 3.1 guarantees that once at RNN collects rewards before traversing a distance of . Next, notice that the chance that (as defined in Algorithm 3) belongs to one of the large CCs is , which is larger than for .
Finally, similar to NNRDFS, assume that the value of OPT is greater than a constant fraction of , i.e., This means that OPT must have collected the first rewards after traversing a distance of at most ^{8}^{8}8To see this, recall that after traversing a distance of , OPT achieved less than . Since it already traversed it can only achieve less than from the remaining rewards, thus a contraction with the assumption that it achieved more than ., and denote this fraction of the rewards by Further denote by the shortest and longest distances from to respectively. By the triangle inequality,
therefore, with a constant probability of
we get that By taking expectation over the first random pick, it follows thatStep 2. Similar to the analysis of NNRDFS, we now assume that OPT collects its value from rewards that it collects in a segment of length (and from all other rewards OPT collects a negligible value). Recall that the RNN is either NN with probability or a random pick with probability followed by NN. By picking the single reward closets to the starting point, NN gets at least of the value of OPT. Notice, that we do not need to assume anything about the length of the tour that OPT takes to collect the rewards (since we didn’t use it in Step 1). It follows that:
Thus, in the worst case scenario, , which implies that . Therefore .
Theorem 1 shows that our stochastic modification to NN improves its guarantees by a factor of While the improvement over NN may seem small () the observation that stochasticity improves the performance guarantees of local policies is essential to our work. In the following sections, we derive more sophisticated randomized algorithms with better performance guarantees.
3.2 Impossibility Results
The NN heuristic guarantees performance of at least We now describe an MDP at which NN indeed cannot guarantee more than Such an MDP is presented in Figure 1 (left). The agent starts in the center, there are rewards at a distance to its left and a single reward at a distance to its right. Inside the left ”room” the rewards are connected to each other with short edges (compared to ). For large values of (larger than and the size of each room), only the rewards in the room that is visited first contribute to the value function because of discounting. Since the single reward is located slightly closer to the agent, it will be collected first. As a result, NN only achieves (only collects the reward to the right).
Next, we show an impossibility result for all deterministic local policies, indicating that no such policy can guarantee more than which makes NN optimal over such policies. An example for the MDP that is used in the proof is presented in Figure 1 (right). [Impossibility for Deterministic Local Policies] For any deterministic local policy DLocal, MDP with rewards and a discount factor s.t.:
Consider a family of graphs, , each of which consists of a star with a central vertex and leaves. The starting vertex is the central vertex, and there is a reward at each leaf. The length of each edge is (s.t ).
Each graph of the family corresponds to a different subset of of the leaves which we connect (pairwise) by edges of length (the other leaves are only connected to the central vertex). While at the central vertex, local policy cannot distinguish among the rewards (they all at the same distance from the origin), and therefore its choice is the same for all graphs in (the next decision is also the same and so on, as long as it does not hit one of the special rewards).
Thus, for any given policy, there exists a graph in such that the adjacent rewards are visited last. Since we have that therefore
Stochastic Local Policies: The last theorem implies that NN is optimal over local deterministic policies, and in the previous section, we saw that a small stochastic adjustment could improve its guarantees. These observations motivated us to look for better local policies in the broader class of stochastic local policies. Theorem 1 gives a stronger impossibility result for such policies. The MDP that is used in the proof is very similar to the one presented in Figure 1 (right), but now there are rewards in the clique instead of .
[Impossibility for Stochastic Local Policies] For each stochastic local policy SLocal, MDP with rewards and a discount factor s.t.:
We consider a family of graphs, , each of which consists of a star with a central vertex and leaves. The starting vertex is the central vertex, and there is a reward at each leaf. The length of each edge is , where is chosen such that . Each graph in corresponds to a subset of leaves which we pairwise connect to form a clique.
Since we have that and therefore
OPT 
On the other hand, local policy at the central vertex cannot distinguish among the rewards and therefore for every graph in it picks the first reward from the same distribution. The policy continues to choose rewards from the same distribution until it hits the first reward from the size clique.
To argue formally that every SLocal policy has small expected reward on a graph from , we use Yao’s principle (Yao, 1977)
and consider the expected reward of a DLocal policy on the uniform distribution over
.Let be the probability that DLocal picks its first vertex from the size clique. Assuming that the first vertex is not in the clique, let be the probability that the second vertex is from the clique, and let , , be defined similarly. When Dlocal picks a vertex in the clique then its reward (without the cumulative discount) is . However, each time DLocal misses the clique then it collects a single reward but suffers a discount of . Neglecting the rewards collected until it hits the clique, the total value of DLocal is
Since for this value is
We do not have a policy that achieves this lower bound, but we now propose and analyze two stochastic policies (in addition to the RNN) that substantially improve over the deterministic upper bound. As we will see, these policies satisfy the Occam razor principle, i.e., policies with better guarantees are more complicated and require more computations.
3.3 NN with Randomized Depth First Search (RDFS)
We now describe the NNRDFS policy (Algorithm 2), our best performing local policy. The policy performs NN with probability and a local policy that we call RDFS with probability . RDFS first collects a random reward and continues by performing DFS^{9}^{9}9DFS is an algorithm for traversing a graph. The algorithm starts at the root node and explores as far as possible along each branch (with the shortest first) before backtracking. on edges shorter than , where is chosen at random as we specify later. When it runs out of edges shorter than then RDFS continues by performing NN. The performance guarantees for the NNRDFS method are stated in Theorem 2.
[NNRDFS Performance] For any MDP that satisfies Definition 2, with rewards,
Step 1. Assume that OPT collects a set of rewards for some fixed , in a segment of length (i.e. is the distance from the first reward to the last reward – it does not include the distance from the starting point to the first reward). Let the shortest and longest distances from to a reward in respectively. By the triangle inequality, We further assume that (i.e., That is the value that OPT collects from rewards which are not in is negligible). We now show that RDFS is for . We start with the following Lemma. For any path of length , and , there are less than edges in that are larger than . For contradiction, assume there are more than edges longer than . The length of is given by thus a contradiction to the assumption that the path length is at most .
Lemma 2 assures that after pruning all edges longer than (from the graph), breaks into at most Connected Components (CCs). Let be the number of such CCs containing . In addition, it holds that , and the edges inside any connected component are shorter than .
Next, we (lower) bound the total gain of RDFS in its prefix of length following its initial reward . Say . Then, since all edges in are shorter than , it collects in this prefix either all the rewards in , or at least rewards overall. That is rewards. To see this, recall that the DFS algorithm traverses each edge at most twice. In addition, as long as it did not collect all the rewards of , the length of each edge that the DFS traverses is at most . Thus, if the algorithm did not collect all the rewards in in its prefix of length , then it collected at least rewards in this prefix.
Notice that the first random step leads RDFS to a vertex in CC with probability . We say that a connected component is small if and we say that it is large otherwise. Let and let be the number of small CCs. If more than half of the rewards are in small components (), then
RDFS  
On the other hand, if more than half of rewards are not in (), then
RDFS 
By setting we guarantee that the value of RDFS is at least . Since ,
where the last inequality follows from the triangle inequality.
Step 2. Assume that OPT gets its value from rewards that it collects in a segment of length (and from all other rewards OPT collects a negligible value). Recall that the NNRDFS policy is either NN with probability or RDFS with probability . By picking the single reward closest to the starting point, NN gets at least of the value of OPT. Otherwise, with probability , RDFS starts with one of the rewards picked by OPT and then, by the analysis of step 1, if it sets , RDFS collects of the value collected by OPT (we use Step (1) with ). It follows that
This lower bound is smallest when , in which case NNRDFS collect of OPT. Notice that since is not known to NNRDFS, it has to be guessed in order to choose . This is done by setting at random from . This guarantees that with probability our guess for will be off of its true value by a factor of at most . This guess approximates by a factor of at most of its true value. Finally, these approximations degrade our bounds by a factor of
Step 3. Finally, we consider the general case where OPT may collect its value in a segment of length larger than . Notice that the value which OPT collects from rewards that follow the first segments of length in its tour is at most (since ). This means that there exists at least one segment of length in which OPT collects at least of its value. Combining this with the analysis in the previous step, the proof is complete.
3.4 NN with a Random Ascent (RA)
We now describe the NNRA policy (Algorithm 3). Similar in spirit to NNRDFS, the policy performs NN with probability and local policy which we call RA with probability . RA starts at a random node, , sorts the rewards in increasing order of their distance from and then collects all other rewards in this order. The algorithm is simple to implement, as it does not require guessing any parameters (like which RDFS has to guess). However, this comes at the cost of a worse bound.
The performance guarantees for the NNRA method are given in Theorem 3. The analysis follows the same steps as the proof of the NNRDFS algorithm. We emphasize that here, the pruning parameter is only used for analysis purposes and is not part of the algorithm. Consequently, we see only one logarithmic factor in the performance bound of Theorem 3 in contrast with two in Theorem 2.
[NNRA Performance] For any MDP that satisfies Definition 2 with rewards, :
Step 1. Assume that OPT collects a set of rewards for some , in a segment of length (i.e. is the distance from the first reward to the last reward – it does not include the distance from the starting point to the first reward). Let the shortest and longest distances from to a reward in respectively. By the triangle inequality, We further assume that (i.e., That is the value that OPT collects from rewards which are not in is negligible).
Let be a threshold that we will fix below, and denote by the CCs of that are created by deleting edges longer than among vertices of . By Lemma 2, we have at most CC.
Assume that RA starts at a vertex of a component , such that . Since the diameter of is at most then it collects its first vertices (including ) within a total distance of . So if then it collects at least rewards before traveling a total distance of , and if it collects at least rewards. (We shall omit the floor function for brevity in the sequal.) It follows that RA collects rewards. Notice that the first random step leads RDFS to a vertex in CC with probability . If more than half of rewards are in CCs s.t then
If more than half of rewards in are in CCs such that let be the number of such CCs and notice that We get that:
RA  
By setting we guarantee that the value of RA is at least . Since ,
where the last inequality follows from the triangle inequality.
Step 2. Assume that OPT gets its value from rewards that it collects in a segment of length (and from all other rewards OPT collects a negligible value). Recall that the NNRA policy is either NN with probability or RA with probability . By picking the single reward closest to the starting point, NN gets at least of the value of OPT. Otherwise, with probability , RA starts with one of the rewards picked by OPT and then, by the analysis of step 1, if it sets , RA collects of the value collected by OPT (we use Step (1) with ). Thus,
This lower bound is smallest when , in which case NNRA collect of OPT.
Step 3. By the same arguments from Step 3 in the analysis of NNRDFS, it follows that
4 Simulations
4.1 Learning Simulations
We evaluated our algorithms in an MDP that satisfies Definition 2, and in generalized more challenging settings in which the MDP is stochastic. We also evaluated them throughout the learning process of the options, when the optionpolicies (and value functions) are suboptimal.^{10}^{10}10We leave this theoretical analysis to future work. Note that with suboptimal options the agent may reach the reward in suboptimal time and even may not reach the reward at all. Furthermore, in stochastic MDPs, while executing an option the agent may arrive to a state where it prefers to switch to a different option rather than completing the execution of the current option.
Our experiments show that our policies perform well, even when some of our assumptions for our theoretical analysis are relaxed.
Setup. An agent (yellow) is placed in a X gridworld domain (Figure 2, top left). Its goal is to navigate in the maze and collect the available rewards (teal) as fast as possible. The agent can move by going up, down, left and right, and without crossing walls (red). In the stochastic scenario, there are also four actions, up, down, left and right, but once an action, say up, is chosen there is a chance that a random action (chosen uniformly) will be executed instead of up. We are interested in testing our algorithms in the regime where OPT can collect almost all of the rewards within a constant discount (i.e., ), but, there also exist bad tours that achieve a constant value (i.e., taking the most distance reward in each step); thus, we set
The agent consists of a set of options, and a policy over the options. We have an option per reward that learns, by interacting with the environment, a policy that moves the agent from its current position to the reward in the shortest way. We learned the options in parallel using Qlearning. We performed the learning in epochs.
At each epoch, we initialized each option in a random state, and performed Qlearning steps either until the option found the reward or until steps have passed. Every phase of epochs ( steps), we tested our policies with the available set of options. We performed this evaluation for phases, resulting in a total of M training epochs for each option. At the end of these epochs the policy for each option was approximately optimal.
Options: Figure 2 (top, right), shows the quality of the options in collecting the reward during learning in the stochastic MDP. For each of the phases of epochs ( steps) we plot the fraction of the runs in which the option reached the reward (red), and the option time gap (blue). The option time gap is the time that took the option to reach the goal, minus the deterministic shortest path from the initial state to the reward state. We can see that the options improve as learning proceeds, succeeding to reach the reward in more than of the runs. The time gap converges, but not zero, since the shortest stochastic path (due to the random environment and the greedy policy with ) is longer than the shortest deterministic path.
Local policies: At the bottom of Figure 2, we show the performance of the different policies (using the options available at the end of each of the phases), measured by the discounted cumulative return. In addition to our four policies, we also evaluated two additional heuristics for comparison. The first, denoted by RAND, is a random policy over the options, which selects an option at random at each step. Rand performs the worst since it goes in and out from clusters. The second, denoted by OPT (with a slight abuse of notation), is a fast approximation to OPT that uses all the information about the MDP (not local) and computes an approximation to OPT by checking all possible choices of the first two clusters and all possible paths through the rewards that they contain and picks the best. This is a good approximation for OPT since discounting makes the rewards collected after the first two clusters negligible. OPT performs better than our policies, because it has the knowledge on the full SMDP. On the other hand, our policies perform competitively, without learning the policy over options at all (zero shot solution).
Among the local policies, we can see that NN is not performing well since it is ”tempted” to collect nearby rewards in small clusters instead of going to large clusters. The stochastic algorithms, on the other hand, choose the first reward at random, thus they have a higher chance to reach larger clusters, and consequently, they perform better than NN. RNN and NNRDFS perform the best and almost the same, because effectively, inside the first two clusters RDFS is taking a tour which is similar to the one taken by NN. This happens because is larger than most pairwise distances inside clusters. NNRA performs worse than the other stochastic algorithms, since sorting the rewards by their distances from the first reward in the cluster introduces an undesired “zigzag” behavior, in which we do not collect rewards which are at approximately the same distance from the first in the right order.
4.2 Planning Simulations
In this section, we evaluate and compare the performance of deterministic and stochastic local policies by measuring the (cumulative discounted) reward achieved by each algorithm on different MDPs as a function of the number of the rewards, with For each MDP, the algorithm is provided with the set of optimal options and their corresponding value functions. We are interested to test our algorithms in the regime where OPT can collect almost all of the rewards within a constant discount (i.e., ), but, there also exist bad tours that achieve a constant value (i.e., taking the most distance reward in each step); thus, we set We always place the initial state at the origin, i.e., . We define , and denotes a short distance.
Next, we describe five MDP types (Figure4, ordered from left to right) that we considered for evaluation. For each of these MDP types, we generate different MDPs, and report the average reward achieved by each algorithm (Figure 3, Top), and in the worstcase (the minimal among the scores) (Figure 3, Bottom). As some of our algorithms are stochastic, we report average results, i.e., for each MDP we run each algorithm times and report the average score.
Figure 4 visualizes these MDPs, for rewards, where each reward is displayed on a 2D grid using gray dots. For each MDP type, we present a single MDP sampled from the appropriate distribution. For the stochastic algorithms, we present the best (Figure 5) and the worst tours (Figure 4), among 20 different runs (for NN we display the same tour since it is deterministic). Finally, for better interpretability, we only display the first rewards of each tour, in which the policy collects most of its value, with unless mentioned otherwise.
(1) Random Cities. For a vanilla TSP with rewards (nodes) randomly distributed on a plane, it is known that the NN algorithm yields a tour which is longer than optimal on average (Johnson and McGeoch, 1997). We used a similar input to compare our algorithms, specifically, we generated an MDP with rewards where U is the uniform distribution.
Figure 3 (left), presents the results for such MDP. we can see that the NN algorithm performs the best both on the average and in the worst case. This observation suggests that when the rewards are distributed at random, selecting the nearest reward is a reasonable thing to do. In addition, we can see that NNRDFS performs the best among the stochastic policies (as predicted by our theoretical results). On the other hand, the RA policy performs the worst among stochastic policies. This happens because sorting the rewards by their distances from , introduces an undesired “zigzag” behavior while collecting rewards at equal distance from (Figure 4).
(2) Line. This MDP demonstrates a scenario where greedy algorithms like NN and RNN are likely to fail. The rewards are located in three different groups; each contains of the rewards. In group 1, the rewards are located in a cluster left to the origin , while in group 2 they are located in a cluster right to the origin but a bit closer than group 1 . Group 3 is also located to the right, but the rewards are placed in increasing distances, such that the th reward is located at .
For visualization purposes, we added a small variance in the locations of the rewards at groups 1 and 2 and rescaled the axes. The two vertical lines of rewards represent these two groups, while we cropped the graph such that only the first few rewards in group 3 are observed. Finally, we chose
, such the first half of the tour is displayed, and we can see the first two groups visited in each tour.Inspecting the results, we can see that NN and RNN indeed perform the worst. To understand this, consider the tour that each algorithm takes. NN goes to group 2, then 3 then 1 (and loses a lot from going to 3). The stochastic tours depend on the choice of . If it belongs to group 1, they collect group1 then 2 then 3, from left to right, and perform relatively the same. If it belongs to group 3, they will first collect the rewards to the left of in ascending order and then come back to collect the remaining rewards to the right, performing relatively the same. However, if is in group 2, then NNRDFS, NNRA will visit group 1 before going to 3, while RNN is tempted to go to group 3 before going to 1 (and loses a lot from doing it).
(3) Random Clusters. This MDP demonstrates the advantage of stochastic policies. We first randomly place cluster centers , on a circle of radius . Then to draw a reward we first draw a cluster center uniformly and then draw such that .
This scenario is motivated by maze navigation problems, where collectible rewards are located at rooms (clusters) while in between rooms there are fewer rewards to collect (similar to Figure 2). Inspecting the results, we can see that NNRDFS and RNN perform the best, in particular in the worst case scenario. The reason for this is that NN picks the nearest reward, and most of its value comes from rewards collected at this cluster. On the other hand, the stochastic algorithms visit larger clusters first with higher probability and achieve higher value by doing so.
(4) Circle. In this MDP, there are circles, all centered at the origin, and the radii of the th circle is . On each circle we place rewards place at equal distances. This implies that the distance between adjacent rewards on the same circle is longer than the distance between adjacent rewards on two consecutive circles.
Examining the tours, we can see that indeed NN and RNN are taking tours that lead them to the outer circles. On the other hand, RDFS and RA are staying closer to the origin. Such local behavior is beneficial for RDFS, which achieves the best performance in this scenario. However, while RA performs well in the best case, its performance is much worse than the other algorithms in the worst case. Hence, its average performance is the worst in this scenario.
(5) Rural vs. Urban.
Here, the rewards are sampled from a mixture of two normal distributions. Half of the rewards are located in a “city”, i.e., their position is a Gaussian random variable with a small standard deviation s.t.
; the other half is located in a “village”, i.e., their position is a Gaussian random variable with a larger standard deviation s.t. To improve the visualization here, we chose , such the first half of the tour is displayed. Since half of the rewards belong to the city, choosing ensures that any tour that is reaching the city only the first segment of the tour (until the tour reaches the city) will be displayed.In this MDP, we can see that in the worst case scenario, the stochastic policies perform much better than NN. This happens because NN is mistakenly choosing rewards that take it to remote places in the rural area, while the stochastic algorithms remain near the city with high probability and collect its rewards.
5 Related work
Predefined rules for option selection are used in several studies. Karlsson et. al. (Karlsson, 1997) suggested a policy that chooses greedily with respect to the sum of the local Qvalues . Humphrys et. al. (Humphrys, 1996) suggested to choose the option with the highest local Qvalue (NN).
Barreto et. al. (Barreto et al., 2017)
considered a general transfer learning problem in RL, where the dynamics is shared along a set of MDPs, and the reward in the
th MDP is linear in some reward features . They suggested using NN (pick the option of maximum value) as the predefined rule for option selection (but referred to it as General Policy Improvement (GPI)), and provided performance guarantees for using GPI in the form of additive (based on regret) error bounds. In contrast, we prove multiplicative performance guarantees for NN and for our stochastic policies. We also proved, for the first time, impossibility results for such local selection rules. Since our Definition 2 is a special case of the framework of (Barreto et al., 2017), our impossibility results apply to their framework as well.A different approach to tackle these challenges is Multitask learning, in which we optimize the options in parallel with the policy over options (Russell and Zimdars, 2003; Sprague and Ballard, 2003; van Seijen et al., 2017). One method that achieves that goal is the local SARSA algorithm (Russell and Zimdars, 2003; Sprague and Ballard, 2003). Similar to (Karlsson, 1997), a Q function is learned locally for each option (concerning a local reward). However, here the local Q functions are learnt onpolicy (using SARSA) with respect to the policy over options instead of being learned offpolicy with Q learning. Russel et al. (Russell and Zimdars, 2003) showed that if the policy over options is being updated in parallel with the local SARSA updates, then the local SARSA algorithm is promised to converge to the optimal value function.
6 Conclusions
We establish theoretical guarantees for policies to collect rewards, based on reward decomposition in deterministic MDPs. Using reward decomposition one can learn many (option) policies in parallel and combine them into a composite solution efficiently. In particular, we focus on approximate solutions that are local, and therefore, easy to implement and do not require substantial computational resources. Local deterministic policies, like NN, are being used in practice for hierarchical reinforcement learning. Our study provides theoretical guarantees on the reward collected by these policies, as well as impossibility results. Our theoretical results show that these policies outperform NN in the worst case.
We tested our policies in a practical maze navigation setup. Our experiments show that our randomized local policies work well compared to the optimal policy and better than the NN policy. Furthermore, we demonstrated that this also holds throughout the options’ learning process (when their policies are suboptimal), and even when the actions are stochastic. We expect to see similar results if each option will be learned with function approximation techniques like DNNs (Tessler et al., 2017; Bacon et al., 2017).
References
 Aumann et al. (1996) Robert J Aumann, Sergiu Hart, and Motty Perry. The absentminded driver. In Proceedings of the 6th conference on Theoretical aspects of rationality and knowledge, pages 97–116, 1996.
 Bacon et al. (2017) PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. AAAI, 2017.
 Barreto et al. (2017) Andre Barreto, Remi Munos, Tom Schaul, and David Silver. Successor features for transfer in reinforcement learning. Advances in Neural Information Processing Systems, 2017.
 Barto and Mahadevan (2003) Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4), 2003.
 Beattie et al. (2016) Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. arXiv preprint, 2016.

Bellemare et al. (2013)
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research (JAIR), 47
, 2013.  Bellman (1962) Richard Bellman. Dynamic programming treatment of the travelling salesman problem. Journal of the ACM (JACM), 1962.
 Blum et al. (2007) Avrim Blum, Shuchi Chawla, David R Karger, Terran Lane, Adam Meyerson, and Maria Minkoff. Approximation algorithms for orienteering and discountedreward tsp. SIAM Journal on Computing, 37(2), 2007.
 Farbstein and Levin (2016) Boaz Farbstein and Asaf Levin. Discounted reward tsp. Algorithmica, pages 1–24, 2016.
 Held and Karp (1962) Michael Held and Richard M Karp. A dynamic programming approach to sequencing problems. Journal of the Society for Industrial and Applied Mathematics, 10(1):196–210, 1962.
 Humphrys (1996) Mark Humphrys. Action selection methods using reinforcement learning. From Animals to Animats, 4, 1996.

Johnson and McGeoch (1997)
David S Johnson and Lyle A McGeoch.
The traveling salesman problem: A case study in local optimization.
Local search in combinatorial optimization
, 1997.  Karlsson (1997) Jonas Karlsson. Learning to solve multiple goals. PhD thesis, Citeseer, 1997.
 Mann et al. (2015) Timothy Arthur Mann, Shie Mannor, and Doina Precup. Approximate value iteration with temporally extended actions. Journal of Artificial Intelligence Research (JAIR), 53, pages 375–438, 2015.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540), 2015.
 Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expertlevel artificial intelligence in headsup nolimit poker. Science, 356, 2017.
 Rosenkrantz et al. (1977) Daniel J Rosenkrantz, Richard E Stearns, and Philip M Lewis, II. An analysis of several heuristics for the traveling salesman problem. SIAM journal on computing, 6(3):563–581, 1977.
 Russell and Zimdars (2003) Stuart J Russell and Andrew Zimdars. Qdecomposition for reinforcement learning agents. ICML, 2003.

Silver et al. (2016)
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George
Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, et al.
Mastering the game of go with deep neural networks and tree search.
Nature, 529(7587):484–489, 2016.  Sprague and Ballard (2003) Nathan Sprague and Dana Ballard. Multiplegoal reinforcement learning with modular sarsa. IJCAI, 2003.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2018.
 Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12), 1999.
 Tesauro (1995) Gerald Tesauro. Temporal difference learning and tdgammon. Communications of the ACM, 38(3):58–69, 1995.
 Tessler et al. (2017) Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. A deep hierarchical approach to lifelong learning in minecraft. AAAI, 2017.
 van Seijen et al. (2017) Harm van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey Tsang. Hybrid reward architecture for reinforcement learning. Advances in Neural Information Processing Systems, 2017.

Vezhnevets et al. (2017)
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max
Jaderberg, David Silver, and Koray Kavukcuoglu.
Feudal networks for hierarchical reinforcement learning.
Proceedings of the 34th International Conference on Machine Learning (ICML17)
, 2017.  Yao (1977) Andrew ChiChin Yao. Probabilistic computations: Toward a unified measure of complexity. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science, pages 222–227. IEEE Computer Society, 1977.
 Zahavy et al. (2018) Tom Zahavy, Avinatan Hasidim, Haim Kaplan, and Yishay Mansour. Hierarchical reinforcement learning: Approximating optimal discounted tsp using local policies. arXiv preprint arXiv:1803.04674, 2018.
7 Exact solutions for the RDTSP
We now present a variation of the HeldKarp algorithm for the RDTSP. Note that similar to the TSP, denotes the length of the tour visiting all the cities in , with being the last one (for TSP this is the length of the shortest tour). However, our formulation required the definition of an additional recursive quantity, , that accounts for the value function (the discounted sum of rewards) of the shortest path. Using this notation, we observe that HeldKarp is identical to doing tabular Qlearning on SMDP . Since HeldKarp is known to have exponential complexity, it follows that solving using SMDP algorithms is also of exponential complexity.