Finding Options that Minimize Planning Time

10/16/2018 ∙ by Yuu Jinnai, et al. ∙ Brown University 4

While adding temporally abstract actions, or options, to an agent's action repertoire can often accelerate learning and planning, existing approaches for determining which specific options to add are largely heuristic. We aim to formalize the problem of selecting the optimal set of options for planning, in two contexts: 1) finding the set of k options that minimize the number of value-iteration passes until convergence, and 2) computing the smallest set of options so that planning converges in less than a given maximum of ℓ value-iteration passes. We first show that both problems are NP-hard. We then provide a polynomial-time approximation algorithm for computing the optimal options for tasks with bounded return and goal states. We prove that the algorithm has bounded suboptimality for deterministic tasks. Finally, we empirically evaluate its performance against both the optimal options and a representative collection of heuristic approaches in simple grid-based domains including the classic four rooms problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov Decision Processes (MDPs) (Puterman, 1994) are an expressive yet simple model of sequential decision-making environments. However, MDPs are computationally expensive to solve. One approach to solving such problems is to add high-level, temporally extended actions—often formalized as options (Sutton et al., 1999)—to the action space. The right set of options allows planning to probe more deeply into the search space with a single computation. Thus, if options are chosen appropriately, planning algorithms can find good plans with less computation.

Indeed, previous work has offered substantial support that abstract actions can accelerate planning. However, little is known about how to find the right set of options. Prior work often seeks to codify an intuitive notion of what underlies an effective option, such as identifying relatively unusual states (Şimşek & Barto, 2004), identifying bottleneck states or high-betweenness states (Şimşek et al., 2005; Şimşek & Barto, 2009; Bacon, 2013; Moradi et al., 2012), finding repeated policy fragments (Pickett & Barto, 2002), or finding states that often occur on successful trajectories (McGovern & Barto, 2001; Bakker & Schmidhuber, 2004). While such intuitions often capture important aspects of the role of options in planning, the resulting algorithms are somewhat heuristic in that they are not based on optimizing any precise performance-related metric; consequently, their relative performance can only be evaluated empirically.

We aim to formalize what it means to find the set of options that is optimal for planning, and to use the resulting formalization to develop a practical algorithm with a principled theoretical foundation. Specifically we consider the problem of finding the smallest set of options so that planning converges in less than a given maximum of VI iterations. We show that the problem is NP-hard. More precisely, the problem has the following properties:

  1. -hard to approximate for any unless 111This is a standard complexity assumption: See, for example, (Dinitz et al., 2012), where is the input size.

  2. -hard to approximate even for deterministic MDPs unless .

  3. There exists an -approximation algorithm.

  4. There exists an -approximation algorithm for deterministic MDPs.

In Section 4 we show A-MOMI, a polynomial time approximation algorithm which has suboptimality bound in general and suboptimality bound for deterministic MDPs. is only slightly smaller than : if then . Thus, the inapproximability results claim that A-MOMI is close to the best possible approximation factor.

In addition, we consider a complementary problem of finding a set of options that minimize the number of VI iterations until convergence. We show that the problem is also NP-hard, even for a deterministic MDP.

Finally, we empirically evaluate the performance of two heuristic approaches for option discovery, betweenness options (Şimşek & Barto, 2009) and eigenoptions (Machado et al., 2018), against the proposed approximation algorithms and the optimal options in standard grid domains.

2 Background

We first provide background on Markov Decision Processes (MDPs), planning, and options.

2.1 Markov Decision Processes

An MDP is a five tuple: , where is a finite set of states; is a finite set of actions; is a reward function,

is a transition function, denoting the probability of arriving in state

after executing action in state , and is a discount factor, expressing the agent’s preference for immediate over delayed rewards.

An action-selection strategy is modeled by a policy, , mapping states to a distribution over actions. Typically, the goal of planning in an MDP is to solve the MDP—that is, to compute an optimal policy. A policy is evaluated according to the Bellman Equation, denoting the long term expected reward received by executing :

(1)

We denote and as the optimal policy and value function, respectively.

2.2 Planning

The core problem we study is planning, namely, computing near optimal policy for a given MDP. The main variant of the planning problem we study we denote the value-planning problem: [roundcorner=1pt, backgroundcolor=white] Definition 1 (Value-Planning Problem): Given an MDP and a non-negative real-value , return a value function, such that for all .

2.3 Options and Value Iteration

Temporally extended actions offer great potential for mitigating the difficulty of solving complex MDPs, either through planning or reinforcement learning (Sutton et al., 1999). Indeed, it is possible that options that are useful for learning are not necessarily useful for planning, and vice versa. Identifying techniques that produce good options in these scenarios is an important open problem in the literature.

We use the standard definition of options (Sutton et al., 1999): [roundcorner=1pt, backgroundcolor=white] Definition 2 (option): An option is defined by a triple: where:

  • is a set of states where the option can initiate,

  • is a policy,

  • , is a termination condition.

We let denote the set containing all options.

In planning, options have well defined transition and reward models for each state named the multi-time model, introduced by Precup & Sutton (1998):

(2)
(3)

We use the multi-time model for value iteration. The algorithm computes a sequence of functions using the Bellman optimality operator on the multi-time model:

(4)

The problem we consider is to find a set of options to add to the set of primitive actions that seek to minimize the number of iterations required for VI to converge:222We can ensure by running VI until for all (Williams & Baird, 1993).

[roundcorner=1pt, backgroundcolor=white] Definition 3 (): The number of iterations of a value-iteration algorithm using option set , with a non-empty set of options, is the smallest at which for all .

2.3.1 Point options.

The options formalism is immensely general. Due to its generality, a single option can actually encode several completely unrelated sets of different behaviors. Consider the nine-state example MDP pictured in Figure 1; a single option can in fact initiate, make decisions in, and terminate along entirely independent trajectories. As we consider more complex MDPs (which, as discussed earlier, is often a motivation for introducing options), the number of independent behaviors that can be encoded by a single option increases further still.

Figure 1: A single option can encode multiple unrelated behaviors. The dark circles indicate where the option can be initiated ( & ) and terminated ( & ), whereas the lighter circles denote the states visited by the option policy when applied in the respective initiating state.

As a result, it can be difficult to reason about the impact of adding a single option, in the traditional sense. As the MDP grows larger, a combinatorial number of different behaviors can emerge from “one” option. Consequently, it is difficult to address the question: which single option helps planning? As MDPs grow large, one option can in fact encode a huge number of possible, independent behaviors. Thus, we instead introduce and study “point options”, which only allow for a single continuous stream of behavior: [roundcorner=1pt, backgroundcolor=white] Definition 4 (Point option): A point option is any option whose initiation set and termination set are each true for exactly one state each:

(5)
(6)
(7)

We let denote the set containing all point options. For simplicity, we denote the initiation state as and the termination state as for a point option .

To plan with a point option from state , the agent runs value iteration using a model in addition to the backup operations by primitive actions. We assume that the model of each option is given to the agent and ignore the computation cost for computing the model for the options.

Point options are a useful subclass to consider for several reasons. First, a point option is a simple model for a temporally extended action. Second, the policy of the point option can be calculated as a path-planning problem for deterministic MDPs. Third, any other options with a single termination state with termination probability 1 can be represented as a collection of point options. Forth, a point option has a fixed amount of computational overhead per iteration.

3 Complexity Results

Our main results focus on two computational problems:

  1. MinOptionMaxIter (MOMI): Which set of options let value iteration converge in at most iterations?

  2. MinIterMaxOption (MIMO): Which set of or fewer options minimizes the number of iterations to convergence?

More formally, MOMI is defined as follows.

[roundcorner=1pt, backgroundcolor=white] Definition 5 (MOMI): The MinOptionMaxIter problem:
Given an MDP , a non-negative real-value , and an integer , return that minimizes subject to and .

We then consider the complementary optimization problem: compute a set of options which minimizes the number of iterations. Motivated by this scenario, the second problem we study is MinIterMaxOption (MIMO).

[roundcorner=1pt, backgroundcolor=white] Definition 6 (MIMO): The MinIterMaxOption problem:
Given an MDP , a non-negative real-value , and an integer , return that minimizes , subject to and .

We now introduce our main result, which shows that both MOMI and MIMO are NP-hard.

Theorem 1.

MOMI and MIMO are NP-hard.

Proof.

We consider a problem OI-DEC which is a decision version of MOMI and MIMO. The problem asks if we can solve the MDP within iterations using at most point options.

[roundcorner=1pt, backgroundcolor=white] Definition 7 (OI-DEC):
Given an MDP , a non-negative real-value , and integers and , return ‘Yes’ if the there exists an option set such that , and . ‘No’ otherwise.

We prove the theorem by reduction from the decision version of the set-cover problem—known to be NP-complete—to OI-DEC. The set-cover problem is defined as follows.

[roundcorner=1pt, backgroundcolor=white] Definition 8 (SetCover-DEC):
Given a set of elements , a set of subsets , and an integer , return ‘Yes’ if there exists a cover that and . ‘No’ otherwise.

If there is some that is not included in at least one of the subsets , then the answer is ‘No’. Assuming otherwise, we construct an instance of a shortest path problem (a special case of an MDP problem) as follows (Figure 2). There are four types of states in the MDP: (1) represents one of the elements in , (2) represents one of the subsets in , (3) : we make a copy for every state and call them , (4) a goal state . Thus, the state set is . We build edges between states as follows: (1) iff : For and , there is an edge between and . (2) , : For every , we have a edge from to . (3) : for every we have a edge from to the goal . This construction can be done in polynomial time.

Let be the MDP constructed in this way. We show that SetCover() = OI-DEC(). Note that by construction every state , , and converges to its optimal value within 2 iterations as it reaches the goal state within 2 steps. A state converges within 2 steps if and only if there exists a point option (a) from to where , (b) from to where , or (c) from to . For options of type (b) and (c), we can find an option of type (a) that makes converge within 2 steps by setting the initial state of the option to , where , and the termination state to . Let be the solution of OI-DEC(). If there exists an option of type (b) or (c), we can swap them with an option of type (a) and still maintain a solution. Let be a set of initial states of each option in (). This construction exactly matches the solution of the SetCover-DEC.

Figure 2: Reduction from SetCover-DEC to OI-DEC. The example shows the reduction from an instance of SetCover-DEC which asks if we can pick two subsets from where to cover all elements . The SetCover-DEC can be reduced to an instance of OI-DEC where the question is whether the MDP can be solved with 2 iterations of VI by adding at most two point options. The answer of OI-DEC is ‘Yes’ (adding point options from and to will solve the problem), thus the answer of the SetCover-DEC is ‘Yes’. Here the set of initial states corresponds to the cover for the SetCover-DEC.

3.1 Generalizations of MOMI and MIMO

A natural question is whether Theorem 1 extends to more general option-construction settings. We consider two possible extensions, which we believe offer significant coverage of finding optimal options for planning in general.

We first consider the case where the options are not necessarily point options. There is little sense in considering MOMI where one can choose any option since clearly the best option is the option whose policy is the optimal policy. Thus, using the space of all options we generalize MOMI as follows: [roundcorner=1pt, backgroundcolor=white] Definition 9 (MOMI):
Given an MDP , a non-negative real-value , , and an integer , return minimizing subject to and .

Theorem 2.

MOMI and MIMO are NP-hard.

The proof follows from the fact that MOMI is a superset of MOMI and MIMO is a superset of MIMO.

We next consider the multi-task generalization, where we aim to find a smallest number of options which the expected number of iterations to solve a problem sampled from a distribution of MDPs, is bounded:

[roundcorner=1pt, backgroundcolor=white] Definition 10 (MOMI):
Given A distribution of MDPs , , a non-negative real-value , and an integer , return that minimizes such that and .

Theorem 3.

MOMI and MIMO are NP-hard.

The proof follows from the fact that MOMI is a superset of MOMI and MIMO is a superset of MIMO.

In light of the computational difficulty of both problems, the appropriate approach is to find tractable approximation algorithms. However, even approximately solving MOMI is a hard problem. More precisely:

Theorem 4.
  1. MOMI is hard to approximate even for deterministic MDPs unless .

  2. MOMI is -hard to approximate for any even for deterministic MDP unless .

  3. MOMI is -hard to approximate for any unless .

Proof.

See appendix. ∎

in Theorem 4 can be thought of as approaching polynomial hardness of approximation, slightly smaller than : if then . Note that an -approximation is achievable by a trivial algorithm which returns a set of all candidate options. Thus, the result roughly gives that there is no polynomial time approximation algorithms other than the trivial algorithm for MOMI in general.

In the next section we show that an -approximation is achievable if the MDP is deterministic and the agent is given a set of all point options. These two results, then, give a formal separation between the hardness of abstraction in MDPs with and without stochasticity.

In summary, the problem of computing optimal behavioral abstractions for planning is intractable.

4 Approximation Algorithms

We now provide polynomial-time approximation algorithms, A-MIMO and A-MOMI, to solve MOMI and MIMO, respectively. Both algorithms have bounded suboptimality slightly worse than a constant factor for deterministic MDPs. We assume that (1) there is exactly one absorbing state with and , and every optimal policy eventually reaches with probability 1, (2) there is no cycle with a positive reward involved in the optimal policy’s trajectory. That is, for all policies . Note that we can convert a problem with multiple goals to a problem with a single goal by adding a new absorbing state to the MDP and adding a transition from each of the original goals to .

Unfortunately, these algorithms are computationally harder than solving the MDP itself, and are thus not practical for planning. Instead, they are useful for analyzing and evaluating heuristically generated options. If the option set generated by the heuristic methods outperforms the option set found by the following algorithms, then one can claim that the option set found by the heuristic is close to the optimal option set (for that MDP). Our algorithms have a formal guarantee on bounded suboptimality if the MDP is deterministic, so any heuristic method that provably exceeds our algorithm’s performance will also guarantee bounded suboptimality. We also believe that the following algorithms may be a useful foundation for future option discovery methods.

4.1 A-Momi

We now describe a polynomial-time approximation algorithm, A-MOMI, based on using set cover to solve MOMI. The overview of the procedure is as follows.

  1. Compute for every state pair.

  2. For every state , compute a set of states within distance of reaching . The set represents the states that converge within steps if we add a point option from to .

  3. Let be a set of for every , where is a set of states that converges within without any options (thus can be ignored).

  4. Solve the set-cover optimization problem to find a set of subsets that covers the entire state set using the approximation algorithm by Chvatal (1979). This process corresponds to finding a minimum set of subsets that makes every state in converge within steps.

  5. Generate a set of point options with initiation states set to one of the center states in the solution of the asymmetric -center, and termination states set to the goal.

We compute a distance function , defined as follows:

[roundcorner=1pt, backgroundcolor=white] Definition 11 (Distance ): is the number of iterations for to reach -optimal if we add a point option from to minus one.

More formally, let denote the number of iterations needed for the value of state to satisfy , and let be an upper bound of the number of iterations needed for the value of to satisfy , if the value of is initialized such that . We define . For simplicity, we use to denote the function . Consider the following example.

Example.

Table 1 is a distance function for the MDP shown in Figure 3. For a deterministic MDP, corresponds to the number of edge traversals from state to , where we have edges only for those that corresponds to the state transition by the optimal actions. The quantity is the minimum of and one plus the number of edge traversals from to . ∎

Note that we only need to solve the MDP once to compute . can be computed once you solved the MDP without any options and store all value functions () until convergence as a function of : . If we add a point option from to , then . Thus, is the smallest where reaches -optimal if we replace with when computing as a function of .

Example.

We use the MDP shown in Figure 3 as an example. Consider the problem of finding a set of options so that the MDP can be solved within iterations. We generate an instance of a set-cover optimization problem. The set of element for the set cover is the set of states of the MDP that do not reach their optimal value within steps without any options . Here, we denote a set of nodes that can be solved within steps by . In the example, . A state is included in a subset iff . For example, . Thus, the set of subsets are given as: . In this case, the approximation algorithm finds the optimal solution for the set-cover optimization problem (). We generate a point option for each state in . Thus, the output of the algorithm is a set of two point options from and to . ∎

Theorem 5.

A-MOMI has the following properties:

  1. A-MOMI runs in polynomial time.

  2. It guarantees that the MDP is solved within iterations using the option set acquired by A-MOMI .

  3. If the MDP is deterministic, the option set is at most times larger than the smallest option set possible to solve the MDP within iterations.

Proof.

See the supplementary material. ∎

Note that the approximation bound for deterministic MDP may be improved by improving the approximation algorithm for the set cover. It is proven to be NP-hard to approximate up to a factor of (Dinur & Steurer, 2014), thus there may be an improvement on the approximation ratio for the set cover problem, which will also improve the approximation ratio of A-MOMI.

4.2 A-Mimo

The outline of the approximation algorithm for MIMO, A-MIMO, is as follows.

  1. Compute an asymmetric distance function representing the number of iterations for a state to reach its -optimal value if we add a point option from a state to a goal state .

  2. Using this distance function, solve an asymmetric -center problem, which finds a set of center states that minimizes the maximum number of iterations for every state to converge.

  3. Generate point options with initiation states set to the center states in the solution of the asymmetric -center, and termination states set to the goal.

(1) We compute the distance function as in A-MOMI.

(2) We exploit this characteristic of and solve the asymmetric -center problem (Panigrahy & Vishwanathan, 1998) on to get a set of centers, which we use as initiation states for point options. The asymmetric -center problem is a generalization of the metric -center problem where the function obeys the triangle inequality, but is not necessarily symmetric:

[roundcorner=1pt, backgroundcolor=white] Definition 12 (AsymKCenter):
Given a set of elements , a function , and an integer , return that minimizes subject to .

We solve the problem using a polynomial-time approximation algorithm proposed by Archer (2001). The algorithm has a suboptimality bound of where and it is proven that the problem cannot be solved within a factor of unless P=NP (Chuzhoy et al., 2005). As the procedure by Archer (2001) often finds a set of options smaller than , we generate the rest of the options by greedily adding options at once. See the supplementary material for details.

(3) We generate a set of point options with initiation-states set to one of the centers and the termination state set to the goal state of the MDP. That is, for every in , we generate a point option starting from to the goal state .

Example.

Consider an MDP shown in Figure 3. The distance for the MDP is shown in Table 1. Note that holds for every pair. Let us first consider finding one option (). This process corresponds to finding a column with the smallest maximum value in the Table 1. The optimal point option is from to as it has the smallest maximum value in the column. If , an optimal set of options is from and to . Note that the optimal option for is not in the optimal option set of size 2. This example shows that the strategy of greedily adding options does not find the optimal set. In fact, the improvement on by the greedy algorithm can be arbitrary small (i.e. 0) compared to the optimal option (see Proposition 1 in the supplementary material for proof). ∎

Figure 3: Example for A-MIMO with . Discovered options are denoted by the dashed lines.
0 1 3 3 2 3
2 0 2 2 1 2
3 3 0 1 2 3
2 2 2 0 1 2
1 1 1 1 0 1
0 0 0 0 0 0
Table 1: for Figure 3.
Theorem 6.

A-MIMO has the following properties:

  1. A-MIMO runs in polynomial time.

  2. If the MDP is deterministic, it has a bounded suboptimality of .333 is the number of times the logarithm function must be iteratively applied before the result is less than or equal to 1.

  3. The number of iterations to solve the MDP using the option set acquired is upper bounded by .

Proof.

See the supplementary material. ∎

5 Experiments

(a) optimal
(b) approx.
(c) optimal
(d) approx.
(e) Betweenness
(f) Eigenoptions
Figure 10: Comparison of the optimal point options with options generated by the approximation algorithm A-MIMO. The green square represents the termination state and the blue squares the initiation states. Observe that the approximation algorithm is similar to that of optimal options. Note that the optimal option set is not unique: there can be multiple optimal option sets, and we are visualizing just one returned by the solver.
(a) Four Room (MIMO)
(b) grid (MIMO)
(c) Four Room (MOMI)
(d) grid (MOMI)
Figure 15: MIMO and MOMI evaluations. Parts (a)–(c) show the number of iterations for VI using options generated by A-MIMO. Parts (d)–(f) show the number of options generated by A-MOMI to ensure the MDP is solved within a given number of iterations. OPT: optimal set of options. APPROX: a bounded suboptimal set of options generated by A-MIMO an A-MOMI. BET: betweenness options. EIG: eigenoptions.

We evaluate the performance of the value-iteration algorithm using options generated by the approximation algorithms on several grid-based simple domains.

We ran the experiments on an four-room domain and a grid world with no walls. In both domains, the agent’s goal is to reach a specific square. The agent can move in the usual four directions but cannot cross walls.

5.0.1 Visualizations

First, we visualize a variety of option types, including the optimal point options, those found by our approximation algorithms, and several option types proposed in the literature. We computed the optimal set of point options by enumerating every possible set of point options and picking the best. We are only able to find optimal solutions up to four options within 10 minutes, while the approximation algorithm could find any number of options within a few minutes. Both betweenness options and eigenoptions are discovered by a polynomial time algorithm, thus able to discover within a few minutes. Figure 31 shows the optimal and bounded suboptimal set of options computed by A-MIMO. See the supplementary material for visualizations for the grid domain.

Figure (e)e shows the four bottleneck states with highest shortest-path betweenness centrality in the state-transition graph (Şimşek & Barto, 2009). Interestingly, the optimal options are quite close to the bottleneck states in the four-room domain, suggesting that bottleneck states are also useful for planning as a heuristic to find important subgoals.

Figure (f)f shows the set of subgoals discovered by graph Laplacian analysis following the method of Machado et al. (2018). While they proposed to generate options to travel between subgoals for reinforcement learning, we generate a set of point options from each subgoal to the goal state as that is a better use of the subgoals for planning setting.

5.0.2 Quantitative Evaluation

Next, we run value iteration using the set of options generated by A-MIMO and A-MOMI. Figures (a)a and (b)b show the number of iterations on the four-room and the grids using a set of options of size . The experimental results suggest that the suboptimal algorithm finds set of options similar to, but not quite as good as, the optimal ones. For betweenness options and eigenoptions, we evaluated every subset of options among the four and present results for the best subset found. Because betweenness options are placed close to the optimal options, the performance is close to optimal especially when the number of options are small.

In addition, we used A-MOMI to find a minimum option set to solve the MDP within the given number of iterations. Figures (c)c and (d)d show the number of options generated by A-MOMI compared to the minimum number of options.

6 Related Work

Many heuristic algorithms have proposed to discover options useful for some purposes (Iba, 1989; McGovern & Barto, 2001; Menache et al., 2002; Stolle & Precup, 2002; Şimşek & Barto, 2004, 2009; Konidaris & Barto, 2009; Machado et al., 2018; Eysenbach et al., 2018). These algorithms seek to capture varying intuitions about what makes behavioral abstraction useful. Jong et al. (2008) sought to investigate the utility of options empirically and pointed out that introducing options might worsen learning performance. They argued that options can potentially improve the learning performance by encouraging exploitation or exploration. For example, some works investigate the use of bottleneck states (Stolle & Precup, 2002; Şimşek & Barto, 2009; Menache et al., 2002; Lehnert et al., 2018). Stolle & Precup (2002) proposed to set states with high visitation count as subgoal states, resulting in identifying bottleneck states in the four-room domain. Şimşek & Barto (2009) generalized the concept of the bottleneck to (shortest-path) betweenness of the graph to capture how pivotal the state is. Menache et al. (2002) used a learned model of the environment to run a Max-Flow/Min-Cut algorithm to the state-space graph to identify bottleneck states. These methods generate options to leverage the idea that subgoals are states visited most frequently. On the other hand, Şimşek & Barto (2004) proposed to generate options to encourage exploration by generating options to relatively novel states, encouraging exploration. Eysenbach et al. (2018) instead proposed learning a policy for each option so that the diversity of the trajectories by the set of options are maximized. These methods generate options to explore infrequently visited states. Harb et al. (2017) proposed to formulate good options to be options which minimize the deliberation costs in the bounded rationality framework (Simon, 1957). That being said, the problem of discovering efficient behavioral abstraction in reinforcement learning is still an open question. The problem of finding minimum state-abstraction with bounded performance loss was studied by Even-Dar & Mansour (2003). They showed it is NP-hard and proposed a polynomial time bicriteria approximation algorithm.

For planning, several works have shown empirically that adding a particular set of options or macro-operators can speed up planning algorithms (Francis & Ram, 1993; Sutton & Barto, 1998; Silver & Ciosek, 2012; Konidaris, 2016). In terms of theoretical analysis, Mann et al. (2015) analyzed the convergence rate of approximate value iteration with and without options and showed that options lead to faster convergence if their durations are longer and a value function is initialized pessimistically. As in reinforcement learning, how to find efficient temporal abstractions for planning automatically remains an open question.

7 Conclusions

We considered a fundamental theoretical question concerning the use of behavioral abstractions to solve MDPs. We considered two problem formulations for finding options: (1) minimize the size of option set given a maximum number of iterations (MOMI) and (2) minimize the number of iterations given a maximum size of option set (MIMO). We showed that the two problems are both computationally intractable, even for deterministic MDPs. For each problem, we produced a polynomial-time algorithm for MDPs with bounded reward and goal states, with bounded optimality for deterministic MDPs. In the future, we are interested in using the insights established here to develop principled option-discovery algorithms for model-based reinforcement learning. Since we now know which options minimize planning time, we can better guide model-based agents toward learning them and potentially reduce sample complexity considerably.

Acknowledgments

We would like to thank the anonymous reviewer for their advice and suggestions to improve the inapproximability result for MOMI.

References

  • Archer (2001) Archer, A. Two O(log* k)-approximation algorithms for the asymmetric k-center problem. In

    International Conference on Integer Programming and Combinatorial Optimization

    , pp. 1–14, 2001.
  • Bacon (2013) Bacon, P.-L. On the bottleneck concept for options discovery. Master’s thesis, McGill University, 2013.
  • Bakker & Schmidhuber (2004) Bakker, B. and Schmidhuber, J. Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In Proceedings of the 8th Conference on Intelligent Autonomous Systems, pp. 438–445, 2004.
  • Bhattacharyya et al. (2012) Bhattacharyya, A., Grigorescu, E., Jung, K., Raskhodnikova, S., and Woodruff, D. P. Transitive-closure spanners. SIAM Journal on Computing, 41(6):1380–1425, 2012.
  • Chuzhoy et al. (2005) Chuzhoy, J., Guha, S., Halperin, E., Khanna, S., Kortsarz, G., Krauthgamer, R., and Naor, J. S. Asymmetric k-center is log* n-hard to approximate. Journal of the ACM, 52(4):538–551, 2005.
  • Chvatal (1979) Chvatal, V. A greedy heuristic for the set-covering problem. Mathematics of operations research, 4(3):233–235, 1979.
  • Dinitz et al. (2012) Dinitz, M., Kortsarz, G., and Raz, R. Label cover instances with large girth and the hardness of approximating basic k-spanner. In International Colloquium on Automata, Languages, and Programming, pp. 290–301. Springer, 2012.
  • Dinur & Safra (2004) Dinur, I. and Safra, S. On the hardness of approximating label-cover. Information Processing Letters, 89(5):247–254, 2004.
  • Dinur & Steurer (2014) Dinur, I. and Steurer, D. Analytical approach to parallel repetition. In

    Proceedings of the forty-sixth annual ACM symposium on Theory of computing

    , pp. 624–633. ACM, 2014.
  • Even-Dar & Mansour (2003) Even-Dar, E. and Mansour, Y. Approximate equivalence of Markov decision processes. In Learning Theory and Kernel Machines, pp. 581–594. Springer, 2003.
  • Eysenbach et al. (2018) Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
  • Francis & Ram (1993) Francis, A. G. and Ram, A. The utility problem in case-based reasoning. In Case-Based Reasoning: Papers from the 1993 Workshop, pp. 160–161, 1993.
  • Harb et al. (2017) Harb, J., Bacon, P.-L., Klissarov, M., and Precup, D. When waiting is not an option: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017.
  • Hochbaum (1982) Hochbaum, D. S. Approximation algorithms for the set covering and vertex cover problems. SIAM Journal on Computing, 11(3):555–556, 1982.
  • Iba (1989) Iba, G. A. A heuristic approach to the discovery of macro-operators. Machine Learning, 3(4):285–317, 1989.
  • Jong et al. (2008) Jong, N. K., Hester, T., and Stone, P. The utility of temporal abstraction in reinforcement learning. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 299–306, 2008.
  • Konidaris (2016) Konidaris, G. Constructing abstraction hierarchies using a skill-symbol loop. In

    Proceedings of the 25th International Joint Conference on Artificial Intelligence

    , volume 2016, pp. 1648, 2016.
  • Konidaris & Barto (2009) Konidaris, G. and Barto, A. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems, pp. 1015–1023, 2009.
  • Kortsarz (2001) Kortsarz, G. On the hardness of approximating spanners. Algorithmica, 30(3):432–450, 2001.
  • Lehnert et al. (2018) Lehnert, L., Laroche, R., and van Seijen, H. On value function representation of long horizon problems. In AAAI, 2018.
  • Littman et al. (1995) Littman, M. L., Dean, T. L., and Kaelbling, L. P. On the complexity of solving Markov decision problems. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 394–402, 1995.
  • Machado et al. (2018) Machado, M. C., Bellemare, M. G., and Bowling, M. A Laplacian framework for option discovery in reinforcement learning. In Proceedings of the Thirty-fourth International Conference on Machine Learning, 2018.
  • Mann et al. (2015) Mann, T. A., Mannor, S., and Precup, D. Approximate value iteration with temporally extended actions. Journal of Artificial Intelligence Research, 53:375–438, 2015.
  • McGovern & Barto (2001) McGovern, A. and Barto, A. G. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 361–368, 2001.
  • Menache et al. (2002) Menache, I., Mannor, S., and Shimkin, N. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning, pp. 295–306, 2002.
  • Moradi et al. (2012) Moradi, P., Shiri, M. E., Rad, A. A., Khadivi, A., and Hasler, M. Automatic skill acquisition in reinforcement learning using graph centrality measures. Intelligent Data Analysis, 16(1):113–135, 2012.
  • Panigrahy & Vishwanathan (1998) Panigrahy, R. and Vishwanathan, S. An O(log* n) approximation algorithm for the asymmetric p-center problem. Journal of Algorithms, 27(2):259–268, 1998.
  • Pickett & Barto (2002) Pickett, M. and Barto, A. Policyblocks: An algorithm for creating useful macro-actions in reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp. 506–513, 2002.
  • Precup & Sutton (1998) Precup, D. and Sutton, R. S. Multi-time models for temporally abstract planning. In Advances in neural information processing systems, pp. 1050–1056, 1998.
  • Puterman (1994) Puterman, M. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
  • Raz & Safra (1997) Raz, R. and Safra, S. A sub-constant error-probability low-degree test, and a sub-constant error-probability pcp characterization of np. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pp. 475–484. ACM, 1997.
  • Silver & Ciosek (2012) Silver, D. and Ciosek, K. Compositional planning using optimal option models. In Proceedings of the 29th International Conference on Machine Learning, pp. 1063–1070, 2012.
  • Simon (1957) Simon, H. A. Models of man; social and rational. 1957.
  • Şimşek & Barto (2004) Şimşek, Ö. and Barto, A. Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, pp. 751–758, 2004.
  • Şimşek & Barto (2009) Şimşek, Ö. and Barto, A. G. Skill characterization based on betweenness. In Advances in Neural Information Processing Systems, pp. 1497–1504, 2009.
  • Şimşek et al. (2005) Şimşek, Ö., Wolfe, A., and Barto, A. Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the Twenty Second International Conference on Machine Learning, pp. 816–823, 2005.
  • Stolle & Precup (2002) Stolle, M. and Precup, D. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212–223, 2002.
  • Sutton et al. (1999) Sutton, R., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211, 1999.
  • Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • Williams & Baird (1993) Williams, R. J. and Baird, L. C. Tight performance bounds on greedy policies based on imperfect value functions. Technical report, College of Computer Science, Northeastern University, 1993.

Appendix A Appendix: Inapproximability of MOMI

In this section we prove Theorem 4:

Theorem 4.
  1. MOMI is hard to approximate even for deterministic MDPs unless .

  2. MOMI is -hard to approximate for any even for deterministic MDPs unless .

  3. MOMI is -hard to approximate for any unless .

First, we show Theorem 4.1 by a reduction from the set cover problem to MOMI with deterministic MDP.

Next, we demonstrate Theorems 4.2 and 4.3. For both results we reduce our problem to the Min-Rep, problem, originally defined by (Kortsarz, 2001). Min-Rep is a variant of the better studied label cover problem (Dinur & Safra, 2004) and has been integral to recent hardness of approximation results in network design problems (Dinitz et al., 2012; Bhattacharyya et al., 2012). Roughly, Min-Rep asks how to assign as few labels as possible to nodes in a bipartite graph such that every edge is “satisfied.”

[roundcorner=1pt, backgroundcolor=white] Definition 13 (Min-Rep):
Given a bipartite graph and alphabets and for the left and right sides of respectively. Each has associated with it a set of pairs which satisfy it. Return a pair of assignments and such that for every there exists an such that and . The objective is to minimize .

We illustrate a feasible solution to an instance of Min-Rep in Figure 16.

,

Figure 16: An instance of Min-Rep with and . Edge is labeled with pairs in . Feasible solution illustrated where and below and in blue. Constraints colored to coincide with stochastic action colors in Figure 18.

The crucial property of Min-Rep we use is that no polynomial-time algorithm can approximate Min-Rep well. Let .

Lemma 1 (Kortsarz 2001).

Unless , Min-Rep admits no polynomial-time approximation algorithm for any .

As a technical note, we emphasize that all relevant quantities in Min-Rep are polynomially-bounded. In Min-Rep we have for constant . It immediately follows that for constant .

a.1 Hardness of Approximation of Momi with Deterministic MDP

Theorem 4.1 Proof.

The optimization version of the set-cover problem cannot be approximated within a factor of by a polynomial-time algorithm unless P = NP (Raz & Safra, 1997). The set-cover optimization problem can be reduced to MOMI with a similar construction for a reduction from SetCover-DEC to OI-DEC. Here, the targeted minimization values of the two problems are equal: , and the number of states in OI-DEC is equal to the number of elements in the set cover on transformation. Assume there is a polynomial-time algorithm within a factor of approximation for MOMI where is the number of states in the MDP. Let SetCover be an instance of the set-cover problem. We can convert the instance into an instance of MOMI. Using the approximation algorithm, we get a solution where , where is the optimal solution. We construct a solution for the set cover from the solution to the MOMI (see the construction in the proof of Theorem 1). Because and , where is the optimal solution for the set cover, we get . Thus, we acquire a approximation solution for the set-cover problem within polynomial time, something only possible if P=NP. Thus, there is no polynomial-time algorithm with a factor of approximation for MOMI, unless P=NP. ∎

a.2 Hardness of Approximation of

We now show our hardness of approximation of for , Theorem 4.2.444We assume that is a “good” set of options in the sense that there exists some set such that . We also assume, without loss of generality, that throughout this section; other values of can be handled by re-scaling rewards in our reduction.

We start by describing our reduction from an instance of Min-Rep to an instance of . The intuition behind our reduction is that we can encode choosing a label for a vertex in Min-Rep as choosing an option in our instance. In particular, we will have a state for each edge in our Min-Rep instance and reward will propagate quickly to that state when value iteration is run only if the options corresponding to a satisfying assignment for that edge are chosen.

More formally, our reduction is as follows. Consider an instance of Min-Rep, MR, given by , , and . Our instance of is as follows where and .555It is easy to generalize these results to by replacing certain edges with paths.

  • State space We have a single goal state along with states and . For each edge we create a state . Let consist of all such that is in some assignment in . Define symmetrically. For each edge we create a set of states, namely and for every . We do the same for .

  • Actions and Transitions We have a single action from to , a single action from to . For each edge we have the following deterministic actions: Every has a single outgoing action to for ; Every has a single outgoing action to for ; Every has an outgoing action to if and every has a single outgoing action to ; Lastly, we have a single action from to for every .

  • Reward The reward of arriving in is . The reward of arriving in every other state is .

  • Option Set Our option set is as follows. For each vertex and each we have an option : The initiation set of this option is every where is incident to ; The termination set of this option is every where is incident to ; The policy of this option takes the action from to when in and the action from to when in .

    Symmetrically, for every vertex and each we have an option : The initiation set of this option is every where is incident to ; The termination set of this option is ; The policy of this option takes the action from to when in and from to when in .

One should think of choosing option as corresponding to choosing label for vertex in the input Min-Rep instance. Let be the MDP output given instance MR of Min-Rep and see Figure 18 for an illustration of our reduction.

Figure 17: Our reduction applied to the Min-Rep problem in Figure 16. , , . Actions given in solid lines and each option in represented in its own color as a dashed line from initiation to termination states. Notice that a single option goes from and to .

Let be the value of the optimal solution to and let be the value of the optimal Min-Rep solution to MR. The following lemmas demonstrates the correspondence between a and Min-Rep solution.

Lemma 2.

Proof.

Given a solution to MR, define as the corresponding set of options. Let and be the optimal solutions to MR which is of cost .

We now argue that is a feasible solution to of cost , demonstrating that the optimal solution to has cost at most . To see this notice that by construction the cost of is exactly the Min-Rep cost of .

We need only argue, then, that is feasible for and do so now. The value of every state in is . Thus, we must guarantee that after 3 iterations of value iteration, every state has value . However, without any options every state except each has value after 3 iterations of value iteration. Thus, it suffices to argue that guarantees that every will have value after iterations of value iteration. Since is a feasible solution to MR we know that for every there exists an and such that ; correspondingly there are options . It follows that, given options from, one can take option then the action from to and then option to arrive in ; thus, after 3 iterations of value iteration the value of is . Thus, we conclude that after 3 iterations of value iteration every state has converged on its value. ∎

We now show that a solution to corresponds to a solution to MR. For the remainder of this section and is the Min-Rep solution corresponding to option set .

Lemma 3.

For a feasible solution to , , we have is a feasible solution to MR of cost .

Proof.

Notice that by construction the Min-Rep cost of is exactly . Thus, we need only prove that is a feasible solution for MR.

We do so now. Consider an arbitrary edge ; we wish to show that satisfies . Since is a feasible solution to we know that after 3 iterations of value iteration every state must converge on its value. Moreover, notice that the value of every state in is . Thus, it must be the case that for every there exists a path of length 3 from to using either options or actions. The only such paths are those that take an option , then an action from to then option where . It follows that and . But since , we then know that is satisfied. Thus, every edge is satisfied and so is a feasible solution to MR. ∎

Theorem 4.2 Proof.

Assume and for the sake of contradiction that there exists an for which polynomial-time algorithm can -approximate . We use to