In classical planning problems, a model of the acting agent and its relationship to the relevant world is given in a formal planning description language, e.g., the classical STRIPS model [Fikes and Nilsson1971] or PDDL [McDermott et al.1998]
. Planning algorithms (planners) use this model to generate a plan for achieving a given goal condition. Creating a planning domain model, however, is acknowledged as a notoriously hard knowledge engineering task. This has motivated much work onlearning such knowledge.
One such well-studied approach is to learn a domain model by observing the agent’s interactions with the environment. The problems that arise in such approaches, however, are frequently intractable [Kearns and Valiant1994, Daniely and Shalev-Shwartz2016]. An alternative approach that is commonly used in reinforcement learning is to skip the model-learning phase and directly learn how to act by observing the agent’s past actions and observations, and by guiding the agent towards performing exploratory actions [Kearns et al.2002, inter alia]. In most prior work, the agent may fail to execute a planned action. When this occurs, the agent replans, possibly refining an underlying domain or action model it has learned. Thus, the agent learns from both positive and negative examples.
In this work we address a different setting, in which such execution failures must be avoided. This setting is relevant when execution failures are very costly or when the agent has limited computation capabilities, and thus does not have the capability to re-plan after the plan it has tried to execute has failed. Consider, for example, a team of nano-bots deployed inside a human body for medical target identification and drug delivery [Cavalcanti et al.2007]. Re-planning is likely not possible in such nano-bots, and, of course, failing to cure the disease is undesirable. Thus, the planning task we focus on in this paper is how to find a plan that is safe, i.e., is guaranteed to achieve the goal, in a setting where a domain model is not available. We call this problem the safe model-free planning problem.
Since performing actions that might fail is not allowed, exploration actions cannot be performed. The only source of information available is a set of trajectories of previously executed plans. First, we show how to learn a set of actions that the agent can use from the given trajectories. For every such action , we bound the set of predicates that are ’s preconditions and effects. This bounded action model is then used to construct a classical planning problem such that a solution to it is a solution to our model-free problem. This approach to solve the model-free planning problem is sound and can be very efficient, since current classical planners are very efficient. However, it is not complete, as the planning problem that uses the learned action model might not be solvable even if the underlying model-free planning problem is. Nonetheless, we prove that under some assumptions, the probability of this occurring decreases quasi-linearly with the number of observed trajectories.
This positive result comes in contrast to the hardness of other tasks related to model learning. For example, learning to predict labels computed by finite-state machines [Kearns and Valiant1994] or even DNF formulas [Daniely and Shalev-Shwartz2016] is believed to be computationally intractable. Thus, we cannot hope to predict the values that fluents will take merely on the basis of the fact that these can be computed by a finite-state machine or a DNF. Similarly, even the problems of finding memoryless policies [Littman1994] or finite-state policies [Meuleau et al.1999] in simple environments is computationally intractable. Finally, in the standard interactive learning model, simple examples (that would be captured by a STRIPS environment, for example) are known to force a learner to explore an exponential number of paths in any reasonable measure of the environment’s complexity [Kakade2003, Section 8.6].
The key difference between the model learning we propose and these hardness results is that we limit our attention to the learning of STRIPS domain models in a PAC (“Probably Approximately Correct”) sense: we do not aim to learn an accurate action model, only one that is sufficient for finding a safe plan in most cases. We introduce and desribe this PAC-like setting in Section 4.
2 Problem Definition
The setting we address is a a STRIPS planning domain, represented using the SAS formalism [Bäckström and Nebel1995]. A planning domain in SAS is defined by the tuple , where
is a set of state variables, each associated with a finite domain .
is a set of actions, where each action is defined by a tuple , where and are assignments of values to state variables, i.e., a set of assignments of the form where . We refer to and its associated sets of preconditions and effects as the action model of the domain.
A state is also a set of assignments of the form such that every variable in is assigned a single value from its corresponding domain. As a shorthand notation, if a state contains an assignment we will write . A planning problem in SAS is defined by the tuple , where is the initial state and is a partial assignment of some state variables that defines the goal. A state is a goal state if . For an action and a state we denote by the state resulting from applying on state . A solution to an SAS planning problem is a plan, which is a sequence of actions such that .
The key challenge we address in this paper is how to solve a SAS planning problem without having the action model of . Instead, a set of trajectories are assumed to be given.
Definition 1 (Trajectory)
A trajectory is an alternating sequence of states () and actions () that starts and ends with a state.
A trajectory represents a successful execution of a sequence of actions by the agent. A set of trajectories may be obtained, for example, by monitoring the acting agent when it is controlled manually by a human operator. The states in the given trajectories are assumed to be fully observable.
Finally, we can define the safe model-free planning problem, which is the focus of this work.
Definition 2 (Safe model-free planning)
Let be a planning problem and let be a set of trajectories in the planning domain . The input to a safe model-free planning problem is the tuple and the task is to generate a plan that is a solution to . We denote this safe model-free planning problem as .
3 Conservative Planning
To solve the safe model-free planning problem, we propose to learn a conservative action model, and then use it to find sound plans.
Following prior work on learning action models [Wang1995, Wang1994, Walsh and Littman2008], we partition every observed trajectory into a set of action triplets, where each action triplet is of the form . Let be all the action triplets for action . A state and are called pre- and post-state of , respectively, if there is an action triplet . Following Walsh and Littman walsh2008efficientLearning and Wang wang1994learning,wang1995learning, we observe that from the set of trajectories we can “bound” the set of predicates in an action’s preconditions and effects, as follows.
Equation 1 holds because a value assignment cannot be a precondition of if it is not in every pre-state of , and thus only value assignments that exists in all the pre-states of may be preconditions of (hence ). On the other hand, the fact that a state variable happened to have the same value in all the pre-states of the observed trajectories does not necessarily mean that is a precondition of . It may even be the case that has no preconditions at all, and thus the “lower bound” on an action’s precondition is .
Equations 2 holds because a value assignment cannot be an effect of if it is not in every post-state of , and thus only value assignments that exists in all the post-states of may be preconditions of (hence ). On the other hand, every variable that has a value in the post-state that is different from the pre-state then it must be an effect (hence ). We denote the “upper bound” of the preconditions by and the “lower bound” of the effects by .
3.1 Compilation to Classical Planning
Next, we use the bounds in Equations 1 and 2 to compile a safe model-free planning problem to a classical SAS problem , such that a solution to is a solution to , i.e., it is a solution for the underlying planning problem . is defined as follows. It has exactly the same set of state variables (), start state (), and goal () as . The actions of is the set of all the actions seen in an observed trajectory. We denote this set of actions by . The preconditions of an action in
are defined as the “upper bound” estimate given in Equation1 () and the effects of in are defined to be the ”lower bound” estimate given in Equation 2 ().
Definition 3 (Safe)
An action model is safe with respect to an action model if for every state and action it holds that (1) if is applicable in according to then it is also applicable in according to , and (2) applying to results in exactly the same state when using either or .
The action model in is safe with respect to the action model of .
Proof: Let and be the action models of and , respectively. Since is applicable in according to , it means that and consequently is also applicable in according to , since (Equation 1).
Now, let be the state resulting from applying on according to , and let denote the value of a state variable in , i.e., . Since , then according to either is an effect of or has no effect on . If the former is true then . Otherwise, it means that in the observed trajectories, applying never changed the value of , i.e., was equal to in both the pre-state and post-state. By definition, this means that is a precondition of according to , and thus . Thus, the effects of on will be the same in both action models, and hence a .
Corollary 1 (Soundness)
Every solution to is also a solution to
Corollary 1 is a direct result of Theorem 1, and its practical implication is the following algorithm for solving any safe model-free planning problem : compile it to a classical planning problem , run an off-the-shelf classical planner, and return the resulting plan. We refer to this algorithm as the conservative model-free planner. The conservative model-free planner is sound, but it is not complete. There can be planning problems that have a solution but the observed trajectories are not sufficient to induce a corresponding compiled planning problem that is solvable. As an extreme example, if we do not receive any observed trajectories, the compiled planning problem will not have any actions in its action model and thus will not be able to solve anything. In the next section we show that the required number of trajectories is actually reasonable.
Figure 1 illustrates how to generate from observed trajectories in a simple logistic-like domain with one truck, one package, and three possible location , , and . The state variables are TruckAt, with domain , , and , and PackageAt with domain , , , and , where represents that the package is on the truck. The possible actions are Move, Pickup, and Unload, for every .
The three tables on the left-hand side, , , and are three observed trajectories, where the column represents the value of the state variable TruckAt and the column represents the value of the state variable PackageAt. For example, represents a trajectory where the truck starts at , moves to , picks up the package, moves to , and unloads the package there. The tables on the right-hand side of Figure 1 show the action model learned from these trajectories. For didactic reasons, we show the action model learned given just (), then the action model learned given and (), and finally the action model learned using all three trajectories.
As can be observed, given only we do not have any knowledge of many actions such as Pick, Pick, etc. Also, the preconditions learned for the actions Move and Move are too restrictive, requiring that the package is at some location (while clearly a Move action only requires knowing the truck’s location). However, given and , these redundant preconditions are removed, and thus task that can be achieved with the actions Move, Move, Pick, and Unload will be found by our conservative model-free planner.
4 Learning to Perform Safe Planning
In general, we cannot guarantee that any finite number of trajectories will suffice to obtain precisely the underlying action model. This is because, for example, if some action never appears in a trajectory, we may not know its effects; or, if an action is only used in a limited variety of states, it may be impossible to distinguish its preconditions. Consequently, we cannot guarantee a complete solution to the model-free safe planning problem. However, as the number of trajectories increases, we can hope to learn enough of the actions accurately enough to be able to find plans for most goals in practice. This gives raise to a statistical view of the model-free safe planning task (Definition 2) that follows the usual statistical view of learning, along the lines of Vapnik and Chervonenkis vapnik1971 and Valiant valiant1984.
Definition 4 (Safe Model-Free Learning-to-Plan)
We suppose that there is an arbitrary, unknown (prior) probability distribution
We suppose that there is an arbitrary, unknown (prior) probability distributionover triples of the form , where is a state, is a goal condition, and is a trajectory that starts in and ends in a state that satisfies , and all trajectories are applicable in a fixed planning domain . In the safe model-free learning-to-plan task, we are given a set of triplets drawn independently from , and a new SAS planning problem such that the initial state and goal condition are from some drawn from . The task is to either output a plan for or, with probability at most , return that no plan was found.
Remarks on the task formulation
We stress that is arbitrary, and thus the conditional distribution of trajectories given a start and goal state, , can also be any arbitrary distribution. For example, could be the distribution of trajectories obtained by running some (unknown, sophisticated) planning algorithm on input , or produced by hand by a human domain expert. More generally, the adversarial choice of in our model subsumes a model in which a trajectory is nondeterministically and adversarially chosen to be paired with the start state and goal . (Indeed, the distributions used in such a case satisfy the stronger restriction that produces a deterministic outcome , which does not necessarily hold in our model.)
We also note that our conservative model-free planner does not actually require knowledge of the goals associated with the trajectories drawn from . Thus, our approach actually solves a more demanding task that does not provide the goals to the learning algorithm. But, such a distribution over goals is nevertheless a central feature in our notion of “approximate completeness,” and features prominently in the analysis as we discuss next.
Analysis of learning
A key question is how our certainty that a plan can be generated for a new start and goal state grows with the number of trajectories. Let and denote and , respectively, and let denote the largest number of values for a state variable.
Using the conservative model-free planner, it is sufficient to observe trajectories to solve the safe model-free learning-to-plan problem with probability .
Proof Outline. First, Lemma 1 shows that the set of actions used by our conservative model-free planner () is sufficient to solve a randomly drawn problem with high probability. Then, Lemma 2 shows that under certain conditions, the preconditions we learned for these actions () are not too conservative, i.e., they are adequate for finding a plan with high probability. Finally, we prove that with high probability these conditions over the action model we learned indeed hold.
Let be the set of actions such that each action appears in a trajectory sampled from with probability at least and let be the set of every action that appeared in a trace. The probability that all the actions appear in is at least .
Proof: By definition, the probability that an action does not exist in a trajectory drawn from is . Since the observed trajectories are drawn independently from we have that the probability that is , using the inequality . Since we assume in Theorem 2 that which is larger than , we have that the probability that is at most
Hence, by a union bound over (noting ), we have that with probability as needed.
Stated informally, Lemma 1 says that with high probability we will observe all the “useful” actions, i.e., the actions used in many trajectories. However, we may have learned preconditions for these actions that are too conservative, preventing the planner from finding a plan even if one exists. We next define a property of action models that states that this does not occur frequently.
Definition 5 (Adequate)
We call an action model -adequate if, with probability at most , we sample a trajectory from such that contains an action triplet where and does not satisfy .
We say that an action model is -inadequate if it is not -adequate. An equivalent way to define the notion of an -adequate action model is that with probability at most a trajectory is sampled in which an action is invoked on a state that does not satisfy the conservative preconditions of we learned from the given set of trajectories ().
If the learned action model is -adequate and , then with probability our conservative model-free planner will find a plan for a start-goal pair sampled from .
Proof: Let be the (unknown) trajectory sampled for . The probability that uses an action that is not in is at most . Thus, contains only actions known to our planner with probability at least , since we assumed that . Since the action model is -adequate then with probability the learned preconditions are satisfied on all of the states in . Thus, by a union bound, we find that with probability , our planner could at least find . Hence it will find a trajectory from to , as required.
The action model used by our conservative model-free planner is -adequate with probability at least .
Proof: Whether an action model is -adequate or not depends on the assignment of preconditions to actions. Since there are state variables each with at most values, then there are possible assignments of preconditions for an individual action and a total of possible preconditions assignments for an action model . Let be the subset of these action model preconditions assignments that are not -adequate. Clearly, has size at most .
Consider a particular assignment of preconditions in an -inadequate action model . Since is -inadequate, it has a set of state-action pairs () associated with it such that and cannot be applied to according to . The action model can only be learned by our algorithm if none of these state-action pairs were observed in the given trajectories . On the other hand, by the definition of inadequacy the probability of having a state-action pair from that list in a trajectory drawn from is at least . Thus, the probability that our algorithm will learn a particular preconditions assignment of an -inadequate action model is at most . Since , then is smaller than
which is at most . Thus, by a union bound over this set of inadequate assignments , the probability that any inadequate assignment of preconditions could be output is at most . Thus, with probability , the algorithm indeed produces an assignment of preconditions that is neither unsafe for nor inadequate for , as needed.
4.1 Unsolvable Instances
The implication of Theorem 2 is that by observing a number of trajectories that is quasi-linear in the number of actions and the number of state variables, we expect our safe model-free planner to be complete with high probability, in the sense that if a solution exists it will be found. But what if some of the drawn problem instances are not solvable?
If the probability of drawing a solvable start-goal pair from is then it is sufficient to observe trajectories (of solvable instances) to guarantee that with probability of at least our conservative model-free planner will solve a start-goal pair drawn from with probability at least
Proof: Let be true or false if a given planning problem is solvable or unsolvable, respectively, and let be true or false if our planner returns a solution to or not respectively. We aim to bound . Since our planner is sound, and so
According to Theorem 2, and by definition.
Table 1 shows the priors, conditional probabilities, and marginals use by the proof of Corollary 2 The first row shows the probabilities , , abd ; the second row shows the probabilities , , and ; and the last row shows the marginal probabilities and .
Corollary 2 and Table 1 are valuable in that they provides a relationship between , , , and . Thus, we can increase to satisfy more demanding values of , , and and different types of error bounds. For example, consider an application that requires bounding, by some , the probability that our planner outputs incorrectly that no plan exists. In other words, an application that requires
Using Bayes’ rule and the values from Table 1, this means that
Plugging into the sample complexity instead of in Theorem 2 will give the required number of trajectories to obtain a bound of on the probability of incorrectly outputting that a problem is not solvable.
4.2 Limited Planner Capabilities
The given trajectories are presumably generated by some planning entity. Since planning in general is a hard problem, it may be the case that the planner that generated the given set of trajectories has drawn a solvable problem from but was just not able to solve it due to memory or time constraints.
Learning from such a set of trajectories does not enable bounding the probability of solving problems in general. What can be obtained in such cases is to bound the solving capabilities of our conservative model-free planner with respect to the capabilities of the planner that generated the observed trajectories. Thus, instead of having represent the probability that an instance is solvable, we will have represent the probability that an instance is solvable by the original planner. The rest of the analysis follows exactly the same as in the previous section.
5 Related Work
Our work relates to several well-studied types of problems: planning under uncertainty, reinforcement learning, and domain model learning.
Planning under uncertainty.
In common models for planning under uncertainty, such as Markov Decision Problems (MDP) and Partially Observable MDPs (POMDP), the uncertainty stems from the stochastic nature of the world or from imperfect sensors that prevent full observability of the agent’s state. Our setting is different in that our uncertainty only stems from not knowing the agent’s action model.
Reinforcement learning algorithms learn how to act by interacting with the environment. Thus, they are designed for a trial-and-error approach to learn the domain and/or how to plan in it. Our task is to generate a plan that must work, so a trial-and-error approach is not sufficient.
Domain model learning.
Most prior work on learning a domain model in general or a STRIPS action model from observed trajectories, such as ARMS [Yang et al.2007] and LOCM [Cresswell et al.2013], learn approximate models that do not guarantee safety. Hence, such work generally also involves some form of trial-and-error as well, iteratively requesting more example trajectories or interacting directly with the environment to refine the learned model [Mourão et al.2012, Wang1994, Wang1994, Walsh and Littman2008, Levine and DeJong2006, Jiménez et al.2013]. In addition, most works learn from both positive and negative examples – observing successful and failed trajectories, while we only require successful trajectories to be provided.
Another key difference is that unlike our work, most prior works do not provide statistical guarantees on the soundness of the plan generated with their learned model. An exception to this is the work of Walsh and Littman walsh2008efficientLearning, that also discussed the problem of learning STRIPS operators from observed trajectories and provided theoretical bounds on the sample complexity – the number of interactions that may fail until the resulting planner is sound and complete. By contrast, we do not assume any planning and execution loop and do not allow failed interactions. Hence, we aim for a planning algorithm that is guaranteed to be sound, at the cost of completeness. This difference affects their approach to learning. They attempted to follow an optimistic assumption about the preconditions and effects of the learned actions, in an effort to identify inaccuracies in their action model. By contrast, we are forced to take a pessimistic approach, as we aim for a successful execution of the plan rather than information gathering to improve the action model.
This paper deals with a planning problem in which the planner agent has no knowledge about its actions. Instead of an action model, the planner is given a set of observed trajectories of successfully executed plans. In this setting we introduced the safe model-free planning problem, in which the task is to find a plan that is guaranteed to reach the goal, i.e., there is no tolerance for execution failure. This type of problem is important in cases where failure is costly or in cases where the agent has no capability to replan during execution.
We showed how to use the given set of trajectories to learn about the agent’s actions, bounding the set of predicates in the actions’ preconditions and effects. Then, we proposed a conservative approach to solve the safe model-free problem that is based on a translation to a classical planning problem. This solution is sound but is not complete, as it may fail to find a solution even if one exists. However, we prove that under some assumptions the likelihood of finding a solution with this approach grows linearly with the number of predicates and quasi-linearly with the number of actions.
Future directions for safe model-free planning include studying how to address richer underlying planning models including parametrized actions, conditional effects, stochastic action outcomes, and partial observability. While some of these more complex action models can be compiled away (e.g., a problem with conditional effects can be compiled to a problem without conditional effects [Nebel2000]), the resulting problem can be significantly larger. A particularly interesting direction is how to learn there lifted action model, i.e., what can be learned from a trajectory with an action on the action model of , where is a parameterized action and and are different values for the same parameter.
B. Juba was partially supported by an AFOSR Young Investigator Award. R. Stern was partially supported by the Cyber Security Research Center at BGU.
- [Bäckström and Nebel1995] Christer Bäckström and Bernhard Nebel. Complexity results for SAS+ planning. Computational Intelligence, 11(4):625–655, 1995.
- [Cavalcanti et al.2007] Adriano Cavalcanti, Bijan Shirinzadeh, Robert A Freitas Jr, and Tad Hogg. Nanorobot architecture for medical target identification. Nanotechnology, 19(1):015103, 2007.
- [Cresswell et al.2013] Stephen N. Cresswell, Thomas L. McCluskey, and Margaret M. West. Acquiring planning domain models using LOCM. The Knowledge Engineering Review, 28(02):195–213, 2013.
Amit Daniely and Shai Shalev-Shwartz.
Complexity theoretic limtations on learning DNF’s.
Proceedings of the 29th Conference on Computational Learning Theory, volume 49 of JMLR Workshops and Conference Proceedings, pages 1–16. 2016.
- [Fikes and Nilsson1971] Richard E Fikes and Nils J Nilsson. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
- [Jiménez et al.2013] Sergio Jiménez, Fernando Fernández, and Daniel Borrajo. Integrating planning, execution, and learning to improve plan execution. Computational Intelligence, 29(1):1–36, 2013.
- [Kakade2003] Sham M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London, 2003.
- [Kearns and Valiant1994] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67–95, 1994.
[Kearns et al.2002]
Michael Kearns, Yishay Mansour, and Andrew Ng.
A sparse sampling algorithm for near-optimal planning in large Markov decision processes.Mach. Learn., 49(2):193–208, 2002.
- [Levine and DeJong2006] Geoffrey Levine and Gerald DeJong. Explanation-based acquisition of planning operators. In ICAPS, pages 152–161, 2006.
- [Littman1994] Michael L. Littman. Memoryless policies: Theoretical limitations and practical results. In From Animals to Animats 3: Proceedings of the Third International Conference on Simulation of Adaptive Behavior, volume 3, page 238. MIT Press, 1994.
- [McDermott et al.1998] Drew McDermott, Malik Ghallab, Adele Howe, Craig Knoblock, Ashwin Ram, Manuela Veloso, Daniel Weld, and David Wilkins. PDDL-the planning domain definition language. Technical report, AIPS ’98 - The Planning Competition Committee, 1998.
- [Meuleau et al.1999] Nicolas Meuleau, Kee-Eung Kim, Leslie Pack Kaelbling, and Anthony R. Cassandra. Solving POMDPs by searching the space of finite policies. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 417–426. Morgan Kaufmann Publishers Inc., 1999.
- [Mourão et al.2012] Kira Mourão, Luke S Zettlemoyer, Ronald Petrick, and Mark Steedman. Learning STRIPS operators from noisy and incomplete observations. In UAI, pages 614–623, 2012.
- [Nebel2000] Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research, 12:271–315, 2000.
- [Valiant1984] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 18(11):1134–1142, 1984.
- [Vapnik and Chervonenkis1971] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
- [Walsh and Littman2008] Thomas J. Walsh and Michael L. Littman. Efficient learning of action schemas and web-service descriptions. In the National Conference on Artificial Intelligence (AAAI), pages 714–719, 2008.
- [Wang1994] Xuemei Wang. Learning planning operators by observation and practice. In the International Conference on Artificial Intelligence Planning Systems (AIPS), pages 335–340, 1994.
- [Wang1995] Xuemei Wang. Learning by observation and practice: An incremental approach for planning operator acquisition. In ICML, pages 549–557, 1995.
- [Yang et al.2007] Qiang Yang, Kangheng Wu, and Yunfei Jiang. Learning action models from plan examples using weighted MAX-SAT. Artificial Intelligence, 171(2-3):107–143, 2007.