End-users of service mobile robots want the ability to teach their robots how to perform novel tasks, by composing known low-level skills into high-level behaviors based on demonstrations and user preferences. Learning from Demonstration (LfD) [lfdSurvey]
, and Inverse Reinforcement Learning (IRL)[ziebartInverse] have been applied to solve this problem, to great success in several domains, including furniture assembly [nieukumFurniture], object pick-and-place [levineDemo], and surgery [padoySurgery, kellerSurgery]. A key driving factor for these successes has been the use of Neural Networks (NNs) to learn the action selection policy (ASP) directly [schulman2015trust, schulman2017proximal], or the value function from which the policy is derived [tamar2016value]. Unfortunately, despite their success at representing and learning policies, LfD using NNs suffers from the following well-known problems: 1) they are extremely data-intensive, and need a variety of demonstrations before a meaningful policy can be learned [deepLearningRobots]; 2) they are opaque to the user, making it hard to understand why they do things in specific ways or to verify them [manuelaExplanation]; 3) they are quite brittle, and very hard to repair when parameters of the problem change, or when moving from simulation to real robots [closingSimToReal].
We present the following observations about ASPs independent of their representation: 1) The input states to a policy consist of physically meaningful quantities, velocities, distances, and angles. 2) The structure of a policy has distinct levels of abstraction, including computing relevant features from the state, composing several decision-making criteria, and making decisions based on task- and domain- specific parameters. 3) A well-structured policy is easy to repair in terms of only the parameters that determine the decision boundaries, when the domain changes.
Based on these insights we build on program synthesis as a means to address the shortcomings of neural approaches. Program synthesis seeks to automatically find a program in an underlying programming language that satisfies some user specification [gulwani2017program]
. Synthesis directly addresses these concerns by learning policies as human-readable programs, that are amenable to program repair, and can by do so with only a small number of demonstrations as a specification. However, due to two major limitations, existing state of the art synthesis approaches are not sufficient for learning robot programs. First, these approaches are not designed to handle non-linear real arithmetic, vector operations, or dimensioned quantities, all commonly found in robot programs. Second, synthesis techniques are largely limited by their ability to scale with the search space of potential programs, such that ASP synthesis is intractable for existing approaches.
To address these limitations and apply synthesis to solving the LfD problem we propose Layered Dimension-Informed Program Synthesis (LDIPS). We introduce a domain-specific language (DSL) for representing ASPs where a type system keeps track of the physical dimensions of expressions, and enforces dimensional constraints on mathematical operations. These dimensional constraints limit the search space of the program, greatly improving the scalability of the approach and the performance of the resulting policies. The DSL structures ASPs into decision-making criteria for each possible action, where the criteria are repairable parameters, and the expressions used are derived from the state variables. The inputs to LDIPS are a set of sparse demonstrations and an optional incomplete ASP, that encodes as much structure as the programmer may have about the problem. LDIPS then fills in the blanks of the incomplete ASP using syntax-guided synthesis [syntaxGuided] with dimension-informed expression and operator pruning. The result of LDIPS is a fully instantiated ASP, composed of synthesized features, conditionals, and parameters.
We present empirical results of applying LDIPS to robot soccer and autonomous driving, showing that it is capable of generating ASPs that are comparable in performance to expert-written ASPs that performed well in a (ommitted for double-blind review) competition. We evaluate experimentally the effect of dimensional constraints on the performance of the policy and the number of candidate programs considered. We further show that LDIPS is capable of synthesizing such ASPs with two orders of magnitude fewer examples than an NN representation. Finally, we show that LDIPS can synthesize ASPs in simulation, and given only a few corrections, can repair the ASPs so that they perform almost as well on the real robots as they did in simulation.
2 Related Work
related The problem of constructing ASPs from human demonstrations has been extensively studied in the LfD, and inverse reinforcement learning (IRL) settings [reinSurvey, deepLearningRobots, lfdSurvey]. In this section, we focus on
alternative approaches to overcome data efficiency, domain transfer, and interpretability problems;
concurrent advances in program synthesis;
recent work on symbolic learning similar to our approach;
synthesis and formal methods applied to robotics. . We conclude with a summary of the our contributions compared to the state of the art.
The field of transfer learning attempts to address generalization and improve learning rates to reduce data requirements[stoneTransferSurvey]. Model-based RL can also reduce the data requirements on real robots, such as by using dynamic models to guide simulation [dudekModelBased]. Other work addresses the problem of generalizing learning by incorporating corrective demonstration when errors are encountered during deployment [thomazCorrectiveDemo]. Approaches to solving the Sim-to-Real problem have modified the training process and adapted simulations [closingSimToReal], or utilized progressive nets to transfer features [simToRealProg]. Recent work on interpreting policies has focused on finding interpretable representations of NN policies, such as with Abstracted Policy Graphs [manuelaExplanation], or by utilizing program synthesis to mimic the NN policy [swaratInterp].
SyGuS is a broad field of synthesis techniques that have been applied in many domains [syntaxGuided]. The primary challenge of SyGuS is scalability, and there are many approaches for guiding the synthesis in order to tractably find the best programs. A common method for guiding synthesis is the use of sketches, where a sketch is a partial program with some holes left to be filled in via synthesis [combinatorialSketching]. Another approach is to quickly rule out portions of the program space that can be identified as incorrect or redundant, such as by identifying equivalent programs given examples [exampleTransit], by learning to recognize properties that make candidate programs invalid [isilConflictDriven], or by using type information to identify promising programs [typeExample]. A similar approach is to consider sets of programs at once, such as by using a special data structure for string manipulation expressions [gulwaniStrings], or by using SMT alongside sketches to rule out multiple programs simultaneously [swaratComponent].
Recent symbolic learning approaches have sought to combine synthesis and deep learning by leveraging NNs for sketch generation[swaratNeuralSketch, solarSketchLearn], by guiding the search using neural models [gulwaniNeuralGuided], or by leveraging purely statistical models to generate programs [robustFill]. Alternatively, synthesis has been used to guide learning, as in work that composes neural perception and symbolic program execution to jointly learn visual concepts, words, and semantic parsing of questions [maoNeurSymbo]. While symbolic learning leveraging program synthesis produces interpretable ASPs in restricted program spaces, these approaches often still require large amounts of data.
State-of-the-art work for synthesis in robotics focuses on three primary areas. The most related work uses SMT-based parameter repair alongside human corrections for adjusting transition functions in robot behaviors [holtz2018interactive]. Similar work utilizes SyGuS as part of a symbolic learning approach to interpret NN policies as PID controllers for autonomous driving [swaratInterp, swaratImitation]. A different, but more common synthesis strategy in robotics is reactive synthesis. Reactive synthesis produces correct-by-construction policies based on Linear Temporal Logic specifications of behavior by generating policies as automata without relying on a syntax [kressSurvey, kressPlanning, topcuReactive, salty].
In this work, we present an LfD approach that addresses data-efficiency, verifiability, and repairability concerns by utilizing SyGuS, without any NN components. builds on past SyGuS techniques by introducing dimensional-constraints. While past work in the programming languages community has leveraged types for synthesis [typeExample], to the best of our knowledge none has incorporated dimensional analysis. Further, extends prior approaches by supporting non-linear real arithmetic, such as trigonmetric functions, as well as vector algebra.
3 Synthesis for Action Selection
This section presents LDIPS, using our RoboCup soccer-playing robot as an example (fig. 0(a)). We consider the problem of learning an action selection policy (ASP) that directs our robot to intercept a moving ball and kick it towards the goal. An ASP for this task employs three low-level actions () to go to the ball (Goto), intercept it (Inter), and kick it toward the goal (Kick). The robot runs the ASP repeatedly, several times per second, and uses it to transition from one action to another, based on the observed the position and velocity of the ball () and robot (). Formally, an ASP for this problem is a function that maps a previous action and a current world state to a next action: . The world state definition is domain-dependent: for robot soccer, it consists of the locations and velocities of the ball and the robot ().
An ASP can be decomposed into three logical layers: 1) expressions that compute features (e.g., the distance to the ball, or its velocity relative to the robot); 2) the construction of decision logic based on feature expressions (e.g., the features needed to determine whether to kick or follow the ball); and 3) the parameters that determine the decision boundaries (e.g., the dimensions of the ball and robot determines the distance at which a kick will succeed).
Given only a sequence of demonstrations, LDIPS can synthesize an ASP encoded as a structured program. For example, fig. 0(b) shows a set of nine demonstrations, where each is a transition from one action to another, given a set of observations. Given these demonstrations, LDIPS generates an ASP in three steps. 1) It generates a sequence of if-then-else statements that test the current action () and return a new action (fig. 0(e)). However, this is an incomplete ASP, that has blank expressions (), and blank parameters (). 2) LDIPS uses bounded program enumeration to generate candidate features. However, these features have blank parameters for decision boundaries. 3) LDIPS uses an SMT solver to find parameter values that are consistent with demonstrations. If the currently generated set of features is inadequate, then LIDPS will not find parameter values. In that case, the algorithm will return to step (2) to generate new features. Eventually, the result is a complete ASP that we can run on the robot (fig. 0(f)). Compared to other LfD approaches, a unique feature of LDIPS is that it can also synthesize parts of an ASP with varying amounts of guidance. For example, in addition to the demonstrations, the user may also provide an incomplete ASP. For example, the user can write the ASP shown in fig. 0(c), which has several blank parameters (), e.g., to determine the maximum distance at which a Kick will succeed. It also has blank expressions () and predicates (), e.g., for the conditions under which the robot should kick a moving ball. Given this incomplete ASP, LDIPS will produce a completed executable ASP that preserves the non-blank portions of the incomplete ASP (fig. 0(d)).
3.1 A Language for (Incomplete) Action Selection Policies
Figure 1(a) presents a context-free grammar for the language of ASPs. In this language, a policy () is a sequence of nested conditionals that return the next action (). Every condition is a predicate () that compares feature expressions () to threshold parameters (). A feature expression can refer to input variables () and the value of the last action (). An incomplete ASP may have blank expressions (), predicates (), or parameters (). The output of LDIPS is a complete ASP with all blanks filled in. At various points in we will need to evaluate programs in this syntax with respect to a world state, to accomplish this we employ a function Eval().
Different problem domains require different sets of primitive actions and operators. Thus for generality, LDIPS is agnostic to the collection of actions and operators required. Instead, we instantiate LDIPS for different domains by specifying the collection of actions (), unary operators (), and binary operators () that are relevant to ASPs for that domain. For example, fig. 1(b) shows the actions and operators of the RoboCup domain.
The specification of every operator includes the types and dimensions of its operands and result. In L2, we see how LDIPS uses both types and dimensions to constrain its search space significantly. LDIPS supports real-valued scalars, vectors, and booleans with specific dimensions. Dimensional analysis involves tracking base physical quantities as calculations are performed, such that both the space of legal operations is constrained, and the dimensionality of the result is well-defined. Quantities can only be compared, added, or subtracted when they are commensurable, but they may be multiplied or divided even when they are incommensurable. We extend the types of our language with dimensions by defining the dimension as the vector of dimensional exponents , corresponding to Length, Time, and Mass. As an example, consider a quantity :, if represents length in meters, then , and if represents a velocity vector with dimensionality is , then . Further, we extend the type signature of operations to include dimensional constraints that refine their domains and describe the resulting dimensions in terms of the input dimensions. The type signatures of operations, , and are represented in a type environment that maps from expressions to types.
3.2 -L1 : Parameter Synthesis
-L1 fills in values for blank constant parameters () in a predicate (), under the assumption that there are no blank expressions or predicates in . The input is the predicate, a set of positive examples on which must produce true (), and a set of negative examples on which must produce false (). The result of -L1 is a new predicate where all blanks in the input are replaced with constant values.
uses Rosette and the Z3 SMT solver [rosette, z3] to solve constraints. To do so, we translate the incomplete predicate and examples into SMT constraints (l1Sketch). -L1 builds a formula () for every example, which asserts that there exists some value for each blank parameter () in the predicate, such that the predicate evaluates to true on a positive example (and false on a negative example). Moreover, for each blank parameter, we ensure that we chose the same value across all examples. The algorithm uses two auxiliary functions: 1) ParamHoles returns the set of blank parameters in the predicate, and 2) PartialEval substitutes input values from the example into a predicate and simplifies it as much as possible, using partial evaluation [partialEval]. A solution to this system of constraints allows us to replace blank parameters with values that are consistent with all examples. If no solution exists, we return UNSAT (unsatisfiable).
3.3 -L2 : Feature Synthesis
2 consumes a predicate () with blank expressions () and blank parameters () and produces a completed predicate. (An incomplete predicate may occur in a user-written ASP, or may be generated by 3 to decide on a specific action transition in the ASP.) To complete the predicate, 2 also receives sets of positive and negative examples ( and ), on which the predicate should evaluate to true and false respectively. Since the predicate guards an action transition, each positive example corresponds to a demonstration where the transition is taken, and each negative example corresponds to a demonstration where it is not. Finally, 2 receives a type environment () of candidate expressions to plug into blank expressions and a maximum depth (). If 2 cannot complete the predicate to satisfy the examples, it returns UNSAT.
The 2 algorithm (L2) proceeds in several steps. 1) It enumerates a set of candidate expressions () that do not exceed the maximum depth and are dimension-constrained (line 4). 2) It fills the blank expressions in the predicate using the candidate expressions computed in the previous step, which produces a new predicate that only has blank parameters (line 4). 3) It calls 1 to fill in the blank parameters and returns that result if it succeeds. 4) If 1 produces UNSAT, then the algorithm returns to Step 2 and tries a new candidate expression.
The algorithm uses the EnumFeatures helper function to enumerate all expressions up to the maximum depth that are type- and dimension- correct. The only expressions that can appear in predicates are scalars, thus the initial call to EnumFeatures asks for expressions of type . (Recursive calls encounter other types.) EnumFeatures generates expressions by applying all possible operators to sub-expressions, where each sub-expression is itself produced by a recursive call to EnumFeatures.
The base case for the recursive definition is when : the result is the empty set of expressions. Calling EnumFeatures with and type produces the subset of input identifiers from the type environment that have the type . Calling EnumFeatures with type produces all expressions , including those that involve operators. For example, if EnumFeatures generates at depth , it makes recursive calls to generates the expressions and at depth . However, it ensures that the type and dimension of and are compatible with the binary operator . For example, if the binary operator is , the sub-expressions must both be scalars or vectors with the same dimensions. This type and dimension constraint allows us to exclude a large number of meaningless expressions from the search space. L2 presents a subset of the recursive rules of expansion for EnumFeatures.
Even with type and dimension constraints, the search space of EnumFeatures can be intractable. To further reduce the search space, the function uses a variation of signatures equivalence [exampleTransit], that we extend to support dimensions. A naive approach to expression enumeration would generate type- and dimension correct expressions that represent different functions, but produce the same result on the set of examples. For example, the expressions and represent different functions with the same type and dimension. However, if our demonstrations only have positive values for , there is no reason to consider both expressions, because they are equivalent given our demonstrations. We define the signature () of an expression as its result on the sequence of demonstrations, and we prune expressions with duplicate signatures at each recursive call, using the SigFilter function.
3.4 -L3 : Predicate Synthesis
L3 Given a set of demonstrations (), -L3 returns a complete ASP that is consistent with . The provided type environment is used to perform dimension-informed enumeration, up to a specified maximum depth . The -L3 algorithm (L3) proceeds as follows.
It separates the demonstrations into sub-problems consisting of action pairs, with positive and negative examples, according to the transitions in .
For each subproblem, it generates candidate predicates with maximum depth .
For each candidate predicate, it invokes -L2 with the corresponding examples and the resulting expression, if one is returned, is used to the guard the transition for that sub-problem.
If all sub-problems are solved, it composes them into an ASP ().
-L3 divides synthesis into sub-problems, using the DivideProblem helper function, to address scalability. DivideProblem identifies all unique transitions from a starting action () to a final action (), and pairs of positive and negative examples , that demonstrate transitions from to , and transitions from to any other final state respectively. As an example sketch generated by DivideProblems, consider the partial program shown in fig. 0(e).
Given the sketch generated by DivideProblem, -L3 employs EnumPredicates to enumerate predicate structure. EnumPredicates fills predicates holes with predicates according to the ASP grammar in fig. 1(a), such that all expressions are left as holes , and all constants are left as repairable parameter holes . Candidate predicates are enumerated in order of increasing size until the maximum depth is reached, or a solution is found. For each candidate predicate , and corresponding example sets and , the problem reduces to one amenable to -L2. If a satisfying solution for all is identified by invoking -L2, they are composed into the policy using MakeP, otherwise UNSAT is returned, indicating that there is no policy consistent with the demonstrations.
evaluation We now present several experiments that evaluate
the performance of ASPs synthesized by ,
the data-efficiency of , compared to training an NN,
the generalizability of synthesized ASPs to novel scenarios and
the ability to repair ASPs developed in simulation, and to transfer them to real robots. Our experiments use three ASPs from two application domains.
From robot soccer, the attacker plays the primary offensive role, and use the fraction of scored goals over attempted goals as its success rate.
From robot soccer, the deflector executes one-touch passes to the attacker, and we use the fraction successful passes over attempted passes as its success rate.
From autonomous driving, the passer maneuvers through slower traffic, and we use the fraction of completed passes as its success rate. We use reference ASPs to build a dataset of demonstrations. For robot soccer, we use ASPs that have been successful in RoboCup tournaments. For autonomous driving, the reference ASP encodes user preferences of desired driving behavior.
4.1 Performance of Synthesized ASPs
|Policy||Success Rates ()|
We use our demonstrations to 1) train an LSTM that encodes the ASP, and 2) synthesize ASPs using -L1, -L2, and -L3. For training and synthesis, the training set consists of 10, 20, and 20 trajectories for the attacker, deflector, and passer. For evaluation, the test sets consists of 12000, 4800, and 4960 problems. simPerformance shows that outperforms the LSTM in all cases. For comparison, we also evaluate the reference ASPs, which can outperform the synthesized ASPs. The best ASP for deflector was within of the reference, while the LSTM ASP was worse.
4.2 Effects of Dimensional Analysis
Dimensional analysis enables tractable synthesis of ASPs and improves the performance of the learned policies. We evaluate the impact of dimensional analysis by synthesizing policies with four variations of -L3, the full algorithm, a variant with only dimension based pruning, with only signature-based pruning, and with no expression pruning, all with a fixed depth of . In simPerformance2 we report the number of expressions enumerated for each variant, for each of our behaviors, as well as the performance of each of the resulting policies.
|Policy||# Enumerated||Success Rate %|
For all of our behaviors, the number of expressions enumerated without dimensional analysis or dimension informed signature pruning increases by orders of magnitude. With this many possible candidate expressions, synthesis becomes intractable, and as such, without pruning, we cannot synthesize a policy to evaluate at all. Further, the performance of the ASPs synthesized with only signature pruning are consistently worse than -L3 and the difference is most stark in the passer ASP, with a performance difference of between them.
4.3 Data Efficiency
can synthesize ASPs with far fewer demonstrations than the LSTM. To illustrate this phenomenon, we train the LSTM with
the full LSTM training demonstrations(LSTM-Full,
half of the training demonstrations (LSTM-Full), and
the demonstrations that uses (LSTM-Synth), which is a tiny fraction of the previous two training sets.
dataEfficiency shows how the performance of the LSTM degrades as we cut the size of the training demonstrations. In particular, when the LSTM and use the same training demonstrations, the LSTM fares significantly worse ( inferior performance).
4.4 Ability to Generalize From Demonstrations
A key requirement for an LfD algorithm is its ability to generalize to novel problems. This experiment shows that an attacker ASP, synthesized using -L3 and only ten demonstrations, can score a goal when a moving ball is placed at almost any reasonable position on the field. On each run, the attacker starts at the origin (LABEL:fig:heatmap). We discretize the soccer field, place the ball at a discrete point, and set the ball in motion in possible directions (12,000 total runs). Thus, each point of the heatmap shows the attacker’s success rate on all runs that start at that point. The figure shows the performance of the -L3 synthesized ASP on ten demonstration runs that start from the eight marked positions. The synthesized ASP generalizes to problems that are significantly different from the training examples. Moreover, its performance degrades on exactly the same region of the field as the reference ASP (i.e., when the ball is too far away for the attacker to intercept).
4.5 Transfer From Sim To Real
ASPs designed and tested in simulation frequently suffer from degraded performance when run on real robots. If the ASP is hand-written and includes parameters it may be repaired by parameter optimization, but NN ASPs are much harder to repair without significant additional data collection and retraining. However, can make the sim-to-real transfer process significantly easier. For this experiment, using the attacker and deflector, we 1) synthesize ASPs in a simulator, and 2) deploy them on a real robot. Predictably, the real robot sees significantly degraded performance on the reference ASP, the learned LSTM ASP, and the LDIPS-synthesized ASP. We use a small variant of 1 (inspired by SRTR [holtz2018interactive]) on the reference and LDIPS ASPs: to every parameter () we add a blank adjustment (), and synthesize a minimal value for each blank, using ten real-world demonstration runs. The resulting ASPs perform significantly better, and are much closer to their performance in the simulator (LABEL:fig:realPerformance). This procedure is ineffective on the LSTM: merely ten demonstration runs have minimal effect on the LSTMs parameters. Morever, gathering a large volume of real-world demonstrations is often impractical.
conclusion In this work, we presented an approach for learning action selection policies for robot behaviors utilizing layered dimension informed program synthesis (). This work composes skills into high-level behaviors using a small number of demonstrations as human-readable programs. We demonstrated that our technique generates high-performing policies with respect to human-engineered and learned policies in two different domains. Further, we showed that these policies could be transferred from simulation to real robots by utilizing parameter repair.
This work was partially supported by the National Science Foundation under grants CCF-2102291 and CCF-2006404, and by JPMorgan Chase & Co. In addition, we acknowledge support from Northrop Grumman Mission Systems’ University Research Program. Any views or opinions expressed herein are solely those of the authors listed.