1 Introduction
Consider the act of driving a car along a narrow country road or in a cramped parking garage. While the rules of the road are defined for all jurisdictions, it may be impossible to follow all of those rules in certain situations; however, even if every single rule cannot be adhered to, it remains desirable to follow the largest possible set of rules. The specifications for obeying the rules of the road are nonMarkovian and can be encoded as linear temporal logic (LTL) formulas [1]. There has been significant interest in incorporating LTL formulas as specifications for reinforcement learning ([2], [3], [4]); however, these approaches require specifications to be expressed as a single LTL formula. Such approaches are not sufficiently expressive to handle a scenario like the one above, where the task specifications can, at best, be expressed as a belief over multiple LTL formulas.
In this paper, we introduce a novel problem formulation for planning with uncertain specifications (PUnS), which allows task specifications to be expressed as a distribution over multiple LTL formulas. We identify four evaluation criteria that capture the semantics of satisfying a belief over LTL formulas and analyze the nature of the task executions they entail. Finally, we demonstrate the existence of an equivalent MDP reformulation for all instances of PUnS, allowing any planning algorithm that accepts an instance of a MDP to act as a solver for instances of PUnS.
2 Related Work
Prior research into reinforcement learning has indicated great promise in sequential decisionmaking tasks, with breakthroughs in handling largedimensional state spaces such as Atari games ([5]), continuous action spaces ([6], [7]), sparse rewards ([8], [9]), and all of these challenges in combination ([10]
). These were made possible due to the synergy between offpolicy training methods and the expressive power of neural networks. This body of work has largely focused on algorithms for reinforcement learning rather than the source of task specifications; however, reward engineering is crucial to achieving high performance, and is particularly difficult in complex tasks where the user’s intent can only be represented as a collection of preferences (
[11]) or a belief over logical formulas inferred from demonstrations ([12]).Reward design according to user intent has primarily been studied in the context of Markovian reward functions. Singh et al. [13] first defined the problem of optimal reward design with respect to a distribution of target environments. Ratner et al. [14] and HadfieldMenell et al. [15] defined inverse reward design as the problem of inferring the true desiderata of a task from proxy reward functions provided by users for a set of task environments. Sadigh et al. [16] developed a model to utilize binary preferences over executions as a means of inferring the true reward. However, all of these works only allow for Markovian reward functions; our proposed framework handles uncertain, nonMarkovian specification expressed as a belief over LTL formulas.
LTL is an expressive language for representing nonMarkovian properties. There has been considerable interest in enabling LTL formulas to be used as planning problem specifications, with applications in symbolic planning ([11],[17]) and hybrid controller synthesis ([18]). There has also been growing interest in the incorporation of LTL specifications into reinforcement learning. Aksaray et al. [2]proposed using temporal logic variants with quantitative semantics as the reward function. Littman et al. [3]
compiled an LTL formula into a specification MDP with binary rewards and introduced geometricLTL, a bounded time variant of LTL where the time horizon is sampled from a geometric distribution. ToroIcarte
[4] proposed a curriculum learning approach for progressions of a cosafe LTL ([19]) specification. Lacerda et al. [20] also developed planners that resulted in maximal completion of tasks for unsatisfiable specifications for cosafe LTL formulas. However, while these works are restricted to specifications expressed as a single temporal logic formula, our framework allows for simultaneous planning with a belief over a finite set of LTL formulas.3 Preliminaries
3.1 Linear Temporal Logic
Linear temporal logic (LTL), introduced by Pnueli [21], provides an expressive grammar for describing temporal behaviors. An LTL formula is composed of atomic propositions (discrete time sequences of Boolean literals) and both logical and temporal operators, and is interpreted over traces of the set of propositions, . The notation indicates that holds at time . The trace satisfies (denoted as ) iff . The minimal syntax of LTL can be described as follows:
(1) 
is an atomic proposition, and and represent valid LTL formulas. The operator is read as “next” and evaluates as true at time if evaluates to true at . The operator is read as “until” and the formula evaluates as true at time if evaluates as true at some time and evaluates as true for all time steps , such that . In addition to the minimal syntax, we also use the additional propositional logic operators (and) and (implies), as well as other higherorder temporal operators: (eventually) and (globally). evaluates to true at if evaluates as true for some . evaluates to true at if evaluates as true for all .
The “safe” and “cosafe” subsets of LTL formulas have been identified in prior research ([19], [22], [23]). A “cosafe” formula is one that can always be verified by a trace of a finite length, whereas a “safe” formula can always be falsified by a finite trace. Any formula produced by the following grammar is considered “cosafe”:
(2) 
Similarly, any formula produced by the following grammar is considered “safe”:
(3) 
A formula expressed as belongs to the Obligation class of formulas presented in Manna and Pnueli’s [23] temporal hierarchy.
Finally, a progression over an LTL formula with respect to a truth assignment at time is defined such that : . Thus, a progression of an LTL formula with respect to a truth assignment is a formula that must hold at the next time step in order for the original formula to hold at the current time step. Bacchus and Kabanza [24] defined a list of progression rules for the temporal operators in Equations 1, 2, and 3.
3.2 Belief over Specifications
In this paper, we define the specification of our planning problem as a belief over LTL formulas. A belief over LTL formulas is defined as a probability distribution with support over a finite set of formulas with the density function
. The distribution represents the probability of a particular formula being the true specification. In this paper, we restrict to the Obligation class of formulas.3.3 Modelfree Reinforcement Learning
A Markov decision process (MDP) is a planning problem formulation defined by the tuple , where is the set of all possible states, is the set of all possible actions, and is the probability distribution that the next state will be given that the current state is and the action taken at the current time step is . represents the reward function that returns a scalar value given a state. The Qvalue function is the expected discount value under a policy
. In a modelfree setting, the transition function is not known to the learner, and the Qvalue is updated by the learner acting within the environment and observing the resulting reward. If the Qvalue is updated while not following the current estimate of the optimal policy, it is considered “offpolicy” learning. Given an initial estimate of the Qvalue
, the agent performs an action from state to reach while collecting a reward and a discounting factor . The Qvalue function is then updated as follows:(4) 
4 Planning with Uncertain Specifications (PUnS)
The problem of planning with uncertain specifications (PUnS) is formally defined as follows: The state representation of the learning and task environment is denoted by , where is a set of features that describe the physical state of the system. The agent has a set of available actions, . The state of the system maps to a set of finite known Boolean propositions, , through a known labeling function, . The specification is provided as a belief over LTL formulas, , with a finite set of formulas in its support. The expected output of the planning problem is a stochastic policy, , that satisfies the specification.
The semantics of satisfying a logical formula are well defined; however, there is no single definition for satisfying a belief over logical formulas. In this work, we present four criteria for satisfying a specification expressed as a belief over LTL, and express them as nonMarkovian reward functions . A solution to PUnS optimizes the reward function representing the selected criteria. Next, using an approach inspired by LTLtoautomata compilation methods ([25]), we demonstrate the existence of an MDP that is equivalent to PUnS. The reformulation as an MDP allows us to utilize any reinforcement learning algorithm that accepts an instance of an MDP to solve the corresponding instance of PUnS.
4.1 Satisfying beliefs over specifications
A single LTL formula can be satisfied, dissatisfied, or undecided; however, satisfaction semantics over a distribution of LTL formulas do not have a unique interpretation. We identify the following four evaluation criteria, which capture the semantics of satisfying a distribution over specifications, and formulate each as a nonMarkovian reward function:

Most likely: This criteria entails executions that satisfy the formula with the largest probability as per . As a reward, this is represented as follows:
(5) where
(6) 
Maximum coverage: This criteria entails executions that satisfy the maximum number of formulas in support of the distribution . As a reward function, it is represented as follows:
(7) 
Minimum regret: This criteria entails executions that maximize the hypothesisaveraged satisfaction of the formulas in support of . As a reward function, this is represented as follows:
(8) 
Chance constrained: Suppose the maximum probability of failure is set to , with defined as the set of formulas such that ; and . This is equivalent to selecting the mostlikely formulas until the cumulative probability density exceeds the risk threshold. As a reward, this is represented as follows:
(9)
Each of these four criteria represents a “reasonable” interpretation of satisfying a belief over LTL formulas, with the choice between the criteria dependent upon the relevant application. In a preference elicitation approach proposed by Kim et al. [11], the specifications within the set are provided by different experts. In such scenarios, it is desirable to satisfy the largest common set of specifications, making maximum coverage the most suitable criteria. When the specifications are inferred from task demonstrations (such as in the case of Bayesian specification inference [12]) , minimum regret
would be the natural formulation. However, if the formula distribution is skewed towards a few likely formulas with a long tail of lowprobability formulas, the
chance constrained or most likely criteria can be used to reduce computational overhead in resourceconstrained or timecritical applications.4.2 SpecificationMDP compilation
We demonstrate that an equivalent MDP exists for all instances of PUnS. We represent the task environment as an MDP sans the reward function, then compile the specification into a finite state automaton (FSA) with terminal reward generating states. The MDP equivalent of the PUnS problem is generated through the crossproduct of the environment MDP with the FSA representing .
Given a single LTL formula, , a finite state automaton (FSA) can be constructed which accepts traces that satisfy the property represented by the [22]. An algorithm to construct the FSA was proposed by Gerth et al. [25]. The automata are directed graphs where each node represents a LTL formula that the trace must satisfy from that point onward in order to be accepted by the automaton . An edge, labeled by the truth assignment at a given time , connects a node to its progression, . Our decision to restrict to the Obligation class of temporal properties () ensures that the FSA constructed from is deterministic and will have terminal states that represent , , or [23]. When planning with a single formula, these terminal states are the rewardgenerating states for the overall MDP, as seen in approaches proposed by Littman et al. [3] and ToroIcarte et al. [4].
A single LTL formula can be represented by an equivalent deterministic MDP described by the tuple , with the states representing the possible progressions of and the actions representing the truth assignments causing the progressions ([3], [4]). The transition function is defined as follows:
(10) 
The reward function is a function of the MDP state, and defined as follows:
(11) 
For an instance of PUnS with specification and support , a deterministic MDP is constructed by computing the crossproduct of MDPs of the component formulas. Let be the progression state for each of the formulas in ; the MDP equivalent of is then defined as . Here, the states are all possible combinations of the component formulas’ progression states, and the actions are propositions’ truth assignments. The transition is defined as follows:
(12) 
This MDP reaches a terminal state when all of the formulas comprising have progressed to their own terminal states. The reward is computed using one of the criteria represented by Equations 5, 7, 8, or 9, with replaced by . Note that while has two possible values ( when the formula is satisfied and when it is not) has three possible values ( when has progressed to or , when has progressed to , or when has not progressed to a terminal state). Thus, the reward is nonzero only in a terminal state.
In the worst case, the size of the FSA of is exponential in . In practice, however, many formulas contained within the posterior may be logically correlated. For example, consider the formula , with its FSA states being ; and the formula , with FSA states representing . The cross product, FSA, can have a maximum of eight unique states; however, a state such as can never exist. Thus, the actual, reachable states for this cross product are . To create a minimal reachable set of states, we start from and perform a breadthfirst enumeration.
We represent the task environment as an MDP without a reward function using the tuple . The cross product of and results in an MDP: . The transition function of is defined as follows:
(13) 
is an equivalent reformulation of PUnS as an MDP, creating the possibility of leveraging recent advances in reinforcement learning for PUnS. In Section section 5, we demonstrate examples of PUnS trained using offpolicy reinforcement learning algorithms.
4.3 Counterfactual updates in a modelfree setting
Constructing as a composition of and results in the following properties: the reward function is only dependent upon , the state of ; the action availability only depends upon , the state of ; and the stochasticity of transitions is only in , as is deterministic. These properties allow us to exploit the underlying structure of in a modelfree learning setting. Let an action from state result in a state . As is deterministic, we can use this action update to apply a Qfunction update (Equation 4) to all states described by .
5 Evaluations
In this section, we first explore how the choice of criteria represented by Equations 5, 7, 8, and 9 results in qualitatively different performance by trained RL agents. Then, we demonstrate how the MDP compilation can serve to train an agent on a realworld task involving setting a dinner table with specifications inferred from human demonstrations, as per Shah et al. [12]. We also demonstrate the value of counterfactual Qvalue updates for speeding up the agent’s learning curve.
5.1 Synthetic Examples
The choice of the evaluation criterion impacts the executions it entails based on the nature of the distribution . Figure 1 depicts examples of different distribution types. Each figure is a Venn diagram where each formula represents a set of executions that satisfy . The size of the set represents the number of execution traces that satisfy the given formula, while the thickness of the set boundary represents its probability. Consider the simple discrete environment depicted in (a): there are five states, with the start state in the center labeled ‘0’ and the four corner states labeled ‘’, ‘’, ‘’, and ‘’. The agent can act to reach one of the four corner states from any other state, and that action is labeled according to the node it is attempting to reach.
Case 1: (a) represents a distribution where the most restrictive formula of the three is also the most probable. All criteria will result in the agent attempting to perform executions that adhere to the most restrictive specification.
Case 2: (b) represents a distribution where the most likely formula is the least restrictive. The minimum regret and maximum coverage rewards will result in the agent producing executions that satisfy , the most restrictive formula; however, using the most likely criteria will only generate executions that satisfy . With the chanceconstrained policy, the agent begins by satisfying and relaxes the satisfactions as risk tolerance is decreased, eventually satisfying but not necessarily or .
Case 3: Case 3 represents three specifications that share a common subset but also have subsets that satisfy neither of the other specifications. Let the scenario specification be with assigned probabilities to each of , respectively. These specifications correspond to always avoiding “” and visiting either “”, “”, or “”. For each figure of merit defined in Section 4.1, the Qvalue function was estimated using and an greedy exploration policy. A softmax policy with temperature parameter was used to train the agent, and the resultant exploration graph of the agent was recorded. The most likely criterion requires only the first formula in to be satisfied; thus, the agent will necessarily visit “” but may or may not visit “” or “”, as depicted in (b). With either maximum coverage or minimum regret serving as the reward function, the agent tries to complete executions that satisfy all three specifications simultaneously. Therefore, each task execution ends with the agent visiting all three nodes in all possible orders, as depicted in (c). Finally, in the chanceconstrained setting with risk level , the automaton compiler drops the second specification; the resulting task executions always visit “” and “” but not necessarily “”, as depicted in (d).
Case 4: Case 4 depicts a distribution where an intersecting subset does not exist. Let the scenario specifications be = , with probabilities assigned to each of , respectively. The first two formulas correspond to the agent visiting either “” or “” but not “”. The third specification is satisfied when the agent visits “”; thus, any execution that satisfies the third formula will not satisfy the first two. The first two formulas also have an intersecting set of satisfying executions when both “” and “” are visited, corresponding to Case 4 from (d). Optimizing for max coverage will result in the agent satisfying both the first and the second formula but ignoring the third, as depicted in (a). However, when using the minimum regret formulation, the probability of the third specification is higher than the combined probability of the first two formulas; thus, a policy learned to optimize minimum regret will ignore the first two formulas and always end an episode by visiting “”, as depicted in (b). The specific examples and exploration graphs for the agents in each of the scenarios in Figure 1 and for each reward formulation in Section 4.1 are provided in the supplemental materials.
5.2 Planning with Learned Specifications: Dinner Table Domain
We also formulated the task of setting a dinner table as an instance of PUnS, using the dataset and resulting posterior distributions provided by Shah et al. [12]. This task features eight dining set pieces that must be organized in a configuration depicted in (a). In order to successfully complete the task, the agent must place each of the eight objects in the final configuration. As the dinner plate, small plate and the bowl were stacked, they had to be placed in that particular partial order order. The propositions comprise eight Boolean variables associated with whether an object is placed in its correct position. The original dataset included 71 demonstrations; Bayesian specification inference was used to compute the posterior distributions over LTL formulas for different training set sizes.
For the purpose of planning, the task environment MDP was simulated. Its state was defined by the truth values of the eight propositions defined above; thus, it had 256 unique states. The action space of the robot was the choice of which object to place next. Once an action was selected, it had an 80% chance of success as per the simulated transitions. For this demonstration, we selected the posterior distribution trained with 30 training examples, as it had the largest uncertainty in true specification. This distribution had 25 unique formulas in its support . As per the expected value of the intersection over union metric, the belief was 85% similar to the true specification. The true specification itself was part of the support, but was only the fourth most likely formula, as per the distribution. The deterministic MDP compiled from had 3,025 distinct states; thus, the crossproduct of and yielded with unique states and the same action space as . We chose the minimum regret criteria to construct the reward function, and trained two learning agents using Qlearning with an greedy policy (): one with and one without counterfactual updates. We evaluated the agent at the end of every training episode using an agent initialized with softmax policy (the temperature parameter was set to ). The agent was allowed to execute 50 episodes, and the terminal value of the reward function was recorded for each; this was replicated 10 times for each agent. All evaluations were conducted on a desktop with i77700K and 16 GB of RAM.
The statistics of the learning curve are depicted in (b). The solid line represents the median value of terminal reward across evaluations collected from all training runs. The error bounds indicate the and percentile. The maximum value of the terminal reward is when all formulas in the support are satisfied, and the minimum value is when all formulas are not satisfied. The learning curves indicate that the agent that performed counterfactual Qvalue updates learned faster and had less variability in its task performance across training runs compared with the one that did not perform counterfactual updates.
We implemented the learned policy with predesigned motion primitives on a UR10 robotic arm. We observed during evaluation runs that the robot never attempted to violate any temporal ordering constraint. The stochastic policy also made it robust to some environment disturbances: for example, if one of the objects was occluded, the robot finished placing the other objects before waiting for the occluded object to become visible again. The robot adhered to the temporal task specifications, despite the maximum a posteriori formula not being the ground truth, by identifying a common set of executions that optimized the minimum regret reward function by satisfying all the formulas in the posterior distributions .^{1}^{1}1example executions can be viewed at https://youtu.be/LrIh_jbnfmo.
6 Conclusions
In this work, we formally define the problem of planning with uncertain specifications (PUnS), where the task specification is provided as a belief over LTL formulas. We propose four evaluation criteria that define what it means to satisfy a belief over logical formulas, and discuss the type of task executions that arise from the various choices. We also present a methodology for compiling PUnS as an equivalent MDP using LTL compilation tools adapted to multiple formulas. We also demonstrate that MDP reformulation of PUnS can be solved using offpolicy algorithms with counterfactual updates for a synthetic example and a realworld task. Although we restricted the scope of this paper to discrete task environment MDPs, this technique is extensible to continuous state and action spaces; we plan to explore this possibility in future work.
References
 [1] E. Frazzoli and K. Iagnemma, “Facilitating vehicle driving and selfdriving,” May 9 2017. US Patent 9,645,577.
 [2] D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Qlearning for robust satisfaction of signal temporal logic specifications,” in 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 6565–6570, IEEE, 2016.
 [3] M. L. Littman, U. Topcu, J. Fu, C. Isbell, M. Wen, and J. MacGlashan, “Environmentindependent task specifications via gltl,” arXiv preprint arXiv:1704.04341, 2017.
 [4] R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Teaching multiple tasks to an rl agent using ltl,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 452–461, International Foundation for Autonomous Agents and Multiagent Systems, 2018.
 [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.

[6]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,”
in
International conference on machine learning
, pp. 1928–1937, 2016.  [7] V. R. Konda and J. N. Tsitsiklis, “Actorcritic algorithms,” in Advances in neural information processing systems, pp. 1008–1014, 2000.
 [8] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “Goexplore: a new approach for hardexploration problems,” arXiv preprint arXiv:1901.10995, 2019.
 [9] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al., “Mastering chess and shogi by selfplay with a general reinforcement learning algorithm,” arXiv preprint arXiv:1712.01815, 2017.
 [10] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. Czarnecki, A. Dudzik, A. Huang, P. Georgiev, R. Powell, T. Ewalds, D. Horgan, M. Kroiss, I. Danihelka, J. Agapiou, J. Oh, V. Dalibard, D. Choi, L. Sifre, Y. Sulsky, S. Vezhnevets, J. Molloy, T. Cai, D. Budden, T. Paine, C. Gulcehre, Z. Wang, T. Pfaff, T. Pohlen, Y. Wu, D. Yogatama, J. Cohen, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, C. Apps, K. Kavukcuoglu, D. Hassabis, and D. Silver, “AlphaStar: Mastering the RealTime Strategy Game StarCraft II.” https://deepmind.com/blog/alphastarmasteringrealtimestrategygamestarcraftii/, 2019.
 [11] J. Kim, C. J. Banks, and J. A. Shah, “Collaborative planning with encoding of users’ highlevel strategies.,” in AAAI, pp. 955–962, 2017.

[12]
A. Shah, P. Kamath, J. A. Shah, and S. Li, “Bayesian inference of temporal task specifications from demonstrations,” in
Advances in Neural Information Processing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, eds.), pp. 3804–3813, Curran Associates, Inc., 2018.  [13] S. Singh, R. L. Lewis, and A. G. Barto, “Where do rewards come from,” in Proceedings of the annual conference of the cognitive science society, pp. 2601–2606, Cognitive Science Society, 2009.
 [14] E. Ratner, D. HadfieldMennell, and A. Dragan, “Simplifying reward design through divideandconquer,” in Robotics: Science and Systems, 2018.
 [15] D. HadfieldMenell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan, “Inverse reward design,” in Advances in neural information processing systems, pp. 6765–6774, 2017.
 [16] A. D. D. Dorsa Sadigh, S. Sastry, and S. A. Seshia, “Active preferencebased learning of reward functions,” in Robotics: Science and Systems (RSS), 2017.
 [17] A. Camacho, J. A. Baier, C. Muise, and S. A. McIlraith, “Finite ltl synthesis as planning,” in TwentyEighth International Conference on Automated Planning and Scheduling, 2018.
 [18] H. KressGazit, G. E. Fainekos, and G. J. Pappas, “Temporallogicbased reactive mission and motion planning,” IEEE transactions on robotics, vol. 25, no. 6, pp. 1370–1381, 2009.
 [19] O. Kupferman and M. Y. Vardi, “Model checking of safety properties,” Formal Methods in System Design, vol. 19, no. 3, pp. 291–314, 2001.

[20]
B. Lacerda, D. Parker, and N. Hawes, “Optimal policy generation for partially
satisfiable cosafe ltl specifications,” in
TwentyFourth International Joint Conference on Artificial Intelligence
, 2015.  [21] A. Pnueli, “The temporal logic of programs,” in Foundations of Computer Science, 1977., 18th Annual Symposium on, pp. 46–57, IEEE, 1977.
 [22] M. Y. Vardi, “An automatatheoretic approach to linear temporal logic,” in Logics for concurrency, pp. 238–266, Springer, 1996.
 [23] Z. Manna and A. Pnueli, A hierarchy of temporal properties. Department of Computer Science, 1987.
 [24] F. Bacchus and F. Kabanza, “Using temporal logics to express search control knowledge for planning,” Artificial intelligence, vol. 116, no. 12, pp. 123–191, 2000.
 [25] R. Gerth, D. Peled, M. Y. Vardi, and P. Wolper, “Simple onthefly automatic verification of linear temporal logic,” in International Conference on Protocol Specification, Testing and Verification, pp. 3–18, Springer, 1995.
Comments
There are no comments yet.