The thesis of this work is that (1) rewards are an excellent way of controlling the behavior of agents, but (2) rewards are difficult to use for specifying behaviors in an environment-independent way, therefore (3) we need intermediate representations between behavior specifications and reward functions.
The intermediate representation we propose is a novel variant of linear temporal logic that is modified to be probabilistic so as to better support reinforcement-learning tasks. Linear temporal logic has been used in the past to specify reward functions that depend on temporal sequences (Bacchus et al., 1996)
; here, we expand the role to provide a robust and consistent semantics that allows desired behaviors to be specified in an environment-independent way. Briefly, our approach involves the specification of tasks via temporal operators that have a constant probability of expiring on each step. As such, it bears a close relationship to the notion of discounting in standard Markov decision process (MDP) reward functions(Puterman, 1994).
At a philosophical level, we are viewing behavior specification as a kind of programming problem. That is, if we think of a Markov decision process (MDP) as an input, a reward function as a program, and a policy as an output, then reinforcement learning can be viewed as a process of program interpretation. We would like the same program to work across all possible inputs.
1.1 Specifying behavior via reward functions
An MDP consists of a finite state space, action space, transition function, and reward function. Given an environment, an agent should behave in a way that maximizes cumulative discounted expected reward. The problems of learning and planning in such environments have been vigorously studied in the AI community for over 25 years (Watkins, 1989; Boutilier et al., 1999; Strehl et al., 2009). A reinforcement-learning (RL) agent needs to learn to maximize cumulative discounted expected reward starting with an incomplete model of the MDP itself.
For “programming” reinforcement-learning agents, the state of the art is to define a reward function and then for the learning agent to interact with the environment to discover ways to maximize its reward. Reward-based specifications have proven to be extremely valuable for optimal planning in complex, uncertain environments (Russell & Norvig, 1994). However, we can show that reward functions, as they are currently structured, are very difficult to work with as a way of reliably specifying tasks. The best use case for reward functions is when the utilities of all actions and outcomes can be expressed in a consistent unit, for example, time or money or energy. In reality, however, putting a meaningful dollar figure on scuffing a wall or dropping a clean fork is challenging. When informally adding negative rewards to undesirable outcomes, it is difficult to ensure a consistent semantics over which planning and reasoning can be carried out. Further, reward values often need to be changed if the environment itself changes—they are not environment independent. Therefore, to get a system to exhibit a desired behavior, it can be necessary to try different reward structures and carry out learning multiple times in the target environment, greatly undermining the purpose of autonomous learning in the first place.
Consider the simple example MDP in Figure 1. The agent is choosing between and in the initial state . Choosing causes the agent to pass through bad state for one step, then to continue on to the goal . Action , however, results in a probabilistic transition to (with probability ) or bad state (with slip probability ). From , the agent can continue on to the goal. If it reaches , it gets stuck there forever.
Let’s say our desired behavior is “maximize the probability of reaching without hitting a bad state”. (A bad state could be something like colliding with a wall or bumping up against a table.) The probability of success of is zero and is . Thus, for any , it is better to take action .
What reward function encourages this behavior? For concreteness, let’s assume a discount of and a reward of for reaching the goal. We can assign bad states a value of . In the case where , setting encourages the desired behavior.
Consider, though, what happens if the slip probability is . Now, there is no value of for which is preferred to 111Actually, works for this example, but that is tantamount to rewarding the agent for bumping into things—something bound to result in other problems.. That is, it has become impossible to find a reward function that creates the correct incentives for the desired behavior to be optimal.
This example is perhaps a bit contrived, but we have observed the same phenomenon in large and natural state spaces as well. The reason for this result is that reward functions force us to express utility in terms of the discounted expected visit frequency of states. In this case, we are stuck trying to make a tradeoff between the certainty of encountering a bad state once and the possibility of encountering a bad state repeatedly. Since we are trying to maximize the probability of zero encounters with a bad state, the expected number of encounters is only useful for distinguishing zero from more than zero—the objective cannot be translated into a reward function when bad states are unavoidable.
1.2 Specifying behavior via LTL
Linear temporal logic formulas are built up from a set of atomic propositions; the logic connectives: negation (), disjunction (), conjunction () and material implication (); and the temporal modal operators: next (), always (), eventually () and until (). A wide class of properties including safety (), goal guarantee (), progress (), response (), and stability (), where and are atomic propositions, can be expressed as LTL formulas. More complicated specifications can be obtained from the composition of such simple formulas. For example, the specification of “repeatedly visit certain locations of interest in a given order while avoiding certain other unsafe or undesirable locations” can be obtained through proper composition of simpler safety and progress formulas (Manna & Pnueli, 1992; Baier & Katoen, 2008).
Returning to the example in Figure 1, the task is to avoid states until is reached: . Given an LTL specification and an environment, an agent, for example, should adopt a behavior that maximizes the probability that the specification is satisfied. One advantage of this approach is its ability to specify tasks that cannot be expressed using simple reward functions (like the example MDP in Section 1.1). Indeed, in the context of reinforcement-learning problems, we have found it very natural to express standard MDP task specifications using LTL.
Standard MDP tasks can be expressed well using these temporal operators. For example:
Goal-based tasks like mountain car (Moore, 1991): If represents the attribute of being at the goal (the top of the hill, say), corresponds to eventually reaching the goal.
Avoidance-type tasks like cart pole (Barto et al., 1983): If represents the attribute of being in the failure state (dropping the pole, say), corresponds to always avoiding the failure state.
Sequence tasks like taxi (Dietterich, 2000): If represents some task being completed (getting the passenger, say) and represents another task being completed (delivering the passenger, say), corresponds to eventually completing the first task, then, from there, eventually completing the second task.
Stabilizing tasks like pendulum swing up (Atkeson, 1994): If represents the property that needs to be stabilized (the pendulum being above the vertical, say), corresponds to eventually achieving and continually maintaining the desired property.
Approach-avoid tasks like the 43 grid (Russell & Norvig, 1994): If represents the attribute of being at the goal (the upper right corner the grid, say), and represents the attribute of being at a bad state (the state below it, say), corresponds to avoiding the bad state en route to the goal.
On the other hand, there are barriers to straightforwardly adopting temporal logic-based languages in a reinforcement-learning setup. The most significant is that we can show that it is simply impossible to learn to satisfy classical LTL specifications in some cases. A key property for being able to learn near-optimal policies efficiently in the context of reward-based MDPs is what is known as the Simulation Lemma (Kearns & Singh, 1998). Informally, it says that, for any MDP and any , there exists an such that finding optimal policies in an -close model of the real environment results in behavior that is -close to optimal in the real environment.
Unfortunately, tasks specified via LTL do not have this property. In particular, there is an MDP and an such that no -close approximation for is sufficient to produce a policy with -close satisfaction probability.
Consider the MDP in Figure 2. If we want to find a behavior that nearly maximizes the probability of satisfying the specification
(stay in the good state forever), we need accurate estimates ofand . If or and , either policy is equally good. If and , only action is near optimal. If and , only action is near optimal. As there is no finite bound on the number of learning trials needed to distinguish from , a near optimal behavior cannot be found in worst-case finite time. LTL expressions are simply too unforgiving to be used with any confidence in a learning setting.
In this work, we develop a hybrid approach for specifying behavior in reinforcement learning that combines the strengths of both reward functions and temporal logic specifications.
2 Learning To Satisfy LTL
While provable guarantees of efficiency and optimality have been at the core of the literature on learning (Fiechter, 1994; Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Li et al., 2011), correctness with respect to complicated, high-level task specifications—during the learning itself or in the behavior resulting from the learning phase—has attracted limited attention (Abbeel & Ng, 2005).
2.1 Geometric linear temporal logic
We present a variant of LTL we call geometric linear temporal logic (GLTL) that builds on the logical and temporal operators in LTL while ensuring learnability. The idea of GLTL is roughly to restrict the period of validity of the temporal operators to bounded windows—similar to the bounded semantics of LTL (Manna & Pnueli, 1992). To this end, GLTL introduces operators of the form of with the atomic proposition , which is interpreted as “ eventually holds within time steps where.” Similar semantics stochastically restricting the window of validity for other temporal operators are also introduced.
This kind of geometric decay fits very nicely with MDPs for a few reasons. It can be viewed as a generalization of reward discounting, which is already present in many MDP models. It also avoids unnecessarily expanding the specification state space by only requiring extra states to represent events and not simply the passage of time.
Using to represent the geometric distribution with parameter , the temporal operators are:
: is achieved in the next steps, .
: holds for the next steps, .
: must hold at least until becomes true, which itself must be achieved in the next steps, .
Returning to our earlier example from Figure 2, evaluating the probability of satisfaction for requires infinite precision in the learned transition probabilities in the environment. Consider instead evaluating in this environment. An encoding of the specification for this example is shown in Figure 3 (Third). We call it a specification MDP, as it specifies the task using states (derived from the formula), actions (representing conditions), and probabilities (capturing the stochasticity of operator expiration). This example says that, from the initial state , encountering any state where is not true results in immediately failing the specification. In contrast, encountering any state where is true results in either continued evaluation (with probability ) or success (with probability ). Success represents the idea that the temporal window in which must hold true has expired without being violated.
Composing these two MDPs leads to the composite MDP in Figure 3 (Fourth). The true satisfaction probability for action is . Thus, if , the dependence of this value on is , which is well behaved for all values of . The sensitivity of the computed satisfaction probability has a maximum dependence on the accuracy of the estimate of . Thus, GLTL is considerably more friendly to learning than is LTL.
Returning to the MDP example in Figure 1, we find that GLTL is also more expressive than rewards. The GLTL formula can be translated to a specification MDP. Essentially, the idea is that encountering a bad state () even once or running out of time results in specification failure. Maximizing the satisfaction of this GLTL formula results in taking action regardless of the value of . That is, it is an environment-independent specification of the task.
The reason the GLTL formulation is able to succeed where standard rewards fail is that the GLTL formula results in an augmentation of the state space so that the reward function can depend on whether a bad state has yet been encountered. On the first encounter, a penalty can be issued. After the first encounter, no additional penalty is added. By composing the environment MDP with this bit of internal memory, the task can be expressed provably correctly and in an environment-independent way.
3 Related Work
Discounting has been used in previous temporal models. In quantitative temporal logic, it gives more weight to the satisfaction of a logic property in the near future than the far future. De Alfaro et al. (2003, 2004) augment computation tree logic (CTL) with discounting and develop fixpoint-based algorithms for checking such properties for probabilistic systems and games. Almagor et al. (2014) explicitly refine the “eventually” operator of LTL to a discounting operator such that the longer it takes to fulfill the task the smaller the value of satisfaction. Further, they show that discounted LTL is more expressive than discounted CTL. They use both discounted until and undiscounted until for expressing traditional eventually as well as its discounted version. However, algorithms for model checking and synthesis discounted LTL for probabilistic systems and games are yet to be developed.
LTL has been used extensively in robotics domains. Work on the trustworthiness of autonomous robots, automated verification and synthesis with provable correctness with respect to temporal logic-based specifications in motion, task, and mission planning have attracted considerable attention recently. The results include open-loop and reactive control of deterministic, stochastic or non-deterministic finite-state models as well as continuous state models through appropriate finite-state abstractions (Wongpiromsarn et al., 2012; Kress-Gazit et al., 2009; Liu et al., 2013; Wolff et al., 2012; Ding et al., 2011; Lahijanian et al., 2011; Kress-Gazit et al., 2011). While temporal logic had initially focused on reasoning about temporal and logical relations, its dialects with probabilistic modalities have been used increasingly for robotics applications (Baier & Katoen, 2008; De Alfaro, 1998; Kwiatkowska et al., 2002).
4 Generating Specification MDPs
Similar to LTL, GLTL formulas are built from a set of atomic propositions , Boolean operators (conjunction), (negation) and temporal operator (-until). Useful operators such as (disjunction), (-eventually) and (-always) can be derived from these basic operators.
GLTL formulas can be converted to the corresponding specification MDPs recursively, with the operator precedence listed in descending order in Table 1. Operators of the same precedence are read from right to left. For example, , .
|Precedence||Operator||# of Operands|
Assume are GLTL formulas in the following discussion.
, where is an atomic proposition: A specification MDP for can be constructed such that, if holds at , the transition is taken with probability 1; otherwise, the transition is taken with probability 1.
: A specification MDP can be constructed from a specification MDP by swapping the terminal states acc and rej.
: A specification MDP can be constructed from specification MDPs and such that (1) , and the accepting state is ; (2) ; (3) for all transitions of and of , if either or , let ; otherwise, .
: The operator -until has two operands, and , which generate specification MDPs and . The new specification MDP is constructed from and : , where acc and rej are the accepting and rejecting state, respectively, and is the initial state; ; for all , , and , if and , a transition is added to with probability as specified in Table 2.
Here are some intuitions behind the construction of . The formula means that, within some stochastically decided time period , we would like to successfully implement task in at most steps without ever failing in task . If we observe a success in (that is, the specification reaches ) before fails (that is the sepcification reaches ), goes to state acc for sure; if we observe a failure in (that, the specifcation reaches ) before succeeding in (that is, the specification reaches ), goes to state rej for sure. In all other cases, primarily keeps track of the transitions in and , with a tiny probability of failing immediately, which corresponds to the operator expiring.
: As in the semantics of LTL, -eventually . Hence, given a specification MDP for , we can construct a specification MDP for : , , , ; ; transitions of are modified from those of as in Table 3. Informally, is satisfied if we succeed in task within the stochastic observation time period.
Using the transitions as described, a given GLTL formula can be converted into a specification MDP. To satisfy the specification in a given environment, a joint MDP is created as follows:
Take the cross product of the MDP representing the environment and the specification MDP.
Any state that corresponds to an accepting or rejecting state of the specification MDP becomes a sink state. However, the accepting states also include a reward of .
The resulting MDP is solved to create a policy.
The resulting policy is one that maximizes the probability of satisfying the given formula where the random events are both the transitions in the environment and the stochastic transitions in the specification MDP. Such policies tend to prefer satisfying formulas quickly, as that increases the chance of successful completion before operators expire.
5 Example Domain
Consider the following formula:
It specifies a task of reaching a red state without encountering a blue state and, once a red state is reached, going to a green state.
Figure 5 illustrates a grid world environment in which this task can be carried out. It consists of different colored grid cells. The agent can move to one of the four adject cells to its current position with a north, south, east, or west action. However, selecting an action for one direction has a 0.02 probability of moving in the one of the three other directions. This stochastic movement causes the agent to keep its distance from dangerous grid cells that could result in task failure, whenever possible. The solid line in the figure traces the path of the optimal policy of following this specification in the grid. As can be seen, the agent moves to red and then green. Note that this behavior can be very difficult to encode in a standard reward function as both green and red need to be given positive reward and therefore either would be a sensible place for the agent to stop.
Figure 5 illustrates a grid world environment in which the blue cells create a partial barrier between the red and green cells. As a result of the “until” in the specification, the agent goes around the blue wall to get to the red cell. However, since the prohibition against blue cells is lifted once the red cell is reached, it goes directly through the barrier to reach green.
These 25-state environments become 98-state MDPs when combined with the specification MDP.
In contrast to standard MDP reward functions, we have provided an environment-independent specification for tasks. We have shown that this specification language can capture standard tasks used in the MDP community and that it can be automatically incorporated into an environment MDP to create a fixed MDP to solve. Maximizing reward in this resulting MDP maximizes the probability of satisfying the task specification.
Future work includes inverse reinforcement learning of task specifications and techniques for accelerating planning.
Abbeel & Ng (2005)
Abbeel, Pieter and Ng, Andrew Y.
Exploration and apprenticeship learning in reinforcement learning.
Proceedings of the 22nd International Conference on Machine Learning, pp. 1–8, 2005.
- Almagor et al. (2014) Almagor, Shaull, Boker, Udi, and Kupferman, Orna. Discounting in ltl. In Tools and Algorithms for the Construction and Analysis of Systems, pp. 424–439. Springer, 2014.
- Atkeson (1994) Atkeson, Christopher G. Using local trajectory optimizers to speed up global optimization in dynamic programming. In Advances in Neural Information Processing Systems, pp. 663–663, 1994.
Bacchus et al. (1996)
Bacchus, Fahiem, Boutilier, Craig, and Grove, Adam.
Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 1160–1167. AAAI Press/The MIT Press, 1996.
- Baier & Katoen (2008) Baier, Christel and Katoen, Joost-Pieter. Principles of Model Checking. MIT Press, 2008.
- Barto et al. (1983) Barto, Andrew G., Sutton, Richard S., and Anderson, Charles W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5):834–846, 1983.
- Boutilier et al. (1999) Boutilier, Craig, Dean, Thomas, and Hanks, Steve. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999.
- Brafman & Tennenholtz (2002) Brafman, Ronen I. and Tennenholtz, Moshe. R-MAX—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002.
- De Alfaro (1998) De Alfaro, Luca. Formal verification of probabilistic systems. PhD thesis, Stanford University, Stanford, CA, USA, 1998.
- De Alfaro et al. (2003) De Alfaro, Luca, Henzinger, Thomas A, and Majumdar, Rupak. Discounting the future in systems theory. In Automata, Languages and Programming, pp. 1022–1037. Springer, 2003.
- De Alfaro et al. (2004) De Alfaro, Luca, Faella, Marco, Henzinger, Thomas A, Majumdar, Rupak, and Stoelinga, Mariëlle. Model checking discounted temporal properties. Springer, 2004.
- Dietterich (2000) Dietterich, Thomas G. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000. URL citeseer.ist.psu.edu/article/dietterich00hierarchical.html.
- Ding et al. (2011) Ding, Xu Chu, Smith, Stephen L., Belta, Calin, and Rus, Daniela. Ltl control in uncertain environments with probabilistic satisfaction guarantees. CoRR, abs/1104.1159, 2011.
Efficient reinforcement learning.
Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pp. 88–97. Association of Computing Machinery, 1994.
- Kearns & Singh (1998) Kearns, Michael and Singh, Satinder. Near-optimal reinforcement learning in polynomial time. In Proceedings of the 15th International Conference on Machine Learning, pp. 260–268, 1998. URL citeseer.nj.nec.com/kearns98nearoptimal.html.
- Kearns & Singh (2002) Kearns, Michael J. and Singh, Satinder P. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2–3):209–232, 2002.
- Kress-Gazit et al. (2009) Kress-Gazit, H., Fainekos, G.E., and Pappas, G.J. Temporal-logic-based reactive mission and motion planning. IEEE Tans. on Robotics, 25:1370–1381, 2009.
- Kress-Gazit et al. (2011) Kress-Gazit, H., Wongpiromsarn, T., and Topcu, U. Correct, reactive robot control from abstraction and temporal logic specifications. IEEE RAM, 18:65–74, 2011.
- Kwiatkowska et al. (2002) Kwiatkowska, Marta, Norman, Gethin, and Parker, David. Prism: Probabilistic symbolic model checker. In Computer Performance Evaluation: Modelling Techniques and Tools, volume 2324, pp. 113–140. Springer, 2002.
- Lahijanian et al. (2011) Lahijanian, M., Andersson, S. B., and Belta, C. Control of Markov decision processes from PCTL specifications. In Proc. of the American Control Conference, pp. 311–316, 2011.
- Li et al. (2011) Li, Lihong, Littman, Michael L., Walsh, Thomas J., and Strehl, Alexander L. Knows what it knows: A framework for self-aware learning. Machine Learning, 82(3):399–443, 2011.
- Liu et al. (2013) Liu, Jun, Ozay, Necmiye, Topcu, Ufuk, and Murray, Richard M. Synthesis of reactive switching protocols from temporal logic specifications. IEEE Transactions on Automatic Control, 58(7):1771–1785, 2013.
- Manna & Pnueli (1992) Manna, Zohar and Pnueli, Amir. The Temporal Logic of Reactive & Concurrent Sys. . Springer, 1992.
- Moore (1991) Moore, Andrew W. Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued spaces. In Proc. Eighth International Machine Learning Workshop, 1991.
- Puterman (1994) Puterman, Martin L. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994.
- Russell & Norvig (1994) Russell, Stuart J. and Norvig, Peter. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 1994. ISBN 0-13-103805-2.
- Strehl et al. (2009) Strehl, Alexander L., Li, Lihong, and Littman, Michael L. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10:2413–2444, 2009.
- Watkins (1989) Watkins, Christopher J. C. H. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.
- Wolff et al. (2012) Wolff, Eric M., Topcu, Ufuk, and Murray, Richard M. Robust control of uncertain markov decision processes with temporal logic specifications. In Proc. of the IEEE Conference on Decision and Control, 2012.
- Wongpiromsarn et al. (2012) Wongpiromsarn, T., Topcu, U., and Murray, R.M. Receding horizon temporal logic planning. IEEE T. on Automatic Control, 57:2817–2830, 2012.