A Hierarchical Reinforcement Learning Method for Persistent Time-Sensitive Tasks

06/20/2016 ∙ by Xiao Li, et al. ∙ 0

Reinforcement learning has been applied to many interesting problems such as the famous TD-gammon and the inverted helicopter flight. However, little effort has been put into developing methods to learn policies for complex persistent tasks and tasks that are time-sensitive. In this paper, we take a step towards solving this problem by using signal temporal logic (STL) as task specification, and taking advantage of the temporal abstraction feature that the options framework provide. We show via simulation that a relatively easy to implement algorithm that combines STL and options can learn a satisfactory policy with a small number of training cases



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement learning is the problem of learning from interaction with the environment to achieve a goal [3]. Usually the interaction model is unknown to the learning agent and an optimal policy is to be learned with sequences of interaction experiences and a reward that indicates the ”correctness” of taking an action hence the reinforcement. There has been a number of successful attempts to apply reinforcement learning to the field of control. One of the most widely known efforts is learning of a flight controller for aggressive aerobatic maneuvers on a RC helicopter [2]. In addition, a PR2 (Personal Robot 2) has learned to perform a number of household chores such at placing a coat hanger, twisting open bottle caps, etc using ideas from reinforcement learning [4]. More recent efforts in this area has led a learning agent to play many of the classic Atari 2600 games at the professional human level [5], and the the possibility of a match at the game of Go between AlphaGo (an AI agent created by Google Deepmind [6]) and one of the top Go players in the world Lee Sedol.

Reinforcement learning has great potential in areas where precise modeling of system and environmental dynamics are difficult but interaction data is available, which is the case for many real world applications. In classical reinforcement learning, the reward structure needs to be carefully designed to obtain a desirable outcome, and often additional techniques such as reward shaping [7],[8] need to be applied to improve the learning efficiency. Moreover, the tasks being learned are often single goal episodic tasks such as reaching a destination in shortest time [9], paddling a ball [10] or winning a game that has a set of well defined rules [5],[1]. Little effort has been put into creating a learning agent for complex time-sensitive multi-goal persistent tasks. Persistence requires that the task is continuous/cyclic and does not have a notion of termination (or absorbing state), whereas multi-goal time-sensitiveness indicates that the task consists of subtasks and it is desirable to switch among them in a predefined timely fashion. An example of such a task is controlling of a robotic manipulator on an assembly line. Here the manipulator may switch from fastening a screw at one location to wielding at another location, and the time between the switch may need to be controlled depending on how the position and orientation of the part are handled by possibly the conveyer belt or other manipulators.

Learning of simple persistent tasks has traditionally been tackled using average reward reinforcement learning [11]. Well known algorithms include R-learning [12] and H-learning [13]. However these methods work well with only unichain MDPs where every deterministic and stationary policy on that MDP contains only a single loop, also called a recurrent class of states [14]. This is obviously not enough for any task of reasonable complexity. [15],[16]

uses model-based approaches to learn policies that maximize the probability of satisfying a given linear temporal logic (LTL) formula. However, using probability of satisfaction to guide learning can be of low efficiency because no ”partial credit” is given to the agent for being ”close” to satisfying the specification. And thus the agent performs random search before it ”accidentally” satisfies the LTL specification for the first time. Moreover, LTL has time-abstract semantics that prevent users from specifying time bounds quantitatively.

In this paper we turned to signal temporal logic (STL), a rich predicate logic that can describe tasks involving bounds on physical states, continuous time windows and logical relationships. For example the assembly line manipulator control task described earlier can be easily expressed using STL in the form ”from start of the assembly task until the end, with a period of repeatedly position the end-effector to within a tolerance of the screw location and perform fastening motion, and then position the end-effector to within a tolerance of the wielding point and perform the wielding task” (the STL formula is presented in the next section). The one significant convenience that STL brings is its equipment with a continuous measure of satisfiability called the robustness degree, which translates naturally to a continuous reward structure that can be used in the reinforcement learning framework. Therefore with STL the user only has to ”spell out” the task requirements in a compact yet powerful language and the rest will be taken care of (no need to struggle with designing a good reward structure).

The challenge of using STL is that evaluation of the robustness degree requires a state trajectory, therefore either some kind of memory needs to be incorporated into the learning agent or the state/action space be expanded to incorporate trajectories that the agent can choose from. Here we adopt the options framework [17] which abstracts each subtask as an MDP with a policy of its own, and a higher level policy is present to choose among the subtask policies at appropriate times. In this paper we present an algorithm that given an STL task specification, automatically generates a set of subtasks, and by interacting with the environment simultaneously learn the subtasks’ policies and the higher level policy that will lead the agent to satisfy the given specification.

Section II introduces the Q-learning algorithm that subsequent contents are developed on, as well as the options framework and STL. Section III describes in detail the proposed algorithm. Section IV provides simulation results to verify the proposed approach and some discussions about the advantages and shortcomings of the algorithm. Section V concludes with final remarks and directions for future work.

Ii Background

Reinforcement learning bears the curse of dimensionality. Especially for discretized representations of state and action spaces (used in many classical tabular methods

[3]), the number of parameters (state value, action value, etc) increase exponentially with the size of the state/action space. One attempt to alleviate such computational burden is to exploit temporal and state abstractions and the possibility of learning on a reduced set of abstractions as oppose to the primitive state and actions. Ideas along this line are called hierarchical reinforcement learning (HRL) in the literature, and a survey of advances in this area as well as the main approaches used are provided in [18]. We base our work on the options framework developed in [17] for its ability to deal with temporally extended actions, which is an extremely helpful factor in the development of an algorithm that learns a policy that satisfies the complicated task specification given by an STL formula.

Ii-a Reinforcement Learning Framework and Q-Learning

Here we briefly describe the reinforcement learning framework for discrete-time finite Markov decision processes (MDP).

Definition 1

An MDP is a tuple where

  • is a finite set of states;

  • is a finite set of actions;

  • is the transition probability with being the probability of taking action at state and end up in state ;

  • is the reward function with being the reward obtained by taking action in and end in .

In reinforcement learning, the transition model and the reward structure are unknown to the learning agent (but an immediate reward is given to the agent after each action), instead the agent has to interact with the environment and figure out the optimal sequence of actions to take in order to maximize the obtained reward. We have based our method on one of the most popular model-free off-policy reinforcement learning algorithms called Q-learning [19]. In short, Q-learning aims at finding a policy that maximizes the expected sum of discounted reward given by


Here [0,1] is a constant discount factor and is decayed with time (hence the exponent ) to put higher value on more recent rewards. is the one step immediate reward at step . Equation (1) can be written recursively as


which becomes the well known Bellman’s Equation. Algorithms exist that learns the optimal value function from experience. The most famous one is perhaps the temporal difference learning algorithm (also called TD-learning [20]). After converges to its optimal value , we have the recursive relationship


And the optimal policy is calculated from


However without knowing the transition model , it is difficult to extract the optimal policy from . This is where Q-learning comes in. Define an action-value function that assigns a value to each state-action pair (also known as the Q function) as follows


Then following Equation (3) we have


Now we can write the optimal Q-function in a recursive form by


And can be approximated by calculating a running average of the Q-values obtained from experience.

Assume at time the agent takes action , transitions from state to , and obtains a one step immediate reward (experiences usually take form of a tuple ). The Q-function is then update following


where is the learning rate. It is proven in [21] that if the choice of satisfies and while every state and action are visited infinitely often, then converges (denoted by ). In practice it is usually sufficient to use a constant and thus the subscript is dropped in later formulations. After convergence, the optimal policy can be calculated by


Since action is an explicit variable of the Q-function, Equation (15) can be easy evaluated.

Ii-B Option-Based Hierarchical Reinforcement Learning

The options framework deals explicitly with temporally extended actions. An option is defined by a tuple where is the initiation set denoting the states where an option is available. is the option’s policy (also called a flat policy) and is the termination map defining the probability of termination of an option at each state. Suppose at time the agent resides at state . Instead of choosing an action , the agent chooses an option , where is the set of options (note that option needs to be available at state i.e. ). After selecting the option, the agent follows the option’s flat policy until termination is invoked. If the termination state is , then . Analogous to Q-learning, the experience that the agent obtained now becomes a tuple , where is a lumped reward from executing option to termination. Assuming that option is being executed for time steps, now instead of updating an action-value function , an option-value function is updated using


This update is applied each time an option is executed to termination. Equation (10) is very similar to Equation (8) except for the exponent on the discount factor . This is to signify that the option is executed for a temporally extend period of time and future rewards should be discounted accordingly. It is worth mentioning that a primitive action can be considered a one step option where , and , therefore if is executed at all times then Equation (10) becomes Equation (8). The optimal options policy is obtained by


The flat policy for each option in can be provided by the user or be learned simultaneously with the options policy . Details on simultaneous learning will be discussed in the next section. We refer readers to [17] for a detailed formulation of the options framework.

Ii-C Signal Temporal Logic (STL)

Signal temporal logic is a framework used to describe an expressive collection of specifications in a compact form. It was originally developed to monitor continuous-time signals, but can be extended to describe desired state constraints in a control system. Here we briefly present the necessary definitions of STL and refer interested readers to [22], [23], [24] for further details. Informally, STL formulas consist of boolean connectives (negation/not), (conjunction/and), (disjunction/or), as well as bounded-time temporal operators (until between and ), (eventually between and ) and (always between and ) that operate on a finite set of predicates over the underlying states. As a quick example, consider a robot traveling in a plane with its position as states. The trajectory of the robot is specified by a simple STL formula


The formula in Equation (12) reads ”always in 0 to 4 time steps, x is to be greater than 10 and smaller than 12, and y greater than 6 and smaller than 8”, which specifies that the robot should stay in a square region given by bound from 0 to 4 time steps.

In this paper, we constrain STL to be defined over sequences of discrete valued states produced by the MDP in Definition (1). We denote to be the state at time , and to be a time series of the state trajectory from to , i.e. . The usefulness of STL lies in its equipment with a set of quantitative measure of how well a given formula is satisfied, which is called robustness degree (robustness for short). In the above example, a term like is called a predicate which we denote by p. Let p take the form of a general inequality , where is a function of the states and is a constant (such as ). If a state trajectory is provided, the robustness of an STL formula is defined recursively by


Note that in general if formula contains temporal operators (), a state trajectory is required to evaluate robustness, but if contains only boolean connected predicates, the robustness is then evaluated with respective to one particular state . Using the above definition of robustness, a larger positive value means stronger satisfaction and a larger negative value means stronger violation of the STL formula.

0 1 2 3
(9,7) (10,7) (11,7) (11,8)
-1 0 1 1
TABLE I: Simple STL Example

Table I shows an example of how to calculate the robustness of a trajectory given STL formula in Equation (12). We can see that for this trajectory the overall robustness is negative meaning that Equation (12) is violated. The reason is that the first point in the trajectory lies outside the desired square given by and the STL formula dictates that all positions should stay inside the square within the timeframe of 0 to 4 . If instead of we specify , then because (eventually) looks at the highest satisfying point whereas (always) looks at the highest violation point. The point of maximum satisfaction occurs at the center of the square () with a robustness value of 2.

Even though the example above uses the simplest form of STL formula for explanation, an STL specification can be much richer. For the assembly line manipulator task mentioned in the Introduction, let be the position of the end-effector, be the position of the screw to be fastened, and be the wielding point. Then the assembly task can be expressed by the STL formula


In the above formula is the Euclidean distance. and are the position thresholds for the screw fastening and wielding tasks respectively.

Iii Reinforcement Learning for Stl Specified Goals

The options framework provides a way to expand the action space to a set of options. Executing options generate repeatable trajectories that can be used to evaluate STL robustness. In this section we present an algorithm that, given an STL formula that describes the desired behavior of the system, automatically generates a set of options. The algorithm then learns a hierarchically optimal options policy and all options’ flat policies by interacting with the environment (more on hierarchical optimality in the next section).

Iii-a Problem Formulation

Given an options policy , let be the expected sum of discounted lumped reward of state obtained from following , which can be written recursively as


In the above equation, the subscripts denote the option being executed i.e. at each state . is the lumped reward obtained from executing option at time and state , and terminating at time and state (refer to Equation (13) for notation and robustness calculation). Here we denote to be the number of time steps option takes to terminate. The problem that we address in this paper can then be formulated as:

Problem 1

Given an MDP with unknown transition model and reward structure , an STL formula over , and a set of options , find a policy that maximizes the expected sum of discounted lumped reward as specified in Equation (15).

Before the algorithm is presented, we introduce some terminology. First a primitive option is an option whose policy is a flat policy (i.e ). This is in contrast with a hierarchical option whose policy maps states to lower level options (). In other words a hierarchical option is an option over option and thus higher up the hierarchy. We will not be using hierarchical options in this paper. A temporally combined option is an option constructed from executing a selected set of options in a predefined order. For example, suppose we have two primitive options and , a temporally combined option can be executed by first following option until termination and then follow option until termination. Therefore the initiation set and the termination map . Also it should be ensured that the states where termination of option is possible should be an element of the initiation set of i.e. . A temporally combined option can be a primitive option or a hierarchical option depending on its constituent options. For the method presented in this paper, all options are primitive options hence the subscript is dropped.

Iii-B The Hierarchical STL Learning Algorithm (HSTL-Learning)

Given a STL specification containing boolean connected predicates to (like the in Equation (12)), for each , construct a primitive option ( and are user defined). Using these primitive options, a set of temporally combined options is constructed. The way in which is constructed can be controlled by the user. For example if the primitive options set is , a possible temporally combined options set can be . Here we take advantage of the fact that Q-learning is an off-policy learning algorithm, meaning that the learned policy is independent of the exploration scheme [25]. Hence multiple policies can be learned simultaneously while the agent is interacting with the environment. In the case of this example, policies need to be learned where is the number of boolean connected predicates in the STL specification (hence the number of flat policies) and one more for the options policy . The complete learning algorithm is present in Algorithm 1.

1:procedure HSTL-update()
2:     For each of the primitive options, initialized action-value function , initiation set and termination map
3:     Construct the temporally combined options set
4:     Initialize the option-value function for
5:     Choose learning rates and discount factors for all learning agents
6:      this is the state where is terminated. colon indicates all elements in the dimension
7:     for  to  do
11:         for  to  do update all primitive options’ Q-functions
12:               robustness as the reward for flat policy learning, refer to Equation (13)
14:         end for
15:          indicates element to
18:     end forreturn all for and
19:end procedure
Algorithm 1 HSTL-Learning

The inputs to Algorithm 1 are an STL specification , the currently selected option , and the trajectory resulted from executing to termination . Here is a matrix where is the dimension of state space and is the number of time steps is executed before termination. is a matrix where is the dimension of primitive action space. The algorithm outputs the updated Q-functions. The main idea of Algorithm 1 is that every time an option is executed to termination, the resulting trajectory is used to calculate a reward based on evaluating its robustness against the given STL formula (line 19). This reward is used to update the Q-function . In cases where the time of executing an option to termination is less than that required to evaluate the robustness of the given STL formula, the upper time bound of the STL formula is adjusted to coincide with the execution time of the option, and evaluation is proceeded as usual. This is to ensure that choices of options are Markovian and does not depend on previous history. In addition, every primitive step within the trajectory is used to update the Q-function for all options’ flat policies, with the reward being the robustness of the resulting state with respective to the corresponding (line 15). Because is updated once only when an option terminates, convergence to a desirable policy can be quite slow. To speed up the learning process, an intra-option update step is introduced which follows from the idea of intra-option value learning presented in [17]. If an option is initiated at state and terminated at with trajectory , then for every intermediate state we can also consider the sub-trajectory a valid experience, where option is initiated at state and terminated at . Therefore instead of updating only once for state , it is updated for all intermediate states (lines 18-20), which drastically increases the efficiency for experience usage.

Iii-C Discussion

In this subsection we discuss some of the advantages and shortcomings of the proposed method. Unlike conventional reinforcement learning approaches where manual design of rewards is necessary, STL provides a way to conveniently specify complicated task goals while naturally translates the specifications to rewards. In addition, since robustness is a continuous measure of satisfiability, the resulting reward structure helps to speed up learning of the flat policies much like potential-based reward shaping [8].

The correctness and completeness of the proposed algorithm are determined by the options framework. Here we introduce the notion of hierarchical optimality. A policy is said to be hierarchically optimal if it achieves the highest cumulative reward among all policies consistent with the given hierarchy [26]. In general, a hierarchical learning algorithm with a fixed set of options converges to a hierarchically optimal policy [17], which is the case for the HSTL-learning algorithm. More specifically, the HSTL-learning algorithm will find a hierarchically optimal policy that satisfies


for a fixed set of options ( defined in Equation (15)). Whether robustness of the STL specification is satisfied/maximized depends on the set of options provided to the algorithm. A policy leading to trajectories that maximize the robustness of the given STL formula will be found if the trajectories can be constructed from the options provided. Therefore the correctness and completeness of the proposed algorithm are related to the hierarchical optimality property, and hence also depend on the set of options provided.

On complexity, Algorithm 1 requires operations per update. Here is the number of steps the current option takes to terminate, and is the number of elements in the set . depends on the number of flat policies and how is constructed. Like Q-learning, the number of training steps required for convergence depends largely on the learning parameters listed in Table II, and convergence is guaranteed if each state-action pair is visited infinitely often (convergence guarantee discussed in Section II-A).

Finally, it is worth mentioning that multiple trajectories exist that maximally satisfy a given STL formula (for example any trajectory that passes through maximally satisfy ). The proposed method chooses only the most greedy trajectory given the set of available options. This takes away some flexibility and the diversity of policies an agent can learn, but is also a predictable characteristic that can be used towards one’s advantage.

Iv Case Study

In this section we evaluate the performance of the proposed method in a simulated environment, and provide a discussion of the results. As depicted in Figure (1), a mobile robot navigates in a grid world with three rectangular regions , , enclosed by colored borders. The state space of the robot is its 2D position , which takes 225 discrete combinations. The robot has an action space . The robot’s transition model entails that it follows a given action with probability 0.7, or randomly choose the other three actions each with probability 0.1. The robot has full state observability but does not have knowledge about its transition model. The goal is for the robot to interact with the environment by taking sequences of actions and observing the resultant states, and in the end learn a policy that when followed satisfy the STL specification



Fig. 1 : A grid world simulation environment. are three regions the robot can visit. The robot can choose to move in the four directions shown in the figure. The probability of moving in the desired direction is 0.7 and the probability of moving in any of the three undesired directions is 0.1

In English the above specification says ”for as long as the robot is running (), enter regions , and every 40 time steps”. This is a cyclic task with no termination. Three primitive options are constructed , . Here we let their initiation sets to be the entire state space i.e. , which means all three options can be initiated anywhere. The termination map is given by

otherwise, (20)

which indicates that each option only terminates when entering a state where the robustness of that state with respective to the corresponding is maximum. The last step is to construct the set of temporally combined options. Here we used (the hypen in the subscript is dropped to save space). Note that the order of subscript is the order in which each primitive option is executed. To obtain a reasonable exploration-exploitation ratio, an exploration policy is carried out. The agent follows the greedy policy (exploitation) with probability , and chooses a random option/action with probability (exploration). The exploration is implemented both at the options policy and flat policy level. It is important that the flat policies converge faster and takes greedy actions at higher probability than the options policy because execution of options depend on the flat policies. This is enforced by decaying the exploration probabilities linearly with time for both and ( where is the rate of decay) while ensuring that decay faster. The exploration probabilities have a lower limit of 0.1 which is to preserve some exploration even near convergence. Table II shows the learning parameters used in simulation. Even though the task specified by the STL formula in Equation (17) is persistent without termination, we divide our learning process in episodes of 200 option choices. That is to say that within each episode the robot chooses an option according to the policy and executes the option to termination, and repeat for 200 times. Then the robot is randomly placed at another location and the next learning episode starts. We performed the training process for 1200 episodes on a Mac with 3 GHz processor and 8 GB memory, and the training took 36 minutes 12 seconds to complete. The resulting policies and two sample runs are presented in Figure (2).

Parameter Description Value
Discount factors for flat policies 0.9
Learning rates for flat policies 0.2
Initial exploration probability for
flat policy learning
Linear decay rate for flat policy’s
exploration probability
Discount factor for options policy 0.9
Learning rate for options policy 0.5
Initial exploration probability for options
policy learning
Linear decay rate for options policy’s
exploration probability
TABLE II: Parameters used in simulation
a Primitive policy
b Primitive policy
c Primitive policy
d Options policy
e Sample run with
initial position
f Sample run with
initial position
Fig. 2 : Learning results for 1200 episodes of training. Subfigures (a), (b), and (c) shows the learned flat policies . The red dot in each figure denotes the state of termination defined by the termination map . The subfigure (d ) shows the learned options policy . Subfigures (e) and (f) illustrate samples run of following the learned policies from two different initial positions shown by the red star

Figures (2a), (2b), and (2c) shows the three flat policies , , and learned by the algorithm. The red dot represents the termination state for each option defined by the termination map. In this case the flat policies lead the robot to this state because it is the state of maximum robustness (but termination can be any state or set of states defined by the user). Figure (2d) shows the learned options policy . This is the policy that the robot follows at the highest level. For example at state , therefore option and will be executed to termination in order. For the STL formula in Equation (17), the desired trajectory as will be a loop that goes through regions , and , the action/option taken at any other state should lead the agent to this loop along a trajectory that evaluates to the highest robustness degree. Figures (2e) and (2f) shows two sample runs with different initial positions (indicated by the red star), and the resulting behavior is as expected. The color of the arrows corresponds to the color coding of the options in the previous options policy subfigure, and are subject to overlay. It can be observed that although for the 1200 episodes of training neither the flat policies nor the options policy has converged (for example at state in ), the resulting policies succeed in navigating the robot towards the desired behavior.

As discussed in Section III-C, the quality of the learned policies with respective to maximizing robustness depends on the set of options provided to the algorithm. Figure (3) shows a comparison of cumulative reward per episode between two different sets of temporally combined options. The first is the set used in previous simulation. The second set takes into account the permutation of primitive options. Results show that using options set achieves an average of 34.8% higher cumulative reward per episode compared to using (negative reward values are due to random exploration when following learned policies). However the time used to train the agent for the same 1200 episodes is 43 minutes 51 seconds for compared to 36 minutes 12 seconds for . In a way this allows the user to leverage a tradeoff between computational resource and optimality by deciding on the number and complexity of the options provided to the framework.

Fig. 3 : Comparison of cumulative reward per episode for two sets of temporally combined options

V Conclusion

In this paper we have developed a reinforcement learning algorithm that takes in an STL formula as task specification, and learns a hierarchy of policies that maximizes the expected sum of discounted robustness degree with hierarchical optimality. We have taken advantage of the options framework to provide to the learning agent a set of temporally extended actions (options), and the ”correctness” of choosing an option at a state is evaluated by calculating the robustness degree of the resulting trajectory against the given STL formula. This naturally becomes the one step immediate reward in the reinforcement learning architecture and thus takes away the burden of manually designing a reward structure. We have shown in simulation that the proposed algorithm learns an options policy and the dependent flat policies that guide the agent to satisfy the task specification with a relatively low number of training steps. The temporal and state abstraction provided by options and STL respectively decomposes a complicated task into a hierarchy of simpler subtasks, and thus modularizing the learning process and increasing the learning efficiency. Moreover, the policies learned for the subtasks can be reused for learning a different high level task and therefore knowledge transfer is enabled. In future work we will look at applying the proposed algorithm to more realistic problems and extending from discrete state and action spaces to continuous ones.


  • [1] G. Tesauro, “Temporal difference learning and td-gammon,” Communications of the ACM, vol. 38, no. 3, pp. 58–68, 1995.
  • [2] A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang, “Autonomous inverted helicopter flight via reinforcement learning,” International Symposium on Experimental Robotics, 2004.
  • [3] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed.   The MIT Press, 2012.
  • [4] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-End Training of Deep Visuomotor Policies,” Arxiv, p. 6922, 2015. [Online]. Available: http://arxiv.org/abs/1504.00702
  • [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. a. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [6]

    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. D. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, and K. Kavukcuoglu, “Mastering the game of Go with deep neural networks and tree search,”

    Nature, vol. 529, no. 7585, pp. 484–489, 2016. [Online]. Available: http://dx.doi.org/10.1038/nature16961
  • [7] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations : Theory and application to reward shaping,”

    Sixteenth International Conference on Machine Learning

    , vol. 3, pp. 278–287, 1999.
  • [8] A. Y. Ng, “Shaping and policy search in reinforcement learning,” Ph.D. dissertation, Computer Science, UC Berkeley, Berkeley, CA, 2003.
  • [9] A. Dutech, T. Edmunds, J. Kok, M. Lagoudakis, M. Littman, M. Riedmiller, B. Russell, B. Scherrer, R. Sutton, S. Timmer, N. Vlassis, A. White, and S. Whiteson, “Reinforcement Learning Benchmarks and Bake-offs II,” Workshop at 2005 NIPS Conference, pp. 1–50, 2005.
  • [10] J. Kober and J. Peters, “Imitation and Reinforcement Learning,” Robotics and Automation Magazine, vol. 17, no. 2, pp. 55–62, 2010.
  • [11] S. Mahadevan, “Average reward reinforcement learning: Foundations, algorithms, and empirical results,” Machine Learning, vol. 22, no. 1-3, pp. 159–195, 1996.
  • [12] A. Schwartz, “A Reinforcement Learning Method for Maximizing Undiscounted Rewards,” Proceedings of the Tenth International Conference on Machine Learning, pp. 298–305, 1993.
  • [13] P. Tadepalli and D. Ok, “H-learning: A reinforcement learning method for optimizing undiscounted average reward,” Corvallis, OR, USA, Tech. Rep., 1994.
  • [14] M. L. Puterman, “Markov Decision Processes: Discrete Stochastic Dynamic Programming,” p. 672, 1994.
  • [15] J. Fu and U. Topcu, “Probably approximately correct MDP learning and control with temporal logic constraints,” CoRR, 2014.
  • [16] D. Sadigh, E. Kim, S. Coogan, S. Sastry, and S. Seshia, “A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications,” CoRR, 2014.
  • [17] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and Semi-MDPs: Learning, Planning, and Representing Knowledge at Multiple Temporal Scales,” Artificial Intelligence, vol. 1, no. 98-74, pp. 1–39, 1998.
  • [18] A. G. Barto, “Recent Advances in Hierarchical Reinforcement Learning,” Discrete Event Dynamic Systems:Theory and Application, vol. 13, p. 41—77, 2003.
  • [19] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, King’s College, Cambridge, England, 1989.
  • [20] R. S. Sutton, “Learning to predict by the methods of temporal differences,” in MACHINE LEARNING.   Kluwer Academic Publishers, 1988, pp. 9–44.
  • [21] F. S. Melo, “Convergence of Q-learning: A simple proof,” Institute Of Systems and Robotics, Tech. Rep, pp. 1–4.
  • [22] A. Donzé and O. Maler, “Robust satisfaction of temporal logic over real-valued signals,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6246 LNCS, pp. 92–106, 2010.
  • [23] O. Maler and D. Nickovic, “Monitoring Temporal Properties of Continuous Signals,” Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, pp. 152 – 166, 2004.
  • [24] S. Sadraddini and C. Belta, “Robust Temporal Logic Model Predictive Control,” 53rd Annual Conference on Communication, Control, and Computing (Allerton), 2015.
  • [25] M. Herrmann, “RL 5 : On-policy and off-policy algorithms,” Edinburgh, UK, 2015.
  • [26] T. G. Dietterich, “Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition,” Journal of Artificial Intelligence Research, vol. 13, pp. 227–303, 2000.