Log In Sign Up

Model-Based Task Transfer Learning

by   Charlott Vallon, et al.

A model-based task transfer learning (MBTTL) method is presented. We consider a constrained nonlinear dynamical system and assume that a dataset of state and input pairs that solve a task T1 is available. Our objective is to find a feasible state-feedback policy for a second task, T1, by using stored data from T2. Our approach applies to tasks T2 which are composed of the same subtasks as T1, but in different order. In this paper we formally introduce the definition of subtask, the MBTTL problem and provide examples of MBTTL in the fields of autonomous cars and manipulators. Then, a computationally efficient approach to solve the MBTTL problem is presented along with proofs of feasibility for constrained linear dynamical systems. Simulation results show the effectiveness of the proposed method.


page 1

page 2

page 3

page 4


KCRL: Krasovskii-Constrained Reinforcement Learning with Guaranteed Stability in Nonlinear Dynamical Systems

Learning a dynamical system requires stabilizing the unknown dynamics to...

Active Learning for Nonlinear System Identification with Guarantees

While the identification of nonlinear dynamical systems is a fundamental...

Delta Schema Network in Model-based Reinforcement Learning

This work is devoted to unresolved problems of Artificial General Intell...

On-edge Multi-task Transfer Learning: Model and Practice with Data-driven Task Allocation

On edge devices, data scarcity occurs as a common problem where transfer...

Identifying Time Lag in Dynamical Systems with Copula Entropy based Transfer Entropy

Time lag between variables is a key characteristics of dynamical systems...

Towards detection and classification of microscopic foraminifera using transfer learning

Foraminifera are single-celled marine organisms, which may have a plankt...

I Introduction

Control design for systems repetitively performing a single task has been studied extensively. Such problems arise frequently in practical applications [1, 2] and examples include autonomous cars racing around a track [3, 4, 5], robotic system manipulators [6, 7, 8, 9], and batch processes in general [10]. Iterative learning controllers (ILCs) aim to autonomously improve the closed-loop performance of a system with each iteration of a repeated task. ILCs are initialized using a suboptimal policy, and at every subsequent task iteration the system uses data from previous iterations to improve the performance of the closed-loop system.

However, the ILCs are not guaranteed to be feasible, let alone effective, on tasks even slightly varied from the original task. Typically, if the task changes, a new ILC must be trained from scratch. This can be problematic for two reasons: (i) it can be very difficult to design a suboptimal policy for complex tasks with which to initialize an ILC algorithm [11, 12, 13], and (ii) if a suboptimal policy is found, the ILC convergence to a locally optimal trajectory may be slow, depending on how suboptimal the initial policy is.

Task transfer learning refers to methods that allow controllers to make use of their experience solving a task to efficiently solve variations of that task. These approaches aim to reduce computation or convergence time, with respect to planning from scratch (PFS), when designing a reference trajectory for a task that is similar to previously seen tasks.

The authors in [14] propose running a desired PFS method in parallel with a retrieve and repair (RR) algorithm that adapts trajectories collected on previous tasks to those of the new task. In [15], environment features are used to divide a task and create a library of trajectories in relative state space frames. In [16]

, a locally optimal controller is designed from Differential Dynamic Programming (DPP) solutions from previous iterations using k-nearest neighbors. While these methods decrease planning time, they verify or interpolate saved trajectories at every time step which might be inefficient.

The authors in [17] propose piecing together trajectories corresponding to discrete system dynamics only at states of dynamics transition, rather than at every time step. However, this method only applies to discontinuities in system dynamics, and does not generalize to other task variations.

Task transfer learning is also well-explored in the feature-based reinforcement learning (RL) literature, where methods attempt to transfer policies by considering shared features between tasks. The authors of

[18]estimate the reward function of a new task by using a set of transition sample states from a previous task. However, this only applies for tasks that have different reward functions, but are otherwise identical. In [19]

, a method to learn a model-free mapping relating shared features of two tasks is proposed, and uses probability trees to represent the probability that actions taken in a previous task will also be useful in the new task. Similar strategies are proposed in

[20, 21]. The authors in [22] propose learning a feature-based value function approximation in the space of shared task features. The learned function then provides an initial guess for the new task’s true value function, and can be used to initialize an RL method. But these mappings must be learned and applied separately for each saved state, scaling poorly to long horizon tasks. Furthermore, these methods offer no guarantees for safety in the new task.

In this paper, a Model-Based Task Transfer Learning (MBTTL) algorithm is presented. We consider a constrained nonlinear system and assume that a dataset of state and input pairs that solve a task is available. The dataset can be stored explicitly (for example, by human demonstrations [23] or an ILC) or generated by roll out of a given policy (for example, a hand-tuned controller). Our objective is to find a feasible policy for a second task, , by using stored data from . Specifically, MBTTL applies to tasks composed of the same subtasks as , but in different sequences. In the first part of this paper we formally introduce the definition of subtask and provide examples of MBTTL in the fields of autonomous cars and manipulators.

The contributions of this method are twofold. First, we present the MBTTL algorithm. MBTTL improves upon past work by reducing the complexity to adapt policies from to the new task . Specifically, MBTTL breaks tasks into different modes of operation, called subtasks. The policy is adapted to only at points of subtask transition, by solving one-step reachability problems.

Second, we prove that when MBTTL is applied to linear systems with convex constraints, the output policies are feasible for , and reduce iteration cost compared to PFS. Next we start by formally defining the MBTTL problem.

Ii Problem Definition

Ii-a Tasks and Subtasks

We consider a system


where is the dynamical model, and and

are the state and input constraint sets, respectively. The vectors

and collect the states and inputs at time step .

A task is an ordered sequence of subtasks


where the th subtask is the four-tuple


is the subtask workspace and the subtask input space. is the subtask obstacle space. represents the set of transition states and inputs from the current subtask workspace into the subsequent one workspace:


A successful subtask execution E() of a subtask is a trajectory of states and inputs evolving according to (1) without intersecting the subtask obstacle set. We define the -th successful execution of subtask as


where the vectors and collect the inputs applied to the system (1a) and the resulting states, and and denote the system state and the control input at time of subtask execution . The final state of each successful subtask execution is in the subtask transition set. For the sake of notational simplicity, we have written all subtask executions as beginning at time step .

We define a -th successful task execution as a concatenation of successful subtask executions:


where is the duration of the first subtasks during the -th task iteration. When the state reaches a subtask transition set, the system has completed subtask , and it transitions into the following subtask . The task is completed when the system reaches the last subtask’s transition set, considered to be the control invariant target set for the task.

After successful executions of task , we define the sampled safe state set and sampled safe input set as:


contains all states visited by the system in all previous successful iterations of task , and the inputs applied at each of these states. Thus, for any state in there exists a feasible input sequence contained in to complete the task while satisfying constraints.

We also define the sampled cost set


where is the vector containing the costs associated with each state and input pair of the -th task iteration, calculated according to the user-defined cost function :


is the realized cost-to-go from state at time step of the -th task execution.

Ii-B Safe-Set Based Policies

The sets , , and induce policies to control the system in a task execution.

Ii-B1 Interpolated Policies

For linear systems with convex constraints, we can define an interpolated policy. At a state in the convex hull of , we first solve the LP

s.t. (13b)

where . interpolates the realized cost-to-go over the safe set.

Let be the optimal solution to (13). Then


The interpolated based policy (13) and (14) computes the control input as a weighted sum of stored inputs.

Ii-B2 MPC-based Policies

For nonlinear systems, we can define an MPC-based policy as

s.t. (15b)

where is a chosen stage cost. Problem (15) searches for an input that controls the current state to the state in the safe state set with the lowest cost-to-go. The policy prediction horizon can be extended as necessary.

Next we provide two examples for how to formulate tasks using the introduced task-subtask notation.

Ii-C Task Formulation Example 1: Autonomous Racing

Consider an autonomous racing task, in which a vehicle is controlled to minimize lap time driving around a race track with piecewise constant curvature (Fig. 1). We model this task as a series of ten subtasks, where the -th subtask corresponds to a section of the track with constant radius of curvature . Tasks with different subtask order are tracks consisting of the same road segments in a different order.

Fig. 1: Each subtask of the racing task corresponds to a segment of the track with constant curvature. The vehicle state tracks the distance traveled along the centerline.

The vehicle system is modeled in the curvilinear abscissa reference frame [24], with states and inputs at time step


where , , and are the vehicle’s longitudinal velocity, lateral velocity, and yaw rate, respectively, at time step . is the distance travelled along the centerline of the road, and and are the heading angle and lateral distance error between the vehicle and the path. The inputs are longitudinal acceleration and steering angle .

Accordingly, the system state and input spaces are


The system dynamics (1a) are described using an Euler discretized dynamic bicycle model:


where the discretization step and and

the moment of inertia and mass of the vehicle, respectively.

and are the distances from the center of gravity to the front and rear axles. , are the Pacejka functions for the front and rear tire forces, respectively. For more detail, we refer to [5].

We formulate each subtask according to (3), with:

Ii-C1 Subtask Workspace


where and mark the distances along the centerline to the start and end of the curve, and is the lane width. is the total length of the track. These bounds indicate that the vehicle can only drive forwards on the track, up to a maximum velocity, and must stay within the lane.

Ii-C2 Subtask Input Space


where are the acceleration input bounds, and the steering input bounds. The input limits are a function of the vehicle, and do not change between subtasks.

Ii-C3 Subtask Obstacle Space,

In the absence of other vehicles or roadblocks on the track, the subtask obstacle space in this example is empty:


Ii-C4 Subtask Transition Set,

Lastly, we define the subtask transition set to be the states along the subtask border where the track’s radius of curvature changes:


where .

Ii-D Task Formulation Example 2: Robotic Path Planning

Fig. 2: Topview of the robotic path planning task. Each subtask corresponds to an obstacle in the environment with constant height.

Consider a task in which a robotic arm needs to move an object to a target without colliding with obstacles (Fig. 2). The obstacles are modeled as extruded disks of varying heights . Here, each subtask corresponds to the workspace above an obstacle. For this example, different subtask orderings correspond to a rearranging of the obstacle locations.

The robotic manipulator is modeled as a six-joint robotic arm, with states and inputs


where and are the angle and angular velocity of the -th joint, respctively, at time step . The inputs are the torques applied at each of the joints.

The system state and input spaces are


The continuous-time system dynamics are given by:


where is the mass inertia matrix, the matrix of Coriolis and centrifugal forces, and the vector of gravity terms. We refer to [13] for details and the discretized form of (25).

We formulate each subtask according to (3), with:

Ii-D1 Subtask Workspace


where and mark the cumulative angle to the beginning and end of the -th obstacle, as in Fig. 2.

Ii-D2 Subtask Input Space


where are the minimum and maximum allowed torques for the -th joint.

Ii-D3 Subtask Obstacle Space,

The subtask obstacle space is those states where the robot end-effector collides with the subtask obstacle:


where is the height of the obstacle defining the -th subtask, and the kinematic mapping from a state to the -position of the robot end-effector, given in [25].

Ii-D4 Subtask Transition Set,

We define the subtask transition set to be the states along the subtask border where the next obstacle begins:


where .

Iii Model-Based Task Transfer Learning

In this section we describe the intuition behind MBTTL and provide an algorithm for the method. We prove feasibility and iteration cost reduction of policies output by MBTTL for linear systems with convex constraints. We conclude this section by analyzing MBTTL from two other perspectives: hybrid systems and feature-based aggregation.

Iii-a Mbttl

Let Task and Task be different orderings of the same subtasks:


where the sequence is a reordering of the sequence . Assume non-empty sampled safe sets , , and .

The goal of MBTTL is to use the state trajectories stored in the sampled safe sets in order to find feasible trajectories for Task , ending in the target set . The key intuition of the method is that all successful subtask executions from Task are also successful subtask executions for Task , as this definition only depends on properties (5) of the subtask itself, not the subtask sequence.

Based on the above intuition, the algorithm proceeds backwards through the new subtask sequence. Consider subtask . We know that all states from stored in our Task executions are controllable to using the stored policies (Alg.2, Lines 4-5). We then look for stored states from subtask controllable to . Only reachability from the sampled guard set will be important in our approach.

Define the sampled guard set subtask as


The sampled guard set for subtask contains the states in from which the system transitioned into another subtask during the previous task executions.

We search for the set of points in that are controllable to stored states in (Alg.2, Lines 9-14). This reachability problem can be solved using a variety of numerical approaches.

Then, , we remove all backward reachable states from stored in as candidate controllable states for Task (Alg.2, Lines 15-17). All remaining stored states from are control invariant to .

Alg. 2 iterates backwards through the remaining subtasks, or until no states in a subtask’s sampled guard set can be shown to be controllable to . The algorithm returns sampled safe sets for Task that have been verified through reachability to contain feasible executions of Task . Fig. 3 depicts this process across three subtasks with sample data from the autonomous racing task detailed in Sec. IV.

Fig. 3: Alg. 2 checks reachability from states in the sampled guard set (in green) to the convex hull of safe trajectories through the next subtask (plotted in light red). If the reachability fails for a point in the sampled guard set, the backwards reachable states are removed from the safe set (shown in grey). The track centerlane is plotted in dashed yellow.
1: input: set of state, input, and cost trajectories, segmented by subtask
3: initialize StateSet, InputSet, CostSet
4:for Subtask idx =  do
5:      solve reachability from SampleGuard to executions in StateSet
7:     for infeasible points in SampleGuard do
8:          remove associated trajectories from and      
9:     update StateSet, InputSet, CostSet
Algorithm 1 MBTTL Method

In this paper, we implement the search for controllable points by solving a one-step reachability problem:


where is a user-defined stage cost, z a state trajectory through the next Task subtask, and q the cost vector associated with the trajectory. (32) aims to find an input that connects the sampled guard state to a state in the convex hull of a previously verified state trajectory through the next subtask. We note that solving the reachability analysis to the convex hull is a method for reducing computational complexity of MBTTL and is exact only for linear systems with convex constraints.

MBTTL improves on the computational complexity of the surveyed transfer learning methods in two key ways: (i) by verifying the stored trajectories only at states in the sampled guard set, rather than at each recorded time step, and (ii) by solving a data-driven, one-step reachability problem to adapt the trajectories, rather than a multi-step or set-based reachability method.

1: input ,
3:initialize empty
6:for  do
9:     for  do
11:         initialize empty
12:         for  do
14:                  (32)          
15:         if  not empty then
18:         else
21: Return
Algorithm 2 MBTTL algorithm

Iii-B Properties of MBTTL-derived Policies

Assumption 1

The dynamics of a system (1a) are , and the system’s state and input constraint sets (1b), and , are convex.

Assumption 2

Task and Task are as in (30), where the subtask workspaces and input spaces are given by .

Theorem 1

(Feasibility) Let Assumptions  1-2 hold. Assume non-empty sets , , containing trajectories for Task . Assume Alg. 2 outputs non-empty sets , , for Task . Then, if , the policy , as defined in (14), produces a feasible execution for Task .

By non-emptiness of and (Alg.2, Line 15), it follows that:


For any such , the interpolated policy applies input


where satisfies




It follows from convexity of and that the policy is feasible .

The proof that the linear system (1a) in closed-loop with converges to follows from [26].

The above Theorem 1 implies that the safe sets output by the MBTTL algorithm induce an interpolating policy for linear systems that can be used to successfully complete Task while satisfying all input and state constraints. The safe sets can also be directly used to initialize an ILC for Task [5].

Assumption 3

Consider Task and Task as defined in (30). The trajectories stored in and correspond to executions of Task by the linear system in closed-loop with an ILC. At each iteration , the ILC executes the feedback policy . The ILC is initialized with a policy that is feasible for both Task and Task .

An example of a control policy feasible for two different tasks for the autonomous racing task is a low speed center-lane following controller. For the robotic manipulation task, a policy which always controls the end-effector to is feasible for any subtask order.

Theorem 2

(Cost Improvement) Let Assumptions  1-3 hold. Then, Alg. 2 will return non-empty sets , for Task . Furthermore, if , the interpolated MBTTL policy as defined in (14) will incur no higher iteration cost than during an execution of Task .

Define the vectors


to be the stored state and input trajectory associated with the implemented policy .

Since is also feasible for Task , when Alg. 2 is applied, the entire task execution can be stored as a successful execution for Task without adapting the policy. It follows that and , and the returned sample safe sets are non-empty.

Next, note that if is defined with in (13) such that




and it follows that the cost incurred by a Task execution with is no higher than an execution with .

Iii-C A Hybrid Systems Perspective

The MBTTL algorithm performs backwards reachability between points in different subtasks. If each subtask is viewed as a different mode of operation, the algorithm can be analyzed from a hybrid systems reachability perspective.

Hybrid systems refer to a class of dynamical systems that switch among several discrete operating modes, with each mode governed by its own dynamics [27]. This includes systems such as automobile powertrains, analog alarm clocks, and walking robots.

Hybrid systems reachability considers whether a feasible trajectory exists between a set of initial states and a set of goal states in a potentially different mode. The extensive literature on hybrid systems reachability mainly focuses on two approaches: set-based methods and simulation [28]. Set-based methods are exhaustive methods that use reach set computation to verify feasibility of entire sets of initial conditions and bounded inputs, and many algorithms have recently been proposed [29, 30, 31, 32]

. While effective, set-based methods suffer from the ”curse of dimensionality”, and do not scale well with state dimension. In order to handle complex systems, these methods often approximate sets as polyhedral or ellipsoidal, which can affect solution accuracy. The authors of

[33] propose splitting the system state into independent substates for combatting the curse of dimensionality, but this is not guaranteed to work for complex systems.

Sampling-based simulation methods check the feasibility of a trajectory beginning from an initial condition under sampled input sequences. These methods are less limited to low-dimensional systems, but are certainly not an exhaustive search and can miss subtle phenomena that a particular model may generate [34, 35, 36, 37].

In contrast, MBTTL only solves reachability problems between discrete points in the sampled guard set. This ensures the algorithm scales well with state dimension and number of subtasks without requiring drastic approximations. Additionally, even if set-based methods are computationally feasible for a particular system, in contrast to the MBTTL algorithm they only provide sets of reachable states, rather than a complete policy. An additional problem with traditional sampling-based hybrid systems reachability methods is that sampled trajectories are propagated with random inputs without any check on whether the trajectory is promising, or will inevitably lead to eventual infeasibility. MBTTL explicitly only checks for controllability to feasible points. Lastly, MBTTL views the transitions between subtasks as particular to the task instance, rather than a permanent. The authors are aware of no previously published work in which the transitions of a hybrid system change.

Iii-D A Feature-Based Aggregation Perspective

Dynamic Programming (DP) methods provide exact solutions to constrained optimal control problems. However, DP can incur tremendous cost and is therefore not implementable for high-dimensional systems. Recent work [38] proposes forming aggregate (or representative) features out of system states in order to reduce the problem dimension. These reduced-dimension problems can provide approximate solutions to the original task.

In spatio-temporal aggregation, coarse space and time states are chosen as aggregate features. Space-time barriers serve as transition sets between these aggregate features, and the shortest path problem is solved only between points immediately adjacent to the barriers. This is an analog to MBTTL performing reachability only at points in the sampled guard sets. While, unlike MBTTL, spatio-temporal aggregation does not explicitly consider a notion of reordering, it provides an additional perspective on the utility of task segmentation for computationally effective policy instantiating.

Iv Simulation Results

(a) (b) (c)
Fig. 4: MBTTL-initialized ILC controllers converge to locally optimal trajectories faster than PID-initialized ones for three different Tasks 2 shown here. Total Iteration Cost corresponds to seconds required to traverse the track. Because the reachability is performed to the convex hull of a trajectory, the green MBTTL states appear disconnected when plotted.

We demonstrate the utility of MBTTL on the autonomous racing task introduced in II-C, taking Task to be the track in Fig. 1. An ILC using Learning Model Predictive Control (LMPC) is used to complete executions of Task , with the vehicle beginning each task iteration at standstill on the centerline at the start of the track. These executions and their costs are stored in , , and . An initial policy for the ILC is provided by a centerline-tracking, low-velocity PID controller. For more details on the LMPC implementation, we refer to [5].

MBTTL is then used to design initial policies for reconfigured tracks from these sampled safe sets. Figure 4 compares the candidate initial policies designed by MBTTL and PID for three different tracks composed of the same track segments as Task . The top figures compare the resulting from two different initialization methods. While the PID-initialization tracks the centerline, the MBTTL policy makes use of the previous ILC’s experience solving Task to traverse the new track more efficiently, for example by traveling along the insides of curves. As shown in the bottom row of figures, this results in an improvement in the time required to traverse the tracks when compared with the conservative PID-initialization.

(a) After one MBTTL iteration.
(b) After two MBTTL applications.
(c) After three MBTTL applications.
Fig. 5: The utility of the MBTTL algorithm increases with each continued application.

MBTTL can also be applied repeatedly if the subtask sequence changes multiple times. If an MBTTL-initialized ILC completes iterations of a related Task (“one MBTTL application”), and MBTTL is then applied again to design an initial policy for another related Task (“two MBTTL applications”), the algorithm draws on subtask executions collected over two different tasks in order to build safe sets for Task . This increases the variability of trajectories contained in the subtask safe sets, which means more subtask sequences may become viable task executions, leading to better initial policies. Figure 5 compares the cost (lap time) incurred by executions of a PID-initialized ILC with the cost incurred by executions of three different levels of MBTTL-initialized ILCs. For the example shown, after three applications the MBTTL-initialized ILC is faster than the PID-initialized ILC over laps. The MBTTL-initialized ILC leads to quicker convergence to local optimum than the PID-initialized ILC, with the latter being two iterations slower than the MBTTL controller (see Fig. 4(c)).

MBTTL was run offline here. For real-time feasibility, the set computation time must be shown to be sufficiently low that an ILC could transition seamlessly between tasks. This remains to be explored in future work, along with application of the method to robotic manipulation tasks.

Note: Since the curvilinear abscessa state is a cumulative state that integrates distance traveled along the centerline, its value depends on the order of subtasks, and stored trajectories must first be preprocessed.

V Conclusion

A model-based task transfer learning method is presented. The MBTTL algorithm uses stored state and input trajectories from executions of a particular task to design safe policies for executing variations of that task. The method breaks each task into subtasks and performing reachability analysis at sampled safe states between subtasks. MBTTL improves upon other task transfer learning methods by only verifying and adapting the previous policy at points of subtask transition, rather than along the entire trajectory.

We test the proposed algorithm on an autonomous racing task. Our simulation results confirm that MBTTL allows an ILC to converge to an optimal lap trajectory faster than planning from scratch. Future work is needed to validate the real-time feasibility of the method in experimental setups.


  • [1] D. A. Bristow, M. Tharayil, and A. G. Alleyne, “A survey of iterative learning control,” IEEE Control Systems, vol. 26, no. 3, pp. 96–114, 2006.
  • [2] Y. Wang, F. Gao, and F. J. D. III, “Survey on iterative learning control, repetitive control, and run-to-run control,” Journal of Process Control, vol. 19, no. 10, pp. 1589 – 1600, 2009.
  • [3] K. Kritayakirana and C. Gerdes, “Using the centre of percussion to design a steering controller for an autonomous race car,” Vehicle System Dynamics, vol. 15, pp. 33–51, 2012.
  • [4] J. Carrau, A. Liniger, X. Zhang, and J. Lygeros, “Efficient Implementation of Randomized MPC for Miniature Race Cars,” in European Control Conference, Jun 2016, pp. 957–962.
  • [5] U. Rosolia, A. Carvalho, and F. Borrelli, “Autonomous racing using learning model predictive control,” in American Control Conference (ACC), 2017.   IEEE, 2017.
  • [6] R. Horowitz, “Learning control of robot manipulators,” Transactions-American Society of Mechanical Engineers Journal of Dynamic Systems Measurement and Control, vol. 115, pp. 402–402, 1993.
  • [7] S. Arimoto, M. Sekimoto, and S. Kawamura, “Task-space iterative learning for redundant robotic systems: Existence of a task-space control and convergence of learning,” SICE Journal of Control, Measurement, and System Integration, vol. 1, no. 4, pp. 312–319, 2008.
  • [8] J. Van Den Berg, S. Miller, D. Duckworth, H. Hu, A. Wan, X.-Y. Fu, K. Goldberg, and P. Abbeel, “Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations,” in 2010 IEEE International Conference on Robotics and Automation.   IEEE, 2010, pp. 2074–2081.
  • [9] Y.-C. Wang, C.-J. Chien, and C.-N. Chuang, “Adaptive iterative learning control of robotic systems using backstepping design,” Transactions of the Canadian Society for Mechanical Engineering, vol. 37, no. 3, pp. 591–601, 2013.
  • [10] J. H. Lee and K. S. Lee, “Iterative learning control applied to batch processes: An overview,” Control Engineering Practice, vol. 15, no. 10, pp. 1306–1318, 2007.
  • [11] K. Wei and B. Ren, “A method on dynamic path planning for robotic manipulator autonomous obstacle avoidance based on an improved rrt algorithm,” Sensors, vol. 18, no. 2, p. 571, 2018.
  • [12] E. Gilbert and D. Johnson, “Distance functions and their application to robot path planning in the presence of obstacles,” IEEE Journal on Robotics and Automation, vol. 1, no. 1, pp. 21–30, March 1985.
  • [13] M. W. Spong and M. Vidyasagar, Robot dynamics and control.   John Wiley & Sons, 2008.
  • [14] D. Berenson, P. Abbeel, and K. Y. Goldberg, “A robot path planning framework that learns from experience,” 2012 IEEE International Conference on Robotics and Automation, pp. 3671–3678, 2012.
  • [15] M. Stolle, “Finding and transferring policies using stored behaviors,” Ph.D. dissertation, Carnegie Mellon University, 2008.
  • [16] Y. Tassa, T. Erez, and W. D. Smart, “Receding horizon differential dynamic programming,” in Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds.   Curran Associates, Inc., 2008, pp. 1465–1472. [Online]. Available:
  • [17] C. G. Atkeson and J. Morimoto, “Nonparametric representation of policies and value functions: A trajectory-based approach,” in Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer, Eds.   MIT Press, 2003, pp. 1643–1650. [Online]. Available:
  • [18] R. Laroche and M. Barlier, “Transfer reinforcement learning with shared dynamics,” in

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • [19] T. Croonenborghs, K. Driessens, and M. Bruynooghe, “Learning a transfer function for reinforcement learning problems,” 2008.
  • [20] T. G. Karimpanal and R. Bouffanais, “Self-organizing maps for storage and transfer of knowledge in reinforcement learning,” CoRR, vol. abs/1811.08318, 2018. [Online]. Available:
  • [21] A. Tirinzoni, A. Sessa, M. Pirotta, and M. Restelli, “Importance weighted transfer of samples in reinforcement learning,” CoRR, vol. abs/1805.10886, 2018. [Online]. Available:
  • [22] G. Konidaris, I. Scheidwasser, and A. Barto, “Transfer in reinforcement learning via shared features,”

    Journal of Machine Learning Research

    , vol. 13, no. May, pp. 1333–1371, 2012.
  • [23] A. Coates, P. Abbeel, and A. Y. Ng, “Learning for control from multiple demonstrations,” in Proceedings of the 25th International Conference on Machine Learning, ser. ICML ’08.   New York, NY, USA: ACM, 2008, pp. 144–151. [Online]. Available:
  • [24] R. Rajamani, Vehicle dynamics and control.   Springer Science & Business Media, 2011.
  • [25] P. Kebria, S. Al-Wais, H. Abdi, and S. Nahavandi, “Kinematic and dynamic modelling of ur5 manipulator,” 10 2016, pp. 004 229–004 234.
  • [26] U. Rosolia, X. Zhang, and F. Borrelli, “Simple policy evaluation for data-rich iterative tasks,” CoRR, vol. abs/1810.06764, 2018. [Online]. Available:
  • [27] F. Borrelli, A. Bemporad, and M. Morari, Predictive Control for linear and hybrid systems.   Cambridge University Press, 2017.
  • [28] S. Schupp, E. Ábrahám, X. Chen, I. B. Makhlouf, G. Frehse, S. Sankaranarayanan, and S. Kowalewski, “Current challenges in the verification of hybrid systems,” in International Workshop on Design, Modeling, and Evaluation of Cyber Physical Systems.   Springer, 2015, pp. 8–24.
  • [29] S. Kong, S. Gao, W. Chen, and E. Clarke, “dreach: -reachability analysis for hybrid systems,” in International Conference on TOOLS and Algorithms for the Construction and Analysis of Systems.   Springer, 2015, pp. 200–205.
  • [30] S. Ratschan and Z. She, “Safety verification of hybrid systems by constraint propagation-based abstraction refinement,” ACM Transactions on Embedded Computing Systems (TECS), vol. 6, no. 1, p. 8, 2007.
  • [31] K. Scheibler, S. Kupferschmid, and B. Becker, “Recent improvements in the smt solver isat.”
  • [32] I. M. Mitchell and Y. Susuki, “Level set methods for computing reachable sets of hybrid systems with differential algebraic equation dynamics,” 04 2008.
  • [33] S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-jacobi reachability: A brief overview and recent advances,” in 2017 IEEE 56th Annual Conference on Decision and Control (CDC).   IEEE, 2017, pp. 2242–2253.
  • [34] W. Taha, A. Duracz, Y. Zeng, K. Atkinson, F. Bartha, P. Brauner, J. Duracz, F. Xu, R. Cartwright, M. KonečnÃœ, E. Moggi, J. Masood, P. Andreasson, J. Inoue, A. Sant’Anna, R. Philippsen, A. Chapoutot, M. O’Malley, A. Ames, and C. Grante, “Acumen: An open-source testbed for cyber-physical systems research,” 10 2015.
  • [35] P. S. Duggirala, S. Mitra, M. Viswanathan, and M. Potok, “C2e2: A verification tool for stateflow models,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2015, pp. 68–82.
  • [36] S. Bak and P. S. Duggirala, “Hylaa: A tool for computing simulation-equivalent reachability for linear systems,” in Proceedings of the 20th International Conference on Hybrid Systems: Computation and Control.   ACM, 2017, pp. 173–178.
  • [37] L. Liebenwein, C. Baykal, I. Gilitschenski, S. Karaman, and D. Rus, “Sampling-based approximation algorithms for reachability analysis with provable guarantees,” 2018.
  • [38] D. P. Bertsekas, “Feature-based aggregation and deep reinforcement learning: A survey and some new implementations,” arXiv preprint arXiv:1804.04577, 2018.