I Introduction
Control design for systems repetitively performing a single task has been studied extensively. Such problems arise frequently in practical applications [1, 2] and examples include autonomous cars racing around a track [3, 4, 5], robotic system manipulators [6, 7, 8, 9], and batch processes in general [10]. Iterative learning controllers (ILCs) aim to autonomously improve the closedloop performance of a system with each iteration of a repeated task. ILCs are initialized using a suboptimal policy, and at every subsequent task iteration the system uses data from previous iterations to improve the performance of the closedloop system.
However, the ILCs are not guaranteed to be feasible, let alone effective, on tasks even slightly varied from the original task. Typically, if the task changes, a new ILC must be trained from scratch. This can be problematic for two reasons: (i) it can be very difficult to design a suboptimal policy for complex tasks with which to initialize an ILC algorithm [11, 12, 13], and (ii) if a suboptimal policy is found, the ILC convergence to a locally optimal trajectory may be slow, depending on how suboptimal the initial policy is.
Task transfer learning refers to methods that allow controllers to make use of their experience solving a task to efficiently solve variations of that task. These approaches aim to reduce computation or convergence time, with respect to planning from scratch (PFS), when designing a reference trajectory for a task that is similar to previously seen tasks.
The authors in [14] propose running a desired PFS method in parallel with a retrieve and repair (RR) algorithm that adapts trajectories collected on previous tasks to those of the new task. In [15], environment features are used to divide a task and create a library of trajectories in relative state space frames. In [16]
, a locally optimal controller is designed from Differential Dynamic Programming (DPP) solutions from previous iterations using knearest neighbors. While these methods decrease planning time, they verify or interpolate saved trajectories at every time step which might be inefficient.
The authors in [17] propose piecing together trajectories corresponding to discrete system dynamics only at states of dynamics transition, rather than at every time step. However, this method only applies to discontinuities in system dynamics, and does not generalize to other task variations.
Task transfer learning is also wellexplored in the featurebased reinforcement learning (RL) literature, where methods attempt to transfer policies by considering shared features between tasks. The authors of
[18]estimate the reward function of a new task by using a set of transition sample states from a previous task. However, this only applies for tasks that have different reward functions, but are otherwise identical. In [19], a method to learn a modelfree mapping relating shared features of two tasks is proposed, and uses probability trees to represent the probability that actions taken in a previous task will also be useful in the new task. Similar strategies are proposed in
[20, 21]. The authors in [22] propose learning a featurebased value function approximation in the space of shared task features. The learned function then provides an initial guess for the new task’s true value function, and can be used to initialize an RL method. But these mappings must be learned and applied separately for each saved state, scaling poorly to long horizon tasks. Furthermore, these methods offer no guarantees for safety in the new task.In this paper, a ModelBased Task Transfer Learning (MBTTL) algorithm is presented. We consider a constrained nonlinear system and assume that a dataset of state and input pairs that solve a task is available. The dataset can be stored explicitly (for example, by human demonstrations [23] or an ILC) or generated by roll out of a given policy (for example, a handtuned controller). Our objective is to find a feasible policy for a second task, , by using stored data from . Specifically, MBTTL applies to tasks composed of the same subtasks as , but in different sequences. In the first part of this paper we formally introduce the definition of subtask and provide examples of MBTTL in the fields of autonomous cars and manipulators.
The contributions of this method are twofold. First, we present the MBTTL algorithm. MBTTL improves upon past work by reducing the complexity to adapt policies from to the new task . Specifically, MBTTL breaks tasks into different modes of operation, called subtasks. The policy is adapted to only at points of subtask transition, by solving onestep reachability problems.
Second, we prove that when MBTTL is applied to linear systems with convex constraints, the output policies are feasible for , and reduce iteration cost compared to PFS. Next we start by formally defining the MBTTL problem.
Ii Problem Definition
Iia Tasks and Subtasks
We consider a system
(1a)  
(1b) 
where is the dynamical model, and and
are the state and input constraint sets, respectively. The vectors
and collect the states and inputs at time step .A task is an ordered sequence of subtasks
(2) 
where the th subtask is the fourtuple
(3) 
is the subtask workspace and the subtask input space. is the subtask obstacle space. represents the set of transition states and inputs from the current subtask workspace into the subsequent one workspace:
(4) 
A successful subtask execution E() of a subtask is a trajectory of states and inputs evolving according to (1) without intersecting the subtask obstacle set. We define the th successful execution of subtask as
(5)  
(6)  
(7)  
(8) 
where the vectors and collect the inputs applied to the system (1a) and the resulting states, and and denote the system state and the control input at time of subtask execution . The final state of each successful subtask execution is in the subtask transition set. For the sake of notational simplicity, we have written all subtask executions as beginning at time step .
We define a th successful task execution as a concatenation of successful subtask executions:
(9a)  
(9b)  
(9c)  
(9d)  
(9e) 
where is the duration of the first subtasks during the th task iteration. When the state reaches a subtask transition set, the system has completed subtask , and it transitions into the following subtask . The task is completed when the system reaches the last subtask’s transition set, considered to be the control invariant target set for the task.
After successful executions of task , we define the sampled safe state set and sampled safe input set as:
(10) 
contains all states visited by the system in all previous successful iterations of task , and the inputs applied at each of these states. Thus, for any state in there exists a feasible input sequence contained in to complete the task while satisfying constraints.
We also define the sampled cost set
(11) 
where is the vector containing the costs associated with each state and input pair of the th task iteration, calculated according to the userdefined cost function :
(12) 
is the realized costtogo from state at time step of the th task execution.
IiB SafeSet Based Policies
The sets , , and induce policies to control the system in a task execution.
IiB1 Interpolated Policies
For linear systems with convex constraints, we can define an interpolated policy. At a state in the convex hull of , we first solve the LP
(13a)  
s.t.  (13b)  
(13c) 
where . interpolates the realized costtogo over the safe set.
IiB2 MPCbased Policies
For nonlinear systems, we can define an MPCbased policy as
(15a)  
s.t.  (15b) 
where is a chosen stage cost. Problem (15) searches for an input that controls the current state to the state in the safe state set with the lowest costtogo. The policy prediction horizon can be extended as necessary.
Next we provide two examples for how to formulate tasks using the introduced tasksubtask notation.
IiC Task Formulation Example 1: Autonomous Racing
Consider an autonomous racing task, in which a vehicle is controlled to minimize lap time driving around a race track with piecewise constant curvature (Fig. 1). We model this task as a series of ten subtasks, where the th subtask corresponds to a section of the track with constant radius of curvature . Tasks with different subtask order are tracks consisting of the same road segments in a different order.
The vehicle system is modeled in the curvilinear abscissa reference frame [24], with states and inputs at time step
(16a)  
(16b) 
where , , and are the vehicle’s longitudinal velocity, lateral velocity, and yaw rate, respectively, at time step . is the distance travelled along the centerline of the road, and and are the heading angle and lateral distance error between the vehicle and the path. The inputs are longitudinal acceleration and steering angle .
Accordingly, the system state and input spaces are
(17a)  
(17b) 
The system dynamics (1a) are described using an Euler discretized dynamic bicycle model:
(18)  
where the discretization step and and
the moment of inertia and mass of the vehicle, respectively.
and are the distances from the center of gravity to the front and rear axles. , are the Pacejka functions for the front and rear tire forces, respectively. For more detail, we refer to [5].We formulate each subtask according to (3), with:
IiC1 Subtask Workspace
(19) 
where and mark the distances along the centerline to the start and end of the curve, and is the lane width. is the total length of the track. These bounds indicate that the vehicle can only drive forwards on the track, up to a maximum velocity, and must stay within the lane.
IiC2 Subtask Input Space
(20) 
where are the acceleration input bounds, and the steering input bounds. The input limits are a function of the vehicle, and do not change between subtasks.
IiC3 Subtask Obstacle Space,
In the absence of other vehicles or roadblocks on the track, the subtask obstacle space in this example is empty:
(21) 
IiC4 Subtask Transition Set,
Lastly, we define the subtask transition set to be the states along the subtask border where the track’s radius of curvature changes:
(22) 
where .
IiD Task Formulation Example 2: Robotic Path Planning
Consider a task in which a robotic arm needs to move an object to a target without colliding with obstacles (Fig. 2). The obstacles are modeled as extruded disks of varying heights . Here, each subtask corresponds to the workspace above an obstacle. For this example, different subtask orderings correspond to a rearranging of the obstacle locations.
The robotic manipulator is modeled as a sixjoint robotic arm, with states and inputs
(23a)  
(23b) 
where and are the angle and angular velocity of the th joint, respctively, at time step . The inputs are the torques applied at each of the joints.
The system state and input spaces are
(24a)  
(24b) 
The continuoustime system dynamics are given by:
(25) 
where is the mass inertia matrix, the matrix of Coriolis and centrifugal forces, and the vector of gravity terms. We refer to [13] for details and the discretized form of (25).
We formulate each subtask according to (3), with:
IiD1 Subtask Workspace
(26) 
where and mark the cumulative angle to the beginning and end of the th obstacle, as in Fig. 2.
IiD2 Subtask Input Space
(27) 
where are the minimum and maximum allowed torques for the th joint.
IiD3 Subtask Obstacle Space,
The subtask obstacle space is those states where the robot endeffector collides with the subtask obstacle:
(28) 
where is the height of the obstacle defining the th subtask, and the kinematic mapping from a state to the position of the robot endeffector, given in [25].
IiD4 Subtask Transition Set,
We define the subtask transition set to be the states along the subtask border where the next obstacle begins:
(29) 
where .
Iii ModelBased Task Transfer Learning
In this section we describe the intuition behind MBTTL and provide an algorithm for the method. We prove feasibility and iteration cost reduction of policies output by MBTTL for linear systems with convex constraints. We conclude this section by analyzing MBTTL from two other perspectives: hybrid systems and featurebased aggregation.
Iiia Mbttl
Let Task and Task be different orderings of the same subtasks:
(30) 
where the sequence is a reordering of the sequence . Assume nonempty sampled safe sets , , and .
The goal of MBTTL is to use the state trajectories stored in the sampled safe sets in order to find feasible trajectories for Task , ending in the target set . The key intuition of the method is that all successful subtask executions from Task are also successful subtask executions for Task , as this definition only depends on properties (5) of the subtask itself, not the subtask sequence.
Based on the above intuition, the algorithm proceeds backwards through the new subtask sequence. Consider subtask . We know that all states from stored in our Task executions are controllable to using the stored policies (Alg.2, Lines 45). We then look for stored states from subtask controllable to . Only reachability from the sampled guard set will be important in our approach.
Define the sampled guard set subtask as
(31) 
The sampled guard set for subtask contains the states in from which the system transitioned into another subtask during the previous task executions.
We search for the set of points in that are controllable to stored states in (Alg.2, Lines 914). This reachability problem can be solved using a variety of numerical approaches.
Then, , we remove all backward reachable states from stored in as candidate controllable states for Task (Alg.2, Lines 1517). All remaining stored states from are control invariant to .
Alg. 2 iterates backwards through the remaining subtasks, or until no states in a subtask’s sampled guard set can be shown to be controllable to . The algorithm returns sampled safe sets for Task that have been verified through reachability to contain feasible executions of Task . Fig. 3 depicts this process across three subtasks with sample data from the autonomous racing task detailed in Sec. IV.
In this paper, we implement the search for controllable points by solving a onestep reachability problem:
(32a)  
(32b)  
(32c)  
(32d) 
where is a userdefined stage cost, z a state trajectory through the next Task subtask, and q the cost vector associated with the trajectory. (32) aims to find an input that connects the sampled guard state to a state in the convex hull of a previously verified state trajectory through the next subtask. We note that solving the reachability analysis to the convex hull is a method for reducing computational complexity of MBTTL and is exact only for linear systems with convex constraints.
MBTTL improves on the computational complexity of the surveyed transfer learning methods in two key ways: (i) by verifying the stored trajectories only at states in the sampled guard set, rather than at each recorded time step, and (ii) by solving a datadriven, onestep reachability problem to adapt the trajectories, rather than a multistep or setbased reachability method.
IiiB Properties of MBTTLderived Policies
Assumption 1
Assumption 2
Task and Task are as in (30), where the subtask workspaces and input spaces are given by .
Theorem 1
By nonemptiness of and (Alg.2, Line 15), it follows that:
(33)  
For any such , the interpolated policy applies input
(34) 
where satisfies
(35) 
Then,
(36a)  
(36b)  
(36c) 
It follows from convexity of and that the policy is feasible .
The above Theorem 1 implies that the safe sets output by the MBTTL algorithm induce an interpolating policy for linear systems that can be used to successfully complete Task while satisfying all input and state constraints. The safe sets can also be directly used to initialize an ILC for Task [5].
Assumption 3
Consider Task and Task as defined in (30). The trajectories stored in and correspond to executions of Task by the linear system in closedloop with an ILC. At each iteration , the ILC executes the feedback policy . The ILC is initialized with a policy that is feasible for both Task and Task .
An example of a control policy feasible for two different tasks for the autonomous racing task is a low speed centerlane following controller. For the robotic manipulation task, a policy which always controls the endeffector to is feasible for any subtask order.
Theorem 2
Define the vectors
(37a)  
(37b) 
to be the stored state and input trajectory associated with the implemented policy .
Since is also feasible for Task , when Alg. 2 is applied, the entire task execution can be stored as a successful execution for Task without adapting the policy. It follows that and , and the returned sample safe sets are nonempty.
Next, note that if is defined with in (13) such that
(38) 
Trivially,
(39) 
and it follows that the cost incurred by a Task execution with is no higher than an execution with .
IiiC A Hybrid Systems Perspective
The MBTTL algorithm performs backwards reachability between points in different subtasks. If each subtask is viewed as a different mode of operation, the algorithm can be analyzed from a hybrid systems reachability perspective.
Hybrid systems refer to a class of dynamical systems that switch among several discrete operating modes, with each mode governed by its own dynamics [27]. This includes systems such as automobile powertrains, analog alarm clocks, and walking robots.
Hybrid systems reachability considers whether a feasible trajectory exists between a set of initial states and a set of goal states in a potentially different mode. The extensive literature on hybrid systems reachability mainly focuses on two approaches: setbased methods and simulation [28]. Setbased methods are exhaustive methods that use reach set computation to verify feasibility of entire sets of initial conditions and bounded inputs, and many algorithms have recently been proposed [29, 30, 31, 32]
. While effective, setbased methods suffer from the ”curse of dimensionality”, and do not scale well with state dimension. In order to handle complex systems, these methods often approximate sets as polyhedral or ellipsoidal, which can affect solution accuracy. The authors of
[33] propose splitting the system state into independent substates for combatting the curse of dimensionality, but this is not guaranteed to work for complex systems.Samplingbased simulation methods check the feasibility of a trajectory beginning from an initial condition under sampled input sequences. These methods are less limited to lowdimensional systems, but are certainly not an exhaustive search and can miss subtle phenomena that a particular model may generate [34, 35, 36, 37].
In contrast, MBTTL only solves reachability problems between discrete points in the sampled guard set. This ensures the algorithm scales well with state dimension and number of subtasks without requiring drastic approximations. Additionally, even if setbased methods are computationally feasible for a particular system, in contrast to the MBTTL algorithm they only provide sets of reachable states, rather than a complete policy. An additional problem with traditional samplingbased hybrid systems reachability methods is that sampled trajectories are propagated with random inputs without any check on whether the trajectory is promising, or will inevitably lead to eventual infeasibility. MBTTL explicitly only checks for controllability to feasible points. Lastly, MBTTL views the transitions between subtasks as particular to the task instance, rather than a permanent. The authors are aware of no previously published work in which the transitions of a hybrid system change.
IiiD A FeatureBased Aggregation Perspective
Dynamic Programming (DP) methods provide exact solutions to constrained optimal control problems. However, DP can incur tremendous cost and is therefore not implementable for highdimensional systems. Recent work [38] proposes forming aggregate (or representative) features out of system states in order to reduce the problem dimension. These reduceddimension problems can provide approximate solutions to the original task.
In spatiotemporal aggregation, coarse space and time states are chosen as aggregate features. Spacetime barriers serve as transition sets between these aggregate features, and the shortest path problem is solved only between points immediately adjacent to the barriers. This is an analog to MBTTL performing reachability only at points in the sampled guard sets. While, unlike MBTTL, spatiotemporal aggregation does not explicitly consider a notion of reordering, it provides an additional perspective on the utility of task segmentation for computationally effective policy instantiating.
Iv Simulation Results
(a)  (b)  (c) 
We demonstrate the utility of MBTTL on the autonomous racing task introduced in IIC, taking Task to be the track in Fig. 1. An ILC using Learning Model Predictive Control (LMPC) is used to complete executions of Task , with the vehicle beginning each task iteration at standstill on the centerline at the start of the track. These executions and their costs are stored in , , and . An initial policy for the ILC is provided by a centerlinetracking, lowvelocity PID controller. For more details on the LMPC implementation, we refer to [5].
MBTTL is then used to design initial policies for reconfigured tracks from these sampled safe sets. Figure 4 compares the candidate initial policies designed by MBTTL and PID for three different tracks composed of the same track segments as Task . The top figures compare the resulting from two different initialization methods. While the PIDinitialization tracks the centerline, the MBTTL policy makes use of the previous ILC’s experience solving Task to traverse the new track more efficiently, for example by traveling along the insides of curves. As shown in the bottom row of figures, this results in an improvement in the time required to traverse the tracks when compared with the conservative PIDinitialization.
MBTTL can also be applied repeatedly if the subtask sequence changes multiple times. If an MBTTLinitialized ILC completes iterations of a related Task (“one MBTTL application”), and MBTTL is then applied again to design an initial policy for another related Task (“two MBTTL applications”), the algorithm draws on subtask executions collected over two different tasks in order to build safe sets for Task . This increases the variability of trajectories contained in the subtask safe sets, which means more subtask sequences may become viable task executions, leading to better initial policies. Figure 5 compares the cost (lap time) incurred by executions of a PIDinitialized ILC with the cost incurred by executions of three different levels of MBTTLinitialized ILCs. For the example shown, after three applications the MBTTLinitialized ILC is faster than the PIDinitialized ILC over laps. The MBTTLinitialized ILC leads to quicker convergence to local optimum than the PIDinitialized ILC, with the latter being two iterations slower than the MBTTL controller (see Fig. 4(c)).
MBTTL was run offline here. For realtime feasibility, the set computation time must be shown to be sufficiently low that an ILC could transition seamlessly between tasks. This remains to be explored in future work, along with application of the method to robotic manipulation tasks.
Note: Since the curvilinear abscessa state is a cumulative state that integrates distance traveled along the centerline, its value depends on the order of subtasks, and stored trajectories must first be preprocessed.
V Conclusion
A modelbased task transfer learning method is presented. The MBTTL algorithm uses stored state and input trajectories from executions of a particular task to design safe policies for executing variations of that task. The method breaks each task into subtasks and performing reachability analysis at sampled safe states between subtasks. MBTTL improves upon other task transfer learning methods by only verifying and adapting the previous policy at points of subtask transition, rather than along the entire trajectory.
We test the proposed algorithm on an autonomous racing task. Our simulation results confirm that MBTTL allows an ILC to converge to an optimal lap trajectory faster than planning from scratch. Future work is needed to validate the realtime feasibility of the method in experimental setups.
References
 [1] D. A. Bristow, M. Tharayil, and A. G. Alleyne, “A survey of iterative learning control,” IEEE Control Systems, vol. 26, no. 3, pp. 96–114, 2006.
 [2] Y. Wang, F. Gao, and F. J. D. III, “Survey on iterative learning control, repetitive control, and runtorun control,” Journal of Process Control, vol. 19, no. 10, pp. 1589 – 1600, 2009.
 [3] K. Kritayakirana and C. Gerdes, “Using the centre of percussion to design a steering controller for an autonomous race car,” Vehicle System Dynamics, vol. 15, pp. 33–51, 2012.
 [4] J. Carrau, A. Liniger, X. Zhang, and J. Lygeros, “Efficient Implementation of Randomized MPC for Miniature Race Cars,” in European Control Conference, Jun 2016, pp. 957–962.
 [5] U. Rosolia, A. Carvalho, and F. Borrelli, “Autonomous racing using learning model predictive control,” in American Control Conference (ACC), 2017. IEEE, 2017.
 [6] R. Horowitz, “Learning control of robot manipulators,” TransactionsAmerican Society of Mechanical Engineers Journal of Dynamic Systems Measurement and Control, vol. 115, pp. 402–402, 1993.
 [7] S. Arimoto, M. Sekimoto, and S. Kawamura, “Taskspace iterative learning for redundant robotic systems: Existence of a taskspace control and convergence of learning,” SICE Journal of Control, Measurement, and System Integration, vol. 1, no. 4, pp. 312–319, 2008.
 [8] J. Van Den Berg, S. Miller, D. Duckworth, H. Hu, A. Wan, X.Y. Fu, K. Goldberg, and P. Abbeel, “Superhuman performance of surgical tasks by robots using iterative learning from humanguided demonstrations,” in 2010 IEEE International Conference on Robotics and Automation. IEEE, 2010, pp. 2074–2081.
 [9] Y.C. Wang, C.J. Chien, and C.N. Chuang, “Adaptive iterative learning control of robotic systems using backstepping design,” Transactions of the Canadian Society for Mechanical Engineering, vol. 37, no. 3, pp. 591–601, 2013.
 [10] J. H. Lee and K. S. Lee, “Iterative learning control applied to batch processes: An overview,” Control Engineering Practice, vol. 15, no. 10, pp. 1306–1318, 2007.
 [11] K. Wei and B. Ren, “A method on dynamic path planning for robotic manipulator autonomous obstacle avoidance based on an improved rrt algorithm,” Sensors, vol. 18, no. 2, p. 571, 2018.
 [12] E. Gilbert and D. Johnson, “Distance functions and their application to robot path planning in the presence of obstacles,” IEEE Journal on Robotics and Automation, vol. 1, no. 1, pp. 21–30, March 1985.
 [13] M. W. Spong and M. Vidyasagar, Robot dynamics and control. John Wiley & Sons, 2008.
 [14] D. Berenson, P. Abbeel, and K. Y. Goldberg, “A robot path planning framework that learns from experience,” 2012 IEEE International Conference on Robotics and Automation, pp. 3671–3678, 2012.
 [15] M. Stolle, “Finding and transferring policies using stored behaviors,” Ph.D. dissertation, Carnegie Mellon University, 2008.
 [16] Y. Tassa, T. Erez, and W. D. Smart, “Receding horizon differential dynamic programming,” in Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds. Curran Associates, Inc., 2008, pp. 1465–1472. [Online]. Available: http://papers.nips.cc/paper/3297recedinghorizondifferentialdynamicprogramming.pdf

[17]
C. G. Atkeson and J. Morimoto, “Nonparametric representation of policies and
value functions: A trajectorybased approach,” in Advances in Neural
Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer,
Eds. MIT Press, 2003, pp. 1643–1650.
[Online]. Available:
http://papers.nips.cc/paper/2213nonparametricrepresentationofpoliciesandvaluefunctionsa
trajectorybasedapproach.pdf 
[18]
R. Laroche and M. Barlier, “Transfer reinforcement learning with shared
dynamics,” in
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  [19] T. Croonenborghs, K. Driessens, and M. Bruynooghe, “Learning a transfer function for reinforcement learning problems,” 2008.
 [20] T. G. Karimpanal and R. Bouffanais, “Selforganizing maps for storage and transfer of knowledge in reinforcement learning,” CoRR, vol. abs/1811.08318, 2018. [Online]. Available: http://arxiv.org/abs/1811.08318
 [21] A. Tirinzoni, A. Sessa, M. Pirotta, and M. Restelli, “Importance weighted transfer of samples in reinforcement learning,” CoRR, vol. abs/1805.10886, 2018. [Online]. Available: http://arxiv.org/abs/1805.10886

[22]
G. Konidaris, I. Scheidwasser, and A. Barto, “Transfer in reinforcement
learning via shared features,”
Journal of Machine Learning Research
, vol. 13, no. May, pp. 1333–1371, 2012.  [23] A. Coates, P. Abbeel, and A. Y. Ng, “Learning for control from multiple demonstrations,” in Proceedings of the 25th International Conference on Machine Learning, ser. ICML ’08. New York, NY, USA: ACM, 2008, pp. 144–151. [Online]. Available: http://doi.acm.org/10.1145/1390156.1390175
 [24] R. Rajamani, Vehicle dynamics and control. Springer Science & Business Media, 2011.
 [25] P. Kebria, S. AlWais, H. Abdi, and S. Nahavandi, “Kinematic and dynamic modelling of ur5 manipulator,” 10 2016, pp. 004 229–004 234.
 [26] U. Rosolia, X. Zhang, and F. Borrelli, “Simple policy evaluation for datarich iterative tasks,” CoRR, vol. abs/1810.06764, 2018. [Online]. Available: http://arxiv.org/abs/1810.06764
 [27] F. Borrelli, A. Bemporad, and M. Morari, Predictive Control for linear and hybrid systems. Cambridge University Press, 2017.
 [28] S. Schupp, E. Ábrahám, X. Chen, I. B. Makhlouf, G. Frehse, S. Sankaranarayanan, and S. Kowalewski, “Current challenges in the verification of hybrid systems,” in International Workshop on Design, Modeling, and Evaluation of Cyber Physical Systems. Springer, 2015, pp. 8–24.
 [29] S. Kong, S. Gao, W. Chen, and E. Clarke, “dreach: reachability analysis for hybrid systems,” in International Conference on TOOLS and Algorithms for the Construction and Analysis of Systems. Springer, 2015, pp. 200–205.
 [30] S. Ratschan and Z. She, “Safety verification of hybrid systems by constraint propagationbased abstraction refinement,” ACM Transactions on Embedded Computing Systems (TECS), vol. 6, no. 1, p. 8, 2007.
 [31] K. Scheibler, S. Kupferschmid, and B. Becker, “Recent improvements in the smt solver isat.”
 [32] I. M. Mitchell and Y. Susuki, “Level set methods for computing reachable sets of hybrid systems with differential algebraic equation dynamics,” 04 2008.
 [33] S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamiltonjacobi reachability: A brief overview and recent advances,” in 2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE, 2017, pp. 2242–2253.
 [34] W. Taha, A. Duracz, Y. Zeng, K. Atkinson, F. Bartha, P. Brauner, J. Duracz, F. Xu, R. Cartwright, M. KoneÄnÃœ, E. Moggi, J. Masood, P. Andreasson, J. Inoue, A. Sant’Anna, R. Philippsen, A. Chapoutot, M. O’Malley, A. Ames, and C. Grante, “Acumen: An opensource testbed for cyberphysical systems research,” 10 2015.
 [35] P. S. Duggirala, S. Mitra, M. Viswanathan, and M. Potok, “C2e2: A verification tool for stateflow models,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2015, pp. 68–82.
 [36] S. Bak and P. S. Duggirala, “Hylaa: A tool for computing simulationequivalent reachability for linear systems,” in Proceedings of the 20th International Conference on Hybrid Systems: Computation and Control. ACM, 2017, pp. 173–178.
 [37] L. Liebenwein, C. Baykal, I. Gilitschenski, S. Karaman, and D. Rus, “Samplingbased approximation algorithms for reachability analysis with provable guarantees,” 2018.
 [38] D. P. Bertsekas, “Featurebased aggregation and deep reinforcement learning: A survey and some new implementations,” arXiv preprint arXiv:1804.04577, 2018.