I Introduction
There exists a plethora of motion planning and control techniques for selfdriving vehicles [1]. The diversity is caused by a core difficulty: the tradeoff between model complexity and permitted online computation at short sampling times. Three popular control classes and recent visionbased endtoend solutions are briefly summarized below.
Ia Modelbased control methods
In [2] a samplingbased anytime algorithm RRT is discussed. Key notion is to refine an initial suboptimal path while it is followed. As demonstrated, this is feasible when driving towards a static goal in a static environment. However, it may be problematic in dynamic environments requiring to constantly replan paths, and where an online sampled suitable trajectory may not be returned in time. Other problems of online samplingbased methods are a limited model complexity and their tendency to produce jagged controls that require a smoothing step, e.g., via conjugate gradient [3]. In [4], a latticebased method is discussed. Such methods, and similarly also based on motion primitives [5, 6, 7, 8], are always limited by the size of the lookup table that can be searched in realtime. In [4], a GPU is used for search. In [9], linear timevarying model predictive control (LTVMPC) is discussed for autonomous vehicles. While appealing for its ability to incorporate constraints, MPC must tradeoff modelcomplexity vs. computational burden from solving optimization problems online. Furthermore, MPC is dependent on state and input reference trajectories, typically for linearization of dynamics, but almost always also for providing a tracking reference. Therefore, a twolayered approach is often applied, with motion planning and tracking as the 2 layers [1]. See [10] for a method using geometric corridor planning in the first layer for reference generation and for the combinatorial decision taking on which side to overtake obstacles. As indicated in [9, Sect. VA] and further emphasized in [11], the selection of reference velocities can become problematic for timebased MPC and motivated to use spatialbased system modeling. Vehicle dynamics can be incorporated by inflating obstacles [7]. For tight maneuvering, a linearization approach [12] is more accurate, however, computationally more expensive. To summarize, 2 core observations are made. First, all methods (from samplingbased to MPC) are derived from vehicle models. Second, all of above methods suffer from the realtime requirement of short sampling times. As a consequence, all methods make simplifications on the employed model. These include, e.g., omitting of dynamical effects, tire dynamics, vehicle dimensions, using inflated obstacles, pruning search graphs, solving optimization problems iteratively, or offline precomputing trajectories.
IB Visionbased methods
In [13]
a pioneering endtoend trained neural network labeled ALVINN was used for steering control of an autonomous vehicle. Video and range measurements are fed to a fully connected (FC)network with a single hidden layer with 29 hidden units, and an output layer with 45 direction output units, i.e., discretized steering angles, plus one road intensity feedback unit. ALVINN does not control velocity and is trained using supervised learning based on road “snapshots”. Similarly, recent DAVE2
[14]also only controls steering and is trained supervisedly. However, it outputs continuous steering action and is composed of a network including convolutional neural networks (CNN) as well as FClayers with a total of 250000 parameters. During testing (i.e., after training), steering commands are generated from only a frontfacing camera. Another endtoend system based on only camera vision is presented in
[15]. First, a driving intention (change to left lane, change to right lane, stay in lane and break) is determined, before steering angle is output from a recurrent neural network (RNN). Instead of mapping images to steering control, in [16] and [17], affordance indicators (such as distance to cars in current and adjacent lanes etc.) and feasible driving actions (such as straight, stop, leftturn, rightturn) are output from neural networks, respectively. See also [18] and their treatment of “option policies”. To summarize, it is distinguished between (i) visionbased endtoend control, and (ii) perceptiondriven approaches that attempt to extract useful features from images. Note that such features (e.g., obstacle positions) are implicitly required for all methods from Sect. IA.IC Motivation and Contribution
This work is motivated by the following additional considerations. As noted in [19], localization relative to lane boundaries is more important than with respect to GPScoordinates, which underlines the importance of lasers, lidars and cameras for automated driving. Second, vehicles are man and womanmade products for which there exist decadelong experience in vehicle dynamics modeling [20],[21]. There is no reason to a priori entirely discard this knowledge (for manufacturers it is present even in form of construction plans). This motivates to leverage available vehicle models for control design. Consider also the position paper [22] for general limitations of endtoend learning. Third, a general purpose control setup is sought avoiding to switch between different vehicle models and algorithms for, e.g., highway driving and parking. There also exists only one realworld vehicle. In that perspective, a complex vehicle model encompassing all driving scenarios is in general preferable for control design. Also, a model mismatch on the planning and tracking layer can incur paths infeasible to track [7]. Fourth, the most accident causes involving other mobile vehicles are rearend collisions [23], which most frequently are caused by inattentiveness or too close following distances. Control methods that enable minimal sampling times, such as feedforward control, can deterministically increase safety through minimal reaction times. In contrast, environment motion prediction (which can also increase safety) always remains stochastic. Fifth, small sampling times may contradict using complex vehicle models for control when applied for expensive online optimization or search problems. These considerations motivate a 2step procedure: first learning of a controller during offline training based on an arbitrarily complicated mathematical system model, before online fast evaluation of the trained controller. In an automated vehicles settings, it implies that once trained, lowcost embedded hardware can be used online for evaluation of only few matrix vector multiplications.
The contribution of this paper is a simple gradientfree algorithm for modelbased deep reinforcement learning using task separation with hill climbing (TSHC). Therefore, it is specifically proposed to (i) simultaneously train on separate deterministic tasks with the purpose of encoding motion primitives in a neural network, and (ii) during training to employ maximally sparse rewards in combinations with virtual velocity constraints (VVCs) in setpoint proximity.
Ii Problem Formulation and Preliminaries
Iia General setup
The problem formulation is visualized in Fig. 1. Exteroceptive measurements are assumed to include intervehicular communication (car2car) sensings as well as the communication with a centralized or decentralized coordination service such that, in general, multiautomated vehicle coordination is also enabled [24]. For learning of controller C it is distinguished between 5 core aspects: the system model used for training, the neural network architecture used for function approximation, the training algorithm, the training tasks selection and the hardware/software implementation. Fundamental objective is to encode many desired motion primitives (training tasks) in a neural network. The main focus of this paper is on the training algorithm aspect, motivated within the context of motion planning for autonomous vehicles characterized by nonholonomic system models.
IiB Illustrative system model for simulation experiments
For simplicity a simple Eulerdiscretized nonlinear kinematic bicycle model [21] is assumed for simulation experiments of Sect. IV. Equations of motion are , , , with 3 states (positioncoordinates and heading), 2 controls (steering angle and velocity ), 1 system parameter (wheelbase m), and indexing sampling time . Coordinates and describe the center of gravity (CoG) in the inertial frame and denotes the yaw angle relative to the inertial frame. Physical actuator absolute and rate constraints are treated as part of the vehicle model on which the network training is based on. Thus, the continuous control vector is defined as , with , and . The minimum velocity is negative to permit reverse driving.
IiC Comments on feature vector selection
While the mathematical system model used for training prescribes , this is not the case for feature vector . The dimension of may in general be much smaller than the system’s state space. In general, may be an arbitrary function of filtered extero and proprioceptive measurements according to Fig. 1. Thus, a plethora of many different sensors may be compressed through the filtering to a lowdimensional
. Due to curse of dimensionality lowdimensional
are favorable, since the easiest way to generate training tasks is to grid over the elements of . Note further that for our purpose of encoding specific motion primitives, feature vector must always relate the current vehicle state with reference to a goal state (e.g., via a difference operator). Certificates about learnt control performance can be provided by statement of (i) the system model used for training, and (ii) the encoded motion primitives (training tasks) and their associated feature vectors. Ultimately, instead of only a single timeinstant, may, in general, also represent a collection of multiple past time measurements (timeseries) leading up to time .IiD Comments on computation
For perspective, deep learning using neural networks as function approximators is in general computationally very demanding. To underline remarkable dimensions and computational efforts in practice, note that, for example, in
[25] training is distributed on 80 machines and 1440 CPU cores. In [26], even more profoundly, 1024 Tesla P100 GPUs are used in parallel. For perspective, one Tesla P100 permits a doubleprecision performance of 4.7 TeraFLOPs [27].Iii Training Algorithm
This section motivates a simple gradientfree algorithm for learning of neural network controllers according to Fig. 1.
Iiia Neural network controller parametrization
The controller in Fig. 1 may be parameterized by any of, e.g., FCs, LSTM cells including peephole connections [28], GRUs [29]
and variants. All neural network parameter weights to be learnt are initialized by Gaussiandistributed variables with zero mean and a small standard deviation (e.g.,
). Exceptions are adding a 1 to the LSTM’s forget gate biases for LSTM cells, as recommended in [30], which are thus initialized with mean . In proposed setting, the affine part of all FClayers is followed by nonlinear tanhactivation functions acting elementwise. Because of their bounded outputs, saturating nonlinearities are preferred over ReLUs, which are used for the hidden layers in other RL settings [31], but can result in large unbounded layer output changes. Before entering the neural network is normalized elementwise (accounting for the typical range of feature vector elements). The final FClayer comprises a tanh activation. It accordingly outputs bounded continous values, which are then affinely scaled to via physical actuator absolute and rate constraints valid at time .So far, continuous was assumed. A remark with respect to gear selection is made. Electric vehicles, which appear suitable to curb urban pollution, do not require gearboxes. Nevertheless, in general can be extended to include discrete gear as an additional decision variable. Suppose gears are available. Then, the output layer can be extended by
channels, with each channel output representing a normalized probability of gear selection as a function of
, that can be trained by means of a softmax classifier.
IiiB Reward shaping
Reward shaping is crucial for the success of learning by reinforcement signals [32]. However, reward shaping was found to be a far from trivial matter in practical problems. Therefore, our preferred choice is motivated in detail. In most practical control problems, a state is given at current time , and a desired goal state is known. Not known, however, is the shape of the best trajectory (w.r.t. a given criterion) and the control signals that realize that trajectory. Thus, by nature these problems offer a sparse reward signal, , received only upon reaching the desired goal state at some time . In the following, alternative rich reward signals and curriculum learning [33] are discussed.
IiiB1 The problematic of designing rich reward signals
A reward signal , abbreviated by , is labeled as rich
when it is timevarying as a function of states, controls and feature vector. Note that the design of any such signal is heuristic and motivated by the hope for accelerated learning through maximally frequent feedback. In the following, the problematic of rich rewards is exposed. First, let
, , and relate states with desired goals, and let a binary flag indicate whether the desired goal pose is reached,(1) 
where
are small tolerance hyperparameters. Then, suppose a rich reward signal of the form
is designed, which characterizes a weighted linear combination of different measures. This class of reward signals, tradingoff various terms and providing feedback at every , occurs frequently in the literature [34, 35, 31, 36]. However, as will be shown, for trajectory planning in an automotive setting (especially due to nonholonomic vehicle models), it may easily lead to undesirable behavior. Suppose case (a) in Fig. 2 and a maximum simulation time . Then, omitting a discount factor for brevity, , may be obtained for accumulated rewards. Thus, the nomovement solution may incur more accumulated reward, namely , in comparison to the true solution, which is indicated on the righthand side of the inequality sign.Similarly, for specific , the second scenario (b) in Fig. 2 can return a nomovement solution since the initial angle is already coinciding with the target angle. Hence, for a specific combination, the accumulated reward when not moving may exceed the value of the actual solution.
The third scenario (c) in Fig. 2 shows that even if reducing rich rewards to a single measure, e.g., , an undesired standstill may result. This occurs especially in the presence of obstacles (and mazelike situations in general).
To summarize, for finite , the design of rich reward signals is not straightforward and can easily result in solution trajectories that may even be globally optimal w.r.t. accumulated reward, however, prohibit to solve the original problem of determining a trajectory from initial to target state.
IiiB2 The problematic of curriculum learning
In [33], curriculum learning (CL) is discussed as a method to speed up learning by providing the learning agent first with simpler examples before gradually increasing complexity. Analogies to humans and animals are drawn. The same paper also acknowledges the difficulty of determining “interesting” examples [33, Sect. 7] that optimize learning progress.
Indeed, CL entails the following issues. First, “simpler” tasks need to be identified. This is not straightforward as discussed shortly. Second, these tasks must first be solved before their result can serve as initialization to more complex tasks. In contrast, without CL, the entire solution time can be devoted to the complex tasks rather than being partitioned into easier and difficult tasks. In experiments, this was found to be relevant. Third, the solution of an easier task does not necessarily represent a better initialization to a harder problem in comparison to an alternative random initialization. For example, consider the scenario in Fig. 3. The solution of the simpler task does not serve as a better initialization than a purely random initialization of weights. This is since the final solution requires outreaching steering and possibly reversing of the vehicle. The simpler task just requires forward driving and stopping. This simple example illustrates the need for careful manual selection of suitable easier tasks for CL.
IiiB3 The benefits of maximal sparse rewards in combination with virtual velocity constraints
In the course of this work, many reward shaping methods were tested. These include, first, solving “simpler” tasks by first dismissing target angles limited to deviation from the initial heading. Second, tolerances were initially relaxed before gradually decreasing them. Third, it was tested to first solve a task for only the criterion, then both , and only finally all of . Here, also varying sequences (e.g., first instead of ) were tested. No consistent improvement could be observed for neither of these methods. On the contrary, solving allegedly simpler task reduced available solver time for the original “hard” problems. Without CL the entire solution time can be devoted to the complex tasks.
Based on these findings, our preferred reward design method is maximally sparse and defined by
(2)  
(3) 
where from (1), and being an indicator flag for a vehicle crash. Thus, upon the RL problem is considered as solved. In addition, the pathlength incurred for a transition from sampling time to is defined as
(4) 
As elaborated below, accumulated total pathlength is used to rank solution candidates solving all desired training tasks.
The integral for is defined for generality, in particular for problems such as the inverted pendulum [37] in mind, which are considered to be solved only after stabilization is demonstrated for sufficiently many consecutive time steps. Note, however, that this is not required for an automotive setting. Here, it must be . Only then learning with is possible. Other criteria and tradeoffs for are possible (e.g., accumulated curvature of resulting paths and a minmax objective therefore). The negation is introduced for maximization (“hill climbing”convention). Note that the preferred reward signal is maximal sparse, returning , for all times up until reaching the target. It represents a tabula rasa solution critizised in [34] for its maximal sparsity. Indeed, standalone it was not sufficient to facilitate learning when also accounting for a velocity target . Therefore, virtual velocity constraints (VVCs) in target proximity are introduced. Two variants are discussed. First, VVCs spatially dependent on can be defined as
(5) 
where , and is a hyperparameter (e.g., rangeview length or a heuristic constant). Second and alternatively, VVCs may be defined as spatially invariant with a constant margin (e.g., 5km/h) around the target velocity. For both variants, the neural network output that regulates velocity is scaled with updated and constraints (i.e., using (5) for spatially dependent VVCs).
Let us further legitimize VVCs. Since speed is a decision variable it can always be constrained artificially. This justifies the introduction of VVCs. In (5), bounds are set to affinely converge towards in the proximity of the goal location. This is a heuristic choice. Note that the affine choice do not necessarily imply constant accelerations. This is since (5) is spatially parameterized. Note further that physical actuator rate constraints still hold when is applied to the vehicle.
It was also tested to constrain . The final heading pose implies circles prohited from trespassing because of the nonholonomic vehicle dynamics. It was tested to add these as virtual obstacles. However, this did not accelerate learning.
Finally, note that VVCs artificially introduce hard constraints and thus shape the learning result w.r.t velocity, at least towards the end of the trajectory. Two comments are made. First, in receding online operation, with additional frequent resetting of targets, this shaping effect is reduced since only the first control of a planned trajectory is applied. Second, in case of spatially dependent VVCs the influence of hyperparameter only becomes apparent during parking when following the trajectory up until standstill. Here, however, no significant velocity changes are desired, such that the choice is not decisive. Ultimately, note that sparse rewards naturally avoid the need to introduce tradeoff hyperparameters for the weighting of states in different units. This permits solution trajectories between start and goal poses to naturally evolve without biasing them by provision of rich references to track.
To summarize this section. It was illustrated that the design of rich reward signals as well as curriculum learning can be problematic. Therefore, maximal sparse rewards in combination with virtual velocity constraints are proposed.
IiiC The role of tolerances
Tolerances in (1) hold an important role for 2 reasons. On one hand, nonzero result in deviations between actually learnt and originally desired goal pose . On the other hand, very small (e.g., m, and km/h) prolong learning time. Two scenarios apply.
First, for a network trained on a largescale and dense grid of training tasks and for small
, during online operation, suitable control commands are naturally interpolated even for setpoints not seen during training. The concept of natural interpolation through motion primitives encoded in neural networks is the core advantage over methods relying on lookup tables with stored trajectories, which require to solve timecritical search problems. For example, in
[4] exhaustive search of the entire latticegraph is conducted online on a GPU. In [8], a total of about 100 motion primitives is considered. Then, online an integer program is solved by enumeration using maximal progress along the centerline as criterion for selection of the best motion primitive. In contrast, for control using neural networks as function approximators this search is not required.Second, the scenario was considered in which existing training hardware does (i) not permit largescale encoding, and (ii) only permits to use larger tolerances to limit training time. Therefore, the following method is devised. First, tuples are stored for each training task. Then, during online operation, for any setpoint, , the closest (according to a criterion) from the set of training tasks is searched, before the corresponding is applied to the network controller. Two comments are made. First, in order to reach (with zero deviation), must be applied to the network. Therefore, tuples need to be stored. Second, eventhough this method now also includes a search, it still holds an important advantage over latticebased methods. This is the compression of the lookup table in the network weights. Hence, only tuples need to be stored—not entire trajectories. This is especially relevant in view of limited hardware memory. Thus, through encoding, potetially many more motion primitives can be stored.
In practice, the first scenario is preferable. It is also implementable for 2 reasons. First, see Sect. IID for computational opportunities. Second, neural networks have in principle unlimited function approximation capability [38]. Hence, the implementation of the first approach is purely a question of intelligent task setup, and computational power.
IiiD Main Algorithm – TSHC
Algorithm 1 is proposed for simple gradientfree modelbased reinforcement learning. The name is derived from the fact of (i) learning from separate training tasks, and (ii) a hill climbing update of parameters (greedy local search).
Let us elaborate on definitions. Analysis is provided in Sect. IIIE. First, all network parameters are lumped into variable . Second, the perturbation step 8 in Algorithm 1
has to be intepreted accordingly. It implies parameterwise affine perturbations with zeromean Gaussian noise and spherical variance
. Third, , and in Steps 1416 denote functional mappings between properties defined in the preceding sections. Fourth, hyperparameters are stated in Step 1. While , , , and denote lengths of different iterations, is used for updating of in Step 35 and 37. Fifth, for every restart iteration, , multiple parameter iterations are conducted, at most many. Sixth, in Steps 25 and 29 hill climbing is conducted, when (i) all tasks have been solved for current , or (ii) not all tasks have yet been solved, respectively. Seventh, there are 2 steps in which an early termination of iterations may occur: Step 21 and 41. The former is a must. Only then learning with is possible. The latter termination criterion in Step 41 is optional. If dismissed, a refinement step is implied. Thus, eventhough all tasks have been solved, parameter iterations (up until ) are continued. Eighth, note that a discount hyperparameter , common to gradientbased RL methods [39], is not required. This is since it is irrelevant in the maximally sparse reward setting. Ninth, nested parallelization is in principle possible with an inner and outer parallelization of Steps 1022 and 722, respectively. The former refers to solutions for a given parameter vector , whereas the latter parallelizes parameter perturbations. For final experiments, Steps 722 were implemented asynchronously. Finally, there are 3 options considered for selection. First, holding an initial selection constant throughout TSHC. Second, updatingrandomly (e.g., uniformly distributed between 10 and 1000), whereby this can be implemented either in Step 4 at every
, or in Step 6 at every combination. Third, may be adapted according to progress in , as outlined in Algorithm 1. For the first 2 options of selecting , Steps 3437 are dismissed and at least can be dismissed from the list of hyperparameters in Step 1.IiiE Analysis
According to classifications in [40], TSHC is a gradientfree instancebased simulation optimization method, generating new candiate solutions based on only the current solution and random search in its neighborhood. Because of its hill climbing (greedy) characteristic, it differs from (i) evolutionary (populationbased) methods that construct solution by combining others typically using weighted averaging [41, 25], and (ii) from modelbased methods that use probability distributions on the space of solution candidates, see [40] for a survey. In its highlevel structure, Algorithm 1 can be related to the COMPASS algorithm [42]. Within a global stage, they identify several possible regions with locally optimal solutions. Then, they find local optimal solutions for each of the identified regions, before they select the best solution among all identified locally optimal solutions. In our setting, these regions are enforced as the separate training tasks and the best solution for all of these is selected.
In combination with sufficiently large , must be large enough to permit sufficient exploration such that a network parametrization solving all tasks can be found. In contrast, the effect of decreasing with an increasing number of solved tasks is that, ideally, a speedup in learning progress results from the assignment of more of solution candidates closer in variance to a promising (see Step 8 of TSHC).
Steps 2931 are discussed. For the case that for a specific iteration not all tasks have yet been solved, has been considered as an alternative criterion for Step 29. Several remarks can be made. First, Step 29 and the alternative are not equal. This is because, in general, different tasks are solved in a different number of time steps. However, the criteria are approximately equivalent for sparse rewards (since accumulates constants according to (2)), and especially for large . The core advantage of employing Step 29 in TSHC is that it can, if desired, also be used in combination with rich rewards to accelerate learnig progress (if a suitable rich reward signal can be generated). In such a scenario, according to Step 29 is updated towards most promisining , then representing the accumulated rich reward. Thus, in contrast to (2), a rich reward could be represented by a weighted sum of squared errors between state and a reference ,
(6) 
where , are tradeoff hyperparameters and scalar elements of vectors are indexed by in brackets. Another advantage of the design in Algorithm 1 according to Step 2931 is its anytime solution character. Even if not all are solved, the solution returned for the tasks that are solved, typically is of good quality and optimized according to Steps 2931.
If for all tasks there exists a feasible solution for a given system model and a sufficiently expressive network structure parameterized by , then Algorithm 1 can find such parametrization for sufficiently large hyperparameters , , , and . The solution parametrization is the result from the initialization Step 4 and parameter perturbations according to Step 8, both nested within multiple iterations. As noted in [43], for optimization via simulation, a global convergence guarantee provides little practical meaning other than reassuring a solution will be found “eventually” when simulation effort goes to infinity. However, the same reference also states that a convergence property is most meaningful if it can help in designing suitable stopping criteria. In our case, there are 2 such conceptual levels of stopping criteria: first, the solution of all training tasks, and second, the refinement of solutions.
Control design is implemented hierarchically in 2 steps. First, suitable training tasks (desired motion primitives) are defined. Then, these are encoded in the network by the application of TSHC. This has practical implications. First, it encourages to train on deterministic tasks. Furthermore, at every , it is simultaneously trained on all of these separate tasks. This is beneficial in that the best parametrization, , is clearly defined via Step 25, maximizing the accumulated measure over all tasks. Second, it enables to provide certificates on the learnt performance, which can be provided by stating (i) the employed vehicle model, and (ii) the list of encoded tasks (motion primitives). Note that such certificates cannot be given for the class of stochastic continuous action RL algorithms that are derived from the Stochastic Policy Gradient Theorem [44]. This class includes all stochastic actorcritic algorithms, including A3C [45] and PPO [39].
IiiF Discussion and comparison with related RL work
Related continuous control methods that use neural network for function approximation are discussed, focusing on one stochastic [39], one deterministic policy gradient method [31], and one evolution strategy [25]. The methods are discussed in detail to underline aspects of TSHC.
First, the stochastic policy gradient method PPO [39] is discussed. Suppose that a stochastic continus control vector is sampled from a Gaussian distribution parameterized^{1}^{1}1In this setting, mean and variance of the Gaussian distribution are the output of a neural network whose parameters are summarized by lumped . by such that . Then,
(7) 
is defined as the expected accumulated and timediscounted reward when at drawing , and following the stochastic policy for all subsequet times when acting in the simulation environment. Since function is a priori not know, it is parameterized by
and estimated. Using RLterminology, in the PPOsetting,
represents the advantage function. Then, using the “loglikelihood trick”, and subsequently a firstorder Taylor approximation of around some reference , the following parameterized cost function is obtained as an approximation of (7),(8) 
Finally, (8) is modified to the final PPOcost function [39]
(9) 
whereby the advantage function is estimated by the policy parameterized by , which is run for consecutive time steps such that for all the tuples can be added to a replay buffer, from which later minibatches are drawn. According to [39], the estimate is with , and so forth until , and where represents a second, the socalled critic neural network. Then, using uniform randomly drawn minibatches of size , parameters of both networks are updated according to and , with denoting the argument of the expectation in (9) evaluated at timeindex . This relatively detailed discussion is given to underline following observations. With first the introduction of a parameterized estimator, then a firstorder Taylor approximation, and then clipping, (9) is an arguably crude approximation of the original problem (7). Second, the complexity with two actor and critic networks is noted. Typically, both are of the same dimensions apart from the output layers. Hence, when not sharing weights between the networks, approximately twice as many parameters are required. However, when sharing any weights between actor and critic network, then optimization function (9
) must be extended accordingly, which introduces another approximation step. Third, note that gradients of both networks must be computed for backpropagation. Fourth, the dependence on
rich reward signals is stressed. As long as the current policy does not find a solution candidate, in a sparse reward setting, all are uniform. Hence, there is no information permitting to find a suitable parameter update direction and all of the computational expensive gradient computations are essentially not usable^{2}^{2}2It is mentioned that typically the first, for example, 50000 samples are collected without parameter update. However, even then that threshold must be selected, and the fundamental problem still perseveres.. Thus, the network parameters are still updated entirely at random. Moreover, even if a solution candidate trajectory was found, it is easily averaged out through the random minibatch update. This underlines the problematic of sparse rewards for PPO. Fifth, A3C [45] and PPO [39] are by nature stochastic policies, which draw their controls from a Gaussian distribution (for which mean and variance are the output of a trained network with current state as its input). Hence, exact repetition of any task (e.g., the navigation between 2 locations) cannot be guaranteed. It can only be guaranteed if dismissing the variance component, and consequently using solely the mean for deterministic control. This can be done in practice, however, introduces another approximation step.Deterministic policy gradient method DDPG [31] is discussed. Suppose a deterministic continuous control vector parameterized such that . Then, the following cost function is defined,
Its gradient can now be computed by applying the chainrule for derivatives
[46]. Introducing a parameterized estimate of , which here represents the Qfunction or action value function (in contrast to the advantage function in above stochastic setting), the final DDPGcost function [31] isThen, using minibatches, critic and actor network parameters are updated as and , with slowly tracking target network parameters . Several remarks can be made. First, the Qfunction is updated towards only its onestep ahead target. It is obvious that rewards are therefore propagated very slowly. For sparse rewards this is even more problematic than for rich rewards, especially because of the additional danger of averaging out important update directions though random minibatch sampling. Furthermore, and analogous to the stochastic setting, for the sparse reward setting, as long as no solution trajectory was found, all of the gradient computations are not usable and all network parameters are still updated entirely at random. DDPG is an offpolicy algorithm. In [31], exploration of the simulation environment is achieved according to the current policy plus additive noise following an OrnsteinUhlenbeck process. This is a meanreverting linear stochastic differential equation [47]. A firstorder Euler approximation thereof can be expressed as the action exploration rule , with hyperparameters in [31]. This detail is provided to stress a key difference between policy gradient methods (both stochastic and deterministic), and methods such as [25] and TSHC. Namely, while the former methods sample controls from the stochastic policy or according to heuristic exploration noise before updating parameters using minibatches of incremental tuples plus for PPO, the latter directly work in the parameter space via local perturbations, see Step 8 of Algorithm 1. This approach appears particularly suitable when dealing with sparse rewards. As outlined above, in such setting, parameter updates according to policy gradient methods are also entirely at random, however, with the computationally significant difference of first an approximately four times as large parameter space and, second, the unnecessary costly solution of nonconvex optimization problems as long as no solution trajectory has been found. A wellknown issue in training neural networks is the problem of vanishing or exploding gradients. It is particularly relevant for networks with saturating nonlinearities and can be addressed by batch [48] and layer normalization [49]. In both normalization approaches, additional parameters are introduced to the network which must be learnt (bias and gains). These issues are not relevant for the proposed gradientfree approach.
This paper is originally inspired by and most closely related to [25]. The main differences are discussed. The latter evolutionary (populationbased) strategy updates parameters using a stochastic gradient estimate. Thus, it updates , where hyperparameters and denote the learning rate and noise standard deviation, and where here indicates the stochastic scalar return provided by the simulation environment. This weighted averaging approach for the stochastic gradient estimate is not suitable for our control design method when using separate deterministic training tasks in combination with maximally sparse rewards. Here, hill climbing is more appropriate. This is since most of the trajectory candidates do not end up at and are therefore not useful. Note also that only the introduction of virtual velocity constraints permitted us to quickly train with maximally sparse rewards. It is well known that for gradientbased training, especially of RNNs, the learning rate ( in [25]) is a critical hyperparameter choice. In the hill climbing setting this issue does not occur. Likewise, fitness shaping [41], also used in [25], is not required. Note that above has the same role as . Except, in our setting, it additionally is adaptive according to Steps 29 and 31 in Algorithm 1. As implemented, this is only possible when training on multiple separate tasks. Other differences include the parallelization method in [25], where random seeds shared among workers permit each worker to only need to send and receive the scalar return of an episode to and from each other worker. All perturbations and parameters are then reconstructed locally by each worker. Thus, for workers there are reconstructions at each parameteriteration step. This requires precise control of each worker and can in rare cases lead to differing CPU utilizations among workers due to differing episode lengths. Therefore, they use a capping strategy on maximal episode length. In contrast, our proposed method is less sophisticated with one synchronized parameter update, which is then sent to all workers.
Iv Numerical Simulations
This section highlights different aspects of Algorithm 1. Numerical simulations of Sect. IVA and IVB were conducted on a laptop with an Intel Core i7 CPU @2.80GHz8, 15.6GB of memory, and with the only libraries employed Python’s numpy and multiprocessing. Furthermore, in Sect. IVA
for the implementation of 2 comparative policy gradient methods, Tensorflow (without GPUsupport) was used. Using these (for deep learning) very limited ressources enabled to evaluate the method’s potential when significant computational power is not available. For more complex problems the latter is a necessity. Therefore, in Sect.
IVC TSHC is implemented in Cuda C++ and 1 GPU is used.Iva Experiment 1: Comparison with policy gradient methods
DDPG  PPO  TSHC  

19078  18440  4610 
To underline conceptual differences between TSHC and 2 policy gradient methods DDPG [31] and PPO [39], a freeform navigation task with and was considered, where vector summarizes four of the vehicle’s states. The same network architecture from [39]
is used: a fullyconnected MLP with 2 hidden layers of 64 units before the output layer. Eventhough this is the basic setup, considerable differences between DDPG, PPO and TSHC are implied. Both DDPG and PPO are each composed of a total of four networks: one actor, one critic, one actor target and one critic target network. For DDPG, further parameters result from batch normalization
[48]. The number of parameters that need to be identified are indicated in Table I. To enable a fair comparison, all of DPPG, PPO and TSHC are permitted to train on 1000 full rollouts according to their methods, whereby each rollout lasts at most timesteps. Thus, for TSHC, and are set. For both PPO and DDPG, this implies 1000 iterations. Results are summarized in Fig. 4. The following observations can be made. First, in comparison to TSHC, for both DDPG and PPO significantly more parameters need to be identified, see Table I. Second, DDPG and PPO do not solve the task based on 1000 training simulations. In contrast, as Fig. 4 demonstrates, TSHC has a much better exploration strategy resulting from noise perturbations in the parameter space. It solves the task in just 2.1s. Finally, note that no iteration is conducted. It is not applicable since a single task is solved with an initial . Because of these findings (other target poses were tested with qualitatively equivalent results) and the discussion in Sect. IIIF about the handling of sparse rewards and the fact that DDPG and PPO have no useful gradient direction for their parameter update or may average these out through random minibatch sampling, the focus in the subsequent sections is on TSHC and its analysis.IvB Experiment 2: Inverted Pendulum
The discussion of tolerance levels in Sect. IIIC motivated to consider an alternative approach for tasks requiring stabilization. An analogy to optimal control is drawn. In linear finite horizon MPC, closedloop stability can be guaranteed through a terminal state constraint set which is invariant for a terminal controller, often a linear quadratic regulator (LQR), see [50]. In a RL setting, the following procedure was considered. First, design a LQR for stabilization. Second, compute the region of attraction of the LQR controller [51, Sect. 3.1.1]. Third, use this region of attraction as stopping criterion, replacing the heuristic tolerance selection.
For evaluation, the inverted pendulum system equations and parameters from [37] were adopted (four states, one input). However, in contrast to [37], which assumes just 2 discrete actions (maximum and minimum actuation force), here a continuous control variable is assumed which is limited by the 2 bounds, respectively. There are 2 basic problems: stabilization in the upright position with initial state in the same position, as well as a swingup from the hanging position plus consequent stabilization in the upright position. For the application of TSHC, are set, and the same MLParchitecture from Sect. IVA is used. The following remarks can be made. First, the swingup plus stabilization task was solved in s runtime of TSHC (without refinement step) and using sparse rewards (obtained in the upright position ). For all three restarts a valid solution was generated. Note that in combination with a sampling time [37] of 0.02s corresponds to 10s simulation time. Stabilization in the upright position was achieved from 2.9s on. Rich reward signals were also tested, exploiting the deviation from current to goal angle as measure. However, rich rewards did not accelerate learning.
In a second experiment, the objective was to simultaneously encode the following 2 tasks in the network: stabilization in the upright position with initial state in the same position and a swingup from the hanging position plus consequent stabilization in the upright position. The runtime of TSHC (without refinement step) was s, with 2 of 3 restarts returning a valid solution and using sparse rewards. Instead of learning both tasks simultaneously according to TSHC, it was also attempted to learn them by selecting one of the 2 tasks at random at every , and consequently conducting Step 641. Since the 2 tasks are quite different, this procedure could not encode a solution for both tasks. This is mentioned to exemplify the importance of training simultaneously on separate tasks, rather than training on a single tasks with combinations varying over .
Finally, for the system parameters from [37], it was observed that the continuous control signal was operating mostly at saturated actuation bounds (switching inbetween). This is mentioned for 2 reasons. First, aforementioned LQRstrategy could therefore never be applied since LQR assumes absence of state and input constraints. Second, it exemplifies the ease of RLworkflow with TSHC for quick nonlinear control design, even without significant system insights.
IvC Experiment 3 and 4: GPUbased training
Experiment 3 is characterized by transitioning from to with measured in . This implies . The feature vector is selected as with normalization constants in the denominators and indicating the steering anglerelated network output (before scaling to ). A highresolution tolerance of is set. In addition, m and km/h. Sampling time is 0.01s. As neural network, a MLP[5,8,2] is used, which implies 1 hidden layer with 8 units. For selections and , MLP[5,8,2] was the smallest possible network found to simultaneously encode all training tasks. The second variant of VVCs discussed in Sect. IIIB is employed. Furthermore, , i.e., uniformly distributed at every combination.
Several comments can be made. First, by application of control mirroring w.r.t. steering the trained network enables to reach also all of . Second, the total learning time (runtime of TSHC) to encode all 181 training tasks was 31.1min. MLP[5,8,2] implies a total of 66 parameters to learn. It is remarkable that such a small network has enough function approximation capability to encode all 181 tasks within limited training time and . Third, note that the TSHCtrained network controller permits repeatable precision. As mentioned in of Sect. IIIF, this is not attainable for stochastic policy gradientbased algorithms, which draw their control signals, typically from a Gaussian distribution. Fourth, the learning results are visualized in Fig. 5. These motivated to conduct an additional experiment with identical basic training setup (TSHCsettings, MLP[5,8,2], etc.), however, now encoding only one task for the transition from to . The result is visualized in Fig. 6. Notice the much reduced number of switches between forward and backward driving, and the different range.
The comparison of Fig. 5 and 6 emphasizes the interesting observation that the more motion primitives are encoded in a single network the less performant the single learnt motion trajectories are. This is believed to illustrate potential of partitioning the total number of designated tasks into subsets of training tasks for which separate networks are then learnt using TSHC. The promised advantages include faster overall trainig times, higher performance of learnt trajectories, and ability to employ tiny networks with few parameters for each subset. Work in this perspective is subject of ongoing work.
V Conclusion
Within the context of automated vehicles, for the design of modelbased controllers parameterized by neural networks a simple gradientfree reinforcement learning algorithm labeled TSHC was proposed. The concept of (i) training on separate tasks with the purpose of encoding motion primitives, and (ii) employing sparse rewards in combinations with virtual velocity constraints in setpoint proximity were specifically advocated. Aspects of TSHC were illustrated in 4 numerical experiments. The presented method is not limited to automated driving. Most realworld learning applications for control systems, especially in robotics, are characterized by sparse rewards and the availability of highfidelity system models that can be leveraged for offline training.
Subject of future work is focus on system models of various complexity (e.g., kinematic vs. dynamic vehicle models), the partitioninig of tasks into separate subsets of tasks for which separate network parametrizations are learnt, analysis of different feature vectors and closedloop evaluation.
References
 [1] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for selfdriving urban vehicles,” IEEE Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 33–55, 2016.
 [2] S. Karaman, M. R. Walter, A. Perez, E. Frazzoli, and S. Teller, “Anytime motion planning using the RRT,” in IEEE Conference on Robotics and Automation, pp. 1478–1483, 2011.
 [3] D. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel, “Path planning for autonomous vehicles in unknown semistructured environments,” The International Journal of Robotics Research, vol. 29, no. 5, pp. 485–501, 2010.
 [4] M. McNaughton, C. Urmson, J. M. Dolan, and J.W. Lee, “Motion planning for autonomous driving with a conformal spatiotemporal lattice,” in IEEE Conference on Robotics and Automation, pp. 4889–4895, 2011.
 [5] E. Frazzoli, M. A. Dahleh, and E. Feron, “A hybrid control architecture for aggressive maneuvering of autonomous helicopters,” in IEEE Conference on Decision and Control, vol. 3, pp. 2471–2476, 1999.
 [6] T. Schouwenaars, B. Mettler, E. Feron, and J. P. How, “Robust motion planning using a maneuver automation with builtin uncertainties,” in IEEE American Control Conference, vol. 3, pp. 2211–2216, 2003.
 [7] A. Gray, Y. Gao, T. Lin, J. K. Hedrick, H. E. Tseng, and F. Borrelli, “Predictive control for agile semiautonomous ground vehicles using motion primitives,” in IEEE American Control Conference, pp. 4239–4244, 2012.
 [8] A. Liniger, A. Domahidi, and M. Morari, “Optimizationbased autonomous racing of 1: 43 scale rc cars,” Optimal Control Applications and Methods, vol. 36, no. 5, pp. 628–647, 2015.
 [9] P. Falcone, F. Borrelli, J. Asgari, H. E. Tseng, and D. Hrovat, “Predictive active steering control for autonomous vehicle systems,” IEEE Transactions on Control Systems technology, vol. 15, no. 3, pp. 566–580, 2007.
 [10] M. G. Plessen, D. Bernardini, H. Esen, and A. Bemporad, “Spatialbased predictive control and geometric corridor planning for adaptive cruise control coupled with obstacle avoidance,” IEEE Transactions on Control Systems Technology, 2017.
 [11] M. G. Plessen, “Trajectory planning of automated vehicles in tubelike road segments,” in IEEE Conference on Intelligent Transportation Systems, pp. 83–88, 2017.

[12]
M. G. Plessen, P. F. Lima, J. Mårtensson, A. Bemporad, and B. Wahlberg, “Trajectory planning under vehicle dimension constraints using sequential linear programming,” in
IEEE Conference on Intelligent Transportation Systems, pp. 108–113, 2017.  [13] D. A. Pomerleau, “ALVINN: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems, pp. 305–313, 1989.
 [14] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, et al., “End to end learning for selfdriving cars,” arXiv preprint arXiv:1604.07316, 2016.
 [15] S. Chen, S. Zhang, J. Shang, B. Chen, and N. Zheng, “Brain inspired cognitive model with attention for selfdriving cars,” arXiv preprint arXiv:1702.05596, 2017.

[16]
C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning
affordance for direct perception in autonomous driving,” in
IEEE International Conference on Computer Vision
, pp. 2722–2730, 2015.  [17] H. Xu, Y. Gao, F. Yu, and T. Darrell, “Endtoend learning of driving models from largescale video datasets,” arXiv preprint arXiv:1612.01079, 2016.
 [18] C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov, “Combining neural networks and tree search for task and motion planning in challenging environments,” arXiv preprint arXiv:1703.07887, 2017.
 [19] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer, et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of Field Robotics, vol. 25, no. 8, pp. 425–466, 2008.
 [20] T. D. Gillespie, “Vehicle dynamics,” Warren dale, 1997.
 [21] R. Rajamani, Vehicle dynamics and control. Springer Science & Business Media, 2011.
 [22] T. Glasmachers, “Limits of endtoend learning,” arXiv preprint arXiv:1704.08305, 2017.
 [23] National Highway Traffic Safety Administration, “Traffic safety facts, 2014: a compilation of motor vehicle crash data from the fatality analysis reporting system and the general estimates system. dot hs 812261,” Department of Transportation, Washington, DC, 2014.
 [24] M. G. Plessen, D. Bernardini, H. Esen, and A. Bemporad, “Multiautomated vehicle coordination using decoupled prioritized path planning for multilane oneand bidirectional traffic flow control,” in IEEE Conference on Decision and Control, pp. 1582–1588, 2016.
 [25] T. Salimans, J. Ho, X. Chen, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” arXiv preprint arXiv:1703.03864, 2017.
 [26] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch sgd: Training resnet50 on imagenet in 15 minutes,” arXiv preprint arXiv:1711.04325, 2017.
 [27] Nvidia, “Tesla P100.” https://images.nvidia.com/content/tesla/pdf/nvidiateslap100PCIedatasheet.pdf, 2016.

[28]
F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing
with LSTM recurrent networks,”
Journal of Machine Learning Research
, vol. 3, no. Aug, pp. 115–143, 2002.  [29] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
 [30] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration of recurrent network architectures,” in International Conference on Machine Learning, pp. 2342–2350, 2015.
 [31] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [32] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1. MIT press Cambridge, 1998.
 [33] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in International Conference on Machine Learning, pp. 41–48, ACM, 2009.
 [34] J. Randlov and P. Alstrom, “Learning to drive a bicycle using reinforcement learning and shaping,” in International Conference on Machine Learning, pp. 463–471, 1998.
 [35] J. Koutník, J. Schmidhuber, and F. Gomez, “Online evolution of deep convolutional network for visionbased reinforcement learning,” in International Conference on Simulation of Adaptive Behavior, pp. 260–269, Springer, 2014.
 [36] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, et al., “Emergence of locomotion behaviours in rich environments,” arXiv preprint arXiv:1707.02286, 2017.
 [37] C. W. Anderson, “Learning to control an inverted pendulum using neural networks,” IEEE Control Systems Magazine, vol. 9, no. 3, pp. 31–37, 1989.
 [38] H. T. Siegelmann and E. D. Sontag, “Turing computability with neural nets,” Applied Mathematics Letters, vol. 4, no. 6, pp. 77–80, 1991.
 [39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [40] M. C. Fu, F. W. Glover, and J. April, “Simulation optimization: a review, new developments, and applications,” in IEEE Winter Simulation Conference, pp. 13–pp, IEEE, 2005.
 [41] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber, “Natural evolution strategies.,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 949–980, 2014.
 [42] J. Xu, B. L. Nelson, and J. Hong, “Industrial strength COMPASS: A comprehensive algorithm and software for optimization via simulation,” ACM Transactions on Modeling and Computer Simulation, vol. 20, no. 1, p. 3, 2010.
 [43] L. J. Hong and B. L. Nelson, “A brief introduction to optimization via simulation,” in IEEE Winter Simulation Conference, pp. 75–85, 2009.
 [44] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, pp. 1057–1063, 2000.
 [45] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, pp. 1928–1937, 2016.
 [46] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International Conference on Machine Learning, pp. 387–395, 2014.
 [47] H. Geering, G. Dondi, F. Herzog, and S. Keel, “Stochastic systems,” Course script, 2011.
 [48] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, pp. 448–456, 2015.
 [49] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
 [50] D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. Scokaert, “Constrained model predictive control: Stability and optimality,” Automatica, vol. 36, no. 6, pp. 789–814, 2000.
 [51] R. Tedrake, I. R. Manchester, M. Tobenkin, and J. W. Roberts, “LQRtrees: Feedback motion planning via sumsofsquares verification,” International Journal of Robotics Research, vol. 29, no. 8, pp. 1038–1052, 2010.

[52]
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
IEEE Conference on Computer Vision and Pattern Recognition
, pp. 3431–3440, 2015.