Automating Vehicles by Deep Reinforcement Learning using Task Separation with Hill Climbing

Within the context of autonomous vehicles, classical model-based control methods suffer from the trade-off between model complexity and computational burden required for the online solution of expensive optimization or search problems at every short sampling time. These methods include sampling-based algorithms, lattice-based algorithms and algorithms based on model predictive control (MPC). Recently, end-to-end trained deep neural networks were proposed to map camera images directly to steering control. These algorithms, however, a priori dismiss decades of vehicle dynamics modeling experience, which could be leveraged for control design. In this paper, a model-based reinforcement learning (RL) method is proposed for the training of feedforward controllers in the context of autonomous driving. Fundamental philosophy is to offline train on arbitrarily sophisticated models, while online cheaply evaluate a feedforward controller, thereby avoiding the need for online optimization. The contributions are, first, the discussion of two closed-loop control architectures, and, second, the proposition of a simple gradient-free algorithm for deep reinforcement learning using task separation with hill climbing (TSHC). Therefore, a) simultaneous training on separate deterministic tasks with the purpose of encoding motion primitives in a neural network, and b) the employment of maximally sparse rewards in combinations with virtual actuator constraints on velocity in setpoint proximity are advocated. For feedforward controller parametrization, both fully connected (FC) and recurrent neural networks (RNNs) are used.


page 1

page 2

page 3

page 4


Encoding Motion Primitives for Autonomous Vehicles using Virtual Velocity Constraints and Neural Network Scheduling

Within the context of trajectory planning for autonomous vehicles this p...

Neural Network Based Model Predictive Control for an Autonomous Vehicle

We study learning based controllers as a replacement for model predictiv...

Model-Reference Reinforcement Learning Control of Autonomous Surface Vehicles with Uncertainties

This paper presents a novel model-reference reinforcement learning contr...

Information Theoretic Model Predictive Q-Learning

Model-free Reinforcement Learning (RL) algorithms work well in sequentia...

Synthesizing Neural Network Controllers with Probabilistic Model based Reinforcement Learning

We present an algorithm for rapidly learning controllers for robotics sy...

A Survey of Deep Reinforcement Learning Algorithms for Motion Planning and Control of Autonomous Vehicles

In this survey, we systematically summarize the current literature on st...

Model-Based Reinforcement Learning for Type 1Diabetes Blood Glucose Control

In this paper we investigate the use of model-based reinforcement learni...

I Introduction

There exists a plethora of motion planning and control techniques for self-driving vehicles [1]. The diversity is caused by a core difficulty: the trade-off between model complexity and permitted online computation at short sampling times. Three popular control classes and recent vision-based end-to-end solutions are briefly summarized below.

I-a Model-based control methods

In [2] a sampling-based anytime algorithm RRT is discussed. Key notion is to refine an initial suboptimal path while it is followed. As demonstrated, this is feasible when driving towards a static goal in a static environment. However, it may be problematic in dynamic environments requiring to constantly replan paths, and where an online sampled suitable trajectory may not be returned in time. Other problems of online sampling-based methods are a limited model complexity and their tendency to produce jagged controls that require a smoothing step, e.g., via conjugate gradient [3]. In [4], a lattice-based method is discussed. Such methods, and similarly also based on motion primitives [5, 6, 7, 8], are always limited by the size of the look-up table that can be searched in real-time. In [4], a GPU is used for search. In [9], linear time-varying model predictive control (LTV-MPC) is discussed for autonomous vehicles. While appealing for its ability to incorporate constraints, MPC must trade-off model-complexity vs. computational burden from solving optimization problems online. Furthermore, MPC is dependent on state and input reference trajectories, typically for linearization of dynamics, but almost always also for providing a tracking reference. Therefore, a two-layered approach is often applied, with motion planning and tracking as the 2 layers [1]. See [10] for a method using geometric corridor planning in the first layer for reference generation and for the combinatorial decision taking on which side to overtake obstacles. As indicated in [9, Sect. V-A] and further emphasized in [11], the selection of reference velocities can become problematic for time-based MPC and motivated to use spatial-based system modeling. Vehicle dynamics can be incorporated by inflating obstacles [7]. For tight maneuvering, a linearization approach [12] is more accurate, however, computationally more expensive. To summarize, 2 core observations are made. First, all methods (from sampling-based to MPC) are derived from vehicle models. Second, all of above methods suffer from the real-time requirement of short sampling times. As a consequence, all methods make simplifications on the employed model. These include, e.g., omitting of dynamical effects, tire dynamics, vehicle dimensions, using inflated obstacles, pruning search graphs, solving optimization problems iteratively, or offline precomputing trajectories.

human route





extero- & proprioceptive measurements
Fig. 1:

Closed-loop control system architecture. “Navi” and “Filter” (not focus of this paper) map human route selections as well as extero- and proprioceptive measurements to feature vector

. This paper proposes a simple gradient-free algorithm (TSHC) for learning of controller C, which maps feature vector to control action to be applied to the vehicle.

I-B Vision-based methods

In [13]

a pioneering end-to-end trained neural network labeled ALVINN was used for steering control of an autonomous vehicle. Video and range measurements are fed to a fully connected (FC)-network with a single hidden layer with 29 hidden units, and an output layer with 45 direction output units, i.e., discretized steering angles, plus one road intensity feedback unit. ALVINN does not control velocity and is trained using supervised learning based on road “snapshots”. Similarly, recent DAVE-2


also only controls steering and is trained supervisedly. However, it outputs continuous steering action and is composed of a network including convolutional neural networks (CNN) as well as FC-layers with a total of 250000 parameters. During testing (i.e., after training), steering commands are generated from only a front-facing camera. Another end-to-end system based on only camera vision is presented in

[15]. First, a driving intention (change to left lane, change to right lane, stay in lane and break) is determined, before steering angle is output from a recurrent neural network (RNN). Instead of mapping images to steering control, in [16] and [17], affordance indicators (such as distance to cars in current and adjacent lanes etc.) and feasible driving actions (such as straight, stop, left-turn, right-turn) are output from neural networks, respectively. See also [18] and their treatment of “option policies”. To summarize, it is distinguished between (i) vision-based end-to-end control, and (ii) perception-driven approaches that attempt to extract useful features from images. Note that such features (e.g., obstacle positions) are implicitly required for all methods from Sect. I-A.

I-C Motivation and Contribution

This work is motivated by the following additional considerations. As noted in [19], localization relative to lane boundaries is more important than with respect to GPS-coordinates, which underlines the importance of lasers, lidars and cameras for automated driving. Second, vehicles are man- and woman-made products for which there exist decade-long experience in vehicle dynamics modeling [20],[21]. There is no reason to a priori entirely discard this knowledge (for manufacturers it is present even in form of construction plans). This motivates to leverage available vehicle models for control design. Consider also the position paper [22] for general limitations of end-to-end learning. Third, a general purpose control setup is sought avoiding to switch between different vehicle models and algorithms for, e.g., highway driving and parking. There also exists only one real-world vehicle. In that perspective, a complex vehicle model encompassing all driving scenarios is in general preferable for control design. Also, a model mismatch on the planning and tracking layer can incur paths infeasible to track [7]. Fourth, the most accident causes involving other mobile vehicles are rear-end collisions [23], which most frequently are caused by inattentiveness or too close following distances. Control methods that enable minimal sampling times, such as feedforward control, can deterministically increase safety through minimal reaction times. In contrast, environment motion prediction (which can also increase safety) always remains stochastic. Fifth, small sampling times may contradict using complex vehicle models for control when applied for expensive online optimization or search problems. These considerations motivate a 2-step procedure: first learning of a controller during offline training based on an arbitrarily complicated mathematical system model, before online fast evaluation of the trained controller. In an automated vehicles settings, it implies that once trained, low-cost embedded hardware can be used online for evaluation of only few matrix vector multiplications.

The contribution of this paper is a simple gradient-free algorithm for model-based deep reinforcement learning using task separation with hill climbing (TSHC). Therefore, it is specifically proposed to (i) simultaneously train on separate deterministic tasks with the purpose of encoding motion primitives in a neural network, and (ii) during training to employ maximally sparse rewards in combinations with virtual velocity constraints (VVCs) in setpoint proximity.

This paper is organized as follows. Problem formulation, the proposed training algorithm and numerical simulation experiments are discussed in Sections II-IV, before concluding.

Ii Problem Formulation and Preliminaries

Ii-a General setup

The problem formulation is visualized in Fig. 1. Exteroceptive measurements are assumed to include inter-vehicular communication (car-2-car) sensings as well as the communication with a centralized or decentralized coordination service such that, in general, multi-automated vehicle coordination is also enabled [24]. For learning of controller C it is distinguished between 5 core aspects: the system model used for training, the neural network architecture used for function approximation, the training algorithm, the training tasks selection and the hardware/software implementation. Fundamental objective is to encode many desired motion primitives (training tasks) in a neural network. The main focus of this paper is on the training algorithm aspect, motivated within the context of motion planning for autonomous vehicles characterized by nonholonomic system models.

Ii-B Illustrative system model for simulation experiments

For simplicity a simple Euler-discretized nonlinear kinematic bicycle model [21] is assumed for simulation experiments of Sect. IV. Equations of motion are , , , with 3 states (position-coordinates and heading), 2 controls (steering angle and velocity ), 1 system parameter (wheelbase m), and indexing sampling time . Coordinates and describe the center of gravity (CoG) in the inertial frame and denotes the yaw angle relative to the inertial frame. Physical actuator absolute and rate constraints are treated as part of the vehicle model on which the network training is based on. Thus, the continuous control vector is defined as , with , and . The minimum velocity is negative to permit reverse driving.

Ii-C Comments on feature vector selection

While the mathematical system model used for training prescribes , this is not the case for feature vector . The dimension of may in general be much smaller than the system’s state space. In general, may be an arbitrary function of filtered extero- and proprioceptive measurements according to Fig. 1. Thus, a plethora of many different sensors may be compressed through the filtering to a low-dimensional

. Due to curse of dimensionality low-dimensional

are favorable, since the easiest way to generate training tasks is to grid over the elements of . Note further that for our purpose of encoding specific motion primitives, feature vector must always relate the current vehicle state with reference to a goal state (e.g., via a difference operator). Certificates about learnt control performance can be provided by statement of (i) the system model used for training, and (ii) the encoded motion primitives (training tasks) and their associated feature vectors. Ultimately, instead of only a single time-instant, may, in general, also represent a collection of multiple past time measurements (time-series) leading up to time .

Ii-D Comments on computation

For perspective, deep learning using neural networks as function approximators is in general computationally very demanding. To underline remarkable dimensions and computational efforts in practice, note that, for example, in

[25] training is distributed on 80 machines and 1440 CPU cores. In [26], even more profoundly, 1024 Tesla P100 GPUs are used in parallel. For perspective, one Tesla P100 permits a double-precision performance of 4.7 TeraFLOPs [27].

Iii Training Algorithm

This section motivates a simple gradient-free algorithm for learning of neural network controllers according to Fig. 1.

Iii-a Neural network controller parametrization

The controller in Fig. 1 may be parameterized by any of, e.g., FCs, LSTM cells including peephole connections [28], GRUs [29]

and variants. All neural network parameter weights to be learnt are initialized by Gaussian-distributed variables with zero mean and a small standard deviation (e.g.,

). Exceptions are adding a 1 to the LSTM’s forget gate biases for LSTM cells, as recommended in [30], which are thus initialized with mean . In proposed setting, the affine part of all FC-layers is followed by nonlinear tanhactivation functions acting elementwise. Because of their bounded outputs, saturating nonlinearities are preferred over ReLUs, which are used for the hidden layers in other RL settings [31], but can result in large unbounded layer output changes. Before entering the neural network is normalized elementwise (accounting for the typical range of feature vector elements). The final FC-layer comprises a tanh activation. It accordingly outputs bounded continous values, which are then affinely scaled to via physical actuator absolute and rate constraints valid at time .

So far, continuous was assumed. A remark with respect to gear selection is made. Electric vehicles, which appear suitable to curb urban pollution, do not require gearboxes. Nevertheless, in general can be extended to include discrete gear as an additional decision variable. Suppose gears are available. Then, the output layer can be extended by

channels, with each channel output representing a normalized probability of gear selection as a function of

, that can be trained by means of a softmax classifier.

Iii-B Reward shaping



Fig. 2: The problematic of rich rewards. Three scenarios (a), (b) and (c) indicating different start (black) and goal (red dashed) states (position and heading). For (c), an obstacle is added. See Sect. III-B for discussion.

Reward shaping is crucial for the success of learning by reinforcement signals [32]. However, reward shaping was found to be a far from trivial matter in practical problems. Therefore, our preferred choice is motivated in detail. In most practical control problems, a state is given at current time , and a desired goal state is known. Not known, however, is the shape of the best trajectory (w.r.t. a given criterion) and the control signals that realize that trajectory. Thus, by nature these problems offer a sparse reward signal, , received only upon reaching the desired goal state at some time . In the following, alternative rich reward signals and curriculum learning [33] are discussed.

Iii-B1 The problematic of designing rich reward signals

A reward signal , abbreviated by , is labeled as rich

when it is time-varying as a function of states, controls and feature vector. Note that the design of any such signal is heuristic and motivated by the hope for accelerated learning through maximally frequent feedback. In the following, the problematic of rich rewards is exposed. First, let

, , and relate states with desired goals, and let a binary flag indicate whether the desired goal pose is reached,



are small tolerance hyperparameters. Then, suppose a rich reward signal of the form

is designed, which characterizes a weighted linear combination of different measures. This class of reward signals, trading-off various terms and providing feedback at every , occurs frequently in the literature [34, 35, 31, 36]. However, as will be shown, for trajectory planning in an automotive setting (especially due to nonholonomic vehicle models), it may easily lead to undesirable behavior. Suppose case (a) in Fig. 2 and a maximum simulation time . Then, omitting a discount factor for brevity, , may be obtained for accumulated rewards. Thus, the no-movement solution may incur more accumulated reward, namely , in comparison to the true solution, which is indicated on the right-hand side of the inequality sign.

Similarly, for specific , the second scenario (b) in Fig. 2 can return a no-movement solution since the initial angle is already coinciding with the target angle. Hence, for a specific -combination, the accumulated reward when not moving may exceed the value of the actual solution.

The third scenario (c) in Fig. 2 shows that even if reducing rich rewards to a single measure, e.g., , an undesired standstill may result. This occurs especially in the presence of obstacles (and maze-like situations in general).

To summarize, for finite , the design of rich reward signals is not straightforward and can easily result in solution trajectories that may even be globally optimal w.r.t. accumulated reward, however, prohibit to solve the original problem of determining a trajectory from initial to target state.

Iii-B2 The problematic of curriculum learning

In [33], curriculum learning (CL) is discussed as a method to speed up learning by providing the learning agent first with simpler examples before gradually increasing complexity. Analogies to humans and animals are drawn. The same paper also acknowledges the difficulty of determining “interesting” examples [33, Sect. 7] that optimize learning progress.


Fig. 3: The problematic of curriculum learning. The difficulty of selecting “simple” examples is illustrated, see Sect. III-B for discussion. The original problem with start (black) and goal (red dashed) state is denoted in (a). A “simpler” problem is given in (b).

Indeed, CL entails the following issues. First, “simpler” tasks need to be identified. This is not straightforward as discussed shortly. Second, these tasks must first be solved before their result can serve as initialization to more complex tasks. In contrast, without CL, the entire solution time can be devoted to the complex tasks rather than being partitioned into easier and difficult tasks. In experiments, this was found to be relevant. Third, the solution of an easier task does not necessarily represent a better initialization to a harder problem in comparison to an alternative random initialization. For example, consider the scenario in Fig. 3. The solution of the simpler task does not serve as a better initialization than a purely random initialization of weights. This is since the final solution requires outreaching steering and possibly reversing of the vehicle. The simpler task just requires forward driving and stopping. This simple example illustrates the need for careful manual selection of suitable easier tasks for CL.

Iii-B3 The benefits of maximal sparse rewards in combination with virtual velocity constraints

In the course of this work, many reward shaping methods were tested. These include, first, solving “simpler” tasks by first dismissing target angles limited to -deviation from the initial heading. Second, -tolerances were initially relaxed before gradually decreasing them. Third, it was tested to first solve a task for only the -criterion, then both , and only finally all of . Here, also varying sequences (e.g., first instead of ) were tested. No consistent improvement could be observed for neither of these methods. On the contrary, solving allegedly simpler task reduced available solver time for the original “hard” problems. Without CL the entire solution time can be devoted to the complex tasks.

Based on these findings, our preferred reward design method is maximally sparse and defined by


where from (1), and being an indicator flag for a vehicle crash. Thus, upon the RL problem is considered as solved. In addition, the pathlength incurred for a transition from sampling time to is defined as


As elaborated below, accumulated total pathlength is used to rank solution candidates solving all desired training tasks.

The integral for is defined for generality, in particular for problems such as the inverted pendulum [37] in mind, which are considered to be solved only after stabilization is demonstrated for sufficiently many consecutive time steps. Note, however, that this is not required for an automotive setting. Here, it must be . Only then learning with is possible. Other criteria and trade-offs for are possible (e.g., accumulated curvature of resulting paths and a minmax objective therefore). The negation is introduced for maximization (“hill climbing”-convention). Note that the preferred reward signal is maximal sparse, returning , for all times up until reaching the target. It represents a tabula rasa solution critizised in [34] for its maximal sparsity. Indeed, standalone it was not sufficient to facilitate learning when also accounting for a velocity target . Therefore, virtual velocity constraints (VVCs) in target proximity are introduced. Two variants are discussed. First, VVCs spatially dependent on can be defined as


where , and is a hyperparameter (e.g., range-view length or a heuristic constant). Second and alternatively, VVCs may be defined as spatially invariant with a constant margin (e.g., 5km/h) around the target velocity. For both variants, the neural network output that regulates velocity is scaled with updated and constraints (i.e., using (5) for spatially dependent VVCs).

Let us further legitimize VVCs. Since speed is a decision variable it can always be constrained artificially. This justifies the introduction of VVCs. In (5), bounds are set to affinely converge towards in the proximity of the goal location. This is a heuristic choice. Note that the affine choice do not necessarily imply constant accelerations. This is since (5) is spatially parameterized. Note further that physical actuator rate constraints still hold when is applied to the vehicle.

It was also tested to constrain . The final heading pose implies circles prohited from trespassing because of the nonholonomic vehicle dynamics. It was tested to add these as virtual obstacles. However, this did not accelerate learning.

Finally, note that VVCs artificially introduce hard constraints and thus shape the learning result w.r.t velocity, at least towards the end of the trajectory. Two comments are made. First, in receding online operation, with additional frequent resetting of targets, this shaping effect is reduced since only the first control of a planned trajectory is applied. Second, in case of spatially dependent VVCs the influence of hyperparameter only becomes apparent during parking when following the trajectory up until standstill. Here, however, no significant velocity changes are desired, such that the -choice is not decisive. Ultimately, note that sparse rewards naturally avoid the need to introduce trade-off hyperparameters for the weighting of states in different units. This permits solution trajectories between start and goal poses to naturally evolve without biasing them by provision of rich references to track.

To summarize this section. It was illustrated that the design of rich reward signals as well as curriculum learning can be problematic. Therefore, maximal sparse rewards in combination with virtual velocity constraints are proposed.

Iii-C The role of tolerances

Tolerances in (1) hold an important role for 2 reasons. On one hand, nonzero result in deviations between actually learnt and originally desired goal pose . On the other hand, very small (e.g., m, and km/h) prolong learning time. Two scenarios apply.

First, for a network trained on a large-scale and dense grid of training tasks and for small

, during online operation, suitable control commands are naturally interpolated even for setpoints not seen during training. The concept of natural interpolation through motion primitives encoded in neural networks is the core advantage over methods relying on look-up tables with stored trajectories, which require to solve time-critical search problems. For example, in

[4] exhaustive search of the entire lattice-graph is conducted online on a GPU. In [8], a total of about 100 motion primitives is considered. Then, online an integer program is solved by enumeration using maximal progress along the centerline as criterion for selection of the best motion primitive. In contrast, for control using neural networks as function approximators this search is not required.

Second, the scenario was considered in which existing training hardware does (i) not permit large-scale encoding, and (ii) only permits to use larger -tolerances to limit training time. Therefore, the following method is devised. First, tuples are stored for each training task. Then, during online operation, for any setpoint, , the closest (according to a criterion) from the set of training tasks is searched, before the corresponding is applied to the network controller. Two comments are made. First, in order to reach (with zero deviation), must be applied to the network. Therefore, tuples need to be stored. Second, eventhough this method now also includes a search, it still holds an important advantage over lattice-based methods. This is the compression of the look-up table in the network weights. Hence, only tuples need to be stored—not entire trajectories. This is especially relevant in view of limited hardware memory. Thus, through encoding, potetially many more motion primitives can be stored.

In practice, the first scenario is preferable. It is also implementable for 2 reasons. First, see Sect. II-D for computational opportunities. Second, neural networks have in principle unlimited function approximation capability [38]. Hence, the implementation of the first approach is purely a question of intelligent task setup, and computational power.

Iii-D Main Algorithm – TSHC

Algorithm 1 is proposed for simple gradient-free model-based reinforcement learning. The name is derived from the fact of (i) learning from separate training tasks, and (ii) a hill climbing update of parameters (greedy local search).

Let us elaborate on definitions. Analysis is provided in Sect. III-E. First, all network parameters are lumped into variable . Second, the perturbation step 8 in Algorithm 1

has to be intepreted accordingly. It implies parameter-wise affine perturbations with zero-mean Gaussian noise and spherical variance

. Third, , and in Steps 14-16 denote functional mappings between properties defined in the preceding sections. Fourth, hyperparameters are stated in Step 1. While , , , and denote lengths of different iterations, is used for updating of in Step 35 and 37. Fifth, for every restart iteration, , multiple parameter iterations are conducted, at most many. Sixth, in Steps 25 and 29 hill climbing is conducted, when (i) all tasks have been solved for current , or (ii) not all tasks have yet been solved, respectively. Seventh, there are 2 steps in which an early termination of iterations may occur: Step 21 and 41. The former is a must. Only then learning with is possible. The latter termination criterion in Step 41 is optional. If dismissed, a refinement step is implied. Thus, eventhough all tasks have been solved, parameter iterations (up until ) are continued. Eighth, note that a discount hyperparameter , common to gradient-based RL methods [39], is not required. This is since it is irrelevant in the maximally sparse reward setting. Ninth, nested parallelization is in principle possible with an inner and outer parallelization of Steps 10-22 and 7-22, respectively. The former refers to solutions for a given parameter vector , whereas the latter parallelizes parameter perturbations. For final experiments, Steps 7-22 were implemented asynchronously. Finally, there are 3 options considered for -selection. First, holding an initial -selection constant throughout TSHC. Second, updating

randomly (e.g., uniformly distributed between 10 and 1000), whereby this can be implemented either in Step 4 at every

, or in Step 6 at every -combination. Third, may be adapted according to progress in , as outlined in Algorithm 1. For the first 2 options of selecting , Steps 34-37 are dismissed and at least can be dismissed from the list of hyperparameters in Step 1.

1 Input: system model, network structure, training tasks; , , , , ; and a method to update : constant, random or adaptive based on (, , ). Initialize , , , . for  do
2        Initialize randomly, and . for  do
3               % RUN ASYNCHRONOUSLY: for  do
4                      Perturb , with . Initialize , , . for  do
5                             Initialize (and LSTM and GRU cells). for  do
6                                    Read from -environment. . . . . . according to (3). if  then
7                                          Break -loop.
9                            .
11              % DETERMINE : if  then
12                      . if  then
13                             .
15              else
16                      . if  then
17                             .
19              . % UPDATE PARAMETERS: if  then
20                      .
21              else if  then
22                      .
23               and . % OPTIONAL: if  then
24                      Break -loop. % no further refinement step.
Output: .
Algorithm 1 TSHC

Iii-E Analysis

According to classifications in [40], TSHC is a gradient-free instance-based simulation optimization method, generating new candiate solutions based on only the current solution and random search in its neighborhood. Because of its hill climbing (greedy) characteristic, it differs from (i) evolutionary (population-based) methods that construct solution by combining others typically using weighted averaging [41, 25], and (ii) from model-based methods that use probability distributions on the space of solution candidates, see [40] for a survey. In its high-level structure, Algorithm 1 can be related to the COMPASS algorithm [42]. Within a global stage, they identify several possible regions with locally optimal solutions. Then, they find local optimal solutions for each of the identified regions, before they select the best solution among all identified locally optimal solutions. In our setting, these regions are enforced as the separate training tasks and the best solution for all of these is selected.

In combination with sufficiently large , must be large enough to permit sufficient exploration such that a network parametrization solving all tasks can be found. In contrast, the effect of decreasing with an increasing number of solved tasks is that, ideally, a speedup in learning progress results from the assignment of more of solution candidates closer in variance to a promising (see Step 8 of TSHC).

Steps 29-31 are discussed. For the case that for a specific -iteration not all tasks have yet been solved, has been considered as an alternative criterion for Step 29. Several remarks can be made. First, Step 29 and the alternative are not equal. This is because, in general, different tasks are solved in a different number of time steps. However, the criteria are approximately equivalent for sparse rewards (since accumulates constants according to (2)), and especially for large . The core advantage of employing Step 29 in TSHC is that it can, if desired, also be used in combination with rich rewards to accelerate learnig progress (if a suitable rich reward signal can be generated). In such a scenario, according to Step 29 is updated towards most promisining , then representing the accumulated rich reward. Thus, in contrast to (2), a rich reward could be represented by a weighted sum of squared errors between state and a reference ,


where , are trade-off hyperparameters and scalar elements of vectors are indexed by in brackets. Another advantage of the design in Algorithm 1 according to Step 29-31 is its anytime solution character. Even if not all are solved, the solution returned for the tasks that are solved, typically is of good quality and optimized according to Steps 29-31.

If for all tasks there exists a feasible solution for a given system model and a sufficiently expressive network structure parameterized by , then Algorithm 1 can find such parametrization for sufficiently large hyperparameters , , , and . The solution parametrization is the result from the initialization Step 4 and parameter perturbations according to Step 8, both nested within multiple iterations. As noted in [43], for optimization via simulation, a global convergence guarantee provides little practical meaning other than reassuring a solution will be found “eventually” when simulation effort goes to infinity. However, the same reference also states that a convergence property is most meaningful if it can help in designing suitable stopping criteria. In our case, there are 2 such conceptual levels of stopping criteria: first, the solution of all training tasks, and second, the refinement of solutions.

Control design is implemented hierarchically in 2 steps. First, suitable training tasks (desired motion primitives) are defined. Then, these are encoded in the network by the application of TSHC. This has practical implications. First, it encourages to train on deterministic tasks. Furthermore, at every , it is simultaneously trained on all of these separate tasks. This is beneficial in that the best parametrization, , is clearly defined via Step 25, maximizing the accumulated -measure over all tasks. Second, it enables to provide certificates on the learnt performance, which can be provided by stating (i) the employed vehicle model, and (ii) the list of encoded tasks (motion primitives). Note that such certificates cannot be given for the class of stochastic continuous action RL algorithms that are derived from the Stochastic Policy Gradient Theorem [44]. This class includes all stochastic actor-critic algorithms, including A3C [45] and PPO [39].

Iii-F Discussion and comparison with related RL work

Related continuous control methods that use neural network for function approximation are discussed, focusing on one stochastic [39], one deterministic policy gradient method [31], and one evolution strategy [25]. The methods are discussed in detail to underline aspects of TSHC.

First, the stochastic policy gradient method PPO [39] is discussed. Suppose that a stochastic continus control vector is sampled from a Gaussian distribution parameterized111In this setting, mean and variance of the Gaussian distribution are the output of a neural network whose parameters are summarized by lumped . by such that . Then,


is defined as the expected accumulated and time-discounted reward when at drawing , and following the stochastic policy for all subsequet times when acting in the simulation environment. Since function is a priori not know, it is parameterized by

and estimated. Using RL-terminology, in the PPO-setting,

represents the advantage function. Then, using the “log-likelihood trick”, and subsequently a first-order Taylor approximation of around some reference , the following parameterized cost function is obtained as an approximation of (7),


Finally, (8) is modified to the final PPO-cost function [39]


whereby the advantage function is estimated by the policy parameterized by , which is run for consecutive time steps such that for all the tuples can be added to a replay buffer, from which later minibatches are drawn. According to [39], the estimate is with , and so forth until , and where represents a second, the so-called critic neural network. Then, using uniform randomly drawn minibatches of size , parameters of both networks are updated according to and , with denoting the argument of the expectation in (9) evaluated at time-index . This relatively detailed discussion is given to underline following observations. With first the introduction of a parameterized estimator, then a first-order Taylor approximation, and then clipping, (9) is an arguably crude approximation of the original problem (7). Second, the complexity with two actor and critic networks is noted. Typically, both are of the same dimensions apart from the output layers. Hence, when not sharing weights between the networks, approximately twice as many parameters are required. However, when sharing any weights between actor and critic network, then optimization function (9

) must be extended accordingly, which introduces another approximation step. Third, note that gradients of both networks must be computed for backpropagation. Fourth, the dependence on

rich reward signals is stressed. As long as the current policy does not find a solution candidate, in a sparse reward setting, all are uniform. Hence, there is no information permitting to find a suitable parameter update direction and all of the computational expensive gradient computations are essentially not usable222It is mentioned that typically the first, for example, 50000 samples are collected without parameter update. However, even then that threshold must be selected, and the fundamental problem still perseveres.. Thus, the network parameters are still updated entirely at random. Moreover, even if a solution candidate trajectory was found, it is easily averaged out through the random minibatch update. This underlines the problematic of sparse rewards for PPO. Fifth, A3C [45] and PPO [39] are by nature stochastic policies, which draw their controls from a Gaussian distribution (for which mean and variance are the output of a trained network with current state as its input). Hence, exact repetition of any task (e.g., the navigation between 2 locations) cannot be guaranteed. It can only be guaranteed if dismissing the variance component, and consequently using solely the mean for deterministic control. This can be done in practice, however, introduces another approximation step.

Deterministic policy gradient method DDPG [31] is discussed. Suppose a deterministic continuous control vector parameterized such that . Then, the following cost function is defined,

Its gradient can now be computed by applying the chain-rule for derivatives

[46]. Introducing a parameterized estimate of , which here represents the Q-function or action value function (in contrast to the advantage function in above stochastic setting), the final DDPG-cost function [31] is

Then, using minibatches, critic and actor network parameters are updated as and , with slowly tracking target network parameters . Several remarks can be made. First, the Q-function is updated towards only its one-step ahead target. It is obvious that rewards are therefore propagated very slowly. For sparse rewards this is even more problematic than for rich rewards, especially because of the additional danger of averaging out important update directions though random minibatch sampling. Furthermore, and analogous to the stochastic setting, for the sparse reward setting, as long as no solution trajectory was found, all of the gradient computations are not usable and all network parameters are still updated entirely at random. DDPG is an off-policy algorithm. In [31], exploration of the simulation environment is achieved according to the current policy plus additive noise following an Ornstein-Uhlenbeck process. This is a mean-reverting linear stochastic differential equation [47]. A first-order Euler approximation thereof can be expressed as the action exploration rule , with hyperparameters in [31]. This detail is provided to stress a key difference between policy gradient methods (both stochastic and deterministic), and methods such as [25] and TSHC. Namely, while the former methods sample controls from the stochastic policy or according to heuristic exploration noise before updating parameters using minibatches of incremental tuples plus for PPO, the latter directly work in the parameter space via local perturbations, see Step 8 of Algorithm 1. This approach appears particularly suitable when dealing with sparse rewards. As outlined above, in such setting, parameter updates according to policy gradient methods are also entirely at random, however, with the computationally significant difference of first an approximately four times as large parameter space and, second, the unnecessary costly solution of non-convex optimization problems as long as no solution trajectory has been found. A well-known issue in training neural networks is the problem of vanishing or exploding gradients. It is particularly relevant for networks with saturating nonlinearities and can be addressed by batch [48] and layer normalization [49]. In both normalization approaches, additional parameters are introduced to the network which must be learnt (bias and gains). These issues are not relevant for the proposed gradient-free approach.

Fig. 4: Experiment 1. 1000 training trajectories resulting from the application DDPG (Left), PPO (Middle) and TSHC (Right), respectively. The effect of virtual constraints on velocity is particularly visible for DDPG. For the given hyperparameter setting [39, Tab. 3], the trajectories for PPO have little spread and are favoring reverse driving. TSHC has a much better exploration strategy resulting from noise perturbations in the parameter space. The task is solved by TSHC in only 2.1s of learning time, when terminating upon the first solution found (no refinement step, no additional restart).

This paper is originally inspired by and most closely related to [25]. The main differences are discussed. The latter evolutionary (population-based) strategy updates parameters using a stochastic gradient estimate. Thus, it updates , where hyperparameters and denote the learning rate and noise standard deviation, and where here indicates the stochastic scalar return provided by the simulation environment. This weighted averaging approach for the stochastic gradient estimate is not suitable for our control design method when using separate deterministic training tasks in combination with maximally sparse rewards. Here, hill climbing is more appropriate. This is since most of the trajectory candidates do not end up at and are therefore not useful. Note also that only the introduction of virtual velocity constraints permitted us to quickly train with maximally sparse rewards. It is well known that for gradient-based training, especially of RNNs, the learning rate ( in [25]) is a critical hyperparameter choice. In the hill climbing setting this issue does not occur. Likewise, fitness shaping [41], also used in [25], is not required. Note that above has the same role as . Except, in our setting, it additionally is adaptive according to Steps 29 and 31 in Algorithm 1. As implemented, this is only possible when training on multiple separate tasks. Other differences include the parallelization method in [25], where random seeds shared among workers permit each worker to only need to send and receive the scalar return of an episode to and from each other worker. All perturbations and parameters are then reconstructed locally by each worker. Thus, for workers there are reconstructions at each parameter-iteration step. This requires precise control of each worker and can in rare cases lead to differing CPU utilizations among workers due to differing episode lengths. Therefore, they use a capping strategy on maximal episode length. In contrast, our proposed method is less sophisticated with one synchronized parameter update, which is then sent to all workers.

Iv Numerical Simulations

This section highlights different aspects of Algorithm 1. Numerical simulations of Sect. IV-A and IV-B were conducted on a laptop with an Intel Core i7 CPU @2.80GHz8, 15.6GB of memory, and with the only libraries employed Python’s numpy and multiprocessing. Furthermore, in Sect. IV-A

for the implementation of 2 comparative policy gradient methods, Tensorflow (without GPU-support) was used. Using these (for deep learning) very limited ressources enabled to evaluate the method’s potential when significant computational power is not available. For more complex problems the latter is a necessity. Therefore, in Sect.

IV-C TSHC is implemented in Cuda C++ and 1 GPU is used.

Iv-a Experiment 1: Comparison with policy gradient methods

19078 18440 4610
TABLE I: Experiment 1. Number of scalar parameters (weights) that need to be identified for DDPG, PPO and TSHC, respectively. TSHC requires to identify the least by a large margin, roughly by a factor 4. The fact that PPO here requires exactly four times the number of parameters of TSHC is a special case for controls (not generalizable for arbitrary ).

To underline conceptual differences between TSHC and 2 policy gradient methods DDPG [31] and PPO [39], a freeform navigation task with and was considered, where vector summarizes four of the vehicle’s states. The same network architecture from [39]

is used: a fully-connected MLP with 2 hidden layers of 64 units before the output layer. Eventhough this is the basic setup, considerable differences between DDPG, PPO and TSHC are implied. Both DDPG and PPO are each composed of a total of four networks: one actor, one critic, one actor target and one critic target network. For DDPG, further parameters result from batch normalization

[48]. The number of parameters that need to be identified are indicated in Table I. To enable a fair comparison, all of DPPG, PPO and TSHC are permitted to train on 1000 full rollouts according to their methods, whereby each rollout lasts at most timesteps. Thus, for TSHC, and are set. For both PPO and DDPG, this implies 1000 iterations. Results are summarized in Fig. 4. The following observations can be made. First, in comparison to TSHC, for both DDPG and PPO significantly more parameters need to be identified, see Table I. Second, DDPG and PPO do not solve the task based on 1000 training simulations. In contrast, as Fig. 4 demonstrates, TSHC has a much better exploration strategy resulting from noise perturbations in the parameter space. It solves the task in just 2.1s. Finally, note that no -iteration is conducted. It is not applicable since a single task is solved with an initial . Because of these findings (other target poses were tested with qualitatively equivalent results) and the discussion in Sect. III-F about the handling of sparse rewards and the fact that DDPG and PPO have no useful gradient direction for their parameter update or may average these out through random minibatch sampling, the focus in the subsequent sections is on TSHC and its analysis.

Iv-B Experiment 2: Inverted Pendulum

The discussion of tolerance levels in Sect. III-C motivated to consider an alternative approach for tasks requiring stabilization. An analogy to optimal control is drawn. In linear finite horizon MPC, closed-loop stability can be guaranteed through a terminal state constraint set which is invariant for a terminal controller, often a linear quadratic regulator (LQR), see [50]. In a RL setting, the following procedure was considered. First, design a LQR for stabilization. Second, compute the region of attraction of the LQR controller [51, Sect. 3.1.1]. Third, use this region of attraction as stopping criterion, replacing the heuristic -tolerance selection.

For evaluation, the inverted pendulum system equations and parameters from [37] were adopted (four states, one input). However, in contrast to [37], which assumes just 2 discrete actions (maximum and minimum actuation force), here a continuous control variable is assumed which is limited by the 2 bounds, respectively. There are 2 basic problems: stabilization in the upright position with initial state in the same position, as well as a swing-up from the hanging position plus consequent stabilization in the upright position. For the application of TSHC, are set, and the same MLP-architecture from Sect. IV-A is used. The following remarks can be made. First, the swing-up plus stabilization task was solved in s runtime of TSHC (without refinement step) and using sparse rewards (obtained in the upright position ). For all three restarts a valid solution was generated. Note that in combination with a sampling time [37] of 0.02s corresponds to 10s simulation time. Stabilization in the upright position was achieved from 2.9s on. Rich reward signals were also tested, exploiting the deviation from current to goal angle as measure. However, rich rewards did not accelerate learning.

In a second experiment, the objective was to simultaneously encode the following 2 tasks in the network: stabilization in the upright position with initial state in the same position and a swing-up from the hanging position plus consequent stabilization in the upright position. The runtime of TSHC (without refinement step) was s, with 2 of 3 restarts returning a valid solution and using sparse rewards. Instead of learning both tasks simultaneously according to TSHC, it was also attempted to learn them by selecting one of the 2 tasks at random at every , and consequently conducting Step 6-41. Since the 2 tasks are quite different, this procedure could not encode a solution for both tasks. This is mentioned to exemplify the importance of training simultaneously on separate tasks, rather than training on a single tasks with -combinations varying over .

Finally, for the system parameters from [37], it was observed that the continuous control signal was operating mostly at saturated actuation bounds (switching in-between). This is mentioned for 2 reasons. First, aforementioned LQR-strategy could therefore never be applied since LQR assumes absence of state and input constraints. Second, it exemplifies the ease of RL-workflow with TSHC for quick nonlinear control design, even without significant system insights.

Fig. 5: Experiment 3. (Left) Display of all 181 trajectories learnt and encoded in one MLP-[5,8,2]. Trajectories for each task are visualized in separate colors. (Right) Display of learnt result for only the most complex of the 181 training tasks, i.e., for . Recall that and m. The vehicle’s start and end CoG-position is indicated by red and black balls, respectively. As indicated, the transition involves frequent forward and backward driving but is constrained locally around the -origin. See Sect. IV-C for further discussion.

Iv-C Experiment 3 and 4: GPU-based training

Experiment 3 is characterized by transitioning from to with measured in . This implies . The feature vector is selected as with normalization constants in the denominators and indicating the steering angle-related network output (before scaling to ). A high-resolution tolerance of is set. In addition, m and km/h. Sampling time is 0.01s. As neural network, a MLP-[5,8,2] is used, which implies 1 hidden layer with 8 units. For selections and , MLP-[5,8,2] was the smallest possible network found to simultaneously encode all training tasks. The second variant of VVCs discussed in Sect. III-B is employed. Furthermore, , i.e., uniformly distributed at every -combination.

Fig. 6: Experiment 4. Display of learnt trajectory when encoding just one task in a MLP-[5,8,2] for the transition from to . Brackets (1) and (3) imply forward, (2) reverse driving and their sequence. A black indicator visualizes the vehicle’s final heading. Total learning time (including 10 restarts) was 10.6s. See Sect. IV-C for further discussion.

Several comments can be made. First, by application of control mirroring w.r.t. steering the trained network enables to reach also all of . Second, the total learning time (runtime of TSHC) to encode all 181 training tasks was 31.1min. MLP-[5,8,2] implies a total of 66 parameters to learn. It is remarkable that such a small network has enough function approximation capability to encode all 181 tasks within limited training time and . Third, note that the TSHC-trained network controller permits repeatable precision. As mentioned in of Sect. III-F, this is not attainable for stochastic policy gradient-based algorithms, which draw their control signals, typically from a Gaussian distribution. Fourth, the learning results are visualized in Fig. 5. These motivated to conduct an additional experiment with identical basic training setup (TSHC-settings, MLP-[5,8,2], etc.), however, now encoding only one task for the transition from to . The result is visualized in Fig. 6. Notice the much reduced number of switches between forward and backward driving, and the different -range.

The comparison of Fig. 5 and 6 emphasizes the interesting observation that the more motion primitives are encoded in a single network the less performant the single learnt motion trajectories are. This is believed to illustrate potential of partitioning the total number of designated tasks into subsets of training tasks for which separate networks are then learnt using TSHC. The promised advantages include faster overall trainig times, higher performance of learnt trajectories, and ability to employ tiny networks with few parameters for each subset. Work in this perspective is subject of ongoing work.

V Conclusion

Within the context of automated vehicles, for the design of model-based controllers parameterized by neural networks a simple gradient-free reinforcement learning algorithm labeled TSHC was proposed. The concept of (i) training on separate tasks with the purpose of encoding motion primitives, and (ii) employing sparse rewards in combinations with virtual velocity constraints in setpoint proximity were specifically advocated. Aspects of TSHC were illustrated in 4 numerical experiments. The presented method is not limited to automated driving. Most real-world learning applications for control systems, especially in robotics, are characterized by sparse rewards and the availability of high-fidelity system models that can be leveraged for offline training.

Subject of future work is focus on system models of various complexity (e.g., kinematic vs. dynamic vehicle models), the partitioninig of tasks into separate subsets of tasks for which separate network parametrizations are learnt, analysis of different feature vectors and closed-loop evaluation.