Deep architectures have become popular as function approximators to represent action-selection policies. Common approaches to learn the parameters of such models include reinforcement learning and/or learning from demonstration 
: both learn model parameters to maximize expected reward, mimic human behavior, and/or achieve implicit goals. However, the design of policy architectures, especially in a deep learning paradigm, remains relatively unexplored. Architectures are typically selected through a combination of intuition and/or trial and error.
Learning to learn, including the learning of learning architectures, is a long-articulated goal of AI, and many “meta-learning” and “lifelong learning” schemes have been proposed (e.g.,  offered seminal views; see 
for a survey). Recently, renewed interest in this topic has focused on models which explicitly search over the structure of deep architectures, including models which fuse non-parametric Bayesian inference with deep learning to select the number of channels for visual recognition tasks, models which use reinforcement learning to directly optimize over deep architectures for recognition , and models which use a gradient-free optimization method (“evolutionary search”) to infer optimal network structure .
We investigate policy architecture search using gradient-free optimization and learn optimal policy structure for autonomous driving tasks. We propose a model which learns jointly from demonstration and optimization, with the goal of “safe training”: minimizing the amount of damage a vehicle incurs to learn a threshold level of performance. We base our approach on exploration-based schemes due to their ability to optimize model weights and architecture hyperparameters, leverage expert demonstrations, and adapt to reward obtained in new domains. We believe that a model which can initialize from demonstration, and learn an optimal policy from that foundation, is likely to achieve higher performance while maintaining the constraint of safe training, compared to models which must randomly search through action space during initial learning, or which learn from a reasonably safe demonstration but cannot further optimize performance based on environmental reward.
Prior approaches to combine demonstration with reward-based learning have had mixed successes [15, 9, 1] mainly due to the poor generalization of the policy learned on demonstrations. We posit that effective behavior cloning requires learning a visual agent architecture that has sufficient structure to perceive the state of the world deemed relevant to the expert providing the demonstration. This may or may not be the case with existing, off-the-shelf visual models. We thus think it is wise to optimize over architectures and parameters when performing expert behavioral cloning.
Often, deep models which learn to perform in one domain fail to perform well when deployed in another setting, such as differing weather or lighting conditions. Models learned from demonstration are also well known to fail when the learned policy takes the agent away from the region of the state space where the demonstration was provided . We show that our method can effectively and safely adapt a model demonstrated in one environment but deployed in a visually different environment based on the reward signal in the latter domain, even when the agent is initialized far from initial demonstrations. Our approach leverages only target domain reward, and makes no assumptions about domain alignment, explicit or implicit, nor assumes any demonstration supervision in the target domain.
To achieve these goals, we present a gradient-free optimization algorithm inspired by 
with a modification in noise generation that results in estimating the gradients more efficiently and accurately (Sec.3.1). We then apply this algorithm to search over variable length architectures Next, we combine our gradient-free policy search with demonstrations to learn a better policy that adapts to the new environment by receiving rewards as feedback (Sec. 3.3). We experimentally show that our architecture search model finds a policy on the GTA game environment that outperforms previously published methods (e.g., ) in end-to-end steering prediction from demonstrations, and that it can be efficiently adapted to learn to drive in previously unseen scenarios (Sec. 4). Our model reduces the number of crashes incurred while learning to drive, compared to baselines based only on reward or demonstration but not both, or compared to previously proposed fixed architectures that were not optimized for the domain.
2 Related work
, a recurrent neural network (RNN) was used to generate fixed-length architecture descriptions from a predefined search space and trained it with policy gradient methods. They were able to get close and surpass the state of the art results on- and Penn Treebank datasets, respectively. A meta-modeling algorithm was proposed in  which used -learning to sequentially search for convolutional layers for image classification tasks. They showed that their approach outperforms other existing meta-models and manually-designed architectures with similar types of layers. Recently,  introduced Budgeted Super Networks which are inspired by the REINFORCE algorithm with an objective function that maximizes prediction quality and computation cost simultaneously. Various versions of biologically-inspired methods, or neuroevolution strategies, have been proposed for architecture search ever since they were introduced by 
. Most of them are based on biological genetics algorithms where there is afitness function that gets re-evaluated at each “generation” to determine whether “genotypes” are perturbed in the correct direction to evolve appropriately [12, 11, 19, 17]. I.e., they initialize a model and evolve it based on its performance. This paradigm was recently re-visited as an alternative to reinforcement learning algorithms where optimization is performed in a gradient-free fashion and the algorithm was shown to be highly parallelizable resulting in significant speedups in playing MuJoCo and Atari games .
Policy search in autonomous driving application has been largely focused on demonstration-based optimization approaches with  or without [3, 10] affordance measurements. It dates back to the classic model  which was a shallow architecture that could map from pixels to simple driving actions. Several years after, researchers demonstrated end-to-end deep learning models for steering control of small-scale cars , and recently NVIDIA followed the same path and showed success in predicting steering angle on a full-size vehicle from raw pixels using a convolutional network . A novel - architecture was proposed on large scale crowed-sourced data to perform egomotion predictions conditioned on the previous temporal states . They used dashcam camera videos to derive a generic driving model that predicted trajectory angle (not steering angles).
We propose a learning-to-learn model which includes architecture optimization, parameter learning, and representation adaptation over different time scales. Our approach can be summarized by the following two steps. (1) Given expert demonstration, search over architectures and parameters to find a policy that best mimics performance by monitoring the obtained accuracy and number of parameters. (2) Having learned from demonstration, adapt the model to the reward provided by the target environment. In both steps, it is essential to derive a function approximator that optimizes an objective function. We use a gradient-free optimization algorithm  that maximizes a parametrized reward function using gradient estimation to perform architecture search (Sec. 3.2) and policy learning (Sec. 3.3).
3.1 Gradient-free optimization algorithm
Let be our objective function parametrized by which is an
-dimensional vector.can be the reward that an environment provides for an agent when it executes a policy with parameters ; our goal is to maximize the expected reward by perturbing the policy parameters, denoted as , by moving in particular directions. The parameter estimate update can be performed using a general stochastic form:
where is an approximation of the objective function (i.e. ) and is the gradient of objective estimate that can be approximated by any gradient estimator in the family of finite difference methods. The gradient is estimated in a randomly chosen direction by perturbing all the elements of to obtain two measurements of as follows:
where is a vector of mutually independent randomly perturbed variables taken from a zero-mean distribution. While there is no restriction for it to have a specific type of distribution, we use Laplace distribution, as it tends to choose orthogonal directions in the long run. Other recent efforts  utilized Gaussian noise to sample mirrored projections. Figure 2 shows a comparison between the two distributions. is a small positive number and and are the noise associated with evaluating such that: . The gradient estimate can then be computed as:
followed by Max-pool and/or Dropout
|Filter height (FH) [1, 3, 5, 7]|
|Filter width (FW) [1, 3, 5, 7]|
|Stride height (SH) [1, 2, 3]|
|Stride width (SW) [1, 2, 3]|
|Number of filters (NF) [16, 24, 32, 64, 128, 256]|
|Max-pool size (MP) [1, 2, 3]|
|Dropout (DO1) [0.3, 0.5, 0.7, 1.0]|
|Fully-connected layer followed by Dropout||Number of units (NU) [8, 16, 32, 64, 128, 256, 512]|
|Dropout (DO2) [0.3, 0.5, 0.7, 1.0]|
3.2 Learning an optimal initial policy from demonstrations
Inspired by  we have used a recurrent neural network to sequentially generate the description of layers of an architecture from a given design space defined by the user. The RNN acts as a controller which generates the architecture description defined by its hyper-parameters chosen from a pre-defined search space. In , the authors used policy gradients to train the RNN which was able to produce fixed-length convolutional and recurrent architectures. Given demonstrations and having a child
network defined by the RNN, they trained the child network using supervised learning and obtained an accuracy metric on the given task on a held-out validation set and used that accuracy as a reward signal to train the RNN.
Our model uses the demonstrations to provide the reward function, ((i.e.,
above), to train the RNN. Unlike backpropagation which suffers from gradient vanishing while training RNNs, gradient-free algorithms do not have such an issue. Our RNN controller specifies three types of layers: convolutional, fully connected, and max-pool which can have inter-layer dropouts. For the reward signal, we use the negative value of totalloss function. At the last layer of the network, we regress to three real-valued numbers, each having a mean-squared loss. The total loss is the sum of all three losses. We use a novel reward function (Eq. 5
that not only results in the minimum total loss but also grows the architecture as long as the loss keeps decreasing. Note that in case of having a classification problem, an accuracy metric can replace loss value, hence the goal will be maximizing the accuracy while controlling the number of parameters. Here we have a regression problem and our goal is to search for an architecture that is guaranteed to achieve a low loss on the given task and grow further with adding more layers (i.e. producing more parameters) to decrease this while being penalized for adding more parameters in turns of no gain in loss reduction. We propose to use a ReLU-based Lagrange-multiplier reward function as below:
where is the negative of the minimum loss (or maximum accuracy in a classification problem) on the validation set for the last epochs and is the total number of parameters in the child network. is the Lagrange parameter defined as a function of the first sub-reward in a ReLU-based fashion:
This reward function acts as follows. The RNN keeps generating new layers by being rewarded only based on the obtained total loss until it produces a child network that achieves the desired value of loss (or accuracy ) on the validation set. Once it reaches this threshold, it will be penalized for further growing the architecture if the loss does not decrease consequently. The parameter in Eq. 5 defines the respective threshold. E.g. choosing allows architecture growth by the number of parameters in the new layer if it causes the overall loss decrease (or accuracy increase in classification) by . The thresholds are adjustable based on the problem at hand and the desired trade-off between computational cost and loss minimization.
Our RNN controller is a three-layer LSTM network followed by a softmax layer. The inputs to the RNN are the hyper-parameters that describe a layer (see Tab.1 for our search space.). Training the RNN starts with randomly initializing the hyper-parameters of the child network which initially has only one layer. RNN uses the reward function to update its parameter weights such that those which contributed more in the obtained reward, receive a higher weighting factor during the update and hence, we move in a hill-climbing direction which eventually maximizes the reward. We use the algorithm described in Sec. 3.1 to generate an architecture that yields the minimum loss. The process of generating a new layer terminates when we achieve convergence in the received reward.
3.3 Adapting a demonstrated policy to a new driving domain
As confirmed experimentally below, it is well known that a policy learned from behavioral cloning can perform poorly when evaluated on inputs with a domain shift relative to the demonstration supervision. To overcome this, we further use the gradient-free search algorithm described in Sec. 3.1 to adapt a driving policy learned from demonstration in a source domain based on rewards in a target domain. We experiment with the setting where the initial agent state (e.g., location), and/or weather and lighting conditions, are substantially different than provided as demonstration. We compare to baselines where we perform reward-based optimization using initial demonstrations instead of a randomly initialized policy, which makes our reward function converge faster, and more safely.
In the driving scenario, we wish to learn to drive with optimal or near-optimal performance, defined by the reward in the target domain. Specifically, the reward function used in our experiment is composed of two factors (we receive if obeyed and if violated): 1) No crashes with other objects 2) Staying within the lane lines if they are available in the driving scene. We have used the lane reward function and accident detection function defined in  source code. The necessary information is provided by a paths.xml file in .
In our model, an episode is the time interval that the agent has successfully driven without having a car crash. Note that not all deviations from the middle of the road necessarily result in an accident. In case of a minor deviation, while the car receives as its reward, it continues driving until it makes a mistake that causes it to crash and the game restarts. There are distinct thresholds for middle-lane deviation defined in  for different roads (highway, urban, etc.) and different vehicle types.
4 Experimental evaluation
We implemented the method described above and ran comprehensive experiments to show the efficiency and applicability of our approach in searching for an optimal driving policy that has the minimum number of catastrophic failures. Details of the experiments along with the results are provided in the following subsections. All the experiments are executed in the GTA game environment using a publicly available plugin  that allowed us to have control over driving conditions such as lighting, weather, car model, and reward function.
For our evaluation we have collected a dataset of an expert policy by playing GTA collecting images of size similar to . Labels include steering angle, brake, and throttle values. In order to learn from diverse driving scenarios, we have used data from different locations (highways, rural roads, urban streets) where weather and lighting conditions were adjusted using . Our goal was to expose the learning algorithm to a comprehensive demonstration set yet to set aside some specific scenes for further testing the performance of behavioral cloning task. Sample images from demonstration are shown in Fig. 1. They include rainy (daytime), overcast (day and night), foggy (daytime), sunny scenes and thunderstorms (daytime). Some particular scenes such as rain and thunderstorm during nighttime as well as snow at daytime have been kept for our test set (see Fig. 1). Our test set is composed of images.
4.2 Learning a policy architecture from GTA demonstrations
The search space for the hyperparameters that describe a fully-convolutional architecture is presented in Tab. 1hidden units in each, and a softmax layer at the end to choose from the given search space. The RNN weights are initialized with a random Laplace distribution . Once the RNN predicts a new layer’s description, the child network is built and trained with batch size of and Adam optimizer  with learning rate . We train the child network for different number of epochs starting from epochs (depending on which layer we are at) and compute the reward function as described in Sec. 3.2. In order to finish optimizing one layer, we track the loss reduction of both validation and training sets between the first and the last epoch to avoid overfitting.
Our model is capable of generating architectures at low or high costs of architecture growth. In order to compare our designed architecture with ,111As no reference implementation of  is openly available we had to use our own, which may be suboptimal w.r.t. the authors’ as we did not have full access to their model parameters. Also, their model was only used to predict steering angle, and overall it is not clear whether their goal was to maximize performance, find a model with relatively few parameters, or both, so they may not have explored the full design space with their model. Nonetheless,  was the closest model in the literature for end-to-end steering angle prediction and thus the best available baseline. we control the reward function such that it never produces an architecture with more than parameters while minimizing the loss (maximizing the reward). We present our architectures and comparison to prior work in Fig. 2(2). The corresponding performance comparison is shown in Tab. 2. The smallest architecture (in terms of the number of parameters) that we have built is shown in Fig. 2, which obtains a smaller total loss than . The network of  (Fig. 2(1)) appears to suffer from overfitting which might be explained by the absence of Dropout or a pooling layer. Not restricting our architecture search algorithm to be bounded by a number of parameters, we learn a larger network (2(3)) that has over parameters and obtains a minimum total loss of on the training set and on the validation set. As discussed above, we test all models on substantially different driving scenes which were never seen during training. On this challenging test set our large network obtains the minimum loss of . We use this network as our initial driving policy in the next section and improve it further in the reward-providing GTA environment.
|Model||of parameters||Training loss||Validation loss||Test loss|
|Bojarski et al. ||252,241||0.098||0.11||0.212|
|Our small network||228,227||0.093||0.096||0.197|
|Our large network||2,198,723||0.085||0.088||0.185|
We further empirically compare two different noise distributions for perturbing the parameters of the network: random Gaussian and random Laplace. We perform the architecture search over our small model using both distributions with mean zero; variance is chosen using grid search. Fig.2 shows the results for reward convergence versus the number of iterations. Both distributions result in convergence to high reward values (minimizing loss), however, the Laplace distribution tends to be less noisy and reaches slightly higher reward values.
4.3 Safe policy adaptation
Next, we want to learn the driving policy in a target game domain. We start with an initial model, either using a behaviorally cloned or randomly initialized policy and gradually improve it by receiving rewards from the environment. As stated in 3.3, an episode of the game starts at a random location and weather condition in the game. To initialize the policy, we use the larger architecture learned in the first step (with the model of  as the baseline). We evaluate both models with and without being adapted to demonstration forming four cases: (1) the baseline network of without demonstration (i.e., with randomly initialized weights) and (2) with behaviorally cloned initial weights, (3) our larger architecture without demonstration and (4) with behaviorally cloned initial weights. We run all models in the GTA environment to receive the reward described in Sec. 3.3. Once the reward is received, the weights are perturbed by a Laplace random noise and the same procedure is repeated until the average reward in each episode of the game converges to its maximum value. Results averaged across several runs are presented in Tab. 3 where our model optimized with demonstration outperforms all other cases. In particular, our model has the least number of cumulative crash occurrence prior to converging to of averaged reward (details below).
|(1) Bojarski et al. , w.o demo||154 hours||15,565||18,662|
|(2) Bojarski et al. , w. demo||74 hours||1,387||3,243|
|(3) Our large network, w.o. demo||114 hours||6,877||8781|
|(4) Our large network, w. demo||53 hours||832||982|
In Table 4 we have listed the results for test accuracies on a dataset taken from a target domain that has not been seen in the demonstrations. On the left, we see behavioral cloning alone has poor performance when significant domain shifts occur. Our adapted model has performance over loss minimization of . On the right we see that adapted performance is strong even without reward in the target domain, indicating that visual domain shift is a lesser issue than being off-demonstration; our model can adapt in the source domain and still be accurate on the target. Best performance is obtained with adaptation to reward in both source and target. This also shows that there is an improvement on loss minimization when we learn from rewards. It is worth noting that we can not judge the driving behavior only by looking at the total MSE loss as it is not a comprehensive representative of the driving task. Each one of the angle, brake, and throttle converges to a separate MSE loss among which steering angle has the least and brake has the largest loss values. This shows that learning steering angle is easier with demonstrations compared to brake and throttle which at each time step depend on multi previous frames. Table 5 shows all model predictions for a relatively complex image chosen from the target domain (Fig. 1 at top corner) where a pedestrian crossing the street when the signal light is green. The behaviorally cloned models tend to predict that the agent should keep going whereas the adapted models to the target domain with rewards are able to predict the correct decision despite the green light presence in the image. Our adapted large network rewarded on both source and target is able to make the best prediction for throttle and brake (steering angle is perfect across all models).
|Bojarski et. al.||0.212||
Fig. 3 illustrates the percentage of averaged reward per episode for the aforementioned four models until convergence. Our designed architecture which is adapted to the demonstrations on a source domain, starts with less than reward in its first episode which lasts for seconds. This is reasonable considering the fact that each episode of the game, is intentionally set up to start in a completely random new environment which is highly possible to be a significantly different domain that what the policy has seen up to that point. This again highlights the fact that a behaviorally cloned model is at high risks of failure when it is tested in a different domain. The models keep learning from the rewards until convergence. It can be seen in Fig. 3 that our designed architecture adapted to the demonstration reaches of averaged reward after hours in its last episode (episode number ) which lasts for minutes and is then terminated by the user (no crashing happens). It is also shown that a suboptimal, yet adapted to the demonstrations policy , also converges but only to of the maximum reward and then plateaus for more than hours. Unadapted policies are also shown in Fig. 3 converging to an averaged reward of in a drastically different time-scale confirming the positive effect of using demonstrations in policy learning. Supplementary video of results can be found in https://saynaebrahimi.github.io/corl.html
|Large network||Steering angle||-0.006||0.003||0.005||0.002|
|Bojarski et. al. ||Steering angle||-0.005||-0.002||0.002||-0.001|
The goal of this work is to learn an policy for an autonomous driving task minimizing crashes and other safety violations while training. To this end we propose an algorithm which learns to generate an optimal network architecture from demonstration using a new reward function that optimizes accuracy and model size simultaneously. We confirm behavioral cloning alone can perform poorly when the target domain differs from source demonstrations. We show that our method can adapt the model learned by demonstration to a new domain relying on target environmental rewards. Experimental evaluation shows that our model achieves higher accuracy, fewer cumulative crashes, and higher target domain reward. We believe these results are encouraging and important steps towards the ultimate goal of learning complex driving policies with zero cumulative crashes or serious accidents either in simulation or the real world.
- Argall et al.  B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- Baker et al.  B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
- Bojarski et al.  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
Chen et al. 
C. Chen, A. Seff, A. Kornhauser, and J. Xiao.
Deepdriving: Learning affordance for direct perception in autonomous
Proceedings of the IEEE International Conference on Computer Vision, pages 2722–2730, 2015.
- Eigen  M. Eigen. Ingo Rechenberg Evolutionsstrategie Optimierung technischer Systeme nach Prinzipien der biologishen Evolution. mit einem Nachwort von Manfred Eigen, Friedrich Frommann Verlag, Struttgart-Bad Cannstatt, 1973.
- Feng and Darrell  J. Feng and T. Darrell. Learning the structure of deep convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2749–2757, 2015.
- Kingma and Ba  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp. Off-road obstacle avoidance through end-to-end learning.
- Nicolescu and Mataric  M. N. Nicolescu and M. J. Mataric. Learning and interacting in human-robot domains. IEEE Transactions on Systems, man, and Cybernetics-part A: Systems and Humans, 31(5):419–430, 2001.
- Pomerleau  D. A. Pomerleau. Alvinn, an autonomous land vehicle in a neural network. Technical report, Carnegie Mellon University, Computer Science Department, 1989.
Pugh and Stanley 
J. K. Pugh and K. O. Stanley.
Evolving multimodal controllers with hyperneat.
Proceedings of the 15th annual conference on Genetic and evolutionary computation, pages 735–742. ACM, 2013.
- Real et al.  E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041, 2017.
Ross et al. 
S. Ross, G. J. Gordon, and D. Bagnell.
A reduction of imitation learning and structured prediction to no-regret online learning.In
International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011.
- Ruano  A. Ruano. Deepgtav. https://github.com/ai-tor/DeepGTAV, 2017.
- Rybski and Voyles  P. E. Rybski and R. M. Voyles. Interactive task training of a mobile robot through human gesture recognition. In Robotics and Automation, 1999. Proceedings. 1999 IEEE International Conference on, volume 1, pages 664–669. IEEE, 1999.
- Salimans et al.  T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Schaffer et al.  J. D. Schaffer, D. Whitley, and L. J. Eshelman. Combinations of genetic algorithms and neural networks: A survey of the state of the art. In Combinations of Genetic Algorithms and Neural Networks, 1992., COGANN-92. International Workshop on, pages 1–37. IEEE, 1992.
- Schmidhuber  J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
- Stanley and Miikkulainen  K. O. Stanley and R. Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
- Sutton and Barto  R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Thrun and Pratt  S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.
- Veniat and Denoyer  T. Veniat and L. Denoyer. Learning time-efficient deep architectures with budgeted super networks. arXiv preprint arXiv:1706.00046, 2017.
- Xu et al.  H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models from large-scale video datasets. arXiv preprint arXiv:1612.01079, 2016.
- Zoph and Le  B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.