1 Introduction
One of the key challenges in building robust autonomous navigation systems is the development of a strong intelligence pipeline that is able to efficiently gather incoming sensor data and take suitable control actions with good repeatability and faulttolerance. In the past, this was addressed in a modular fashion, where specialized algorithms were developed for each subsystem and integrated with fine tuning. More recent trends show a revival of endtoend approaches that learn complex mappings directly from the input to the output by leveraging large volume of taskspecific data and the remarkable abstraction abilities afforded by deep neural networks. In autonomous navigation, these techniques have been used for learning visuomotor policies
[1]from human driving data. However, the traditional deep supervised learningbased driving requires a great deal of human annotation, and yet, may not be able to deal with the problem of accumulating errors during test time
[2]. On the other hand, deep reinforcement learning (DRL) offers a better formulation that allows policy improvement with feedback, and has achieved humanlevel performance on challenging game environments
[3, 4].In this work, we present an endtoend controller that uses multisensor input to learn an autonomous navigation policy in a physicsbased gaming environment called TORCS [5] (without needing any pretraining). To show the effectiveness of multisensory perception, we pick two popular continuous action DRL algorithms namely Normalized Advantage Function (NAF) [6] and Deep Deterministic Policy Gradient (DDPG) [7], and augment them to accept multisensory input. We limit our objective to only achieving autonomous navigation without any obstacles or other cars. This problem is kept simpler to focus on analyzing the performance of the proposed multisensor configurations using extensive quantitative and qualitative testing. Sensor redundancy can be a bane if the policy relies heavily on all inputs and lead to significant performance drop even if a single sensor fails. In order to avoid this situation, we apply a customized stochastic regularization technique called Sensor Dropout during training. Our approach reduces the policy overdependence on a specific sensor subset, and guarantees minimal performance drop even in the face of any partial sensor failure. We further augment the standard DRL loss with an additional auxiliary loss to reduce variance in the trained policy and offer smoother performance during abrupt sensor loss or reactivation.
Recently, promising experimental results were shown combining camera and lidar to build an endtoend steering controller of a UGV navigation [8]. Similarly, a multimodal DQN was built for a Kuka YouBot [9] by fusing information for homogeneous sensing modalities. However, the fusion stage in [8] is limited to sensors that are spatially redundant with each other, and requires the feature embedding of each sensor to have the same dimensionality. On the other hand, [9] requires a twostage training scheme which first approximates a function and then refines the policy with DropPath [9] regularization. In addition to longer training time, this only if you assume DropPath during the second stage doesn’t throw the policy outside of the initially optimized policy distribution. Any two stage policy with regularization in the second stage has to make this strong assumption.
The proposed method can be best seen as a far more generalized version of the above two. Multisensor fusion can be performed on heterogeneous sensing modalities, any where in the network pipeline, and in shorter timescales. Moreover, the objective is not only improving sensorfusion but also providing guaranteed operation feature even if a sensor subset fails (unique to this work). Through extensive empirical testing we show the following exciting results in this paper:

Multisensory DRL with Sensor Dropout (SD) reduces performance drop in a noisy environment from to just , when compared to a baseline system.

A multisensory policy with SD guarantees functionality even in a face a sensor subset failure. This particular feature underscores the need for redundancy in a safetycritical application like autonomous navigation.
2 Related Work
Multisensory DRL aims to leverage the availability of multiple, potentially imperfect, sensor inputs to improve learned policy. Most autonomous driving vehicles today are equipped with an array of sensors like GPS, Lidar, Camera, and Odometer, etc. However, some of these sensors, like GPS and odometers, are readily available but seldom included in deep supervised learning models [1]. Even in DRL, policies are predominantly single sensorbased, i.e., either lowdimensional physical states, or highdimensional pixels. For autonomous driving where it is essential to achieve highest possible safety and accuracy targets, developing policies that operate with multiple inputs is better suited. In fact, multisensory perception was an integral part of autonomous navigation solutions and even played a critical role in their success [10]
before the advent of deep learning based approaches. Sensor fusion offers several advantages, namely robustness to individual sensor noise/failure, improving object classification and tracking
[11], etc. In this light, several recent works in DRL have tried to solve the complex robotics tasks such as humanrobotinteraction [12], manipulation [13] and maze navigation [14] with multisensory sensor inputs. Mirowski et al. use similar using similar sensory data as in this work to navigate through a maze. However, the robot evolves with simpler dynamics and the depth information is only used to formulate an auxiliary loss and not as an input to learn a navigation policy.Multisensory deep learning, popularly called Multimodal deep learning, is an active area of research in other domains like audiovisual systems [15], text/speech and language models [16], etc. However, Multimodal learning is conspicuous by its absence in the modern endtoend autonomous navigation literature. Another challenge in multimodal learning is the specific case of overfitting where instead of learning the underlying latent target state representation using multiple diverse observations, the model instead learns a complex representation in the original space itself, defeating the purpose of using multisensor observations and making the process computationally burdensome. An illustrative example for this case is a car navigating when all sensors remain functional but fails to navigate at all even if one sensor fails or is partially corrupted. This kind of behavior is detrimental and suitable regularization measures should be set up during training to avoid it.
Stochastic regularization is an active area of research in deep learning made popular by the success of, Dropout [17]. Following this landmark paper, numerous extensions were proposed to further generalize this idea ([18, 19, 20, 21]). In the similar vein, an interesting technique has been proposed for specialized regularization in the multimodal setting namely ModDrop [22]
. ModDrop, however, requires pretraining with individual sensor inputs using separate loss functions. The method is originally designed for multimodal deep learning on a
fixed dataset. We argue that for DRL where the training dataset is generated during runtime, pretraining for each sensor policy may end up optimizing on different input distributions. In comparison, Sensor Dropout is designed to be applicable to the DRL setting. With SD, a network can be directly constructed in an endtoend fashion and the sensor fusion layer can be added just like Dropout. The training time is much shorter and scales better with increasing number of sensors.3 Multimodal Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) Brief Review: We consider a standard Reinforcement Learning (RL) setup, where an agent operates in an environment . At each discrete time step , the agent observes a state , picks an action , and receives a scalar reward from the environment. The return is defined as total discounted future reward at time step , with being a discount factor . The objective of the agent is to learn a policy that eventually maximizes the expected return. The learned policy, , can be formulated as either stochastic , or deterministic . The value function and actionvalue function describe the expected return for each state and stateaction pair upon following a policy . Finally, an advantage function is defined as the additional reward or advantage that the agent will have for executing some action at state and it is given by .
In high dimensional state/action space, these functions are usually approximated by a suitable parametrization. Accordingly, we define , , , , and as the parameters for approximating , , , , and functions, respectively. It was generally believed that using nonlinear function approximators would lead to unstable learning in practice. Recently, Mnih et al. [3] applied two novel modifications, namely replay buffer and target network, to stabilize the learning with deep nets. Later, several variants were introduced that exploited deep architectures and extended to learning tasks with continuous actions [7, 23, 6].
To exhaustively analyze the effect of multisensor input and the new stochastic regularization technique, we pick two algorithms in this work namely DDPG and NAF. It is worth noting that the two algorithms are very different, with DDPG being an offpolicy actorcritic method and NAF an offpolicy valuebased one. By augmenting these two algorithms, we highlight that any DRL algorithm, modified appropriately, can benefit from using multisensor inputs. Due to space constraint, we list the formulation of the two algorithms in Supplementary Material (Section A).
Multimodal (or) Multisensory Policy Architecture: We denote a set of observations composed from sensors as, , where stands for observation from
sensor. In the multimodal network, each sensory signal is preprocessed along an independent path. Each path has a feature extraction module that can be either pure identity function (modality
), or convolutionbased layer (modality ). The modularized feature extraction stage naturally allows for independent extraction of salient information that is transferable (with some tuning if needed) to other applications . The outputs of feature extraction modules are eventually flattened and concatenated to form the multimodal state. The schematic illustration of modularized multimodal policy is shown in Fig. 1.4 Augmenting MDRL
In this section, we propose two methods to improve training of a multisensor policy. We first introduce a new stochastic regularization called Sensor Dropout, and explain its advantages over the standard Dropout for this problem. Later, we propose an additional unsupervised auxiliary loss function to reduce the policy variance.
Sensor Dropout (SD) for Robustness: Sensor Dropout is a customization of Dropout [17]
that maintains dropping configurations on each sensor module instead of each neuron. Though both methods serve the purpose of regularization, SD is bettermotivated for training multisensory policies. By randomly dropping the sensor block during training, the policy network is encouraged to exploit cross connections across different sensing streams. When applied to complex robotic system, SD has advantages of handling imperfect sensing conditions such as latency, noise and even partial sensor failure. As shown in Fig.
1, consider the multimodal state , the dropping configuration is defined as adimensional vector
, where each element represents the on/off indicator for the sensor modality. Each sensor modality is represented by a dimensional vector, denoted as . The subscript indicates that each sensor may have different dimension. We now detail the two main differences between original Dropout and SD along with their interpretations.Firstly, note that the dimension of the dropping vector is much lower than the one in the standard Dropout (
). As a consequence, the probability of the event where all sensors are dropped out (i.e.
) is not negligible in SD. To explicitly remove , we slightly depart from [17] in modeling the SD layer. Instead of modeling SD as random process where any sensor block is switched on/off with a fixed probability, we define the random variable as the dropping configuration
itself. Since there are possible states for , we accordingly sample from an state categorical distribution . We denote the probability of a dropping configuration occurring with , where the subscript ranges from to . The corresponding pseudoBernoulli ^{1}^{1}1 We wish to point out that is pseudoBernoulli as we restrict our attention to cases where at least one sensor block is switched on at any given instant. This implies that switchingon of any sensor block is independent of the other but switchingoff is not. So the distribution is no longer fully independent. distribution for switching on a sensor block can be calculated as .Remark: Note that sampling from standard Bernoulli on sensor blocks with rejection of will have the same effect. However, the proposed categorical distribution aids in better bookkeeping and makes configurations easy to interpret. It can also be adaptive to the current sensor reliability during runtime.
Another difference from the standard Dropout is the rescaling process. Unlike the standard Dropout which preserves a fixed scaling ratio after dropping neurons, the rescaling ratio in SD is formulated as a function of the dropping configuration and sensor dimensions. The intuition is to keep the weighted summations equivalent among different dropping configurations in order to activate the later hidden layers. The scaling ratio is calculated as
In summary, the output of SD for the feature in sensor block (i.e. ) given a dropping configuration can be shown as , where is an augmented mask encapsulating both dropout and rescaling.
Auxiliary Loss for Variance Reduction: An alternative interpretation of the SDaugmented policy is that subpolicies induced by each sensor combination are jointly optimized during training. Denote the ultimate SDaugmented policy and subpolicy induced by each sensor combination as and
, respectively. The final output maintains a geometric mean over
different actions.Though the expectation of the total policy gradients for each subpolicy is the same, SD provides no guarantees on the consistency of these actions. To encourage the policy network to extract salient features from each sensor that embed into a common latent state representation, we further add an auxiliary loss that penalizes the inconsistency among . This additional penalty term provides an alternative gradient that reduces the variation of the ultimate policy, i.e. . The mechanism is motivated from the recent successes [14, 24] that use the auxiliary tasks to improve both agent’s performance and convergence rate. However, unlike most previous works that design the auxiliary tasks carefully from the ground truth environment, we formulate the target action from the policy network itself. Under the standard actorcritic architecture, the target action is defined as the output action of the subpolicy in target actor network that maximizes the target critic values
. In other words, we use the currently besttrained subpolicy as a heuristic to guide other subpolicies during training.
(1) 
Here,
is an additional hyperparameter that indicates the ratio between the two losses, and
is the batch size for offpolicy learning.5 Evaluation Results
5.1 Platform Setup
TORCS Simulator The proposed approach is verified on TORCS [5], a popular opensource car racing simulator that is capable of simulating physically realistic vehicle dynamics as well as multiple sensing modalities [25] to build sophisticated AI agents. In order to make the learning problem representative of the realworld setting, we use the following sensing modalities for our state description: (1) We define Sensor 1 as a hybrid state containing physicalbased information such as odometry and simulated GPS signal. (2) Sensor 2 consists of consecutive laser scans (i.e., at time , we input scans from times ). Finally, as Sensor 3, we supply
consecutive color images capturing the car’s frontview. These three representations are used separately to develop our baseline unimodal sensor policies. The multimodal state, on the other hand, has access to all sensors at any given point. When Sensor Dropout (SD) is applied, the agent will randomly lose access to a strict subset of sensors. The categorical distribution is initialized with a uniform distribution among total
possible combinations of sensor subset, and the bestlearned policy is reported here. The action space is a continuous vector in , whose elements represent steering angle, and acceleration. Experiment details such as exploration strategy, network architectures of each model, and sensor dimensionality are shown in the Supplementary Material (Section B).5.2 Results
Training Summary: The training performances, for all the proposed models and their corresponding baselines, are shown in Fig. 2. For DDPG, using highdimensional sensory input directly impacts convergence rate of the policy. Note that the Images unipolicy (orange line) has a much larger dimensional state space compared with Multi policies (purple and green lines). Counterintuitively, NAF performs a nearly linear improvement over training steps, and is relatively insensitive to the dimensionality of the state space. However, adding Sensor Dropout (SD) dramatically increases the convergence rate. For both algorithms, the final performance for multimodal sensor policies trained with SD is slightly lower than training without SD, indicating that SD has a regularization effect similar to original Dropout.
Policy  w/o Noise  w/ Noise  Performance Drop 

Multi Unimodal w/ Meta Controller  1.51 0.57  0.73 0.40  51.7 % 
Multimodal w/ SD  2.54 0.08  2.29 0.60  9.8 % 
Comparison with Unimodal Policies + Meta Controller: One of the intuitive baseline for the multisensor problem is to train each unimodal sensor policy separately. Once individual policies are learned, we can train an additional metacontroller that select which policy to follow given the current state. For this, we follow the setup in [26] by training a meta controller that takes the processed states from each unimodal policy, and outputs a softmax layer as the probability of choosing which subpolicy to perform. Note that, we assume perfect sensing during the training. However, to test performance in a more realistic scenario, we simulate mildly imperfect sensing by adding Gaussian noise. Policy performance with and without noise are summarized in Table 1. The performance of the baseline policy drops dramatically once noise is introduced, which implies that the unimodal policy is prone to overfitting without any regularization. In fact, the performance drop is sometimes severe in physicalbased or laserbased policy. In comparison, the policy trained with SD reaches a higher score in both scenarios, and the drop when noise is introduced is almost negligible.
Training  Testing  

Env.  Env.  
NAF  w/o SD  1.651  1.722 
w/ SD  1.284  1.086  
DDPG  w/o SD  1.458  1.468 
w/ SD  1.168  1.171 
Policy Robustness Analysis: In this part, we show that SD reduces the learned policy’s acute dependence on a subset of sensors in a multimodal sensor setting. First, we consider a scenario when malfunctions of sensors have been detected by the system, and the agent must rely on the remaining sensors to make navigation decisions. To simulate this setting during testing, we randomly block out some sensor modules, and scale the rest using the same rescaling mechanism as proposed in Section 4. Fig. 3 reports the averaging normalized reward of each model. A naive multimodal policy without any stochastic regularization (blue bar) performs poorly in the face of partial sensor failure and transfer tasks. Adding original Dropout makes the policy more generalized, yet the performance is not comparable with SD. Interestingly, by reducing the variance of the multimodal sensor policy with the auxiliary loss, policy tends to have a better generalization among other environments.
Policy Sensitivity Analysis: To monitor the extent to which the learned policy depends on each sensor block, we measure the gradient of the policy output w.r.t a subset block . The technique is motivated from the salient map analysis [27], which has also been applied to DRL study recently [28]. To better analyze the effects of SD, we report on a smaller subset by implementing SD layer to drop either (1) or (2) . Consequently, the sensitivity metric is formulated as the relative sensitivity of the policy on two sensor subsets. If the ratio increases, the agent’s dependence shifts toward the sensor block in the numerator and vice versa. Assuming the fusionofinterest is between the abovementioned two subsets, we show in Table 2 that, using SD, the metric gets closer to , indicating nearly equal importance to both the sensing modalities. The sensitivity metric is calculated as .
Effect of Auxiliary Loss: In this experiment, we verify how the auxiliary loss helps reshape the multimodal sensor policy and reduce the action variance. We extract the representations of the last hidden layer assigned by the policy network throughout a fixed episode. At every time step, the representation induced by each sensor combination is collected. Our intuition is that this latent space represents how the policy network interprets the incoming sensor stream for reaction. Based on this assumption, an ideal multimodal sensor policy should map different sensor streams to a similar distribution as long as the information provided by each combination is representative to lead to the same output action.
As shown in Fig. 4, the naive multimodal sensor policy has a scattered distribution over the latent space, indicating that representative information from each sensor is treated very differently. In comparison, the policy trained with SD has a concentrated distribution, yet it is still distinguishable w.r.t. different sensors. Adding the auxiliary training loss encourages the true sensor fusion as the distribution becomes more integrated. During training, the policy is not only forced to explicitly make decisions under each sensor combination, but also penalized with the disagreements among multimodal sensor policies. In fact, as shown in Fig. 5, the concentration of the latent space directly affect the action variance induced by each subpolicy. We provide the actual covariances for each component and the actual action variance values in the Supplementary Material (Section C).
6 Discussion
Full SubPolicy Analysis: The performance of each subpolicy is summarized in Fig. 6. As shown in the first and third column, the performances of the naive multimodal sensor policy (red) and the policy trained with standard Dropout (blue) drop dramatically as the policies lose access to image, which shares of the total multimodal state. Though Dropout increases the performance of the policy in the testing environment, the generalization is limited to using full multimodel state as input. On the other hand, SD generalizes the policy across sensor module, making the subpolicies successfully transfer to the testing environment. It is worth mentioning that the policies trained with SD is capable to operate even when both laser and image sensor are blocked. Interestingly, neither original Dropout or SD show apparent degradation in full policy induced by the regularization. We list more analysis as our future work.
Visualize Policy Attention Region: The average gradient in the policy sensitivity section can also be used to visualize the regions among each sensor where the policy network pays attentions. As shown in Fig. 7, we observe that policies trained with SD have higher gradients on neurons corresponding to the corner inputs of the laser sensor, indicating that a more sparse and meaningful policy is learned. These corner inputs corresponded to the laser beams that are oriented perpendicularly to the vehicle’s direction of motion, and give an estimate of its relative position on the track. To look for similar patterns in Fig. 7, image pixels with higher gradients are marked to interpret the policy’s view of the world. We pick two scenarios, 1) straight track and 2) sharp left turn, depicted by the first and second rows in the figure. Note that though policies trained without SD tend to focus more on the road, those areas are in plain color and offer little salient information. In conclusion, policies trained with SD are more sensitive to features such as road boundary, which is crucial for long horizon planning. In comparison, networks trained without SD have relatively low and unclear gradients over both laser and image sensor state space.
7 Conclusions and Future Work
In this work, we introduce a new stochastic regularization technique called Sensor Dropout to promote an effective fusing of information from multiple sensors. The variance of the resulting policy can be further reduced by introducing an auxiliary loss during training. We show that SD reduces the policy sensitivity to a particular sensor subset, and guarantees functionality even in the face of a sensor subset failure. Moreover, the policy network is able to automatically infer and weight locations providing salient information. For future work, we wish to extend the framework to other environments such as real robotics systems, and other algorithms like TRPO [29], and QProp [30], etc.. Secondly, systematic investigation into the problems such as how to augment the reward function for other important driving tasks like collision avoidance, and lane changing, and how to adaptively adjust the SD distribution during training are also interesting avenues that merit further study.
Acknowledgement
The authors would like to thank PoWei Chou, Humphrey Hu, and Ming Hsiao for many helpful discussions, suggestions and comments on the paper. This research was funded under award by Yamaha Motor Corporation.
References
 Bojarski et al. [2016] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.

Ross et al. [2011]
S. Ross, G. J. Gordon, and D. Bagnell.
A reduction of imitation learning and structured prediction to noregret online learning.
In AISTATS, volume 1, page 6, 2011.  Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS’13 Workshop on Deep Learning, 2013.
 Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Wymann et al. [2000] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner. Torcs, the open racing car simulator. Software available at http://torcs. sourceforge. net, 2000.

Gu et al. [2016]
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine.
Continuous deep qlearning with modelbased acceleration.
In
Proceedings of The 33rd International Conference on Machine Learning
, pages 2829–2838, 2016.  Lillicrap et al. [2016] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.
 Patel et al. [2017] N. Patel, A. Choromanska, P. Krishnamurthy, and F. Khorrami. Sensor modality fusion with cnns for ugv autonomous driving in indoor environments. In International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017.
 Bohez et al. [2017] S. Bohez, T. Verbelen, E. De Coninck, B. Vankeirsbilck, P. Simoens, and B. Dhoedt. Sensor fusion for robot control through deep reinforcement learning. preprint arXiv:1703.04550, 2017.
 Urmson et al. [2007] C. Urmson, J. A. Bagnell, C. R. Baker, M. Hebert, A. Kelly, R. Rajkumar, P. E. Rybski, S. Scherer, R. Simmons, S. Singh, et al. Tartan racing: A multimodal approach to the darpa urban challenge. 2007.
 Cho et al. [2014] H. Cho, Y.W. Seo, B. V. Kumar, and R. R. Rajkumar. A multisensor fusion system for moving object detection and tracking in urban driving environments. In International Conference on Robotics and Automation (ICRA), pages 1836–1843. IEEE, 2014.
 Qureshi et al. [2016] A. H. Qureshi, Y. Nakamura, Y. Yoshikawa, and H. Ishiguro. Robot gains social intelligence through multimodal deep reinforcement learning. In 16th International Conference on Humanoid Robots, pages 745–751. IEEE, 2016.
 Levine et al. [2016] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
 Mirowski et al. [2017] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Hadsell. Learning to navigate in complex environments. In International Conference on Learning Representations (ICLR), 2017.
 Ngiam et al. [2011] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML11), pages 689–696, 2011.

Srivastava and Salakhutdinov [2012]
N. Srivastava and R. R. Salakhutdinov.
Multimodal learning with deep boltzmann machines.
In Advances in neural information processing systems, pages 2222–2230, 2012.  Srivastava et al. [2014] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

Murdock et al. [2016]
C. Murdock, Z. Li, H. Zhou, and T. Duerig.
Blockout: Dynamic model selection for hierarchical deep networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2583–2591, 2016.  Wan et al. [2013] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 1058–1066, 2013.
 Krueger et al. [2016] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, H. Larochelle, A. Courville, et al. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.

Frazão and Alexandre [2014]
X. Frazão and L. A. Alexandre.
Dropall: Generalization of two convolutional neural network regularization methods.
In International Conference Image Analysis and Recognition, pages 282–289. Springer, 2014.  Neverova et al. [2016] N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multimodal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1692–1706, 2016.
 Mnih et al. [2016] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
 Jaderberg et al. [2016] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2016. URL http://arxiv.org/abs/1611.05397.
 Yoshida [2016] N. Yoshida. Gymtorcs. https://github.com/ugonamakun/gym_torcs, 2016.
 [26] R. Liaw, S. Krishnan, A. Garg, D. Crankshaw, J. E. Gonzalez, and K. Goldberg. Composing metapolicies for autonomous driving using hierarchical deep reinforcement learning.
 Simonyan et al. [2014] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 Wang et al. [2016] Z. Wang, N. de Freitas, and M. Lanctot. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016.
 Schulman et al. [2015] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
 Gu et al. [2017] S. Gu, T. P. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Qprop: Sampleefficient policy gradient with an offpolicy critic. In International Conference on Learning Representations (ICLR), 2017.
 Sutton et al. [1999] R. S. Sutton, D. A. McAllester, S. P. Singh, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 99, pages 1057–1063, 1999.
 Silver et al. [2014] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.

Lau [2016]
Y.P. Lau.
Using keras and deep deterministic policy gradient to play torcs.
https://yanpanlau.github.io/2016/10/11/TorcsKeras.html, 2016.  Kingma and Ba [2015] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. 2015.
Appendix A Continuous Action Space Algorithms
a.1 Normalized Advantage Function (NAF)
Qlearning [31] is an offpolicy modelfree algorithm, where agent learns an approximated function, and follows a greedy policy at each step. The objective function , can be reached by minimizing the square loss Bellman error , where target is defined as .
Recently, [6] proposed a continuous variant of Deep QLearning by a clever network construction. The network, which they called Normalized Advantage Function (NAF), parameterized the advantage function quadratically over the action space, and is weighted by nonlinear feature of states.
(2)  
(3)  
(4) 
During runtime, the greedy policy can be performed by simply taking the output of subnetwork . The data flow at forward prediction and backpropagation steps are shown in Fig. 8 (a) and (b), respectively.
a.2 Deep Deterministic Policy Gradient (DDPG)
An alternative approach to continuous RL tasks was the use of an actorcritic framework, which maintains an explicit policy function, called actor, and an actionvalue function called as critic. In [32], a novel deterministic policy gradient (DPG) approach was proposed and it was shown that deterministic policy gradients have a modelfree form and follow the gradient of the actionvalue function.
(5) 
[32] proved that using the policy gradient calculated in (5) to update model parameters leads to the maximum expected reward.
Building on this result, [7] proposed an extension of DPG with deep architecture to generalize their prior success with discrete action spaces [4] onto continuous spaces. Using the DPG, an offpolicy algorithm was developed to estimate the function using a differentiable function approximator. Similar techniques as in [4] were utilized for stable learning. In order to explore the full state and action space, an exploration policy was constructed by adding OrnsteinUhlenbeck noise process. The data flow for prediction and backpropagation steps are shown in Fig. 8 (c) and (d), respectively.
Appendix B Experiment Details
b.1 Exploration and Reward
An exploration strategy is injected adding an OrnsteinUhlenbeck process noise to the output of the policy network. The choice of reward function is slightly different from [7] and [23] as an additional penalty term to penalize sideways drifting along the track was added. In practice, this modification leads to more stable policies during training [33].
b.2 Network Architecture
For laser feature extraction module, we use two convolution layers with filters of size , while image feature extraction is composed of three convolution layers: one layer of filters of size
and striding length
, followed by two layers each with filters of size and striding length. Batch normalization is followed after every convolution layer. All these extraction modules are fused and are later followed up with two fullyconnected layers of
hidden units each. All hidden layers have relu activations. The final layer of the critic network use leaner activation, while the output of the actor network are bounded using tanh activation. We use sigmoid activation for the output of network in NAF. In practice, it leads to a more stable training for high dimensional state space. We trained with minibatch size of .We used Adam [34] for learning the network parameters. For DDPG, the learning rates for actor and critic are and , respectively. We allow the actor and critic to maintain its own feature extraction module. In practice, sharing the same extraction module can lead to unstable training. Note that the NAF algorithm maintains three separate networks, which represent the value function (), policy network (), and the statedependent covariance matrix in the action space (), respectively. In order to maintain a similar experiment setting and avoid unstable training, we maintain two independent feature extraction modules for , and both and . In a similar vein, we apply a learning rate of for , and for both and .
Model ID  State Dimensionality  Description 

Physical  10  
Lasers  4 19  4 consecutive laser scans 
Images  12 64 64  4 consecutive RGB image 
Multi  10+119+36464  all sensor streams at current time step 
b.3 Simulated Sensor Detail
As shown in Fig. 9, the physical state is a DOF hybrid state, including D velocity ( DOF), position and orientation with respect to track centerline ( DOF), and finally rotational speed of wheels ( DOF) and engine ( DOF). Each laser scan is composed of readings spanning a fieldofview in the the front of car. Finally, camera provides RGB channels with resolution .
Appendix C More Experimental Results
c.1 Effect of Auxiliary Loss
NAF  DDPG  

Principal Component  w/oSD  w/SD  w/SD+aux  w/oSD  w/SD  w/SD+aux 
First (%)  94.9  82.0  58.9  93.4  59.2  47.4 
Second (%)  4.1  12.3  25.2  3.1  20.7  21.9 
Third (%)  0.6  3.1  5.3  1.6  6.2  6.1 
NAF  DDPG  

w/oSD  w/SD  w/SD+aux  w/oSD  w/SD  w/SD+aux  
Steering  0.1177  0.0819  0.0135  0.3329  0.0302  0.0290 
Acceleration  0.4559  0.0472  0.0186  0.5714  0.0427  0.0143 