One of the key challenges in building robust autonomous navigation systems is the development of a strong intelligence pipeline that is able to efficiently gather incoming sensor data and take suitable control actions with good repeatability and fault-tolerance. In the past, this was addressed in a modular fashion, where specialized algorithms were developed for each sub-system and integrated with fine tuning. More recent trends show a revival of end-to-end approaches that learn complex mappings directly from the input to the output by leveraging large volume of task-specific data and the remarkable abstraction abilities afforded by deep neural networks. In autonomous navigation, these techniques have been used for learning visuomotor policies
from human driving data. However, the traditional deep supervised learning-based driving requires a great deal of human annotation, and yet, may not be able to deal with the problem of accumulating errors during test time
. On the other hand, deep reinforcement learning (DRL) offers a better formulation that allows policy improvement with feedback, and has achieved human-level performance on challenging game environments[3, 4].
In this work, we present an end-to-end controller that uses multi-sensor input to learn an autonomous navigation policy in a physics-based gaming environment called TORCS  (without needing any pretraining). To show the effectiveness of multisensory perception, we pick two popular continuous action DRL algorithms namely Normalized Advantage Function (NAF)  and Deep Deterministic Policy Gradient (DDPG) , and augment them to accept multisensory input. We limit our objective to only achieving autonomous navigation without any obstacles or other cars. This problem is kept simpler to focus on analyzing the performance of the proposed multi-sensor configurations using extensive quantitative and qualitative testing. Sensor redundancy can be a bane if the policy relies heavily on all inputs and lead to significant performance drop even if a single sensor fails. In order to avoid this situation, we apply a customized stochastic regularization technique called Sensor Dropout during training. Our approach reduces the policy over-dependence on a specific sensor subset, and guarantees minimal performance drop even in the face of any partial sensor failure. We further augment the standard DRL loss with an additional auxiliary loss to reduce variance in the trained policy and offer smoother performance during abrupt sensor loss or re-activation.
Recently, promising experimental results were shown combining camera and lidar to build an end-to-end steering controller of a UGV navigation . Similarly, a multimodal DQN was built for a Kuka YouBot  by fusing information for homogeneous sensing modalities. However, the fusion stage in  is limited to sensors that are spatially redundant with each other, and requires the feature embedding of each sensor to have the same dimensionality. On the other hand,  requires a two-stage training scheme which first approximates a function and then refines the policy with DropPath  regularization. In addition to longer training time, this only if you assume DropPath during the second stage doesn’t throw the policy outside of the initially optimized policy distribution. Any two stage policy with regularization in the second stage has to make this strong assumption.
The proposed method can be best seen as a far more generalized version of the above two. Multi-sensor fusion can be performed on heterogeneous sensing modalities, any where in the network pipeline, and in shorter timescales. Moreover, the objective is not only improving sensor-fusion but also providing guaranteed operation feature even if a sensor subset fails (unique to this work). Through extensive empirical testing we show the following exciting results in this paper:
Multisensory DRL with Sensor Dropout (SD) reduces performance drop in a noisy environment from to just , when compared to a baseline system.
A multisensory policy with SD guarantees functionality even in a face a sensor subset failure. This particular feature underscores the need for redundancy in a safety-critical application like autonomous navigation.
2 Related Work
Multisensory DRL aims to leverage the availability of multiple, potentially imperfect, sensor inputs to improve learned policy. Most autonomous driving vehicles today are equipped with an array of sensors like GPS, Lidar, Camera, and Odometer, etc. However, some of these sensors, like GPS and odometers, are readily available but seldom included in deep supervised learning models . Even in DRL, policies are predominantly single sensor-based, i.e., either low-dimensional physical states, or high-dimensional pixels. For autonomous driving where it is essential to achieve highest possible safety and accuracy targets, developing policies that operate with multiple inputs is better suited. In fact, multisensory perception was an integral part of autonomous navigation solutions and even played a critical role in their success 
before the advent of deep learning based approaches. Sensor fusion offers several advantages, namely robustness to individual sensor noise/failure, improving object classification and tracking, etc. In this light, several recent works in DRL have tried to solve the complex robotics tasks such as human-robot-interaction , manipulation  and maze navigation  with multisensory sensor inputs. Mirowski et al. use similar using similar sensory data as in this work to navigate through a maze. However, the robot evolves with simpler dynamics and the depth information is only used to formulate an auxiliary loss and not as an input to learn a navigation policy.
Multisensory deep learning, popularly called Multimodal deep learning, is an active area of research in other domains like audiovisual systems , text/speech and language models , etc. However, Multi-modal learning is conspicuous by its absence in the modern end-to-end autonomous navigation literature. Another challenge in multimodal learning is the specific case of over-fitting where instead of learning the underlying latent target state representation using multiple diverse observations, the model instead learns a complex representation in the original space itself, defeating the purpose of using multi-sensor observations and making the process computationally burdensome. An illustrative example for this case is a car navigating when all sensors remain functional but fails to navigate at all even if one sensor fails or is partially corrupted. This kind of behavior is detrimental and suitable regularization measures should be set up during training to avoid it.
Stochastic regularization is an active area of research in deep learning made popular by the success of, Dropout . Following this landmark paper, numerous extensions were proposed to further generalize this idea ([18, 19, 20, 21]). In the similar vein, an interesting technique has been proposed for specialized regularization in the multimodal setting namely ModDrop 
. ModDrop, however, requires pretraining with individual sensor inputs using separate loss functions. The method is originally designed for multimodal deep learning on afixed dataset. We argue that for DRL where the training dataset is generated during run-time, pretraining for each sensor policy may end up optimizing on different input distributions. In comparison, Sensor Dropout is designed to be applicable to the DRL setting. With SD, a network can be directly constructed in an end-to-end fashion and the sensor fusion layer can be added just like Dropout. The training time is much shorter and scales better with increasing number of sensors.
3 Multimodal Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) Brief Review: We consider a standard Reinforcement Learning (RL) setup, where an agent operates in an environment . At each discrete time step , the agent observes a state , picks an action , and receives a scalar reward from the environment. The return is defined as total discounted future reward at time step , with being a discount factor . The objective of the agent is to learn a policy that eventually maximizes the expected return. The learned policy, , can be formulated as either stochastic , or deterministic . The value function and action-value function describe the expected return for each state and state-action pair upon following a policy . Finally, an advantage function is defined as the additional reward or advantage that the agent will have for executing some action at state and it is given by .
In high dimensional state/action space, these functions are usually approximated by a suitable parametrization. Accordingly, we define , , , , and as the parameters for approximating , , , , and functions, respectively. It was generally believed that using non-linear function approximators would lead to unstable learning in practice. Recently, Mnih et al.  applied two novel modifications, namely replay buffer and target network, to stabilize the learning with deep nets. Later, several variants were introduced that exploited deep architectures and extended to learning tasks with continuous actions [7, 23, 6].
To exhaustively analyze the effect of multi-sensor input and the new stochastic regularization technique, we pick two algorithms in this work namely DDPG and NAF. It is worth noting that the two algorithms are very different, with DDPG being an off-policy actor-critic method and NAF an off-policy value-based one. By augmenting these two algorithms, we highlight that any DRL algorithm, modified appropriately, can benefit from using multi-sensor inputs. Due to space constraint, we list the formulation of the two algorithms in Supplementary Material (Section A).
Multimodal (or) Multisensory Policy Architecture: We denote a set of observations composed from sensors as, , where stands for observation from
sensor. In the multimodal network, each sensory signal is pre-processed along an independent path. Each path has a feature extraction module that can be either pure identity function (modality), or convolution-based layer (modality ). The modularized feature extraction stage naturally allows for independent extraction of salient information that is transferable (with some tuning if needed) to other applications . The outputs of feature extraction modules are eventually flattened and concatenated to form the multimodal state. The schematic illustration of modularized multimodal policy is shown in Fig. 1.
4 Augmenting MDRL
In this section, we propose two methods to improve training of a multi-sensor policy. We first introduce a new stochastic regularization called Sensor Dropout, and explain its advantages over the standard Dropout for this problem. Later, we propose an additional unsupervised auxiliary loss function to reduce the policy variance.
Sensor Dropout (SD) for Robustness: Sensor Dropout is a customization of Dropout 
that maintains dropping configurations on each sensor module instead of each neuron. Though both methods serve the purpose of regularization, SD is better-motivated for training multisensory policies. By randomly dropping the sensor block during training, the policy network is encouraged to exploit cross connections across different sensing streams. When applied to complex robotic system, SD has advantages of handling imperfect sensing conditions such as latency, noise and even partial sensor failure. As shown in Fig.1, consider the multimodal state , the dropping configuration is defined as a
-dimensional vector, where each element represents the on/off indicator for the sensor modality. Each sensor modality is represented by a -dimensional vector, denoted as . The subscript indicates that each sensor may have different dimension. We now detail the two main differences between original Dropout and SD along with their interpretations.
Firstly, note that the dimension of the dropping vector is much lower than the one in the standard Dropout (
). As a consequence, the probability of the event where all sensors are dropped out (i.e.) is not negligible in SD. To explicitly remove , we slightly depart from  in modeling the SD layer. Instead of modeling SD as random process where any sensor block is switched on/off with a fixed probability
, we define the random variable as the dropping configurationitself. Since there are possible states for , we accordingly sample from an -state categorical distribution . We denote the probability of a dropping configuration occurring with , where the subscript ranges from to . The corresponding pseudo-Bernoulli 111 We wish to point out that is pseudo-Bernoulli as we restrict our attention to cases where at least one sensor block is switched on at any given instant. This implies that switching-on of any sensor block is independent of the other but switching-off is not. So the distribution is no longer fully independent. distribution for switching on a sensor block can be calculated as .
Remark: Note that sampling from standard Bernoulli on sensor blocks with rejection of will have the same effect. However, the proposed categorical distribution aids in better bookkeeping and makes configurations easy to interpret. It can also be adaptive to the current sensor reliability during run-time.
Another difference from the standard Dropout is the rescaling process. Unlike the standard Dropout which preserves a fixed scaling ratio after dropping neurons, the rescaling ratio in SD is formulated as a function of the dropping configuration and sensor dimensions. The intuition is to keep the weighted summations equivalent among different dropping configurations in order to activate the later hidden layers. The scaling ratio is calculated as
In summary, the output of SD for the feature in sensor block (i.e. ) given a dropping configuration can be shown as , where is an augmented mask encapsulating both dropout and re-scaling.
Auxiliary Loss for Variance Reduction: An alternative interpretation of the SD-augmented policy is that sub-policies induced by each sensor combination are jointly optimized during training. Denote the ultimate SD-augmented policy and sub-policy induced by each sensor combination as and
, respectively. The final output maintains a geometric mean overdifferent actions.
Though the expectation of the total policy gradients for each sub-policy is the same, SD provides no guarantees on the consistency of these actions. To encourage the policy network to extract salient features from each sensor that embed into a common latent state representation, we further add an auxiliary loss that penalizes the inconsistency among . This additional penalty term provides an alternative gradient that reduces the variation of the ultimate policy, i.e. . The mechanism is motivated from the recent successes [14, 24] that use the auxiliary tasks to improve both agent’s performance and convergence rate. However, unlike most previous works that design the auxiliary tasks carefully from the ground truth environment, we formulate the target action from the policy network itself. Under the standard actor-critic architecture, the target action is defined as the output action of the sub-policy in target actor network that maximizes the target critic values
. In other words, we use the currently best-trained sub-policy as a heuristic to guide other sub-policies during training.
is an additional hyperparameter that indicates the ratio between the two losses, andis the batch size for off-policy learning.
5 Evaluation Results
5.1 Platform Setup
TORCS Simulator The proposed approach is verified on TORCS , a popular open-source car racing simulator that is capable of simulating physically realistic vehicle dynamics as well as multiple sensing modalities  to build sophisticated AI agents. In order to make the learning problem representative of the real-world setting, we use the following sensing modalities for our state description: (1) We define Sensor 1 as a hybrid state containing physical-based information such as odometry and simulated GPS signal. (2) Sensor 2 consists of consecutive laser scans (i.e., at time , we input scans from times ). Finally, as Sensor 3, we supply
consecutive color images capturing the car’s front-view. These three representations are used separately to develop our baseline uni-modal sensor policies. The multi-modal state, on the other hand, has access to all sensors at any given point. When Sensor Dropout (SD) is applied, the agent will randomly lose access to a strict subset of sensors. The categorical distribution is initialized with a uniform distribution among totalpossible combinations of sensor subset, and the best-learned policy is reported here. The action space is a continuous vector in , whose elements represent steering angle, and acceleration. Experiment details such as exploration strategy, network architectures of each model, and sensor dimensionality are shown in the Supplementary Material (Section B).
Training Summary: The training performances, for all the proposed models and their corresponding baselines, are shown in Fig. 2. For DDPG, using high-dimensional sensory input directly impacts convergence rate of the policy. Note that the Images uni-policy (orange line) has a much larger dimensional state space compared with Multi policies (purple and green lines). Counter-intuitively, NAF performs a nearly linear improvement over training steps, and is relatively insensitive to the dimensionality of the state space. However, adding Sensor Dropout (SD) dramatically increases the convergence rate. For both algorithms, the final performance for multimodal sensor policies trained with SD is slightly lower than training without SD, indicating that SD has a regularization effect similar to original Dropout.
|Policy||w/o Noise||w/ Noise||Performance Drop|
|Multi Uni-modal w/ Meta Controller||1.51 0.57||0.73 0.40||51.7 %|
|Multimodal w/ SD||2.54 0.08||2.29 0.60||9.8 %|
Comparison with Uni-modal Policies + Meta Controller: One of the intuitive baseline for the multi-sensor problem is to train each uni-modal sensor policy separately. Once individual policies are learned, we can train an additional meta-controller that select which policy to follow given the current state. For this, we follow the setup in  by training a meta controller that takes the processed states from each uni-modal policy, and outputs a softmax layer as the probability of choosing which sub-policy to perform. Note that, we assume perfect sensing during the training. However, to test performance in a more realistic scenario, we simulate mildly imperfect sensing by adding Gaussian noise. Policy performance with and without noise are summarized in Table 1. The performance of the baseline policy drops dramatically once noise is introduced, which implies that the uni-modal policy is prone to over-fitting without any regularization. In fact, the performance drop is sometimes severe in physical-based or laser-based policy. In comparison, the policy trained with SD reaches a higher score in both scenarios, and the drop when noise is introduced is almost negligible.
Policy Robustness Analysis: In this part, we show that SD reduces the learned policy’s acute dependence on a subset of sensors in a multimodal sensor setting. First, we consider a scenario when malfunctions of sensors have been detected by the system, and the agent must rely on the remaining sensors to make navigation decisions. To simulate this setting during testing, we randomly block out some sensor modules, and scale the rest using the same rescaling mechanism as proposed in Section 4. Fig. 3 reports the averaging normalized reward of each model. A naive multimodal policy without any stochastic regularization (blue bar) performs poorly in the face of partial sensor failure and transfer tasks. Adding original Dropout makes the policy more generalized, yet the performance is not comparable with SD. Interestingly, by reducing the variance of the multimodal sensor policy with the auxiliary loss, policy tends to have a better generalization among other environments.
Policy Sensitivity Analysis: To monitor the extent to which the learned policy depends on each sensor block, we measure the gradient of the policy output w.r.t a subset block . The technique is motivated from the salient map analysis , which has also been applied to DRL study recently . To better analyze the effects of SD, we report on a smaller subset by implementing SD layer to drop either (1) or (2) . Consequently, the sensitivity metric is formulated as the relative sensitivity of the policy on two sensor subsets. If the ratio increases, the agent’s dependence shifts toward the sensor block in the numerator and vice versa. Assuming the fusion-of-interest is between the above-mentioned two subsets, we show in Table 2 that, using SD, the metric gets closer to , indicating nearly equal importance to both the sensing modalities. The sensitivity metric is calculated as .
Effect of Auxiliary Loss: In this experiment, we verify how the auxiliary loss helps reshape the multimodal sensor policy and reduce the action variance. We extract the representations of the last hidden layer assigned by the policy network throughout a fixed episode. At every time step, the representation induced by each sensor combination is collected. Our intuition is that this latent space represents how the policy network interprets the incoming sensor stream for reaction. Based on this assumption, an ideal multimodal sensor policy should map different sensor streams to a similar distribution as long as the information provided by each combination is representative to lead to the same output action.
As shown in Fig. 4, the naive multimodal sensor policy has a scattered distribution over the latent space, indicating that representative information from each sensor is treated very differently. In comparison, the policy trained with SD has a concentrated distribution, yet it is still distinguishable w.r.t. different sensors. Adding the auxiliary training loss encourages the true sensor fusion as the distribution becomes more integrated. During training, the policy is not only forced to explicitly make decisions under each sensor combination, but also penalized with the disagreements among multimodal sensor policies. In fact, as shown in Fig. 5, the concentration of the latent space directly affect the action variance induced by each sub-policy. We provide the actual covariances for each component and the actual action variance values in the Supplementary Material (Section C).
Full Sub-Policy Analysis: The performance of each sub-policy is summarized in Fig. 6. As shown in the first and third column, the performances of the naive multimodal sensor policy (red) and the policy trained with standard Dropout (blue) drop dramatically as the policies lose access to image, which shares of the total multimodal state. Though Dropout increases the performance of the policy in the testing environment, the generalization is limited to using full multimodel state as input. On the other hand, SD generalizes the policy across sensor module, making the sub-policies successfully transfer to the testing environment. It is worth mentioning that the policies trained with SD is capable to operate even when both laser and image sensor are blocked. Interestingly, neither original Dropout or SD show apparent degradation in full policy induced by the regularization. We list more analysis as our future work.
Visualize Policy Attention Region: The average gradient in the policy sensitivity section can also be used to visualize the regions among each sensor where the policy network pays attentions. As shown in Fig. 7, we observe that policies trained with SD have higher gradients on neurons corresponding to the corner inputs of the laser sensor, indicating that a more sparse and meaningful policy is learned. These corner inputs corresponded to the laser beams that are oriented perpendicularly to the vehicle’s direction of motion, and give an estimate of its relative position on the track. To look for similar patterns in Fig. 7, image pixels with higher gradients are marked to interpret the policy’s view of the world. We pick two scenarios, 1) straight track and 2) sharp left turn, depicted by the first and second rows in the figure. Note that though policies trained without SD tend to focus more on the road, those areas are in plain color and offer little salient information. In conclusion, policies trained with SD are more sensitive to features such as road boundary, which is crucial for long horizon planning. In comparison, networks trained without SD have relatively low and unclear gradients over both laser and image sensor state space.
7 Conclusions and Future Work
In this work, we introduce a new stochastic regularization technique called Sensor Dropout to promote an effective fusing of information from multiple sensors. The variance of the resulting policy can be further reduced by introducing an auxiliary loss during training. We show that SD reduces the policy sensitivity to a particular sensor subset, and guarantees functionality even in the face of a sensor subset failure. Moreover, the policy network is able to automatically infer and weight locations providing salient information. For future work, we wish to extend the framework to other environments such as real robotics systems, and other algorithms like TRPO , and Q-Prop , etc.. Secondly, systematic investigation into the problems such as how to augment the reward function for other important driving tasks like collision avoidance, and lane changing, and how to adaptively adjust the SD distribution during training are also interesting avenues that merit further study.
The authors would like to thank Po-Wei Chou, Humphrey Hu, and Ming Hsiao for many helpful discussions, suggestions and comments on the paper. This research was funded under award by Yamaha Motor Corporation.
- Bojarski et al.  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
Ross et al. 
S. Ross, G. J. Gordon, and D. Bagnell.
A reduction of imitation learning and structured prediction to no-regret online learning.In AISTATS, volume 1, page 6, 2011.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS’13 Workshop on Deep Learning, 2013.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Wymann et al.  B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner. Torcs, the open racing car simulator. Software available at http://torcs. sourceforge. net, 2000.
Gu et al. 
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine.
Continuous deep q-learning with model-based acceleration.
Proceedings of The 33rd International Conference on Machine Learning, pages 2829–2838, 2016.
- Lillicrap et al.  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.
- Patel et al.  N. Patel, A. Choromanska, P. Krishnamurthy, and F. Khorrami. Sensor modality fusion with cnns for ugv autonomous driving in indoor environments. In International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017.
- Bohez et al.  S. Bohez, T. Verbelen, E. De Coninck, B. Vankeirsbilck, P. Simoens, and B. Dhoedt. Sensor fusion for robot control through deep reinforcement learning. preprint arXiv:1703.04550, 2017.
- Urmson et al.  C. Urmson, J. A. Bagnell, C. R. Baker, M. Hebert, A. Kelly, R. Rajkumar, P. E. Rybski, S. Scherer, R. Simmons, S. Singh, et al. Tartan racing: A multi-modal approach to the darpa urban challenge. 2007.
- Cho et al.  H. Cho, Y.-W. Seo, B. V. Kumar, and R. R. Rajkumar. A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In International Conference on Robotics and Automation (ICRA), pages 1836–1843. IEEE, 2014.
- Qureshi et al.  A. H. Qureshi, Y. Nakamura, Y. Yoshikawa, and H. Ishiguro. Robot gains social intelligence through multimodal deep reinforcement learning. In 16th International Conference on Humanoid Robots, pages 745–751. IEEE, 2016.
- Levine et al.  S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
- Mirowski et al.  P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Hadsell. Learning to navigate in complex environments. In International Conference on Learning Representations (ICLR), 2017.
- Ngiam et al.  J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696, 2011.
Srivastava and Salakhutdinov 
N. Srivastava and R. R. Salakhutdinov.
Multimodal learning with deep boltzmann machines.In Advances in neural information processing systems, pages 2222–2230, 2012.
- Srivastava et al.  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Murdock et al.  C. Murdock, Z. Li, H. Zhou, and T. Duerig. Blockout: Dynamic model selection for hierarchical deep networks. In
- Wan et al.  L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066, 2013.
- Krueger et al.  D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, H. Larochelle, A. Courville, et al. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.
Frazão and Alexandre 
X. Frazão and L. A. Alexandre.
Dropall: Generalization of two convolutional neural network regularization methods.In International Conference Image Analysis and Recognition, pages 282–289. Springer, 2014.
- Neverova et al.  N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1692–1706, 2016.
- Mnih et al.  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
- Jaderberg et al.  M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2016. URL http://arxiv.org/abs/1611.05397.
- Yoshida  N. Yoshida. Gym-torcs. https://github.com/ugo-nama-kun/gym_torcs, 2016.
-  R. Liaw, S. Krishnan, A. Garg, D. Crankshaw, J. E. Gonzalez, and K. Goldberg. Composing meta-policies for autonomous driving using hierarchical deep reinforcement learning.
- Simonyan et al.  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- Wang et al.  Z. Wang, N. de Freitas, and M. Lanctot. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016.
- Schulman et al.  J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
- Gu et al.  S. Gu, T. P. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations (ICLR), 2017.
- Sutton et al.  R. S. Sutton, D. A. McAllester, S. P. Singh, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 99, pages 1057–1063, 1999.
- Silver et al.  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
Using keras and deep deterministic policy gradient to play torcs.https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html, 2016.
- Kingma and Ba  D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. 2015.
Appendix A Continuous Action Space Algorithms
a.1 Normalized Advantage Function (NAF)
Q-learning  is an off-policy model-free algorithm, where agent learns an approximated function, and follows a greedy policy at each step. The objective function , can be reached by minimizing the square loss Bellman error , where target is defined as .
Recently,  proposed a continuous variant of Deep Q-Learning by a clever network construction. The network, which they called Normalized Advantage Function (NAF), parameterized the advantage function quadratically over the action space, and is weighted by non-linear feature of states.
During run-time, the greedy policy can be performed by simply taking the output of sub-network . The data flow at forward prediction and back-propagation steps are shown in Fig. 8 (a) and (b), respectively.
a.2 Deep Deterministic Policy Gradient (DDPG)
An alternative approach to continuous RL tasks was the use of an actor-critic framework, which maintains an explicit policy function, called actor, and an action-value function called as critic. In , a novel deterministic policy gradient (DPG) approach was proposed and it was shown that deterministic policy gradients have a model-free form and follow the gradient of the action-value function.
Building on this result,  proposed an extension of DPG with deep architecture to generalize their prior success with discrete action spaces  onto continuous spaces. Using the DPG, an off-policy algorithm was developed to estimate the function using a differentiable function approximator. Similar techniques as in  were utilized for stable learning. In order to explore the full state and action space, an exploration policy was constructed by adding Ornstein-Uhlenbeck noise process. The data flow for prediction and back-propagation steps are shown in Fig. 8 (c) and (d), respectively.
Appendix B Experiment Details
b.1 Exploration and Reward
An exploration strategy is injected adding an Ornstein-Uhlenbeck process noise to the output of the policy network. The choice of reward function is slightly different from  and  as an additional penalty term to penalize side-ways drifting along the track was added. In practice, this modification leads to more stable policies during training .
b.2 Network Architecture
For laser feature extraction module, we use two convolution layers with filters of size , while image feature extraction is composed of three convolution layers: one layer of filters of size
and striding length, followed by two layers each with filters of size and striding length
. Batch normalization is followed after every convolution layer. All these extraction modules are fused and are later followed up with two fully-connected layers ofhidden units each. All hidden layers have relu activations. The final layer of the critic network use leaner activation, while the output of the actor network are bounded using tanh activation. We use sigmoid activation for the output of network in NAF. In practice, it leads to a more stable training for high dimensional state space. We trained with minibatch size of .
We used Adam  for learning the network parameters. For DDPG, the learning rates for actor and critic are and , respectively. We allow the actor and critic to maintain its own feature extraction module. In practice, sharing the same extraction module can lead to unstable training. Note that the NAF algorithm maintains three separate networks, which represent the value function (), policy network (), and the state-dependent covariance matrix in the action space (), respectively. In order to maintain a similar experiment setting and avoid unstable training, we maintain two independent feature extraction modules for , and both and . In a similar vein, we apply a learning rate of for , and for both and .
|Model ID||State Dimensionality||Description|
|Lasers||4 19||4 consecutive laser scans|
|Images||12 64 64||4 consecutive RGB image|
|Multi||10+119+36464||all sensor streams at current time step|
b.3 Simulated Sensor Detail
As shown in Fig. 9, the physical state is a DOF hybrid state, including D velocity ( DOF), position and orientation with respect to track center-line ( DOF), and finally rotational speed of wheels ( DOF) and engine ( DOF). Each laser scan is composed of readings spanning a field-of-view in the the front of car. Finally, camera provides RGB channels with resolution .
Appendix C More Experimental Results
c.1 Effect of Auxiliary Loss