Recently, the domain of autonomous driving has emerged as one of the hotbeds for deep learning research, bolstered by strong industry support and availability of large real world datasets (such as KITTI [kitti], Berkeley Driving Dataset [bdd], Honda Driving Dataset [ramanishka2018toward], Argoverse [argoverse]) and physically/visually realistic simulators (like TORCS [wymann2000torcs, GymTORCS], Udacity’s Car Simulator [udacity], CARLA [carla]).
More recent datasets and simulators provide multi-sensor111In this paper, we use the terms sensor and mode interchangeably. The term sensor is more meaningful to interpret in the autonomous driving setting whereas mode is more generally used in literature to indicate the various forms of state representation - this may come directly from sensors (image, speech signal) or after some meaningful post-processing (depth map, n-grams, etc.).
is more generally used in literature to indicate the various forms of state representation - this may come directly from sensors (image, speech signal) or after some meaningful post-processing (depth map, n-grams, etc.).temporal data which offers more leverage to predict optimal driving actions. Designing efficient architectures to exploit this rich source of data is still an open research problem. The current strategy for multimodal temporal data is to either pre-concatenate (concatenation followed by recurrent modules) [ramanishka2018toward, lidar-video, girdhar2017actionvlad] or post-concatenate (parallel recurrent modules for each sensor followed by concatenation) [ren2016look-aaai, jain2016recurrent, CorrRNN, morency-multiview, li2016lstm, girdhar2017actionvlad]. While they both offer unique merits and challenges, in neither of the two conceptions, the multi-sensor data is fused explicitly. Moreover, even if they work in practise, these design choices may lead to networks with much bigger parametric spaces, be potentially harder to train and might need an order of magnitude more data.
Research on temporal fusion is more popular in multimedia domains where either text or audio [CorrRNN, ren2016look-aaai] are combined with video. Temporal fusion in autonomous driving while similar in some ways (Video + LiDAR is similar in dimensions to Video + Audio) presents many more challenges - a) There can be more sensors (example - CAN bus data, radar streams, etc.) or multiple copies of each for full coverage (multi-camera [nvidiacar] multi-radar, etc.), b) The individual sensor data sizes could be disproportionate leading to undesirable biases towards a select few, and c) Instances of intermittent data quality degradation or loss (e.g., motion blur and occlusion, LiDAR in snow, camera in a tunnel, etc.). This makes autonomous navigation a more general and challenging setting for developing temporal fusion models. Therefore, we first validate our proposed models on autonomous driving related tasks. However, our main goal is to build on the past work in temporal fusion domain and develop general purpose temporal fusion models that can be used in a wider range of interesting tasks spanning audio, video, text and tactile data streams [calandra2017feeling].
Finally, given the complex inter-dependencies that emerge from learning on multimodal and temporal data, it is essential to ensure the models are interpretable to verify and correct for any undesirable over-fitting. Hence, in this work, fusion is formulated as the problem of finding the optimal linear interpolation between all the sensors. The interpolation weights (also learned usinggating functions) can be directly interpreted as each sensor’s percentage contribution to the fused state representation.
Contributions of this work:
We introduce a novel recurrent neural network unit, called the Gated Recurrent Fusion Unit (GRFU) that can jointly learn fusion and target prediction from temporal data. Note that, not only does the new formulation offer superior performance but is also interpretable.
The formulation learns a linear interpolation of individual sensor encodings. This makes sensor contribution to the fused representation explainable and useful for any higher level intervention. To the best of our knowledge, this is the first time this has been attempted in the autonomous navigation domain.
The new modules are tested on two challenging datasets (one real-world and one simulated) and for two tasks (driver behavior classification and steering angle regression, respectively).
For an overview of each task, see Fig. 1. On the classification task, we report a improvement in the mAP score over the current state-of-the-art, and on the regression task, we note a drop in test error. In the next section, we briefly review relevant prior work both in end-to-end autonomous driving and temporal fusion domains to optimally position our contribution in their intersection space.
Ii Related Work
Ii-a Temporal Fusion
An exciting area of research using recurrent neural networks is temporal fusion, where the input temporal data is coming from multiple sensors. This is a more challenging problem of learning not only the temporal pattern but also the joint encoding for the data that can best predict the desired output. This has been used to a great success in Behavior Prediction and Image Captioning[morency-multiview], Activity Recognition[ordonez2016deep] and Audiovisual Speech Recognition [CorrRNN, hu2016temporal], Speaker Identification [ren2016look-aaai], Action Classification [girdhar2017actionvlad], etc. However, temporal fusion has received little attention in the Autonomous Navigation domain which also generate a lot of multimodal data coming from a range of sensors like Camera, LiDAR, wheel odometry, etc.
Ii-B Learning for Autonomous Navigation
Typical driving automation tasks and sub-tasks of interest are learning driver behavior[ramanishka2018toward] and intent [jain2016recurrent], motion forecasting [luo2018fast, radwan2018multimodal], object detection [muller2006off, johnson2016driving] and motion prediction [nvidiacar], learning affordances [deepdriving], action regression [kendall2019learning, pomerleau1989alvinn], semantic segmentation [radwan2018vlocnet++, valada2018self] among others. A major chunk of research attacks these tasks in a non-temporal fashion, mainly using either RGB or RGBD data [pomerleau1989alvinn, muller2006off, radwan2018vlocnet++, deepdriving, nvidiacar, radwan2018vlocnet++, luo2018fast, johnson2016driving]. There is some prior work on using fusion for autonomous navigation that is either non-recurrent [liu2017learning]
in the reinforcement learning setting or recurrent unsupervised[endtoendcars] in the motion forecasting setting with only video and odometry information.
In this section, we describe the new temporal fusion architectures that we build over the standard LSTM model. We first review the LSTM model and simple fusion ideas in Section III-B. Next, in Section III-C, we introduce three new models that we proposed with the last two using linear interpolation to find the optimal fused state to pass through the recurrent units.
Assume we are given a set of modalities where M is the number of sensors, and the sensor signal for an arbitrary sensor, , is a time-series . The objective is to jointly learn the optimal temporal and modal composition to correctly predict the desired classification/regression target. Further, we make no additional assumptions like sensors having similar structure or dimensions, having similar forms of occlusions and noise ranges, or to be temporally correlated always. We do however pre-process all sensor inputs, using appropriate encoders to bring them to the same dimension prior to temporal fusion (we call the processed sensor inputs as sensor encodings and denote them as ). This is done for all proposed models and baselines for fair comparison.
The LSTM setup most commonly used in literature [lstm, lstm-forgetgate] features three gated states (input , forget , output ) along with the hidden and candidate cell states . Cell state represents memory while the hidden state is the output of the model at time . The gated states control how much of the current and the past information need to be fused and transmitted to the next state in time. The two hidden states perform important functions namely: slow-state
that fights vanishing gradient problems, and a fast-statethat allows the LSTM to make complex decisions over short periods of time. Each gated state performs a unique task of modulating the exposure and combination of the cell and hidden states. For a detailed overview of LSTM inner-workings and empirically evaluated importance of each gate, refer to [bengio-empirical, zaremba-empirical, schmidhuber-empirical].
Remark To make visualization easy, all the model schematics depicted in this section are for the two sensor case, but the equations are defined for the general M sensor case.
Iii-B Early Recurrent Fusion (ERF)
The simplest way to extend LSTMs to multimodal settings is by first summing or concatenating all the sensor encodings [ramanishka2018toward, lidar-video, girdhar2017actionvlad] and passing that as an input to the LSTM, ie., , where each = (). From a temporal standpoint, one can view this as a type of early fusion.
Remark: Concatenation, while providing individual sensor inputs to the LSTM to extract useful information, bloats up the cell and hidden state size. On the other hand, summation reduces the cell size but naively combines all sensor encodings with equal emphasis. This may not be a good idea always, especially at time steps where one or more sensors provide noisy information to the fused state (for example, when a car is driving through a tunnel, camera information is unreliable). Necessarily, temporal fusion architectures must be provided with sufficient tuning choice such that they can learn how to fuse and use temporal data. This is particularly necessary in driving datasets and both ERF models lack the explicit structures to learn them. Example scenarios where fusion needs to be dynamic are,
1) Occlusion in a sensor subset: While approaching an intersection a huge object in the form of a truck occludes the entire view in one of the image frames rendering image features unreliable. The model should rely on CAN data history to classify driver action correctly.
2) Action specific dependency: Actions like lane branching are subtle steering actions. Therefore, if the steering signal doesn’t offer sufficient correlation, change in video features like lane markers and road curvature could supplement that to avoid inter-class confusion.
3) Loss of temporal correlation across sensors: As alluded to previously, when a car is going through a dark tunnel, optical flow for odometry maybe hard to obtain and might at best be weakly correlated to the data stream obtained from the CAN bus or LiDAR. Similarly, LiDAR gets really noisy and unreliable in snow, rain and grass [lidarfailures].
Iii-C Proposed Temporal Fusion Models
Motivated by this opportunity, we identify two important ways to mitigate the above mentioned issues, a) delay fusion and pass each sensor parallely through M LSTM cells, allowing each sensor to individually decide how much of their respective histories to utilize with the current sensor input (we term this late recurrent sensor summation), b) define gates for each sensor to determine the contribution of each sensor encoding to the fused cell and output states (we term this early gated recurrent fusion). In the next sub-sections, we first define both the modifications separately and finally define our main model which combines the two (this leads to the late gated recurrent fusion model). Moreover, we use the late recurrent sensor summation and early gated recurrent fusion models also as baselines to evaluate the individual contributions (ablation study) of the two modifications.
Iii-C1 Late Recurrent Summation (LRS)
In this model, we use M copies of the LSTM units, one for each sensor. A model schematic along with equations is shown below. For each modality separate forget, input , output and cell states are first computed. The weights, , , and biases, , that transform the input space for each gate are unique for each modality but are shared across time. As summarized in the previous section, each LSTM unit receives information from the states of the past time step (, ) and the input from the current time step, . Now, instead of having separate states of each LSTM unit of a sensor, all the copies receive the same states (, ) obtained from the previous time-step. Through this modelling choice we can propagate fused representations temporally. In contrast, in [ren2016look-aaai], the weights are shared between modalities but not states. By sharing the past cell state () across all sensors, the model can individually decide whether to retain or discard memory for each modality. Finally, all the hidden () and cell () states are added to produce a combined representation and that is sent to the next time step (hence the prefix late to indicate late fusion in the model name).
Iii-C2 Early Gated Recurrent Fusion (EGRF)
We argued previously that late fusion offers the model some flexibility to separately control the memories of individual sensors, but even here summation at the end fuses all sensors assuming equal importance. However, we wish to also learn from the data the extent of each sensor’s contribution to the final fused states. Inspired by the gating mechanisms used in the LSTM [lstm, lstm-forgetgate] and GRU [gru], we propose a similar exposure control in the sensor fusion module as well. For M sensors, we define M-1 gates () that control the exposure of the sensor encoding, , in the final state . Similar to [gru], we define the gating for the last sensor as . This makes the joint representation a linear interpolation of individual sensor encodings. The model schematic and equations are shown below. In (9), we show the final fusion step where each gate is multiplied to the corresponding sensor encoding and summed to form the fused state . Temporal Modelling is performed with as in the input to the LSTM, as shown in (10) and (11).
The gating functions as very valuable to draw insights and explain the nature of fusion occurring within the model. Once they are learned, the user can easily interpret the gating values as percentage contributions of each sensor and verify if they match human insight/experience for some arbitrary sample in the dataset. This explainability feature becomes very critical and useful in complex modeling scenarios involving safety-critical tasks.
Iii-C3 Late Gated Recurrent State Fusion (LGRF)
As a final step, we describe our proposed Late Gated Recurrent Fusion model, which combines the best aspects of both late recurrent fusion (independent control of memory for each sensor) and gated recurrent fusion (learning how to fuse) in order to improve learning performance of temporal fusion models.
The model schematic is shown above. Similar to the early gated recurrent fusion model, we compute fusion gates as a function of all the sensor encodings , but instead of doing the linear interpolation of all sensor inputs to get a joint input state, , we use the gates to control the exposure of each encoding that is passed into sensor specific LSTM cells. The final joint cell and hidden states are computed by summing all the final cell and hidden state outputs.
Having described the new temporal fusion designs, in the next section, we test the models on two challenging autonomous driving datasets to validate our claims.
Iv-a Tactical Driver Behaviour Classification
Iv-A1 HDD Dataset
Recently, HDD [ramanishka2018toward] was proposed to stimulate research on learning driver behavior in interactive situations. The dataset includes a 104-hour synchronized multi-sensor naturalistic driving data. We focus our attention on the goal-oriented driver behavior classification task which involves temporally classifying the multimodal data involving video stream and CAN bus data into driver actions. The 104-video-hour data corresponds to 137 sessions. Each frame contains one label from the twelve behavior classes such as left turn, right turn, intersection passing, lane change, etc.
We follow the prior work [ramanishka2018toward] and obtain our training (100 driving sessions) and testing splits (37 driving sessions). CAN bus sensors include: car speed, accelerator and braking pedal positions, yaw rate, steering wheel angle, and the rotation speed of the steering wheel,turn signals (eight dimensional stream). The images are of dimension . The data is synchronized. The image representation is extracted from conv layer of InceptionResnet-V2 [szegedy2017inception]
pre-trained on ImageNet[deng2009imagenet]. The features are convolved with a convolution to reduce the dimensionality from to and flattened to . Raw sensor signals are passed through a fully-connected layer to transform
size signal to obtain a one dimensional feature vector of size. This is similar to the feature embedding used in [ramanishka2018toward]. For a fair comparison, the InceptionResnet-V2 backbone weights are frozen, while the rest of the linear and convolution weights are learnt end to end.
In this task, the input is untrimmed, egocentric sequences of video and CAN signals. The output is the tactical driver behavior label of each frame. We follow the evaluation protocol as in [NakamuraCVPR2017, ramanishka2018toward, shou2018online] that compute the average precision (AP) for each tactical driver behavior as well as the mean AP (mAP) over all classes. We use the Adam optimizer [kingma2014adam] to learn the network parameters with the sequence length set to 90 video frames. To fairly compare with the baseline methods [ramanishka2018toward]
, we use the same batch size set to 40. The training is performed using truncated back-propagation through time. The training process is terminated after 50 epochs, with a fixed learning rate.
Non-Fusion Architecture. We first perform experiments only on the CAN (CAN bus signal) and Img (Image) sensors separately. The embeddings are directly sent to a standard LSTM with hidden size of dimension . The output is directly fed into a fully connected layer then squashes the dimension to classes including background class. The CAN bus sensor outperforms in certain classes such as left turn, right turn, U-turn while Image performs better in classes such as lane change, lane branch, intersection passing and crosswalk passing. TCN [bai2018empirical] performs slightly better than LSTM. A successful sensor fusion architecture should outperform these results benefiting from the best characteristics of each sensor separately.
Early Fusion LSTM. As baseline architectures we use the early sensor fusion where sensor embeddings are either concatenated (Early-Concat) or element wise summed (Early-Add) as explained in Section III-B. Early-Concat is similar to the technique used in [ramanishka2018toward]. In the early fusion stage the Early-Concat outperforms Early-Add (mAP of 32.66 vs 29.88) as the LSTM has access to individual sensor information, and can choose to discard noisy sensor readings. However adding would potentially corrupt fused encoding resulting in the LSTM operating on a corrupted feature space.
Late Fusion LSTM. Here we have two separate LSTM cells that do not share any weights or hidden states between the modalities. Concatenation or summation happens after the LSTM cell. More precisely is sent to a single fully connected layer for classification. The fully connected layer operates on a dimension vector in the case of Late-Concat or dimension vector in the case of Late-Add respectively. Interestingly Late-Add (which is essentially LRS without cell state sharing) outperforms all other types of baseline fusion as the addition of cell states allows the model to focus more on the individual temporal aspects of each sensor. However this can still be improved using a joint cell representations (which can be thought of as a latent embedding).
Look, Listen and Learn [ren2016look-aaai]. The most similar baseline to our LRS model described in Section III-C1 is the Look, Listen and Learn architecture presented in [ren2016look-aaai]
. We re-implement the architecture in Pytorch for the HDD dataset. We add auxiliary losses to both the modalities and sum in the predicted output results with weight sharing. While providing a huge improvement over baseline model over every class, we turn to our fusion models to improve upon the SOTA. We call this modelLook Listen in Table.I
EGRF, LRS, LGRF. We replace the standard LSTM with each of our fusion modules explained in Section III-C2 (EGRF) , Section III-C1 (LRS) , Section III-C3 (LGRF). Each of our hypothesised fusion architectures outperform the SOTA on almost all of the classes. Our EGRF and LRS models increase the mAP by over the standard fusion LSTMs while benefiting distinct class labels. Finally, we hypothesize that our combined model LGRF attempts to combine the benefits of both LRS and EGRF and therby increasing the mAP by . As alluded to in the earlier section, the main driver for the performance boost is the added flexibility in learning afforded by the gating functions which allow the network to modulate the fusion process at each time step and best optimize the data being input from individual sensors.
|Look,Listen & Learn [ren2016look-aaai]||CAN+Img||81.11||78.46||79.01||43.20||25.29||30.17||7.79||13.94||3.56||8.92||33.39||36.80|
One of the limitations of most sensor fusion architectures is the inability to provide visual explanations for the decision-making process. For example, when in the case of a noisy sensor signal, the model needs to adapt to another sensor and gate the noise. LGRF model is uniquely positioned to give class specific reasoning for the sensor weighting. For this, we apply global average pooling on the pre-gate layer along the sensor dimension and display its value. For example, a value of for sensor 1 means that sensor 1 had a higher weighting than the for sensor 2. We additionally visualize the class activation maps [zhou2016learning] to show the localization ability of our models by using Grad-CAM [gradcam] on the last convolution layer of the image input.
This is showcased in Figure 6. We get some explainable results that validate our assumptions about which sensor is important for which action. The heat map falls on image locations such as lane markers for lane actions, road extensions for turns or intersection passing. Turns have higher can bus weighting as they capture the motion better. The most interesting observation is the truck occluding the view in the last example. Our model not only improves the attention region by localizing to the cross-walk but also showcases a high but equal weighting for both images and CAN signals, thereby correctly classifying the action.
Iv-B Steering Action Regression
To showcase the generality of the our methodology we test our models on the problem of steering angle regression. Given a set of sensor signals the task is to determine the appropriate steering control action to successfully drive in a race track. One method of addressing this problem is to perform end to end regression. A better temporal fusion could provide richer features to deal with the challenging task of understanding vehicle dynamics just by observing sensors. We showcase our extension to a 3 sensor setting where there is a vast discrepancy in the dimension of each sensor.
Iv-B1 TORCS Dataset
TORCS driving simulator that is capable of simulating physically realistic vehicle dynamics as well as multiple sensing modalities to build sophisticated AI agents that can complete race tracks. The following sensing modalities for our state description include : (1) odometry (SpeedX, SpeedY, SpeedZ) substituting CAN signals (2) laser scans consisting of 19 LiDAR points. (3) color images capturing the ego-view.
We collect time steps from 32 different tracks that vary in the form of complicated loops to different road conditions.To collect the steering action ground truths we use the standard PID controller that successfully completes navigating one lap on each track without veering off the road. Out of the 32 tracks, We divide the training-test into - track split. We perform cross validation and display the results for the best split. For the encoding we employ multiple convolution layers of kernel size . More specifically, the following layers are used: layer1,layer2
has 32 filters, followed by max pooling with kernel size, followed by layer3,layer4 with 64 filters each. Finally an additional max pooling layer with to downsample the feature space to a feature size of . The velodyne points and odometry signals are embedded using separate linear layers to and respectively. The flattened image, odometry and LiDAR embeddings are used as inputs to our model. This is shown in Fig. 2. The output of the LSTM is also sent to a same linear layer followed by activation. This is maintained for all experiments and trained end to end for fair comparison.
Iv-B2 Results and Discussion
In this task the input is a batch of images, odometry and LiDAR points with a time history of four rime steps. We train the model to directly regress the steering action. We compute the average Mean square error(MSE) between the predicted action and ground truth action value over all the data in validation set.After ablation study we found that history above
time steps does not provide significant information for temporal modeling. We set the batch size to 128 for all experiments and use Adam optimizer to train the weights. For the final activation function we chooseactivation function to squish the last linear layer output to a range . We perform a grid search on the learning rate from to . Overall performs best.
Iv-B3 Fusion Architectures
Baselines. We extend our previous baselines from HDD experiments to a three sensor setting and report the results in Table II. A similar trend in results is obtained with [ren2016look-aaai] outperforming other baselines with the lowest error of . Most of the models are not able to handle the huge disparity in sensor embedding dimensionality with image embedding size of overshadowing others.
EGRF, LRS, LGRF. We extend our models to a three sensor setting. This involves 1) modifying Eqn. 4 in LRS to support 12 gates (4 for each modality) 2) compute two pre-gates for EGRF as in Eq.8 for images (), LiDAR () and odometry () 3) Combine both for LGRF. Our models outperform all other baselines. LGRF gives an overall best performance with an additional +20% drop over the state-of-the-art. An interesting note is the huge variation in error between EGRF and LRS. We suspect that this might be due to the highly correlated sensors in a simulated setting as opposed to the real world setting in HDD dataset. Hence the benifit of early noise rejection from EGRF does not play as important a role as learning to fuse the best aspects of each sensor as in LRS.
In this work we presented a novel temporal fusion architecture that we termed Gated Recurrent Fusion Unit to learn from large-scale multi-sensory temporal data. We define gating functions that modulate the exposure of individual sensor data at each time step to determine optimal fusion strategy. For future work we wish to test on more general multimodal settings. Failure cases as shown in Figure. 7 have to be thoroughly investigated further as to why the attention appears to fall at unexplained regions in the image. Moreover additional extensions using TCN backbones need to considered for further temporal modelling improvement.