Log In Sign Up

Multi-Modal Fusion for Sensorimotor Coordination in Steering Angle Prediction

Imitation learning is employed to learn sensorimotor coordination for steering angle prediction in an end-to-end fashion requires expert demonstrations. These expert demonstrations are paired with environmental perception and vehicle control data. The conventional frame-based RGB camera is the most common exteroceptive sensor modality used to acquire the environmental perception data. The frame-based RGB camera has produced promising results when used as a single modality in learning end-to-end lateral control. However, the conventional frame-based RGB camera has limited operability in illumination variation conditions and is affected by the motion blur. The event camera provides complementary information to the frame-based RGB camera. This work explores the fusion of frame-based RGB and event data for learning end-to-end lateral control by predicting steering angle. In addition, how the representation from event data fuse with frame-based RGB data helps to predict the lateral control robustly for the autonomous vehicle. To this end, we propose DRFuser, a novel convolutional encoder-decoder architecture for learning end-to-end lateral control. The encoder module is branched between the frame-based RGB data and event data along with the self-attention layers. Moreover, this study has also contributed to our own collected dataset comprised of event, frame-based RGB, and vehicle control data. The efficacy of the proposed method is experimentally evaluated on our collected dataset, Davis Driving dataset (DDD), and Carla Eventscape dataset. The experimental results illustrate that the proposed method DRFuser outperforms the state-of-the-art in terms of root-mean-square error (RMSE) and mean absolute error (MAE) used as evaluation metrics.


page 3

page 7

page 15

page 21


RGB-Event Fusion for Moving Object Detection in Autonomous Driving

Moving Object Detection (MOD) is a critical vision task for successfully...

Imitation Learning for End to End Vehicle Longitudinal Control with Forward Camera

In this paper we present a complete study of an end-to-end imitation lea...

MEFNet: Multi-scale Event Fusion Network for Motion Deblurring

Traditional frame-based cameras inevitably suffer from motion blur due t...

End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perception

Convolutional Neural Networks (CNN) have been successfully applied to au...

Image-based Localization using Hourglass Networks

In this paper, we propose an encoder-decoder convolutional neural networ...

APS: A Large-Scale Multi-Modal Indoor Camera Positioning System

Navigation inside a closed area with no GPS-signal accessibility is a hi...

1 Introduction

Autonomous driving has gained tremendous attention both in vision and robotics communities over the past decade. The development of autonomous vehicles involves modular design incorporating localization, perception, planning and control azam2020system

. Since the vision modality mimics human perception more, a surrogate architecture to classical design is to learn a mapping function between the vision modality and control actuators. The learning of this mapping function broadly involves two approaches that are i) computer vision approaches and ii) imitation learning-based approach. The former approach of predicting the steering angle involves road boundaries or lane marking detection as the feature extraction mechanism. So, firstly, the images are pre-processed and followed by either road boundary detection or lane marking detection. In addition, to further improve the road or lane marking detection, tracking is employed to remove the erroneous road or lane information by considering the consecutive frames. The latter approach employs the neural network to learn the mapping function between the vision modality and the control actuator. For this purpose, vision data induced with the steering angle information is collected through behavior cloning (expert demonstrations) and utilized for training the neural network. Learning the mapping function through imitation learning involves neural network architecture selection, dataset collection, and optimizing the neural network parameters for better performance in predicting the steering angle.


Figure 1: (a,b) Our autonomous vehicle equipped with sensors for data collection. (c,d) shows the frame-based RGB image and event data captured using this setup.

However, the vision modality that is frame-based RGB cameras has shown impressive results for end-to-end driving codevilla2019exploring; chen2020learning; ohn2020learning; prakash2020exploring. Yet, the robust performance of these vision systems for autonomous driving in challenging conditions is still an open problem. For instance, the frame-based camera has limitations in changing illumination conditions and prone to motion blur. However, the event camera captures the per-pixel brightness change as a stream of asynchronous events, providing complementary sensory information to the conventional frame-based RGB cameras. The event camera operates in a high dynamic range and has high temporal resolution and less motion blur compared to frame-based RGB cameras gallego2019event.

Since the event camera provides complementary sensory information to the frame-based RGB camera in changing illumination conditions, this motivates the hypothesis of this work to explore the fusion mechanism of event and RGB frame-based cameras in the context of steering angle prediction for the lateral control. Prior work in the context of the event and frame-based cameras fusion involves two approaches. The first approach corresponds to augmenting the event as a channel to the frame-based RGB image hu2020ddd20. In contrast, in the other approach, frames are generated from the event data for the feature representation learning maqueda2018event

. The earlier approaches have utilized the convolutional neural network as a feature extractor network with stacked event and frame-based RGB data that helps CNN capture the global context with a single modality. However, using this stacking nature limits the utility of the event camera as a secondary modality with the frame-based RGB images and leads to the non-trivial behavior of model interactions between multiple features.

To overcome the problems mentioned earlier, this work explores learning the lateral control in terms of steering angle prediction by introducing the self-attention approach between the frame-based RGB and event data fusion. A convolutional encoder-decoder architecture named DRFuser is designed. In the proposed architecture (DRFuser), an individual convolutional encoder network is adopted for each of both modalities, including self-attention layers for the feature extraction and fusion, as illustrated in Fig.2. In this work, we follow the behavior cloning approach of imitation learning approach to learn the lateral control in terms of steering angle prediction. The efficacy of the proposed method is evaluated on three datasets, i) our collected dataset, ii) Davis Driving dataset (DDD) hu2020ddd20, and iii) the simulated Carla EventScape dataset gehrig2021combining. We have utilized our experimental vehicle as illustrated in Fig.1 for the dataset collection installed with Davis- event camera and frame-based RGB camera azam2020system. The vehicle control data is collected from the vehicle Controller Area Network (CAN) bus. In the experimental evaluation of the proposed method, Root Mean Square (RMSE) and Mean Absolute Error (MAE) metrics are employed, showing the better scores of the proposed method. As in the literature, the DDD dataset is only used for the experimental evaluation of predicting the steering angle; the proposed method has outperformed the state-of-the-art methods using the DDD dataset in terms of RMSE evaluation. In summary, the main contributions of this work are as follows:

  1. A novel convolutional encoder-decoder architecture (DRFuser) with the self-attention layer in each encoder’s modality (event and frame-based RGB) is designed for learning end-to-end lateral steering angle. We explore the behavior cloning approach of imitation learning for learning the steering angle by fusing the event data and frame-based RGB data.

  2. We have collected our dataset comprises of event data, frame-based RGB data and Controller Area Network (CAN) bus data which include throttle, brake, speed , steering angle and torque values. All the data is collected using Robot Operating System (ROS) framework and synchronized using the ROS framework. Our code and data 111

    is open-sourced.

  3. The efficacy of the proposed method is extensively evaluated on three datasets which include our collected dataset, DDD dataset and Carla EventScape dataset. We have evaluated the proposed method in term of RMSE and MAE scores for predicting the steering angle.

2 Related Work

2.1 End-to-End lateral control learning for Autonomous vehicles

End-to-end learning of lateral control in terms of steering angle prediction is a challenging problem. The typical autonomous driving lateral control is developed by aggregating highly engineered modular architecture that includes localization, perception, motion planning, prediction, decision making, and control. In contrast, end-to-end approaches generate steering angle values directly mapped from visual observation to control actions. The visuomotor lateral control tightly couples the perception and control element of the problem. ALVINN was the first visuomotor prediction network that learned to steer from the images pomerleau1998autonomous. The network consists of layers of a neural network trained in an end-to-end fashion. Therefore, ALVINN demonstrated the potential of neural networks for steering autonomous vehicles. Following that, NVIDIA used the convolutional neural network (CNN) network to predict steering angles using images from the front-facing camera bojarski2016end. They acquire good predictions on the highway driving by utilizing five convolution layers and three fully connected layers network.

The researchers have used the studies mentioned earlier to introduce an assortment of networks to learn to steer. xu2017end use an extensive video database to train the FCN-LSTM network to learn a generic vehicle motion model that predicts future steering angle from camera observation and previous vehicle state. kim2017interpretable uses attention heat maps to regress steering angle from frame-based RGB. They emphasize the interpretability of input data to predict the control commands. azam2021n2c leverage behavioral cloning to train a neural network-based controller which predicts throttle, brake, and torque commands using speed and steering values. Moreover, they have used images to predict speed and steering angle using a deep neural network which is used in conjunction with a neural network-based controller.

Moreover, some research is focused on multi-modal steering angle prediction. For example, maanpaa2021multimodal uses RGB image and Lidar range and reflectance to predict steering angle for adverse weather conditions. The data is fused using two techniques; the middle fusion dual model and channel gated dual model. In the middle fusion, the features are concatenated after the convolution part of the network, whereas the channel gated dual model has two parallel architectures which concatenate the features through the gated channel. hou2019learning supplement the visual input using auxiliary information from segmentation and optical flow data to learn the steering angle. They used PSPNet and FlowNet network to obtain segmentation and optical flow from RGB image and incorporate low, medium, and high level features map to train end-to-end network for steering angle prediction.

2.2 Dynamic vision sensor for visuomotor lateral control Autonomous vehicle

Early work on end-to-end learning to steer used visual data from the frame-based camera. However, we propose incorporating information fusion gathered from event-based asynchronous and frame-based cameras to learn the lateral control of the autonomous vehicle in an end-to-end learning manner. The event camera operates asynchronously and captures the change in brightness (events) for all pixels independently gallego2019event. Therefore, the event camera can generate sparse asynchronous signals in space and time, enabling it to have a higher temporal resolution, low latency, and high dynamic range. The frame-based cameras are prone to illumination variation, motion blur, and sun glare. The complementary nature of event data to the RGB frame-based enhances it’s applicability to be utilized for the fusion of information both the modalities.

The proficiency of event cameras in providing rich data helps in solving perception problems in autonomous vehicles hidalgo2020learning; munir2021ldnet; alonso2019ev. Here, we tackle the problem of steering angle prediction by fusing data from event-based cameras and frame-based RGB cameras. Similar work is done by maqueda2018event; hu2020ddd20, they publish two large scale datasets, DDD and DDD

. These datasets contain event and RGB-frame data for different road and weather conditions. They designed a simple deep neural network based on ResNet-32 and linear layers to predict the steering angle. The input is a two-channel tensor consisting of event-frame and RGB-Frame. The network is trained on one sequence of recorded data and tested for the event-only data and fused data. In

maqueda2018event same network is used with different input data representation. The event data is converted to the frame using the integration of events with time. The proposed work is improved on maqueda2018event; hu2020ddd20 by developing a self-attention-based fusion network for steering angle prediction. Instead of using event data as an input channel, we explore the fusion of features by utilizing self-attention to emphasize valuable features and improve the network’s learning.

2.3 Attention

The ability of attention to acquire to focus on essential features within a context has made it a significant ingredient in deep learning models for several modalities. In literature, there are three kinds of attention used, additive

bahdanau2014neural, multiplicative luong2015effective and self-attention vaswani2017attention

. The additive attention transfer the information from the encoder to the decoder, which enhances the feature representation. The decoder neuron receives additional input through a gating signal from the encoder providing flexibility to focus on essential features. However, in multiplication attention, the gating signal is multiplied instead of addition, and it has the drawback of poor performance for high dimensional input features. The self-attention incorporates long-distance interactions into the model, which gives it the strength to remember global and local features. We incorporate a self-attention module to fuse the information from the event and frame-based data to encode the surrounding environment’s global context, which helps the algorithm predict the steering angle robustly.

Figure 2: The overall framework for the proposed method. It includes convolution encoders, self-attention layer for feature extraction and fusion and a convolution decoder. The network input frame-based RGB and event data and predict steering angle.

3 Methodology

This section elaborates in detail a novel proposed framework for learning end-to-end lateral control steering angle from multi-modal data, as illustrated in Fig.2. The framework is composed of the convolutional encoders for frame-based RGB images and event data, self-attention layers and a convolutional decoder. The self-attention layers are introduced between the encoders to learn the long-range interaction between the encoded features for frame-based RGB image and event data.

3.1 Problem Formulation

This work explores end-to-end learning of the lateral control by predicting steering angle in an urban setting through fusing the information from frame-based RGB and event data.

Imitation learning (IL) deals with learning a control strategy that demonstrates the behaviour of an expert hussein2017imitation

. Employing the IL strategy, the focus of our problem is to learn the lateral control by mapping the input to the steering angle. We have adopted IL’s Behavior Cloning (BC) approach that corresponds to the supervised learning method

ly2020learning. An expert behavior cloning dataset of size is collected from the environment. The dataset comprises of high-dimensional observations with corresponding steering angle recorded by manual driving. The fusion network in the form of encoder-decoder architecture is trained in a supervised manner using the dataset

with the loss function

by employing the objective function as expressed as:


The high-dimensional observation, , encompasses the frame-based RGB image and event camera data. In training, the Huber loss (smooth) loss function is employed to measure the distance between the predicted steering angle, and the expert steering angle . We have used Huber loss (smooth-L1 loss) as a loss function given below, the is used in the experimentation. The Huber loss function is combination mean squared error and the absolute value function. The intuition to use Huber loss as the objective function in training the proposed method is to combine the best of two world (i-e mean squared error and absolute value function). The balancing nature of Huber loss between mean squared error and mean absolute error allow to train the proposed model for varied data (as in the case of steering angle). The value of in the Huber loss determines the applicability of mean squared error and mean absolute error as an objective function in training the proposed network. For the loss values that are less than , mean squared error is employed whereas for larger loss values in comparison to mean absolute error is employed.


3.2 Self-attention Multi-Modal Fusion Network

The convolution in deep neural networks forms a fundamental building block for current vision architectures. However, convolution is unable to acquire long-term dependencies of the features, and effort is focused on augmenting convolution modules with non-local means such as attention to gain advancement on vision tasks. The introduction of attention in the encoder-decoder architecture for the neural transduction models has enabled learning the representation from variable sources bahdanau2014neural

. Besides, the success of attention in natural language processing applications approaches incorporating attention to vision tasks have illustrated tremendous performance. Specifically, self-attention is in the form of attention applied within a single context instead of across multiple contexts. Here, context can be regarded as different modalities, for instance, speech, text, or images

xu2015show; wu2016google. The capability of attention to learn to concentrate on essential features within the context has motivated this study to use it in the fusion of multi-modal data.

The self-attention multi-modal fusion network comprises encoder-decoder architecture with the addition of self-attention layers in the encoder module. The key idea is to exploit the self-attention layer to incorporate the global context of frame-based RGB image and event data in predicting steering angle. Formally, given the input features , where is the number of channels that correspond to the input features. A local region of features known as memory bank in the position of with the spatial extent of centred around is extracted for the computation of self-attention. This local form of self-attention is disparate from the global attention between all features. Global attention is computationally expensive, which limits its applicability across all layers in the encoder-decoder network. The linear projections for computing the set of queries, keys and values (, and ) are expressed in Eq.(3) vaswani2017attention.


where are the learned weight matrices. The single-headed attention is computed by employing the dot-product between and and then aggregates the values for each query as expressed in Eq.(4)


where corresponds to the output feature of same size as the input feature . In practice, multiple attention head are employed, so the aforementioned computation is performed repeatedly for every feature . For the computation efficacy, the input feature is partitioned in to groups and single-head attention is computed for each group with different learned weight matrices per head, and finally concatenated to give the output representations. To incorporate the positional information in the attention for permutation inequivariant, relative distance of is added to each position of . This relative distance is offseted by row and column offset as denoted as and respectively. In addition, these offsets are associated with embedding and respectively. The relative attention that is employed in the feature extractor is expressed in Eq.(5) vaswani2017attention.


For feature extraction and fusion, two encoder networks are designed for frame-based RGB image and event data, respectively. The architecture of both encoder networks are identical to each other. In contrast to SegNet badrinarayanan2017segnet, MFNet ha2017mfnet and FuseNet hazirbas2016fusenet, we have employed the ResNet he2016deep

model for the feature extraction. The last two layers of ResNet consisting of average pooling and fully connected layers are excluded from the ResNet model as they were used for the classification task. ResNet model incorporates an initial block containing convolution, batch normalization, and ReLU activation layer. A max-pooling layer and four residual blocks are sequentially used after the initial block to reduce the spatial resolution of the features. Here it is to be mentioned that in this work, we have the ResNet model as it is highly accepted as a feature extractor in the research community.

The self-attention layer is employed in each encoder to learn the long-range interaction between the encoded features. The self-attention layers are applied at multiple scales in the ResNet feature extraction, as illustrated in Fig.2. The output of the self-attention layer from both encoders networks is fused element-wisely. It is to be noted that the feature map size is not changed after the fusion. The resulting output from the last fusion operation is fed as input to the decoder network.

The function of the decoder is to predict the prediction based on the encoded features. We have designed a simple yet effective decoder for steering angle prediction in the proposed work. The decoder is not the mirrored version of the encoder network. The encoder and decoder are asymmetric in architecture. The decoder network comprises three convolutional blocks; each includes convolution, batch normalization, and ReLU activation layer. In the second and third convolutional block, a dropout layer is introduced for generalization and avoiding the over-fitting of the model. Followed by convolution blocks, two linear layers are employed to predict steering angle. The decoder architecture details are given in Table-6. The decoder used with ResNet-34 has one convolution block compared to ResNet-50 to accommodate the feature map size.

Figure 3: The detail of working operation of self-attention layer is illustrated.

4 Experimentation and Results

4.1 Datasets

The efficacy of the proposed method for learning the control policies using the fusion of frame-based RGB image and event data is evaluated on two available public datasets and our collected dataset. The details of each dataset is explained below.

4.1.1 Our Collected dataset

We have used our experimental vehicle shown in Fig.1 equipped with exteroceptive and proprioceptive sensors for data collection. The perception sensor includes -channels Velodyne Lidar, Logitech RGB camera and DAVIS dynamic vision event camera. In addition, the Novatel Global Navigation Satellite System (GNSS) is incorporated as the proprioceptive sensor. Finally, the drivekit222 and can-shield are used to extract the CAN bus data. The details of which are given in our previous work azam2021n2c. This study used CAN bus data, frame-based RGB image data, and event data for the proposed method. The event data is collected using a DAVIS camera, which has a resolution of pixels. The DAVIS camera also outputs an intensity image, but the resolution quality of that intensity image is low. To compensate for this, we have utilized the Logitech-c RGB camera. It gives a max resolution of pixel at frame per second (fps). Meanwhile, the CAN data compromises torque, speed, throttle, brake, and steering angle are obtained through behavioural cloning while driving in an urban environment during day and night.

The event camera, frame-based RGB camera and vehicle CAN are all operating at different frequencies. The difference in operating frequencies limits the utilization of this data for the proposed method. For this purpose, the data from the sensors, as mentioned earlier, need synchronizing. Generally, there are two synchronizing approaches adopted in the research community that are hardware-triggered and soft-time synchronization. In the proposed method of data collection, the soft-time synchronization approach is adopted. The soft-time synchronization approach is developed under the Robot Operating System (ROS)333 framework. ROS provides a flexible ecosystem for robotics systems and contains powerful tools and libraries. Therefore, for the data synchronization, we have utilized the ROS framework. We have employed individual ROS nodes for the data acquisition from CAN bus, frame-based RGB camera and event camera. We have designed a proprietary ROS node for the CAN bus that publishes speed, steering angle, brake, torque, and throttle values in the ROS framework. Since ROS provides drivers for the frame-based RGB camera, we have utilized that ROS node for the RGB image data collection. For the DAVIS camera, we have employed the open-source ROS node for event data collection. Since each sensor has a different frequency, we adopted the nearest neighbour search based on the lowest sensors’ frequency. We matched the corresponding other sensor data with that for the data synchronization. In our case, the frame-based RGB camera has the lowest frequency, whereas the event camera and CAN bus have high frequencies. In addition, we have open-sourced the synchronization code for the research community.

Pre-processing of Event Data

Event cameras are operated asynchronously and generate output data in the form of spikes or events; as a result, from the change in brightness level in the viewing scene. The event or spike representation of event camera data limits its usability for the convolutional neural network. For this purpose, the event data is transformed to a suitable representation (for instance, images) that the convolutional neural networks can employ. Mathematically, to convert a stream of event data to an image where each independent pixel corresponds to a change in brightness . An event is recorded for each pixel and time when there is a change in brightness and surpasses the threshold () is expressed as :


where represent polarity of brightness and corresponds to the time of last triggered event at location . Eq.(6) represents the event generation model and using this a sequence of events is generated for the time interval . In our data collection, we fixed the events to , which is approximately msec and generate the corresponding image representation when the number of events reached to specific number as mentioned earlier. The number of events selection is empirical and produce better results in our data collection.

We have collected GB of raw driving data in an urban environment at day and night, respectively. We have processed GB of data for our experimentation by converting event data into image representation and synchronizing it with frame-based RGB and vehicle control data. The processed data consists of image pairs of event and frame-based RGB and steering angle information, split into training () and testing () sets. The data is split carefully, consecutive and non-overlapping sequences of data is considered for training and test sets.

4.1.2 Davis Driving Dataset (DDD)

The Davis Driving Dataset (DDD) is one of the most extensive datasets collected using a DAVIS camera, a dynamic and active pixel vision sensor hu2020ddd20

. It provides a concurrent stream of events and an active pixel sensor (standard grayscale images). The events represent brightness changes occurring at a particular moment and have a dynamic range of

dB and an effective frame rate of . The DAVIS camera has a resolution of pixels. The dataset comprises DAVIS camera data (frames and events), and vehicle human control data in the form of speed and steering angle is collected for approximately hours of driving on urban and highway roads under different weather and illumination conditions. The total data size is approximately , from which we have sampled of data containing different weather and lighting conditions as shown in Table-1. The consecutive and non-overlapping recordings are split into training and test set. The data is pre-processed on the same principle as explained above. The events are accumulated for the time interval of msec and represented as an image. The training set contains image pairs and test set .

Filename Scene Condition T(s) GB
1487354030 City night,wet 377 3
1487354811 City night,wet 190 1.4
1487417411 Freeway day 2096 18.2
1487597945 City evening 50 0.5
1487598202 freeway day 1882 15.1
1487778564 campus day 101 1.1
1487839456 City day,sun 406 5.7
1487849151 town day,sun 429 5.5
1487856408 town day,sun 817 13.2
Table 1: The list of sample data included in training and testing from Davis driving data.

4.1.3 EventScape

EventScape is a large-scale synthetic dataset recorded through the CARLA simulator gehrig2021combining. An event camera sensor is implemented in CARLA simulator dosovitskiy2017carla, which renders images by computing per-pixel brightness change from the simulated environment. The similar pre-processing method is adopted as explained above. The event data for an approximate time interval of 2ms is rendered into a single image frame. The camera parameters are set to resolution and focal length of pixels. In addition, the simulated sensor can record various environmental conditions, such as different weather conditions and illumination variations. In this work, we have utilized the data published by gehrig2021combining. It consists of approximately hours of driving in different towns and weather conditions. The data contains event data frames, frame-based RGB images, and vehicle control data. The training set consists of data collected from Town-, , and , while validation and test data are collected from Town-. The training set contains image pairs, and the test set has .

Figure 4: The additive attention layer mechanism is shown, where it input two feature set from each encoder and compute fusion of features.

4.2 Baseline

Towards gaining a deep insight into our proposed network, which fuse the event and frame-based RGB by employing a self-attention layer in the encoder. Here we investigate the effect of the self-attention layer on fusing the event and frame-based RGB by performing experiments using the No-attention layer and the additive attention layer bahdanau2014neural. The No-attention layer is simply element-wise addition of features maps obtained from the event encoder and RGB encoder, as shown by Eq.(7).


For additive attention layer utilizes an additive attention mechanism to fuse data between event and RGB encoder. The Fig.4 illustrates the working principle of additive attention. The attention coefficients identify the salient regions in the frame-based RGB and prune the relevant features, which are fused with event data by element-wise multiplication. The output of the additive attention layer is . The is given by Eq.(9) bahdanau2014neural.



characterised sigmoid activation function.

represents set of learning parameters , , and bias term, .These parameters are trained with standard back-propagation.

Figure 5: The visualization of prediction of steering angle in comparison of ground-truth, (a) shows collected dataset, (b) shows Davis driving data and (c) shows Carla EventScape dataset.
Figure 6: The quantitative comparison between the proposed method self-attention DRFuser, DRFuser with No-attention and additive attention layer on test data is illustrated. (a) Steering angle prediction on our collected dataset. (b) Steering angle prediction on Davis driving data, and (c) Steering angle prediction on Carla Eventscape dataset is presented.
Figure 7: The quantitative comparison between frame-based RGB and event data fusion using the proposed framework DRFuser self-Attention (ResNet-50) and ResNet-50 trained on singular modality is shown. (a) shows steering angle prediction on our collected dataset. (b) illustrate the steering angle prediction on Davis driving data and (c) shows the steering angle prediction on Carla Eventscape dataset.

4.3 Training Details

We have implemented the proposed network DRFuser using PyTorch 1.8.0 with CUDA 11.1. The DRFuser is trained on a PC with Intel i

core and RTX

graphic card. Since the graphic card memory determines the batch size of the training network, it was changed accordingly for different networks. The minimum batch size of 2 was employed. The network is trained end-to-end using pre-trained ImageNet weights of ResNet. The image size of frame-based RGB and event data are identical for input to the network. We have used AdamW optimization solver for training. The weight decay, learning rate and momentum are set to

, and

, respectively. The network is trained for 100 epochs.

Two evaluation metrics, root means square error (RMSE) and means absolute error (MAE), are used to determine the effectiveness of the proposed network, as shown by Eq.(11) and Eq.(10)

Model Frame-based RGB Event Data RMSE Score MAE Score
ResNet-34 0.2396 0.1295
ResNet-34 0.3234 0.1913
ResNet-50 0.2046 0.1143
ResNet-50 0.2782 0.1698
No attention (ResNet-34 backbone) 0.1837 0.0963
No attention (ResNet-50 backbone) 0.1773 0.0842
Additive attention (ResNet-34 backbone) 0.1722 0.0808
Additive attention (ResNet-50 backbone) 0.1612 0.0713
DrFuser (ResNet-34 backbone) 0.1430 0.0540
DrFuser (ResNet-50 backbone) 0.1266 0.0396
Table 2: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) Scores for Our Collected Dataset for the proposed method and baseline methods.
Model Frame-based RGB Event Data RMSE Score MAE Score
ResNet-34 0.0293 0.0093
ResNet-34 0.0562 0.0134
ResNet-50 0.0281 0.0082
ResNet-50 0.0475 0.0122
No attention (ResNet-34 backbone) 0.0191 0.0051
No attention (ResNet-50 backbone) 0.0174 0.0049
Additive attention (ResNet-34 backbone) 0.0169 0.00435
Additive attention (ResNet-50 backbone) 0.0148 0.0040
DrFuser (ResNet-34 backbone) 0.0125 0.00297
DrFuser (ResNet-50 backbone) 0.0118 0.00214
Table 3: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) Scores for EventScape Dataset for the proposed method and baseline methods.
Model Frame-based RGB Event Data RMSE Score MAE Score
ResNet-34 0.03363 0.01107
ResNet-34 0.03749 0.01312
ResNet-50 0.03242 0.01009
ResNet-50 0.03582 0.01211
No attention (ResNet-34 backbone) 0.03023 0.00988
No attention (ResNet-50 backbone) 0.02992 0.00934
Additive attention (ResNet-34 backbone) 0.02355 0.00812
Additive attention (ResNet-50 backbone) 0.02118 0.00798
DrFuser (ResNet-34 backbone) 0.01654 0.00721
DrFuser (ResNet-50 backbone) 0.01519 0.00631
Table 4: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) Scores for Davis Driving Dataset (DDD) for the proposed method and baseline methods.
Model Frame-based RGB Event Data RMSE Score
Bojarski et al. bojarski2016end 0.1574287
CNN-LSTM xu2017end 0.1429425
Maqueda et al. ResNet50 maqueda2018event 0.0715585
Yuhuang et al. hu2020ddd20 0.0977384
Yuhuang et al. hu2020ddd20 0.11397
Yuhuang et al. hu2020ddd20 0.0720821
DRFuser self-attention Resnet 50 0.05192
Table 5: The comparison of the evaluation result of DRFuser with other state-of-the-art method on Davis Driving Dataset. All RMSE scores are in radians.
Layers Input/Output channel size
Decoder with ResNet-50 encoder Convolution 2D 2048/1024
BatchNormalization 1024/1024
ReLu 1024/1024
Convolution 2D 1024/512
BatchNormalization 512/512
DropOut 512/512
ReLu 512/512
Convolution 2D 512/256
BatchNormalization 256/256
ReLu 256/256
DropOut 256/256
Linear layer 256*h*w/512
ReLu 512/512
Linear Layer 512/1
Decoder with ResNet-34 encoder Convolution 2D 512/256
BatchNormalization 256/256
ReLu 256/256
Dropout 256/256
Linear layer 256*h*w/512
ReLu 512/512
Linear Layer 512/1
Table 6: The detailed framework of decoder of the proposed network DRFuser.

4.4 Results

The efficacy in terms of quantitative and qualitative analysis for the proposed method is performed on our collected dataset, DDD dataset, and Carla Eventscape dataset, respectively. RMSE and MAE scores are computed for our quantitative experimental evaluation’s proposed and baselines methods. Table-2 shows the RMSE and MAE scores for our collected dataset. Both RMSE and MAE are negative-oriented scores, which means lower values are better. In Table-2

, the proposed method is quantitatively compared with the baselines, which includes additive and no-attention models. In our experimentation, we have evaluated the results with the ResNet-34 and ResNet-50 backbone. The experimental results illustrate better results with the ResNet-50 backbone. In addition, the steering angle prediction using the single modality is also performed to provide a sanity check in terms of the fusion of event and frame-based RGB data and illustrate the efficacious of the proposed method in learning end-to-end lateral control. Moreover, the proposed network DRFuser is extensively evaluated on DDD and Carla Eventscape datasets. Table-

3 and Table-4 shows the quantitative evaluation of DRFuser along with baseline method and single modality evaluation on DDD and Carla Eventscape datasets, respectively. It is to be noted, to the best of our knowledge, only the DDD dataset is utilized in research for the steering angle prediction. Table-5 shows the quantitative comparison between the proposed DRFuser and state-of-the-art methods. For better evaluation, in Table-5 event and frame-based RGB data fusion, as well as single modality results, are illustrated. As the focus of this paper is to explore the fusion of event and frame-based RGB data for learning the lateral control in the form of steering angle, the proposed DRFuser achieves a RMSE score in contrast to RMSE score of the state-of-the-art hu2020ddd20 method. Similarly, the experimental results outperformed the state-of-the-art singular modalities scores.

To further investigate the evaluation of the proposed method DRFuser, a quantitative comparison is performed between the proposed method and baseline as illustrated in Fig.6 for our collected dataset, DDD, and Carla Eventscape dataset. The graphs in Fig.6

show the prediction results for the proposed method and baseline. The prediction results for the additive-attention baseline over-estimate the prediction result in contrast to the proposed DRFuser method. Similarly, to analyze the effect of fusion in comparison to single modality prediction, a quantitative evaluation is performed, as shown in Fig.

7. The event camera data, when utilized as a standalone for the steering angle prediction, has not produced promising outcomes as in the case with only frame-based RGB data and DRFuser on all three datasets. The fusion results show better efficacy since the event camera provides complementary information to the frame-based RGB camera, as illustrated in Fig.7. It is to be mentioned here that we compared only ResNet-50 based backbone results, as the proposed method has shown better efficacy with the ResNet-50 backbone. We visualized the prediction results for all three datasets with the proposed DRFuser method for the qualitative evaluation, as illustrated in Fig.5. The Fig.5 only shows a few frames for all three datasets; however, the complete visualization is available at

4.5 Ablation Studies

To further investigate the efficacy of the proposed method, we expand our analysis by designing the early and late fusion approaches using the same modalities as utilized by the proposed method.

Figure 8: The visualization of prediction of steering angle in comparison of ground-truth, (a) shows collected dataset, (b) shows Davis driving data and (c) shows Carla EventScape dataset.
Figure 9: Comparison of early and late fusion approaches with the proposed method on our collected dataset.

4.5.1 Comparison of early and late fusion approaches with the proposed method

Multi-modal fusion can be grouped into early, late, and intermediate fusion approaches. In the context of early fusion, the multi-modal features are fused at the input level before the learning algorithm ramachandram2017deep. Besides, in the late fusion, the features from the multi-modal are fused at the decision level. However, studies from neuroscience suggest that intermediate fusion could provide the necessary assistance in learning the feature representation from multi-modalities schroeder2005multisensory macaluso2006multisensory. In order to evaluate the efficacy of the proposed method and analysis, early and late fusion approaches are designed using frame-based RGB and event cameras as modalities. Fig.8(a) and Fig.8(b) illustrate the early and late fusion architectures, respectively. The frame-based RGB and event data are fused in the early fusion network by stacking both modalities. In the early fusion, the encoder follows the same architecture as utilized by the proposed network. We have employed the RMSE and MAE as evaluation metrics for the steering angle prediction for the quantitative evaluation of the early fusion with the proposed method. Fig.9, Fig.10 and Fig.11 illustrate the quantitative results of early fusion on our collected, DDD and EventScape datasets, respectively.

Figure 10: Comparison of early and late fusion approaches with the proposed method on DDD dataset.
Figure 11: Comparison of early and late fusion approaches with the proposed method on EventScape dataset.

In the context of the late fusion, the input modalities are fed to the unimodal learning architecture for the feature representation. Later, the features are fused for the final decision of predicting the steering angle. In our experimental analysis, the later fusion performs better in contrast to the early fusion method. The justification of this behavior relies on the architectural structure of CNN that has been carefully designed over the past few years for the unimodal stream. However, the proposed method’s efficacy in terms of RMSE and MAE is better than the early and late fusion approaches. Fig.9, Fig.10 and Fig.11 show the quantitative results of late fusion on the three datasets, respectively. It is to be mentioned here that we have employed the same settings for decoder as introduced in the proposed method in both early and late fusion.

In the experimental analysis of fusing the frame-based RGB and event data using the early fusion approach on our collected dataset, both the backbone network ResNet-34 and ResNet-50 are employed for the fair quantitative evaluation with the proposed method. In the case of early fusion with ResNet-34, RMSE score of and MAE score of is achieved in contrast to RMSE and MAE score of and respectively for the DRFuser with ResNet-34 backbone. Similarly, the proposed method with ResNet-50 backbone with RMSE and MAE scores of and surpasses the early fusion with ResNet-50 backbone having RMSE and MAE scores of and , respectively. Furthermore, the early fusion results on DDD and EventScape datasets are also quantified with the proposed method using both ResNet-34 and ResNet-50 backbones, as illustrated in Fig.10 and Fig.11 respectively. For instance, the early fusion with ResNet-34 on the DDD dataset gives the RMSE of and MAE score of respectively; in contrast, the DRFuser with ResNet has achieved the RMSE and MAE scores of and respectively. Furthermore, the proposed DrFuser with ResNet-50 on the DDD dataset has also surpassed the early fusion with ResNet-50 has the RMSE and MAE scores of and , respectively, in contrast to DRFuser-ResNet-50 RMSE score of and MAE score of . In the case of the EventScape dataset, the early fusion with ResNet-34 backbone has achieved the RMSE and MAE scores of and , respectively. Similarly, with the ResNet-50 backbone, the early fusion gives the RMSE score of and the MAE score of . These early fusion scores with both backbones on the EventScape dataset have lower RMSE and MAE scores in contrast to the DRFuser with ResNet-34 and ResNet-50 backbone.

The quantitative results in terms of RMSE and MAE as evaluation metrics for late fusion perform slightly better in contrast to the early fusion approaches rather than the proposed method. For our collected dataset, the late fusion approaches with ResNet-34 and ResNet-50 have achieved the RMSE scores of and respectively, in contrast to the DRFuser with ResNet-34 and ResNet-50 RMSE scores of and . Furthermore, the MAE score of the proposed method with both ResNet-34 and ResNet-50 backbones have better scores in contrast to late fusion approaches. For the DDD dataset, the late fusion approach with ResNet-34 backbone has achieved the RMSE score of and MAE score of in contrast to the DRFuser with ResNet-34 backbone having RMSE and MAE scores of and respectively. Similarly, for the ResNet-50 backbone, the proposed DRFuser has better RMSE and MAE scores of and than the late fusion approach with ResNet-50 having RMSE and MAE scores of and respectively. Furthermore, on the EventScape dataset, the late fusion with ResNet-34 and ResNet-50 have achieved the RMSE scores of and , respectively. The MAE scores of late fusion with both backbones ResNet-34 and ResNet-50 are and . These RMSE and MAE scores illustrate the lower performance of late fusion compared to the proposed method.

5 Conclusion

This paper proposed a novel architecture DRFuser self-attention to fuse frame-based RGB and event data. Here we have shown how deep neural networks benefit from the dynamic response of event cameras and accurately predict vehicle steering angle for a wide range of conditions. Moreover, we have shown the efficacy of the self-attention layer in a deep neural network compared to the element-wise addition of features and using additive attention to fuse the data from two modalities. Moreover, we have conducted an ablation study on the architecture configuration of early, dense and late fusion. The desnse fusion of data is robust in learning data representation and give accurate results. Our approach is adapted to process event data as image representation so that a deep neural network can easily be applied. The experimental results show the robustness of DRFuser in comparison to other state-of-the-art methods.

In future work, we aim to exploit the event data for raining and fog condition and improve the representation of event data to include dense features representation.


This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2014-3-00077, AI National Strategy Project) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1A2C2087489), and Ministry of Culture, Sports and Tourism (MCST), and Korea Creative Content Agency (KOCCA) in the Culture Technology (CT) Research & Development (R2020070004) Program 2022.