Log In Sign Up

VM-MODNet: Vehicle Motion aware Moving Object Detection for Autonomous Driving

by   Hazem Rashed, et al.

Moving object Detection (MOD) is a critical task in autonomous driving as moving agents around the ego-vehicle need to be accurately detected for safe trajectory planning. It also enables appearance agnostic detection of objects based on motion cues. There are geometric challenges like motion-parallax ambiguity which makes it a difficult problem. In this work, we aim to leverage the vehicle motion information and feed it into the model to have an adaptation mechanism based on ego-motion. The motivation is to enable the model to implicitly perform ego-motion compensation to improve performance. We convert the six degrees of freedom vehicle motion into a pixel-wise tensor which can be fed as input to the CNN model. The proposed model using Vehicle Motion Tensor (VMT) achieves an absolute improvement of 5.6 architecture. We also achieve state-of-the-art results on the public KITTI_MoSeg_Extended dataset even compared to methods which make use of LiDAR and additional input frames. Our model is also lightweight and runs at 85 fps on a TitanX GPU. Qualitative results are provided in


page 1

page 3

page 4

page 6


FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving

Moving Object Detection (MOD) is an important task for achieving robust ...

RST-MODNet: Real-time Spatio-temporal Moving Object Detection for Autonomous Driving

Moving Object Detection (MOD) is a critical task for autonomous vehicles...

InstanceMotSeg: Real-time Instance Motion Segmentation for Autonomous Driving

Moving object segmentation is a crucial task for autonomous vehicles as ...

LiMoSeg: Real-time Bird's Eye View based LiDAR Motion Segmentation

Moving object detection and segmentation is an essential task in the Aut...

Spherical formulation of moving object geometric constraints for monocular fisheye cameras

In this paper, we introduce a moving object detection algorithm for fish...

A Cluster-Based Weighted Feature Similarity Moving Target Tracking Algorithm for Automotive FMCW Radar

We studied a target tracking algorithm based on millimeter-wave (MMW) ra...

I Introduction

Autonomous Driving (AD) environment is complex as they include moving objects that are navigating in different ways [8, 5]. Thus, full perception of all the surround moving agents is necessary for effective motion planning of the autonomous vehicle. Motion in the image, captured by optical flow, is induced by motion of other objects in the scene and by the ego-motion of the vehicle carrying the reference camera. Unlike systems where the reference camera is fixed, it is challenging to predict moving objects in automotive scenes as the ego-vehicle is constantly in motion.

Motion is a strong cue in automotive scenes. Moving objects like pedestrians and vehicles pose a higher risk and it is essential to reliably detect them. Motion cues also be used to detect objects in an appearance agnostic manner. For example, construction trucks and animals like moose which are rare to be trained based on appearance cues can be alternatively be detected using motion cues. In addition, High Definition maps which is a main source for autonomous driving provides a reliable prior detection of static objects [17]. Thus, it is more important to detect moving objects reliably.

Dominant approaches for moving object detection (MOD) make use of a combination of optical flow and RGB images to combine motion and appearance cues. They are either combined in early or mid level fusion techniques within a CNN model. These methods do not take advantage of known vehicle motion. Vehicle motion has six degrees of freedom comprising of three rotation angles and three translations. It can be obtained by a highly accurate inertial measurement unit (IMU) sensor. It can also be partially obtained by vehicle odometry sensors which provides two translations

and yaw angle from steering wheel. It can also be estimated using visual odometry or from a sensor fused odometry.

Fig. 1: Our model predicts motion segmentation using RGB image, optical flow and vehicle motion as inputs.

In this work, we aim to leverage vehicle motion and use it explicitly within a CNN model as inductive bias to improve the accuracy. We propose and implement a vehicle motion aware network that will be able to detect surrounding moving objects more accurately. In this work, we focus on KITTI [7] dataset which provides accurate IMU values for vehicle motion. This work can also be used for other estimates of vehicle motion mentioned in the previous paragraph. We convert vehicle motion parameters into a pixel-wise vehicle motion tensor (VMT) that is suitable for using within a CNN network. We demonstrate significant improvements over the baseline using VMT. To summarize, the contributions of this work include:

  • Design and implementation of CNN based moving object detection utilizing vehicle motion.

  • State-of-the-art results on KITTI_MoSeg_Extended dataset with significantly faster runtime.

  • Ablation study of different fusion methodologies.

The paper is organized as follows. Section II reviews the related work in moving object detection for autonomous driving. Section III discusses the proposed architecture including details of modelling of vehicle motion tensor. Section IV describes the experimental setup and discusses quantitative and qualitative analysis. Finally, Section V provides concluding remarks.

Ii Related Work

Motion detection has been studied through classical approaches such as [14]. Recently, CNN-based approaches provide better accuracy as they can encode global context. However, they require a extensive dataset with diverse moving objects. Optical flow has been used in generic foreground segmentation by [10]. In [4, 21], video object segmentation has been explored, however these models are computationally intensive. Siam et al. [20] explored motion segmentation using CNN for autonomous driving scenario and further improved it by using depth [19]. In [24], MOD has been explored on images captured by wide-angle fisheye camera using [25] dataset.

In addition to camera sensors, MOD has been explored using LiDAR sensors as well. The most common way is to predict moving points using geometric constraints and then perform clustering to obtain moving objects [2]

. Using deep learning approaches, 3D convolution has been utilized to predict moving vehicles in

[12]. In other approaches, the points were projected from 3D to 2D range images to make use of conventional 2D convolutions [11]. In FuseMODNet [16], optical flow was generated from both camera and LiDAR to obtain robust low-illumination detection. Some approaches predict MOD using two sequential LiDAR scans to implicitly learn motion without using optical flow [3].

Iii Proposed Method

In this section, we describe the proposed method including vehicle motion modelling and our network architecture.

Iii-a Baseline Architecture

We start with the OmniDet [18] motion segmentation network using two-stream RGB only network. The network consists of two ResNet18 streams with shared weights and a motion segmentation decoder with deconv layers for upsampling to the higher resolution output. The architecture is simple and it provides real-time performance. We train this model by feeding an RGB image at time in one stream, and the previous image at time in another stream. This model is reported as the baseline in the first row of Table I. To enable comparison of accuracy and inference speeds with previous methods [16, 15], we scale the input resolution to . Then we replace the previous frame with Optical flow to enable better fusion with Vehicle Motion Tensor. This two-stream RGB and Optical flow model with shared weights will act as our baseline architecture.

Iii-B Vehicle Motion Tensor

Motivation: Motion segmentation is far more challenging in autonomous vehicles compared to surveillance or traffic light cameras which are fixed. The camera motion due to motion of the vehicle induces optical flow in all the static objects and it becomes difficult to separate them from moving objects. In addition, there are fundamental geometric challenges like motion-parallax ambiguity which makes it difficult to distinguish between a parallel moving car versus a static car.

In this work, we aim to design a vehicle motion aware network that utilizes ego-motion information to improve motion segmentation. Ego-motion compensation is commonly used in classical computer vision to subtract the camera motion related motion fields in optical flow. It was explored for motion segmentation task in SMSNet

[23], where optical flow was compensated utilizing depth information. It requires a good depth estimation model which is a complex task on its own and does not handle noise in ego-motion as the compensation term is explicitly subtracted from the optical flow. In contrast, we want to use even noisy ego-motion from vehicle odometry sensors without using depth.

Camera motion and vehicle motion are equivalent except for a possible co-ordinate system transformation. Camera motion information has six degrees of freedom. It is difficult to incorporate these six scalar values directly as input to a CNN model. Thus, we explore to convert them to a pixel-wise tensor which is easy to fuse with other image planes. We make use of the concept of Motion Fields which is closely related to Optical Flow. Motion field is the 2D vector field of velocities of image pixels induced by relative motion between observing camera and the 3D scene. Motion field is the projection of 3D relative velocity vectors onto image plane whereas optical flow is the observed 2D displacements on the image plane without using 3D information. We propose to use a modified version of motion fields as a pixel-wise encoding of ego-motion in a scene agnostic manner. Then we feed this as an independent input plane so that the network can learn to combine it effectively with optical flow and image information. This design will enable the network to have a loose coupling on ego-motion so that the estimation does not break down when ego-motion is not correct unlike an explicit ego-motion compensation.

We provide a short overview of motion fields in this section, please refer to this textbook [22] for more details. We make use of a pinhole camera model for the KITTI images. They have a slight barrel distortion which can be rectified using the intrinsic parameters. Motion fields induced by the camera motion has two components for rotation and translation .

(a) (a)
(b) (e)

(c) (b)
(d) (f)
(e) (c)
(f) (g)
(g) (d)
(h) (h)
Fig. 2: Different representations of Vehicle Motion Tensor (VMT) computed from six degrees of freedom vehicle motion. (a) is the RGB image and (e) is optical flow map. (b,c,d) correspond to VMT planes generated from each camera rotation angle around axes. (f,g,h) correspond to VMT planes generated from each camera translation in directions of axes. All the images are normalized for better visualization.

Iii-B1 Rotation

Rotation of the camera can be described by three parameters namely pitch, roll and yaw. Motion fields induced by pure camera rotation is derived in [22] as shown below. It is important to observe that it is independent of scene structure.


where represent the camera rotation parameters, represent the pixel indices and represent motion vectors corresponding to camera motion in the image coordinate system. The visualization of this equation is illustrated in the first column of Fig. 2. Rows (b) ,(c) and (d) represents rotation around x, y and z axis respectively. These motion vectors are converted to colorwheel representation to enable reuse of CNN pre-trained weights similar to how it was done for optical flow in [20]. To compute each component separately, we assume that the other components are equal to zero. Focal length is normalized to 1 and the output vectors are normalized from 0 to 255 for better visualization. The blue color represents pixels moving to the left direction which indicates the camera rotation in the right direction, while the red color indicates the opposite.

Camera rotation is mainly dominated by rotation around axis due to the steering of the car. Hence, the final VMT is mainly dominated by rotation around axis as it has larger values than rotation around axes as demonstrated in Fig. 3 where we show an example in KITTI dataset where the car is being steered to the right.


Fig. 3: Example from KITTI dataset when the camera is dominated by rotation around y-axis. From top to bottom: RGB image, Optical Flow and Vehicle Motion Tensor.

Iii-B2 Translation

Fig. 4: Illustration of our VM-MODNet architecture. Vehicle Motion Tensor (VMT) is generated from vehicle’s differential 3D translation and differential 3D rotation . Optical flow is generated using [9] but this will be obtained for free in a typical automotive hardware. ResNet18 encoder is used as backbone and Deconv layers are used in the decoder.

Camera translation can be described by three parameters along the three axes in 3D space. Motion field induced by pure translation is derived in [22] as shown below.


where demonstrate camera translation parameters, and indicates the scene’s depth.

Unlike , has a scene dependent depth component. Our objective is to avoid using an additional depth measurement sensor or a more complex depth estimation network that predicts pixel-wise depth map of the scene. To simplify, we assume a constant depth virtual plane parallel to the image plane and use the motion field induced by it. The resulting motion fields are demonstrated in Fig. 2 (f,g,h). Fixed depth results in constant VMT in and directions and it doesn’t add significant information to the network. However, forward motion along the direction shown in Fig. 2 (h) captures useful information across the tensor. In this case, pixels on the right side will move to right direction (shown in shades of red) and pixels on left will move to the left direction (shown in shades of blue).

Iii-C Fusion Architectures

We use the two-stream RGB and Optical flow mid-fusion model as the baseline architecture. We aim to keep the RGB encoder unaffected as it will be shared for other tasks in a multi-task setting. Thus we focus on different ways to fuse Vehicle Motion Tensor (VMT) with Optical Flow. Specifically, we evaluate early fusion, mid-fusion and multi-scale fusion of VMT. We also explore weight sharing to reduce number of parameters used in the network to simplify training and to reduce model footprint. Table I summarizes results of different fusion architectures we evaluated.

Iii-C1 Early Fusion Architecture

In early fusion, input modalities are fused before being processed by the network. In this work, we concatenate VMT and optical flow and then feed the concatenated tensor into the encoder. We adapt the encoder’s first layer to accept an input of six channels and the corresponding weights are randomly initialized. For the rest of the encoder, ResNet18 pre-trained weights are used for initializing the training process. The output feature maps from this encoder is then concatenated with RGB encoder features and fed to the decoder. The architecture is explained in Fig. 4.

Iii-C2 Mid-fusion Architecture

In mid-fusion architecture, a dedicated encoder for VMT is used and then encoder feature map is concatenated with RGB and Optical flow encoder feature maps. This increases the complexity significantly and not the preferred approach for deployment. However, we use this to understand the best possible performance. As expected, this model provides the best performance as shown in Table I.

Iii-C3 Multi-scale Fusion Architecture

We also make use of multi-scale fusion mechanism used by CAMConvs [6] where they fuse camera calibration information in a depth estimation network. They show that this performs better than simple concatenation. In our case, we perform multi-scale fusion of VMT with the optical flow encoder. We resize VMT to five different resolutions and concatenate it to the corresponding feature map while feeding to the decoder. Thus, it has slightly higher complexity than a simple early fusion. However, there was a slight degradation in performance compared to early fusion as reported in Table I. This is likely because optical flow and VMT are closely linked and the network is able to leverage ego-motion with simple concatenation.

Iii-C4 Weight sharing

We aim to design efficient models targeting deployment in memory constrained automotive embedded platforms. Thus we explore weight sharing of encoders to minimize the model footprint which will be stored in persistent memory. It is more pronounced considering the large number of cameras (around 10) used in modern vehicles. Thus we perform an ablation study to understand the impact of weight sharing for both two-stream (RGB + Optical flow) and three-stream (RGB + Optical flow + VMT) architectures. Our results in Table I

show that there is a significant degradation using shared weights probably due to difference in modalities of the inputs.

Iv Experiments

In this section, we provide details of the experimental setup and analysis of results obtained for different architectures.

Iv-a Dataset

There are only a few automotive datasets that provide moving object detection (MOD) annotation. Cityscapes

[1] has been manually labelled for MOD by [23] for around 3k images. The dataset does not provide vehicle motion information which is necessary for our experiments and thus we could not use it. KITTI [7] is the most commonly used dataset for automated driving tasks. FuseMODNet [16] released an extended version of improved MOD annotations for 12.9k images. The dataset contains annotations for vehicles class only. In this work, we make use of this dataset and report results on the same test set used by other recent methods to enable comparison.

Iv-B Experimental Setup

We use ResNet18 as the backbone with multiple architecture configurations explained in Section III. We initialize our network with ResNet18 pre-trained weights and we set the batch size to 16. The network is trained using the Ranger the Ranger (RAdam[13] + LookAhead [26]

) optimizer. We train all the models using weighted binary cross-entropy loss function for 60 epochs. For the early-fusion architecture, we adapt the first layer of the encoder to accept an input of 6 channels instead of 3, and we initialize the corresponding weights randomly. We use transposed convolution layers in the decoder for upsampling progressively to the original input size.

Iv-C Results

Architecture Type Moving IoU mIoU
RGB only architecture with shared weights
RGB + RGB (prev) 40.5 69.85
RGB & OF Fusion architectures with shared weights
{RGB + OF} 44.7 72
{RGB + OF + VMT} 46.6 72.95
RGB & OF Fusion architectures without shared weights
RGB + OF 49.3 74.3
(RGB + OF) [+] VMT 51 75.15
RGB + (OF x VMT) 51.4 75.4
RGB + {OF + VMT(yaw-only)} 52.9 76.2
RGB + {OF + VMT} 53.6 76.5
RGB + OF + VMT 55.6 77.6
TABLE I: Quantitative comparison of different architectures. OF is Optical Flow, VMT is Vehicle Motion Tensor, + is mid-fusion architecture, x is early-fusion via concatenation of inputs, [+] is multi-scale feature map concatenation [6], and {} refers to encoders with shared weights.
(a) (a)
(b) (a)
(c) (a)
(d) (b)
(e) (b)
(f) (b)
(g) (c)
(h) (c)
(i) (c)
Fig. 5: Qualitative comparison on KITTI_MoSeg_Extended dataset. (a) is the output of baseline RGB + Optical Flow architecture, (b) is the proposed VM-MODNet output and (c) is ground truth. Red boxes illustrates better detection of static vehicles in (b) compared to (a). Yellow box illustrates better detection of a far away moving object in (b) compared to (a).

Table I demonstrates a summary of the results evaluated for different fusion architectures. The first model RGB + RGB (prev) learns motion cues without the explicit usage of optical flow. It uses a shared encoder with two sequential RGB frames . The shared encoder enables re-use of previous encoder without having to compute it again, thus using only one encoder in steady state. Thus it is the most efficient model. We obtain an accuracy of 40% IoU for moving objects but usage of optical flow provides better performance consistent with the previous literature [20, 16].

Then we report shared weight two-stream {RGB+OF} and three-stream {RGB+OF+VMT} architectures. {RGB+OF} provides an increase of 4% in moving IoU compared to the baseline and {RGB+OF+VMT} provides an additional 2% improvement. The same models achieve much better performance without sharing weights. The three-stream RGB+OF+VMT achieves the best results improving moving IoU by 11%. However, it is computationally expensive and we explore efficient two-stream versions. We also evaluate RGB+{OF+VMT} model where the weight sharing is limited only to optical flow and VMT enoders. Interestingly, its performance is closer to the model with no weight sharing than the model with shared weights. This illustrates the stronger relationship between OF and VMT.

RGB+(OFxVMT) is an efficient fusion model where VMT is concatenated with optical flow but it only improves performance by 2%. However, it has the same complexity as the baseline model. We also evaluate multi-scale feature map fusion of VMT in (RGB+OF)[+]VMT model but it did not provide any improvement over simple concatenation. There was a slight degradation in accuracy. In future work, we aim to explore incorporating epipolar geometric constraints as inductive bias to obtain the best performance in this efficient model.

We provide an ablation study for the effect of using only ego-vehicle rotation around y-axis (yaw or steering angle) in VMT. There are two reasons for evaluating this model RGB+OF+VMT(yaw-only). Firstly, yaw is the dominant rotation because of the steering of the vehicle. Secondly, it is available in all the vehicles using steering wheel angle measurement sensor without needing a more expensive IMU. There was only a slight degradation of 0.7% relative to using all the rotation angles.

Network Type
FuseMODNet (RGB + OF) [16] 49.36 74.24 25
FuseMODNet (RGB + LiDAR) [16] 51.46 75.3 18
RST-MODNet (LSTM) [15] 53.3 76.3 21
Ours (RGB + OF x VMT) 51.4 75.4 125
Ours (RGB + OF + VMT) 55.6 77.6 85
TABLE II: Quantitative comparison on KITTI_MoSeg _Extended dataset.

Table II illustrates our model performance compared to other methods. Our method achieve better performance than the other higher complexity networks which use multistage LSTM architecture [15] and multi-sensor model [16] which fuses LiDAR sensor for improving motion segmentation results. Our run-time is also significantly better than these methods.

Fig. 5 shows qualitative results of the best model RGB+OF+VMT. (a) shows the baseline results of RGB + OF model. In some cases, we observe static cars that are incorrectly segmented as moving objects. Visually, we observe better results using VMT indicated within red boxes for static cars in (b). Furthermore, higher accuracy has been observed for moving vehicles highlighted in yellow boxes. More qualitative results can be observed in the video provide in the abstract. The proposed method provides significant improvement despite the fact that KITTI is dominated by forward motion without rotation.

V Conclusion

In this paper, we proposed a vehicle motion aware moving object detection. We demonstrated significant improvements over the baseline by using vehicle motion on KITTI_MoSeg _Extended dataset. Majority of this dataset comprises of the vehicle going in a straight line with a standard urban driving velocity and we expect larger improvements on a more diverse ego-motion dataset. We perform a comparative study on different types of network architectures and fusion mechanisms to find the best model which achieves state-of-the-art results. Given the emergence of low-cost IMU sensors for commercial deployment, we hope that our work encourages further research in using vehicle motion for other tasks including tracking and depth estimation.


  • [1] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 3213–3223. Cited by: §IV-A.
  • [2] A. Dewan, T. Caselitz, G. D. Tipaldi, and W. Burgard (2016) Motion-based detection and tracking in 3d lidar scans. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 4508–4513. Cited by: §II.
  • [3] A. Dewan, G. L. Oliveira, and W. Burgard (2017) Deep semantic classification for 3d lidar data. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §II.
  • [4] B. Drayer and T. Brox (2016) Object detection, tracking, and motion segmentation for object-level video segmentation. arXiv preprint arXiv:1608.03066. Cited by: §II.
  • [5] C. Eising, J. Horgan, and S. Yogamani (2021) Near-field sensing architecture for low-speed vehicle automation using a surround-view fisheye camera system. arXiv preprint arXiv:2103.17001. Cited by: §I.
  • [6] J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera (2019) CAM-convs: camera-aware multi-scale convolutions for single-view depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11826–11835. Cited by: §III-C3, TABLE I.
  • [7] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §IV-A.
  • [8] J. Horgan, C. Hughes, J. McDonald, and S. Yogamani (2015) Vision-based driver assistance systems: survey, taxonomy and advances. In 2015 IEEE 18th International Conference on Intelligent Transportation Systems, pp. 2032–2039. Cited by: §I.
  • [9] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: Fig. 4.
  • [10] S. D. Jain, B. Xiong, and K. Grauman (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 2117–2126. Cited by: §II.
  • [11] B. Li, T. Zhang, and T. Xia (2016) Vehicle detection from 3d lidar using fully convolutional network. In Robotics: Science and Systems, Cited by: §II.
  • [12] B. Li (2017) 3d fully convolutional network for vehicle detection in point cloud. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1513–1518. Cited by: §II.
  • [13] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2020)

    On the variance of the adaptive learning rate and beyond

    In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §IV-B.
  • [14] M. Menze and A. Geiger (2015) Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061–3070. Cited by: §II.
  • [15] M. Ramzy, H. Rashed, A. E. Sallab, and S. Yogamani (2019) Rst-modnet: real-time spatio-temporal moving object detection for autonomous driving. NeurIPS Workshop on Autonomous Driving. Cited by: §III-A, §IV-C, TABLE II.
  • [16] H. Rashed, M. Ramzy, V. Vaquero, A. El Sallab, et al. (2019) Fusemodnet: real-time camera and lidar based moving object detection for robust low-light autonomous driving. In Proc. of the IEEE International Conference on Computer Vision Workshops, Cited by: §II, §III-A, §IV-A, §IV-C, §IV-C, TABLE II.
  • [17] B. Ravi Kiran, L. Roldao, B. Irastorza, R. Verastegui, S. Suss, S. Yogamani, V. Talpaert, et al. (2018) Real-time dynamic object detection for autonomous driving using prior 3d-maps. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §I.
  • [18] V. Ravikumar, S. Yogamani, H. Rashed, G. Sistu, C. Witt, I. Leang, S. Milz, and P. Mader (2021) OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving. IEEE Robotics and Automation Letters. Cited by: §III-A.
  • [19] M. Siam, S. Eikerdawy, M. Gamal, M. Abdel-Razek, M. Jagersand, and H. Zhang (2018) Real-time segmentation with appearance, motion and geometry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5793–5800. Cited by: §II.
  • [20] M. Siam, H. Mahgoub, M. Zahran, S. Yogamani, M. Jagersand, and A. El-Sallab (2018) MODNet: motion and appearance based moving object detection network for autonomous driving. In Proceedings of the 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2859–2864. Cited by: §II, §III-B1, §IV-C.
  • [21] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning video object segmentation with visual memory. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4481–4490. Cited by: §II.
  • [22] E. Trucco and A. Verri (1998) Introductory techniques for 3-d computer vision. Vol. 201, Prentice Hall Englewood Cliffs. Cited by: §III-B1, §III-B2, §III-B.
  • [23] J. Vertens, A. Valada, and W. Burgard (2017)

    Smsnet: semantic motion segmentation using deep convolutional neural networks

    In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 582–589. Cited by: §III-B, §IV-A.
  • [24] M. Yahiaoui, H. Rashed, L. Mariotti, G. Sistu, I. Clancy, et al. (2019) FisheyeMODNet: moving object detection on surround-view cameras for autonomous driving. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Cited by: §II.
  • [25] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, et al. (2019) WoodScape: a multi-task, multi-camera fisheye dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9308–9318. Cited by: §II.
  • [26] M. R. Zhang, J. Lucas, J. Ba, and G. E. Hinton (2019) Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), Cited by: §IV-B.