Input Dropout for Spatially Aligned Modalities

02/07/2020 ∙ by Sébastien de Blois, et al. ∙ 0

Computer vision datasets containing multiple modalities such as color, depth, and thermal properties are now commonly accessible and useful for solving a wide array of challenging tasks. However, deploying multi-sensor heads is not possible in many scenarios. As such many practical solutions tend to be based on simpler sensors, mostly for cost, simplicity and robustness considerations. In this work, we propose a training methodology to take advantage of these additional modalities available in datasets, even if they are not available at test time. By assuming that the modalities have a strong spatial correlation, we propose Input Dropout, a simple technique that consists in stochastic hiding of one or many input modalities at training time, while using only the canonical (e.g. RGB) modalities at test time. We demonstrate that Input Dropout trivially combines with existing deep convolutional architectures, and improves their performance on a wide range of computer vision tasks such as dehazing, 6-DOF object tracking, pedestrian detection and object classification.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of deeper networks and data-hungry algorithms to solve challenging computer vision problems has created the need for ever richer datasets. In addition to common image datasets such as the famed ImageNet 


, datasets containing multiple modalities have also been collected to address a variety of problems ranging from depth estimation 


, indoor scene understanding 

[28], 6-DOF tracking [9], multispectral object detection [15], autonomous driving [10, 4] to haze removal [23], to name just a few. More generally, learning from multiple modalities has been explored to determine which ones are useful [27], and multiple ways of combining them have been proposed [11, 14, 20, 26].

Figure 1:

The Input Dropout strategy for a given RGB image and additional modality (e.g. depth in orange). At training time, the additional modality is concatenated to the RGB image, with a given probability the RGB modality or the additional modality is being set to 0 (black). At test time (middle), the additional modality is always unavailable (i.e., set to 0). In training, for the

addit mode, only the two left cases are used, while for the both mode, all three cases are used.

Training deep learning models on additional modalities typically means that these extra modalities must also be available at test time. Unfortunately, capturing more modalities requires significant time and effort. Adding sensors alongside an RGB camera results in increased power consumption, less portable setups, the need to carefully calibrate and synchronize each sensor, as well as additional constraints on bandwidth and storage requirements. This may not be practical for multiple applications—including augmented reality, robotics, wearable and mobile computing, etc.—where these physical constraints preclude the use of additional sensors.

This dichotomy between the advantage brought by additional modalities and the impediment they impose on real systems has attracted attention in the literature. Can we train on additional modalities without relying on them at test time? In their “learning with privileged information” paper, Vapnik et al. [25] introduce a theoretical framework which shows that this may indeed be feasible. Practical techniques have since been introduced, but those tend to be specifically targeted towards specific network architectures and applications. For example, “modality hallucination” [13] and variants [7, 8] propose to train networks on different modalities independently, and shows that by changing the input modality of one of the networks while forcing the latent space to keep its former structure improves convergence. In [22], authors proposed to independently process multiple modalities in parallel branches within a network, and fuse the resulting feature maps using so-called “modality dropout” to make the network invariant to missing modalities. Despite improving performance, these methods are complex to implement, may require multiple training steps, and must be adapted differently to each problem.

In this paper, we propose a technique for exploiting additional modalities at training time, without having to rely on them at test time. In contrast to previous work which requires task-specific architectures [22] or multiple training passes [8], our approach is extremely simple (can be implemented in a few lines of code), is independent of the learning architecture used, and does not require any additional training pass. Assuming that modalities are spatially-aligned and share the same spatial resolution, we propose to randomly dropout [24] entire input modalities at training time. At test time, the missing modality is simply set to 0. We demonstrate that our proposed strategy, Input Dropout, can be leveraged to obtain between 2–20% gain over training on RGB-only, on a variety of applications.

2 Input Dropout


We assume that all input modalities are spatially aligned and can be represented as additional channels of the same input image. In our experiments, we also assume that the RGB modality is the only modality available at test time, therefore the other modality is never available during testing.


Our proposed Input Dropout strategy is illustrated in fig. 1

. The additional modality is first channel-wise concatenated to the RGB image, and the resulting tensor is fed as input to the neural network. The first convolutional layer of the network must be adapted to this new input dimensionality (c.f. sec. 

4). At training time, one of the input modalities is randomly set to 0 with probability . This effectively “drops out” [24] the corresponding modality. At test time, the additional modality is always set to 0. Implementing Input Dropout

requires a few lines of PyTorch code.

Since we assume a single additional modality is combined with an RGB image, we are faced with two options. We could randomly drop only the additional modality and always keep the RGB (we dub this option addit), or drop either the RGB or the additional modality (both

). In these two cases, a uniform probability distribution for the different possible cases is used. For the

addit mode, the probability of dropping the additional modality is set to . For the both mode, the probability of dropping either the RGB or the additional modality is .

Our method is mainly related to “modality dropout” [22], which fuses the modalities in a learned latent space. Their main limitation is that specialized network branches must be learned for each modality, which adds complexity. In contrast, our method can be used on existing convolutional architectures with very little change. We will compare to [22] in sec. 4.

3 Input Dropout for image dehazing

We first experiment with Input Dropout on single image dehazing [17, 29] with depth (RGB+D) as the additional modality available at training time only. For this, we employ the D-Hazy dataset [2], which contains 1449 pairs of RGB+D images where haze is synthetically added on images from the NYU Depth dataset [21]. We use 1180 images in training, 69 for validation, and 200 for test. Our model is similar to [17], the only difference being that the generator is a ResNet (with nine blocks) as in [16].

Similar to [17, 29], the network is trained on a combination of a GAN, a pixel-wise , and a perceptual loss [16] to preserve the sharpness of the image:


where (obtained with grid search on the validation set). At training time, Input Dropout uses the addit mode. Indeed, it does not make sense to drop the RGB image since it would be equivalent to obtain a haze-free image from a depth map. We also experiment on single image dehazing using segmentation (RGB+S) as an additional training modality. For this, we use the Foggy Cityscape Dataset [23], an extension of Cityscapes [4] which contains ground truth scene segmentations. The same network and training procedure are used.

Quantitative dehazing results with Input Dropout are provided in tab. 1, and corresponding representative qualitative results in fig. 2. Over the RGB-only baseline, relative improvements of 3.6% and 3.4% on PSNR and SSIM respectively are observed when using Input Dropout on RGB+D, and 4.5% PSNR and 2.2% SSIM for RGB+S. We also compare our method to competing techniques, such as “Dehazing for segmentation” (D4S) [5]

which proposes an approach to dehaze to increase performance for a subsequent task using a modality only available during training, and Pix2Pix GAN 

[3] which employs an extra generator to generate the missing modality from the RGB image. In every case, Input Dropout performs better while being simpler than the other approaches. Note that we have not compared our approach to “modality distillation” [8] here since the method cannot be applied to this scenario. Indeed, it would involve training a network to dehaze a depth (or segmentation) image, which would require hallucinating scene contents.

RGB-only 17.61 0.74 23.55 0.91
D4S [5] 17.95 0.75 23.90 0.92
Pix2Pix GAN [3] 17.70 0.75 22.90 0.91
Input Dropout 18.24 0.76 24.60 0.93
Table 1: Quantitative results for single image dehazing using an additional depth (RGB+D) and segmentation (RGB+S) modality at training time. Results are reported on the D-Hazy dataset [1] for RGB+D and the Foggy Cityscapes dataset [23] for RGB+S. For each technique, the average over five different training runs are reported. In all scenarios, Input Dropout, despite its simplicity, is the technique that provides the largest improvement over the RGB-only baseline.
Input Ground truth RGB only Input Dropout Difference
Figure 2: Qualitative examples for dehazing RGB images from (top row) D-Hazy [1] and (bottom row) Foggy Cityscapes [23, 4]. From left to right: hazy input, ground truth haze-free image, results when trained on RGB only, results with Input Dropout, absolute difference between the 3rd and 4th column, shown using a color map ranging from blue (low) to yellow (high).

4 Input dropout for classification

We evaluate the use of Input Dropout for image classification using RGB+D training data. For this, we rely on the methodology proposed by Garcia et al. [8], who use the crops of individual objects from the NYU V2 dataset [21] adapted by [13] for object classification using RGB+D. We used the same split as in [8]: 4,600 RGB-D images in total, where around 50% are used for training and the remainder for testing. Here, we rely on a ResNet-34 [12], initialized with pretrained weights on ImageNet [6]. To adapt the pretrained ResNet-34 to use Input Dropout, we append additional channels to the filters of the first convolution layer and initialize the new weights randomly. Doing so preserves the pretrained weights for the RGB modality.

Tab. 2 shows the quantitative classification accuracy obtained with the various methods. First, we report results when the depth modality is available at test time to provide an upper bound on performance. Next, we evaluate training a single network on the RGB modality only (“RGB-only”), the approach of [3] which relies on a GAN to hallucinate the depth at test time, and our Input Dropout strategy (in the addit mode). Our approach provides the best results, despite being the simplest.

We further compare to ensemble methods. First, two networks trained on RGB only, with their answers averaged before the argmax, yield an absolute performance improvement of 3.2% over the single-network baseline. The “modality distillation” approach of Garcia et al. [8] relies on a combination of two networks: one trained on RGB only, and another, so-called “hallucination” network. That second network is trained to produce a latent representation that is similar to a proxy network trained on the depth modality only. The final output is the mean of the RGB-only and the hallucination network. We reimplemented their approach in PyTorch to ensure direct comparison with our results, which yields a 1.8% absolute improvement over the RGB+RGB baseline.

We directly compare our technique to “modality distillation” [8] by using one network trained on RGB only, and another network trained on RGB+D with Input Dropout (instead of their “hallucination” network). This yields approximately the same performance as “modality distillation” [8], despite being much simpler to train, requiring a single network architecture and a single training pass (i.e. both networks can be trained in parallel, while they must be trained sequentially for [8]).

Method Ensemble Accuracy
RGB+D No 58.9%
Depth (D) only No 57.0%
RGB only No 47.5%
Pix2Pix GAN [3] No 48.2%
ModDrop [22] No 44.3%
Input Dropout No 49.5%
RGB+RGB Yes 50.7%
Mod. distillation [8] Yes 52.5%
Input Dropout + RGB Yes 52.7%
Table 2: Classification accuracies on the NYU V2 dataset adapted by [13]. Results are the average over five different training runs.

5 Other applications

We evaluate Input Dropout (in the both mode) on two additional applications: tracking in RGB+D and pedestrian detection in RGB+thermal.

5.1 3D object tracking with RGB+D

We first focus on the problem of tracking 3D objects in 6 degrees of freedom (DOF). To do so, we employ the methodology of Garon et al. 

[9], who presented a technique for tracking a known 3D object in real-time using synthetic RGB+D data. They also provide an evaluation dataset containing 297 real sequences captured with a Kinect V2 with ground truth annotations of the 6-DOF poses of 11 different objects.

Here, we focus on the “occlusion” scenario proposed by [9] where the objects are rotated on a turntable while being partially hidden by a planar occluder with (measured) occlusion varying from 0% to 75%. We evaluate Input Dropout using the same CNN architecture as in [9]. For a given pose where is the translation and the rotation matrix, the translation error is defined by its norm and the rotation matrix distance is computed with:


where is the matrix trace [9].

Quantitative 6-DOF tracking results are reported in tab. 3. We observe that Input Dropout generally improves the results for the tracking task in translation with a relative gain as high as 33.3% in the hardest sequences, and an average of 17% relative gain in rotation. The error reported is the average of 5 training runs for each method.

Translation (mm) Rotation (degrees)
Occlusion % 0–30 45–75 0–30 45–75
RGB-only 22.3 43.2 10.2 24.6
Input Dropout 22.8 28.8 8.1 21.3
Relative gain -2.2% 33.3% 20.5% 13.4%
Table 3: Tracking error in translation and rotation with respect to the ratio of occlusion from the dataset of Garon et al. [9]. We observe that Input Dropout augment most of the scenarios significantly. In translation, the error with Input Dropout stabilizes after 45% occlusion, and the average relative gain in rotation is 16.7%.

5.2 Pedestrian detection with RGB+T

We experiment with pedestrian detection on RGB+T (thermal) images using the KAIST Multispectral pedestrian dataset [15]. The training/validation/test sets are composed of 16,000/1,100/3,500 pairs of thermal/visible images for nighttime and 32,000/1,500/8,500 for daytime.

Here, we rely on RetinaNet [19], which is a state-of-the-art architecture for object detection. The RetinaNet is trained with a focal loss using a ResNet-34 [12] and a Feature Pyramid Network (FPN) [18]

as backbone for feature extraction. The RetinaNet is initialized with pretrained weights on ImageNet 

[6]. As in sec. 4, additional channels are appended to the filters of the first convolutional layer to preserve the learned weights on RGB. The network is then fine-tuned on the KAIST images until convergence on the validation set.

To evaluate performance, we compute the mean average precision (mAP) with an intersection-over-union (IoU) score of 0.5. Tab. 4 shows the results of the experiments in both night- and daytime scenarios. We observe that our Input Dropout strategy yields improvements in all cases, nighttime RGB pedestrian detection improves by 18.9%, and daytime RGB pedestrian detection improves by 15.1%.

Method Nighttime Daytime
RGB-only 0.228 0.351
Input Dropout 0.271 0.404
Relative gain 18.9% 15.1%
Table 4: Mean average precision (mAP) with an IoU of 0.5 results with RGB+T for nighttime and daytime pedestrian detection with and without Input Dropout, RGB only in test time. Each results column indicates the modality that is used at test time. The RGB-only row trains on the test modality only, while Input Dropout uses both modalities at training time. Results are the average over five different training runs. The last row is the relative performance gain resulting from using Input Dropout.

6 Discussion

We propose Input Dropout as a simple and effective strategy for leveraging additional modalities at training time which are not available at test time. We extensively test our technique in several applications—including single image dehazing, object classification, 3D object tracking, and object detection—on several additional modalities—including depth, segmentation maps, and thermal images. In all cases, using Input Dropout in training yields improved performance at test time, even if the additional modality is unavailable. Our approach, which can be implemented in a few lines of code only, can be used as a drop-in replacement with no change to the network architecture, aside from the addition of one extra input dimension to the first layer filters. The main limitation of our approach is that we have experimented on adding only a single additional modality to the RGB baseline. In the future, we plan on exploring the applicability of the approach with more modalities.


  • [1] C. Ancuti, C. O. Ancuti, and C. De Vleeschouwer (2016) D-hazy: a dataset to evaluate quantitatively dehazing algorithms. In ICIP, Cited by: Figure 2, Table 1.
  • [2] C. Ancuti, C. O. Ancuti, and R. Timofte (2018) Ntire 2018 challenge on image dehazing: methods and results. In CVPR Workshops, Cited by: §3.
  • [3] B. Bischke, P. Helber, F. Koenig, D. Borth, and A. Dengel (2018) Overcoming missing and incomplete modalities with generative adversarial networks for building footprint segmentation. In CBMI, pp. 1–6. Cited by: Table 1, §3, Table 2, §4.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §1, Figure 2, §3.
  • [5] S. de Blois, I. Hedhli, and C. Gagné (2019-09) Learning of image dehazing models for segmentation tasks. In EUSIPCO, A Coruña, Spain. Cited by: Table 1, §3.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1, §4, §5.2.
  • [7] N. C. Garcia, P. Morerio, and V. Murino (2018) Modality distillation with multiple stream networks for action recognition. In ECCV, Cited by: §1.
  • [8] N. C. Garcia, P. Morerio, and V. Murino (2019) Learning with privileged information via adversarial discriminative modality distillation. TPAMI. Cited by: §1, §1, §3, Table 2, §4, §4, §4.
  • [9] M. Garon, D. Laurendeau, and J. Lalonde (2018) A framework for evaluating 6-dof object trackers. In ECCV, Cited by: §1, §5.1, §5.1, Table 3.
  • [10] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In CVPR, Cited by: §1.
  • [11] H. Gunes and M. Piccardi (2005) Affect recognition from face and body: early fusion vs. late fusion. In IEEE international conference on systems, man and cybernetics, Cited by: §1.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4, §5.2.
  • [13] J. Hoffman, S. Gupta, and T. Darrell (2016) Learning with side information through modality hallucination. In CVPR, Cited by: §1, Table 2, §4.
  • [14] J. Hu, W. Zheng, J. Lai, and J. Zhang (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In CVPR, Cited by: §1.
  • [15] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon (2015) Multispectral pedestrian detection: benchmark dataset and baseline. In CVPR, Cited by: §1, §5.2.
  • [16] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In ECCV, Cited by: §3, §3.
  • [17] R. Li, J. Pan, Z. Li, and J. Tang (2018) Single image dehazing via conditional generative adversarial network. In CVPR, Cited by: §3, §3.
  • [18] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §5.2.
  • [19] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §5.2.
  • [20] O. Mees, A. Eitel, and W. Burgard (2016) Choosing smartly: adaptive multimodal fusion for object detection in changing environments. In IROS, Cited by: §1.
  • [21] P. K. Nathan Silberman and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §1, §3, §4.
  • [22] N. Neverova, C. Wolf, G. Taylor, and F. Nebout (2016) Moddrop: adaptive multi-modal gesture recognition. TPAMI 38 (8), pp. 1692–1706. Cited by: §1, §1, §2, Table 2.
  • [23] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. IJCV 126 (9), pp. 973–992. Cited by: §1, Figure 2, Table 1, §3.
  • [24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §1, §2.
  • [25] V. Vapnik and A. Vashist (2009) A new learning paradigm: learning using privileged information. Neural networks 22 (5-6), pp. 544–557. Cited by: §1.
  • [26] J. Wagner, V. Fischer, M. Herman, and S. Behnke (2016)

    Multispectral pedestrian detection using deep fusion convolutional neural networks

    In ESANN, Cited by: §1.
  • [27] Y. Wu, E. Y. Chang, K. C. Chang, and J. R. Smith (2004) Optimal multimodal fusion for multimedia data analysis. In ACM Multimedia, Cited by: §1.
  • [28] J. Xiao, A. Owens, and A. Torralba (2013) Sun3d: a database of big spaces reconstructed using sfm and object labels. In ICCV, Cited by: §1.
  • [29] X. Yang, Z. Xu, and J. Luo (2018) Towards perceptual image dehazing by physics-based disentanglement and adversarial training. In AAAI, Cited by: §3, §3.