The use of deeper networks and data-hungry algorithms to solve challenging computer vision problems has created the need for ever richer datasets. In addition to common image datasets such as the famed ImageNet
, datasets containing multiple modalities have also been collected to address a variety of problems ranging from depth estimation
, indoor scene understanding, 6-DOF tracking , multispectral object detection , autonomous driving [10, 4] to haze removal , to name just a few. More generally, learning from multiple modalities has been explored to determine which ones are useful , and multiple ways of combining them have been proposed [11, 14, 20, 26].
Training deep learning models on additional modalities typically means that these extra modalities must also be available at test time. Unfortunately, capturing more modalities requires significant time and effort. Adding sensors alongside an RGB camera results in increased power consumption, less portable setups, the need to carefully calibrate and synchronize each sensor, as well as additional constraints on bandwidth and storage requirements. This may not be practical for multiple applications—including augmented reality, robotics, wearable and mobile computing, etc.—where these physical constraints preclude the use of additional sensors.
This dichotomy between the advantage brought by additional modalities and the impediment they impose on real systems has attracted attention in the literature. Can we train on additional modalities without relying on them at test time? In their “learning with privileged information” paper, Vapnik et al.  introduce a theoretical framework which shows that this may indeed be feasible. Practical techniques have since been introduced, but those tend to be specifically targeted towards specific network architectures and applications. For example, “modality hallucination”  and variants [7, 8] propose to train networks on different modalities independently, and shows that by changing the input modality of one of the networks while forcing the latent space to keep its former structure improves convergence. In , authors proposed to independently process multiple modalities in parallel branches within a network, and fuse the resulting feature maps using so-called “modality dropout” to make the network invariant to missing modalities. Despite improving performance, these methods are complex to implement, may require multiple training steps, and must be adapted differently to each problem.
In this paper, we propose a technique for exploiting additional modalities at training time, without having to rely on them at test time. In contrast to previous work which requires task-specific architectures  or multiple training passes , our approach is extremely simple (can be implemented in a few lines of code), is independent of the learning architecture used, and does not require any additional training pass. Assuming that modalities are spatially-aligned and share the same spatial resolution, we propose to randomly dropout  entire input modalities at training time. At test time, the missing modality is simply set to 0. We demonstrate that our proposed strategy, Input Dropout, can be leveraged to obtain between 2–20% gain over training on RGB-only, on a variety of applications.
2 Input Dropout
We assume that all input modalities are spatially aligned and can be represented as additional channels of the same input image. In our experiments, we also assume that the RGB modality is the only modality available at test time, therefore the other modality is never available during testing.
Our proposed Input Dropout strategy is illustrated in fig. 1
. The additional modality is first channel-wise concatenated to the RGB image, and the resulting tensor is fed as input to the neural network. The first convolutional layer of the network must be adapted to this new input dimensionality (c.f. sec.4). At training time, one of the input modalities is randomly set to 0 with probability . This effectively “drops out”  the corresponding modality. At test time, the additional modality is always set to 0. Implementing Input Dropout
requires a few lines of PyTorch code.
Since we assume a single additional modality is combined with an RGB image, we are faced with two options. We could randomly drop only the additional modality and always keep the RGB (we dub this option addit), or drop either the RGB or the additional modality (both
). In these two cases, a uniform probability distribution for the different possible cases is used. For theaddit mode, the probability of dropping the additional modality is set to . For the both mode, the probability of dropping either the RGB or the additional modality is .
Our method is mainly related to “modality dropout” , which fuses the modalities in a learned latent space. Their main limitation is that specialized network branches must be learned for each modality, which adds complexity. In contrast, our method can be used on existing convolutional architectures with very little change. We will compare to  in sec. 4.
3 Input Dropout for image dehazing
We first experiment with Input Dropout on single image dehazing [17, 29] with depth (RGB+D) as the additional modality available at training time only. For this, we employ the D-Hazy dataset , which contains 1449 pairs of RGB+D images where haze is synthetically added on images from the NYU Depth dataset . We use 1180 images in training, 69 for validation, and 200 for test. Our model is similar to , the only difference being that the generator is a ResNet (with nine blocks) as in .
where (obtained with grid search on the validation set). At training time, Input Dropout uses the addit mode. Indeed, it does not make sense to drop the RGB image since it would be equivalent to obtain a haze-free image from a depth map. We also experiment on single image dehazing using segmentation (RGB+S) as an additional training modality. For this, we use the Foggy Cityscape Dataset , an extension of Cityscapes  which contains ground truth scene segmentations. The same network and training procedure are used.
Quantitative dehazing results with Input Dropout are provided in tab. 1, and corresponding representative qualitative results in fig. 2. Over the RGB-only baseline, relative improvements of 3.6% and 3.4% on PSNR and SSIM respectively are observed when using Input Dropout on RGB+D, and 4.5% PSNR and 2.2% SSIM for RGB+S. We also compare our method to competing techniques, such as “Dehazing for segmentation” (D4S) 
which proposes an approach to dehaze to increase performance for a subsequent task using a modality only available during training, and Pix2Pix GAN which employs an extra generator to generate the missing modality from the RGB image. In every case, Input Dropout performs better while being simpler than the other approaches. Note that we have not compared our approach to “modality distillation”  here since the method cannot be applied to this scenario. Indeed, it would involve training a network to dehaze a depth (or segmentation) image, which would require hallucinating scene contents.
|Pix2Pix GAN ||17.70||0.75||22.90||0.91|
|Input||Ground truth||RGB only||Input Dropout||Difference|
4 Input dropout for classification
We evaluate the use of Input Dropout for image classification using RGB+D training data. For this, we rely on the methodology proposed by Garcia et al. , who use the crops of individual objects from the NYU V2 dataset  adapted by  for object classification using RGB+D. We used the same split as in : 4,600 RGB-D images in total, where around 50% are used for training and the remainder for testing. Here, we rely on a ResNet-34 , initialized with pretrained weights on ImageNet . To adapt the pretrained ResNet-34 to use Input Dropout, we append additional channels to the filters of the first convolution layer and initialize the new weights randomly. Doing so preserves the pretrained weights for the RGB modality.
Tab. 2 shows the quantitative classification accuracy obtained with the various methods. First, we report results when the depth modality is available at test time to provide an upper bound on performance. Next, we evaluate training a single network on the RGB modality only (“RGB-only”), the approach of  which relies on a GAN to hallucinate the depth at test time, and our Input Dropout strategy (in the addit mode). Our approach provides the best results, despite being the simplest.
We further compare to ensemble methods. First, two networks trained on RGB only, with their answers averaged before the argmax, yield an absolute performance improvement of 3.2% over the single-network baseline. The “modality distillation” approach of Garcia et al.  relies on a combination of two networks: one trained on RGB only, and another, so-called “hallucination” network. That second network is trained to produce a latent representation that is similar to a proxy network trained on the depth modality only. The final output is the mean of the RGB-only and the hallucination network. We reimplemented their approach in PyTorch to ensure direct comparison with our results, which yields a 1.8% absolute improvement over the RGB+RGB baseline.
We directly compare our technique to “modality distillation”  by using one network trained on RGB only, and another network trained on RGB+D with Input Dropout (instead of their “hallucination” network). This yields approximately the same performance as “modality distillation” , despite being much simpler to train, requiring a single network architecture and a single training pass (i.e. both networks can be trained in parallel, while they must be trained sequentially for ).
|Depth (D) only||No||57.0%|
|Pix2Pix GAN ||No||48.2%|
|Mod. distillation ||Yes||52.5%|
|Input Dropout + RGB||Yes||52.7%|
5 Other applications
We evaluate Input Dropout (in the both mode) on two additional applications: tracking in RGB+D and pedestrian detection in RGB+thermal.
5.1 3D object tracking with RGB+D
We first focus on the problem of tracking 3D objects in 6 degrees of freedom (DOF). To do so, we employ the methodology of Garon et al., who presented a technique for tracking a known 3D object in real-time using synthetic RGB+D data. They also provide an evaluation dataset containing 297 real sequences captured with a Kinect V2 with ground truth annotations of the 6-DOF poses of 11 different objects.
Here, we focus on the “occlusion” scenario proposed by  where the objects are rotated on a turntable while being partially hidden by a planar occluder with (measured) occlusion varying from 0% to 75%. We evaluate Input Dropout using the same CNN architecture as in . For a given pose where is the translation and the rotation matrix, the translation error is defined by its norm and the rotation matrix distance is computed with:
where is the matrix trace .
Quantitative 6-DOF tracking results are reported in tab. 3. We observe that Input Dropout generally improves the results for the tracking task in translation with a relative gain as high as 33.3% in the hardest sequences, and an average of 17% relative gain in rotation. The error reported is the average of 5 training runs for each method.
|Translation (mm)||Rotation (degrees)|
5.2 Pedestrian detection with RGB+T
We experiment with pedestrian detection on RGB+T (thermal) images using the KAIST Multispectral pedestrian dataset . The training/validation/test sets are composed of 16,000/1,100/3,500 pairs of thermal/visible images for nighttime and 32,000/1,500/8,500 for daytime.
Here, we rely on RetinaNet , which is a state-of-the-art architecture for object detection. The RetinaNet is trained with a focal loss using a ResNet-34  and a Feature Pyramid Network (FPN) 
as backbone for feature extraction. The RetinaNet is initialized with pretrained weights on ImageNet. As in sec. 4, additional channels are appended to the filters of the first convolutional layer to preserve the learned weights on RGB. The network is then fine-tuned on the KAIST images until convergence on the validation set.
To evaluate performance, we compute the mean average precision (mAP) with an intersection-over-union (IoU) score of 0.5. Tab. 4 shows the results of the experiments in both night- and daytime scenarios. We observe that our Input Dropout strategy yields improvements in all cases, nighttime RGB pedestrian detection improves by 18.9%, and daytime RGB pedestrian detection improves by 15.1%.
We propose Input Dropout as a simple and effective strategy for leveraging additional modalities at training time which are not available at test time. We extensively test our technique in several applications—including single image dehazing, object classification, 3D object tracking, and object detection—on several additional modalities—including depth, segmentation maps, and thermal images. In all cases, using Input Dropout in training yields improved performance at test time, even if the additional modality is unavailable. Our approach, which can be implemented in a few lines of code only, can be used as a drop-in replacement with no change to the network architecture, aside from the addition of one extra input dimension to the first layer filters. The main limitation of our approach is that we have experimented on adding only a single additional modality to the RGB baseline. In the future, we plan on exploring the applicability of the approach with more modalities.
-  (2016) D-hazy: a dataset to evaluate quantitatively dehazing algorithms. In ICIP, Cited by: Figure 2, Table 1.
-  (2018) Ntire 2018 challenge on image dehazing: methods and results. In CVPR Workshops, Cited by: §3.
-  (2018) Overcoming missing and incomplete modalities with generative adversarial networks for building footprint segmentation. In CBMI, pp. 1–6. Cited by: Table 1, §3, Table 2, §4.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §1, Figure 2, §3.
-  (2019-09) Learning of image dehazing models for segmentation tasks. In EUSIPCO, A Coruña, Spain. Cited by: Table 1, §3.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1, §4, §5.2.
-  (2018) Modality distillation with multiple stream networks for action recognition. In ECCV, Cited by: §1.
-  (2019) Learning with privileged information via adversarial discriminative modality distillation. TPAMI. Cited by: §1, §1, §3, Table 2, §4, §4, §4.
-  (2018) A framework for evaluating 6-dof object trackers. In ECCV, Cited by: §1, §5.1, §5.1, Table 3.
-  (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In CVPR, Cited by: §1.
-  (2005) Affect recognition from face and body: early fusion vs. late fusion. In IEEE international conference on systems, man and cybernetics, Cited by: §1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4, §5.2.
-  (2016) Learning with side information through modality hallucination. In CVPR, Cited by: §1, Table 2, §4.
-  (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In CVPR, Cited by: §1.
-  (2015) Multispectral pedestrian detection: benchmark dataset and baseline. In CVPR, Cited by: §1, §5.2.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3, §3.
-  (2018) Single image dehazing via conditional generative adversarial network. In CVPR, Cited by: §3, §3.
-  (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §5.2.
-  (2017) Focal loss for dense object detection. In ICCV, Cited by: §5.2.
-  (2016) Choosing smartly: adaptive multimodal fusion for object detection in changing environments. In IROS, Cited by: §1.
-  (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §1, §3, §4.
-  (2016) Moddrop: adaptive multi-modal gesture recognition. TPAMI 38 (8), pp. 1692–1706. Cited by: §1, §1, §2, Table 2.
-  (2018) Semantic foggy scene understanding with synthetic data. IJCV 126 (9), pp. 973–992. Cited by: §1, Figure 2, Table 1, §3.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §1, §2.
-  (2009) A new learning paradigm: learning using privileged information. Neural networks 22 (5-6), pp. 544–557. Cited by: §1.
Multispectral pedestrian detection using deep fusion convolutional neural networks. In ESANN, Cited by: §1.
-  (2004) Optimal multimodal fusion for multimedia data analysis. In ACM Multimedia, Cited by: §1.
-  (2013) Sun3d: a database of big spaces reconstructed using sfm and object labels. In ICCV, Cited by: §1.
-  (2018) Towards perceptual image dehazing by physics-based disentanglement and adversarial training. In AAAI, Cited by: §3, §3.