Log In Sign Up

What Makes Training Multi-Modal Networks Hard?

by   Weiyao Wang, et al.

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including fine-grained sport classification, human action recognition, and acoustic event detection.


page 3

page 4

page 5

page 6

page 8

page 9

page 13

page 14


AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Multi-modal learning, which focuses on utilizing various modalities to i...

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Fine-grained action recognition datasets exhibit environmental bias, whe...

Sequential Outlier Detection based on Incremental Decision Trees

We introduce an online outlier detection algorithm to detect outliers in...

Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)

Despite the remarkable success of deep multi-modal learning in practice,...

BUDD: Multi-modal Bayesian Updating Deforestation Detections

The global phenomenon of forest degradation is a pressing issue with sev...

Learning Multi-modal Similarity

In many applications involving multi-media data, the definition of simil...

1 Introduction

Consider a late-fusion multi-modal network, trained end-to-end to solve a task. In this setting, the single-modal solutions are a strict subset of the solutions available to the multi-modal network; a well-optimized multi-modal model should outperform the best single-modal model. However, we show here that current techniques do not generally achieve this. In fact, what we observe is contrary to common sense: the best single-modal model always outperforms the jointly trained late-fusion model, across different modalities and benchmarks (Table 2). Details will be given later in section 4.1. Anecdotally, this appears to be common. In personal communications we have heard of similar phenomena occurring in other tasks when fusing RGB+geometry, audio+video, and others.

Dataset Multi-modal V@1 Best single V@1 Accuracy drop Kinetics A + RGB 71.4 RGB 72.6 -1.2 RGB + OF 71.3 RGB 72.6 -1.3 A + OF 58.3 OF 62.1 -3.8 A + RGB + OF 70.0 RGB 72.6 -2.6 mini-Sports A + RGB 60.2 RGB 62.7 -2.5
Table 1: Single-modal networks consistently outperform multi-modal networks. Comparison between the best single-modal networks with late fusion multi-modal networks on Kinetics and mini-Sports using video top-1 validation accuracy. Single stream modalities include video clips (RGB), Optical Flow (OF), and Audio Signal (A). Multi-modal networks use the same architectures as those used in single-modal networks with late fusion by concatenation at the last layer before prediction.
RGB late-concat pre-train early-stop dropout mid-concat SE-gate NL-gate 72.6 71.4 71.7 71.3 72.9 72.8 71.4 72.0
Table 2: Standard regularizers do not provide a good improvement over the best single-modal network. Comparison of the best single modal network (RGB) with the known approaches applied on a multi-modal network (RGB+Audio) on Kinetics. Various approaches for avoiding overfitting (Pre-training, Early-stopping, and Dropout) cannot solve the issue. Different fusion architectures (Mid-concatenation fusion, SE-gating, and NL-gating) also do not help. Dropout and Mid-concatenation fusion approaches provide small improvements (+0.3% and +0.2%), while other methods degrade accuracy.

1.1 Lack of known solution to the problem

There are two direct ways to approach this problem. First, one can consider solutions such as dropout Dropout14 , pre-training, or early stopping to reduce overfitting. On the other hand, one may speculate that this is an architectural deficiency. We experiment with mid-level fusion by concatenation Owens_2018_ECCV and fusion by gating Kiela18 , trying both Squeeze-and-Excitation (SE) SENet gates and Non-Local (NL) XiaolongWang18 gates. We refer to supplementary materials for details of these architectures.

Remarkably, none of these provide an effective solution. For each method, we record the best audio-visual results on Kinetics in Table 2. Pre-training fails to offer improvements, and early stopping tends to under-fit. Gating adds interactions between the modalities but fails to improve the performance. Mid-concat and dropout provide only modest improvements over RGB model. We note that mid-concat (with 37% fewer parameters compared to late-concat) and dropout make 1.4% and 1.5% improvements over late-concat, which indicates an overfitting problem with late-concat.

How do we reconcile these experiments with previous multi-modal successes? Multi-modal networks have successfully been trained jointly on tasks including sound localization ZhaoSOP18 , image-audio alignment L3Net17 , and audio-visual synchronization Owens_2018_ECCV ; Korbar18 . However, these tasks cannot be performed with a single-modal network alone, so the performance drop found in this paper does not apply to them. In other work, joint training is avoided entirely by fusing features from independently pre-trained single-modal networks (either on the same task or on different tasks). Good examples include two-stream networks for video classification SimonyanZ14 ; WangXW0LTG16 ; FeichtenhoferPZ16 ; I3D and image+text classification Arevalo17 ; Kiela18 . However, these methods do not train jointly, so they are again not comparable, and their accuracy is most likely sub-optimal due to independent training.

1.2 The contributions of this paper

Our contributions include:

  • [noitemsep,leftmargin=0.6cm]

  • We empirically demonstrate the significance of overfitting in joint training of multi-modal networks, and we identify two causes for the problem. We note that such problem is architecture agnostic: different fusion techniques also suffer the same overfitting problem.

  • We propose a metric to understand the overfitting problem quantitatively: the overfitting-to-generalization ratio (OGR). We provide both theoretical and empirical justification.

  • We propose a new training scheme based on OGR which constructs an optimal blend (in a sense we make precise below) of multiple supervision signals. This Gradient-Blend method gives significant gains in ablations and achieves state-of-the-art accuracy on benchmarks including Kinetics, Sports1M, and AudioSet. It applies broadly to end-to-end training of ensemble models.

2 Background: joint multi-modal training

Single-modal network. Given a training set , where is the -th training example and is its true label, training on a single modality (e.g. RGB frames, audio, or optical flows) means minimizing an empirical loss:


where is normally a deep network parameterized by , and

is a classifier, typically one or more fully-connected (FC) layers with parameters

. For classification problems, is normally the cross entropy loss. Minimizing Eq. 1 gives a solution and . Figure 1a shows independent training of two modalities and .

Multi-modal network. We train a late-fusion ensemble model on different modalities (). Each modality is processed by a different deep network , and their features are concatenated and passed to a classifier . Formally, training is done by minimizing the loss:


where denotes a concatenation operation. Figure 1 b) shows an example of a joint training of two modalities and . Note that the multi-modal network in Eq. 2 is a super-set of the single-model network in Eq. 1, for any modality . In fact, for any solution to Eq. 1 on any modality , one can construct an equally-good solution to Eq. 2 by choosing parameters that mute all modalities other than . In practice, this solution is not found, and we next explain why.

Figure 1: Single- vs. multi-modal joint training. a) Single-modal training of two different modalities. b) Naive joint training of two modalities by late fusion. c) Joint training of two modalities with weighted blending of supervision signals. Different deep network encoders (white trapezoids) produce features (blue or pink rectangles) which are concatenated and passed to a classifier (yellow rounded rectangles).

3 Multi-modal joint training via gradient blending

3.1 Generalizing vs. Overfitting

Overfitting is, by definition, learning patterns in a training set that do not generalize to the target distribution. We quantify this as follows. Given model parameters , where

indicates the training epoch, let

be the model’s average loss over the fixed training set, and be the “true” loss w.r.t the hypothetical target distribution. (In practice, is approximated by the test and validation losses.) For either loss, the quantity is a measure of the information gained during training. We define overfitting as the gap between the gain on the training set and the target distribution:

and generalization to be the amount we learn (from training) about the target distribution:

The overfitting-to-generalization ratio is a measure of information quality:


However, it does not make sense to optimize this as-is. Very underfit models, for example, may still score quite well. What does make sense, however, is to solve an infinitesimal problem: given several estimates of the gradient, blend them to minimize an infinitesimal

, ensuring each gradient step now produces a gain no worse than that of the single best modality.

Given parameter , the full-batch gradient with respect to the training set is , and the groundtruth gradient is . We decompose into the true gradient and a remainder:


In particular, is exactly the infinitesimal overfitting. Given an estimate , we can measure its contribution to the losses via Taylor’s theorem:

which implies ’s contribution to overfitting is given by . If we train for steps with gradients , and is the learning rate at -th step, the final can be aggregated as:



for a single vector



Next we will compute the optimal blend to minimize single-step .

3.2 Blending of Multiple Supervision Signals by Ogr Minimization

We can obtain multiple approximate gradients by attaching classifiers to each modality’s features and to the fused features (see fig 1c). Gradients are obtained by back-propagating through each loss separately (so per-modality gradients contain many zeros in other parts of the network). Our next result allows us to blend them all into a single vector with better overfitting behavior.

Proposition 1 (Gradient-Blend).

Let be a set of gradient estimators whose overfitting satisfies for . Let denote weights with . The optimal weights


are given by


where and enforces the sum is unity.

The assumption that will be false when two models’ overfitting is very correlated. However, if this is the case then very little can be gained by blending. In informal experiments we have indeed observed that these cross terms are often small relative to the

. This is likely due to complementary information across modalities, and we speculate that additionally, this happens naturally as joint training tries to learn complementary features across neurons. Please see supplementary materials for proof of Proposition 

1, including formulas for the correlated case.

3.3 Use of Ogr and Gradient-Blend in practice

We adapt a multi-task architecture to construct an approximate solution to the optimization above.

Gradient-Blend in practice. Proposition 1 suggests calculating optimal weights every update step to minimize . This would be a noisy and computationally demanding task. Instead, we find it works remarkably well to assign a single fixed weight per modality, obtained using the per-modality generalization () and overfitting () measured after an initial training run of each model separately. We demonstrate the gains from such simplified training schema and look forward to developing robust per-step or per-epoch estimation in future work.

Optimal blending by loss re-weighting Figure 1c shows our joint training setup for two modalities with weighted losses. At each back-propagation step, the per-modality gradient for is (), and the gradient from the fused loss is given by Eq. 2 (we denote it as ). Taking the gradient of the blended loss


thus produces the blended gradient . For appropriate choices of this yields a convenient way to implement gradient blending with fixed weights. Intuitively, loss reweighting re-calibrates the learning schedule to balance the generalization/overfitting rate of different modalities.

Measuring OGR in practice. In practice, is not available. To measure OGR, we hold out a subset of the training set to approximate the true distribution (i.e. ), and compute


In summary, we train as follows:

  1. [nosep]

  2. Train single-modal models for each modality, as well as the joint model .

  3. For each model, compute as per (10)

  4. Train a multi-modal model, as per figure 1c, with loss weights given by

In practice, we find it is equally effective to replace the loss measure by an accuracy metric.

4 Ablation Experiments

4.1 Experimental setup

Datasets. We use three datasets for our ablation experiments: Kinetics, mini-Sports, and mini-AudioSet. Kinetics is a standard benchmark for action recognition with 260k videos Kinetics of human action classes. We use the train split (240k) for training and the validation split (20k) for testing. Mini-Sports is a subset of Sports-1M Karpathy14 , a large-scale video classification dataset with 1.1M videos of 487 different fine-grained sports. We uniformly sampled 240k videos from train split and 20k videos from the test split. Mini-AudioSet is a subset of AudioSet audioset , a multi-label dataset consisting of 2M videos labeled by 527 acoustic events. AudioSet is very class-unbalanced, so we remove classes with less than 500 samples and subsample such that each class has about 1100 samples to balance it (see supplementary). The balanced mini-AudioSet has 418 classes with 243k videos.

Backbone architecture. We use ResNet3D Tran18 as our visual backbone and ResNet KaimingHe16 as our audio model, both with 50 layers. For fusion, we use a two-FC-layer network applied on the concatenated features from visual and audio backbones, followed by one prediction layer.

Input preprocessing & augmentation. We use three modalities in ablations: RGB frames, optical flows and audio. For RGB and optical flows, we use the same visual backbone ResNet3D-50, which takes a clip of 16224224 as input. We follow the same data pre-processing and augmentation as used in XiaolongWang18 for our visual modal, except for we use 16-frame clip input (instead of 32) to reduce memory. For audio, our ResNet-50 takes a spectrogram image of 40100, i.e. MEL-spectrograms extracted from audio input with 100 temporal frames and each has 40 MEL filters.

Training and testing. We train our models with synchronous distributed SGD on GPU clusters using Caffe2 caffe2 with the same training setup as Tran18 . We hold out a small portion of training data for estimating the optimal weights (8% for Kinetics and mini-Sports, 13% for mini-AudioSet). For evaluation, we report clip top-1 accuracy, video top-1 and top-5 accuracy. For video accuracy, we use the center crops of 10 clips uniformly sampled from the video and average these 10 clip predictions to get the final video prediction.

4.2 Overfitting Problems in Naive Joint Training

In this ablation, we compare the performance of naive audio-RGB joint training with the single-modal network training of audio-only and RGB-only. Fig. 2 plots the training curves of these models on Kinetics (left) and mini-Sports (right). On both Kinetics and mini-Sports, the audio model overfits the most and video overfits least. We note that the naive joint audio-RGB model has lower training error and higher validation error compared with the video-only model. This is evidence that naive joint training of the audio-RGB model increases overfitting, explaining its accuracy drop compared with the video-only model.

Figure 2: Severe overfitting of naive audio-video models on Kinetics and mini-Sports. The learning curves (error-rate) of audio model (A), video model (V), and the naive joint audio-video (AV) model on Kinetics (left) and mini-Sports (right). Solid lines plot validation error while dashed lines show train error. The audio-video model inherits the severe overfitting of audio model, and is inferior to the video-only model.

We extend the analysis and confirm severe overfitting on other multi-modal problems. We consider all 4 possible combinations of the three modalities (audio, RGB, and optical flow). In every case, the validation accuracy of naive joint training is significantly worse than the best single stream model (see Table 2), and training accuracy is almost always higher (see supplementary materials).

4.3 Gradient-Blend is an effective regularizer

In this ablation, we show the merit of Gradient-Blend in multi-modal training. We first show our method helps to regularize and improve the performance on different multi-modal problems on Kinetics. We then compare our method with other regularization methods on the three datasets.

On Kinetics, we study all combinations of three modalities: RGB, optical flow, and audio. Table 3 presents comparison of our method with naive joint training and best single stream model. We observe significant gains of our Gradient-Blend strategy compared to both baselines on all multi-modal problems. It is worth noting that our Gradient-Blend is generic enough to work for more than two modalities.

Modal RGB + A RGB + OF OF + A RGB + OF + A
Clip V@1 V@5 Clip V@1 V@5 Clip V@1 V@5 Clip V@1 V@5
Single 63.5 72.6 90.1 63.5 72.6 90.1 49.2 62.1 82.6 63.5 72.6 90.1
Naive 61.8 71.4 89.3 62.2 71.3 89.6 46.2 58.3 79.9 61.0 70.0 88.7
G-Blend 65.8 74.7 91.5 64.3 73.1 90.8 55.0 66.5 86.3 65.7 74.7 91.6
Table 3: Gradient-Blend (G-Blend) works on different multi-modal problems. Comparison between G-Blend with naive late fusion and single best modality on Kinetics. On all 4 combination of different modalities, G-Blend outperforms both naive late fusion network and best single-modal network by large margins, and it also works for cases with more than two modalities.

Furthermore, we pick the problem of joint audio-RGB model training, and go deeper to compare our Gradient-Blend with other regularization methods on different tasks and benchmarks: action recognition (Kinetics), sport classification (mini-Sports), and acoustic event detection (mini-AudioSet). We include three baselines: adding dropout at concatenation layer Dropout14 , pre-training single stream backbones then finetuning the fusion model, and blending the supervision signals with equal weights (which is equivalent to naive training with two auxiliary losses). Auxiliary losses are popularly used in multi-task learning, and we extend it as a baseline for multi-modal training.

As presented in Table 4, our Gradient-Blend outperforms all baselines by significant margins on both Kinetics and mini-Sports. On mini-AudioSet, Gradient-Blend improves all baselines on mAP, and is slightly worse on mAUC compared to auxiliary loss baseline. The reason is that the gradient weights learned in Gradient-Blend ( on Audio, RGB and Audio-RGB) are very similar to equal weights. The failures of auxiliary loss on Kinetics and mini-Sports demonstrates that the weights used in Gradient-Blend are indeed important. We also experiment with other less obvious multi-task techniques such as treating the weights as learnable parameters during back-prop Kendall18 . However, this approach converges to a similar result as naive joint training. This happens because it lacks of overfitting prior, and thus the learnable weights were biased towards to the modality that has the lowest training loss which is audio-RGB.

Dataset Kinetics mini-Sports mini-AudioSet
Method Clip V@1 V@5 Clip V@1 V@5 mAP mAUC
Audio only 13.9 19.7 33.6 14.7 22.1 35.6 29.1 90.4
RGB only 63.5 72.6 90.1 48.5 62.7 84.8 22.1 86.1
Pre-Training 61.9 71.7 89.6 48.3 61.3 84.9 37.4 91.7
Naive 61.8 71.7 89.3 47.1 60.2 83.3 36.5 92.2
Dropout 63.8 72.9 90.6 47.4 61.4 84.3 36.7 92.3
Auxiliary Loss 60.5 70.8 88.6 48.9 62.1 84.0 37.7 92.3
G-Blend 65.8 74.7 91.5 49.7 62.8 85.5 37.8 92.2
Table 4: G-Blend outperforms all baseline methods on different benchmarks and tasks. Comparison of our G-blend with different regularization baselines as well as single-modal networks on Kinetics, mini-Sports, and mini-AudioSet. G-Blend consistently outperforms other methods, except for being comparable with using auxiliary loss on mini-AudioSet due to the similarity of learned weights of G-Blend and equal weights.

Fig. 3 presents the top and bottom 20 classes on Kinetics where Gradient-Blend makes the most and least improvements compared with RGB network. We observe that the improved classes usually have a strong audio-correlation: such as beatboxing, whistling, etc. For classes like moving-furniture, cleaning-floor, although audio alone has nearly 0 accuracy, when combined with RGB, there are still significant improvements. These classes also tend to have high accuracy with naive joint training, which indicates the value of the joint supervision signal. On the bottom-20 classes, where the Gradient-Blend is doing worse than RGB model, we indeed find that audio does not seem to be very semantically relevant.

Figure 3: Top-Bottom 20 classes based on improvement of G-Blend to RGB model. The improved classes are indeed audio-relevant, while those have performance drop are not very audio semantically-related.

5 Compare with State-of-the-Art

In this section, we train our multi-modal networks with deeper backbone architectures using Gradient-Blend and compare them with state-of-the-art methods on Kinetics, Sports1M, and AudioSet. Our G-Blend is trained with RGB and audio input. We use R(2+1)D Tran18 for visual backbone and R2D KaimingHe16 for audio backbone, both with 101 layers. We use the same pre-processing and data augmentation as described in section 4. We use the same 10-crop evaluation setup as in section 4 for Sports-1M and AudioSet. For Kinetics, we follow the same 30-crop evaluation setup as XiaolongWang18 . Our main purposes in these experiments are: 1) to confirm the benefit of Gradient-Blend on high-capacity models; and 2) to compare our G-Blend with state-of-the-art methods on different large-scale benchmarks.

Results. Table 7 presents results of our G-Blend and compares them with current state-of-the-art methods on Kinetics. First, we observe that our G-Blend provides an 1.3% improvement over RGB model (the best single modal network) with the same backbone architecture R(2+1)D-101 when both models are trained from scratch. This confirms that the benefits of G-Blend still hold with high capacity model. Second, our G-Blend, when fine-tuned from Sports-1M, outperforms Shift-Attention Network abs-1708-03805 and Non-local Network XiaolongWang18 by 1.2% and achieves state-of-the-art accuracy on Kinetics. We note that this is not a fair and direct comparison. First, the Shift-Attention network uses 3 different modalities (RGB, optical flows, and audio), our G-blend uses only RGB and audio. Second, Non-local network uses 128-frame clip input while our G-Blend uses only 16-frame clip input. We also note that there are many competitive methods reporting results on Kinetics, due to the space limit, we select only a few representative methods for comparison including Shift-Attention network abs-1708-03805 , Non-local network XiaolongWang18 , and R(2+1)D Tran18 . Shift-Attention and Non-local networks are the methods with the best published accuracy using multi-modal and single-modal input, respectively. R(2+1)D is used as the visual backbone of G-Blend thus serves as a direct baseline.

Table 7 and Table 7 present our G-Blend results and compare them with current best methods on Sports-1M and AudioSet. On Sports-1M, G-Blend significantly outperforms previously published results by good margins. It outperforms the current state-of-the-art R(2+1)D model by 1.8% while using shorter clip input (16 instead of 32 due to memory constraint). On AudioSet, our G-Blend outperforms the Google benchmarkaudioset and Softmax Attention Kong18 by 4.1% and 2.8%, respectively, both of which used the feature extractor pre-trained on YouTube100M YouTube100M . Our G-Blend is comparable with Multi-level Attention NetworkYuMultilvl18 and TAL-NetTalNet , although the first one uses strong features (pre-trained on YouTube100M) and the second one uses 100 clips per video, while our G-Blend uses only 10 clips.

Method abs-1708-03805 XiaolongWang18 Tran18 Tran18 G-Blend (ours) backbone Shift-Attn Net R3D-101+NL R(2+1)D-34 R(2+1)D-101 R(2+1)D-101 & R2D-101 input RGB + OF + A RGB RGB RGB RGB + A pretrain ImageNet ImageNet Sports1M none none Sports1M Video@1 77.7 77.7 74.3 76.4 77.7 78.9 Video@5 93.2 93.3 91.4 92.1 93.0 93.5
Table 5: Comparison with state-of-the-art methods on Kinetics. Our G-Blend outperforms various current state-of-the-art methods despite the fact that it uses fewer modalities or shorter clip input. Our G-Blend also gives a good improvement over RGB model (the best single-modal network) when using the same backbone.
Method Video@1 Video@5 C3D Tran15 61.1 85.2 P3D P3D 66.4 87.4 Conv pool Ng15 71.7 90.4 R(2+1)D Tran18 73.0 91.5 G-Blend (ours) 74.8 92.4
Table 6: Comparison with state-of-the-art methods on Sports1M. Our G-Blend with outperforms the state-of-the-art methods by good margins.
Method mAP mAUC Benchmark audioset 0.314 0.959 Softmax Attn. Kong18 0.327 0.965 Multi-level Attn. YuMultilvl18 0.360 0.970 TAL-Net TalNet 0.362 0.965 G-Blend (ours) 0.355 0.966
Table 7: Comparison with state-of-the-art methods on AudioSet. Our G-Blend outperforms or is comparable with state-of-the-art methods.

6 Related Work

Our work is related to the previous line of research on multi-modal networks Baltruaitis2018MultimodalML for classifications SimonyanZ14 ; WangXW0LTG16 ; FeichtenhoferPZ16 ; Fukui16 ; I3D ; Arevalo17 ; abs-1708-03805 ; Kiela18 , which primarily uses pre-training in contrast to our joint training. On the other hand, our work is also related to cross-modal tasks Weston:2011:WSU:2283696.2283856 ; NIPS2013_5204 ; Socher:2013:ZLT:2999611.2999716 ; VQA ; balanced_binary_vqa ; balanced_vqa_v2 ; ImageCaption16

and cross-modal self-supervised learning 

ZhaoSOP18 ; L3Net17 ; Owens_2018_ECCV ; Korbar18 . These tasks either take one modality as input and make prediction on the other modality (e.g. Visual-Q&AVQA ; balanced_binary_vqa ; balanced_vqa_v2 , image captioning ImageCaption16 , sound localization Owens_2018_ECCV ; ZhaoSOP18 in videos) or uses cross-modality correspondences as self-supervision (e.g. image-audio correspondence L3Net17 , video-audio synchronization Korbar18 ). Different from them, our Gradient-Blend tries to address the problem of joint training of multi-modal for classification. Our Gradient-Blend training scheme is also related to other works on auxiliary loss, which is widely adopted in multi-task learning approaches Kokkinos16 ; Eigen15 ; Kendall18 ; GradNorm18 . These methods either use uniform\manually tuned weights, or learn the weights as parameters during training, while our work re-calibrates supervision signals using a prior OGR.

7 Discussion

In single-modal networks, diagnosing and correcting overfitting typically involves manual inspection of learning curves. Here we have shown that for multi-modal networks it is essential to measure and correct overfitting in a principled way, and we put forth a useful and practical measure of overfitting. Our proposed method, Gradient-Blend, uses this measure to obtain significant improvements over baselines, and either outperforms or is comparable with state-of-the-art methods on multiple tasks and benchmarks. We look forward to extending Gradient-Blend to a single-pass online algorithm: OGR estimates are made during training and learning parameters are dynamically adjusted.


  • (1) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In

    International Conference on Computer Vision (ICCV)

    , 2015.
  • (2) R. Arandjelović and A. Zisserman. Look, listen and learn. In ICCV, 2017.
  • (3) J. Arevalo, T. Solorio, M. M. y Gómez, and F. A. González. Gated multimodal units for information fusion. In ICLR Workshop, 2017.
  • (4) T. Baltruvsaitis, C. Ahuja, and L.-P. Morency.

    Multimodal machine learning: A survey and taxonomy.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:423–443, 2018.
  • (5) R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Int. Res., 55(1):409–442, Jan. 2016.
  • (6) Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification. CoRR, abs/1708.03805, 2017.
  • (7) Caffe2-Team.

    Caffe2: A new lightweight, modular, and scalable deep learning framework.
  • (8) J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • (9) Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. 2018.
  • (10) D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. 2015.
  • (11) C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  • (12) A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2121–2129. Curran Associates, Inc., 2013.
  • (13) A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, 2016.
  • (14) J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
  • (15) Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • (16) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • (17) S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, March 2017.
  • (18) J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, 2018.
  • (19) A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei.

    Large-scale video classification with convolutional neural networks.

    In CVPR, 2014.
  • (20) W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  • (21) A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018.
  • (22) D. Kiela, E. Grave, A. Joulin, and T. Mikolov. Efficient large-scale multi-modal classification. In AAAI, 2018.
  • (23) I. Kokkinos. Ubernet: Training a ‘universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. arXiv preprint arXiv:1609.02132, 2016.
  • (24) Q. Kong, Y. Xu, W. Wang, and M. Plumbley.

    Audio set classification with attention model: A probabilistic perspective.

    04 2018.
  • (25) B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
  • (26) A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In The European Conference on Computer Vision (ECCV), September 2018.
  • (27) Z. Qiu, T. Yao, , and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
  • (28) K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • (29) R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng. Zero-shot learning through cross-modal transfer. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS’13, pages 935–943, USA, 2013. Curran Associates Inc.
  • (30) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, Jan. 2014.
  • (31) D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • (32) D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
  • (33) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. 2017.
  • (34) L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • (35) X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018.
  • (36) Y. Wang, J. Li, and F. Metze. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. arXiv preprint arXiv:1810.09050, 2018.
  • (37) J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In

    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three

    , IJCAI’11, pages 2764–2770. AAAI Press, 2011.
  • (38) C. Yu, K. S. Barsim, Q. Kong, and B. Yang. Multi-level attention model for weakly supervised audio classification. arXiv preprint arXiv:1803.02353, 2018.
  • (39) J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
  • (40) P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and Yang: Balancing and answering binary visual questions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • (41) H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In ECCV, 2018.

Appendix A Proof of Proposition 1

We first introduce the following lemma to assist with the proof:

Lemma 1 (Scaling Invariance of Minimization).

Given the assumptions of Proposition 1, we transform the vectors such that . Let be a transformed set of weights with , the weights that minimize , satisfies


In other words, the optimum value of is invariant to rescaling of input vectors .

Proof of Lemma 1.

Let be the optimal value of given and be the optimal value of given , we only need to show , because by symmetry we have .

Assume a contradiction: , and let be the solution for . Then we have


However, in equation 12 is a feasible solution to minimization given , and thus, its satisfies . Therefore, we have


Therefore, the contradiction assumption is incorrect; thus . By symmetry, ; thus . ∎

Proof of Proposition 1.

We create the normalized set


and solve for coefficients given . By Lemma 1, minimizing is equivalent to original problem (minimizing ). From the constraint, we have


Thus, the problem simplifies to:


We first compute the expectation:


where .

We apply Lagrange multipliers on our objective function (equation 17) and constraint (equation 15):


The partial gradient of is given by


Setting the partial gradient to zero gives:


Applying the constraint gives:


In other words,


The normalized variance

and original variance are related by


And the original problem is related to the normalized problem by:


Using to normalize the weights we get . ∎

Note: if we relax the assumption that for , the proof proceeds similarly, although from (17) it becomes more convenient to proceed in matrix notation. Define a matrix with entries given by

Then one finds that

Appendix B Sub-sampling and Balancing Multi-label Dataset

For a single-label dataset, one can subsample and balance at a per-class level such that each class may have the same volume of data. Unlike single-label dataset, classes in multi-label dataset can be correlated. As a result, sampling a single data may add volume for more than one class. This makes the naive per-class subsampling approach difficult.

To uniformly sub-sample and balance AudioSet to get mini-AudioSet, we propose the following algorithm:

Data: Original Multi-Class Dataset , Minimum Class Threshold , Target Class Volume
Result: Balanced Sub-sampled Multi-label Dataset
Initialize empty dataset Remove labels from such that label volume is less than ;
Randomly shuffle entries in ;
for Data Entry  do
       Choose class of such that the volume of is the smallest in ;
       Let the volume of be in ;
       Let the volume of be in ;
       Generate random number to be an integer between and ;
       if  then
             Select to ;
             Skip and continue ;
       end if
end for
Algorithm 1 Sub-sampling and Balancing Multi-label Dataset

Appendix C Details on Model Architectures

c.1 Late Fusion By Concatenation

In late fusion by concatenation strategy, we concatenate the output features from each individual network (i.e. modalities’ 1-D vectors with dimensions). If needed, we add dropout after the feature concatenations.

The fusion network is composed of two FC layers, with each followed by an ReLU layer, and a linear classifier. The first FC maps dimensions to dimensions, and the second one maps to . The classifier maps to , where is the number of classes.

As sanity check, we experimented using less or more layers on Kinetics:

  • [noitemsep]

  • 0 FC. We only add a classifier that maps dimensions to dimensions.

  • 1 FC. We add one FC layer that maps dimensions to dimension, followed by an ReLU layer and classifier to map dimension to dimensions.

  • 4 FC. We add one FC layer that maps dimensions to dimension, followed by an ReLU layer. Then we add 3 FC-ReLU pairs that preserve the dimensions. Then we add an a classifier to map dimension to dimensions.

We noticed that the results of all these approaches are sub-optimal. We speculate that less layers may fail to fully learn the relations of the features, while deeper fusion network overfits more.

c.2 Mid Fusion By concatenation

Inspired by Owens_2018_ECCV , we also concatenate the features from each stream at an early stage rather than late fusion. The problem with mid fusion is that features from individual streams can have different dimensions. For example, audio features are 2-D (time-frequency) while visual features are 3-D (time-height-width).

We propose three ways to match the dimension, depending on the output dimension of the concatenated features:

  • [noitemsep]

  • 1-D Concat. We downsample the audio features to 1-D by average pooling on the frequency dimension. We downsample the visual features to 1-D by average pooling over the two spatial dimensions.

  • 2-D Concat. We keep the audio features the same and match the visual features to audio features. We downsample the visual features to 1-D by average pooling over the two spatial dimensions. Then we tile the 1-D visual features on frequency dimension to make 2-D visual features.

  • 3-D Concat. We keep the visual features fixed and match the audio features to visual features. We downsample the audio features to 1-D by average pooling over the frequency dimension. Then we tile the 1-D visual features on two spatial dimensions to make 3-D features.

The temporal dimension may also be mismatched between the streams: audio stream is usually longer than visual streams. We add convolution layers with stride of 2 to downsample audio stream if we are performing 2-D concat. Otherwise, we upsample visual stream by replicating features on the temporal dimension.

There are five blocks in the backbones of our ablation experiments (section 4), and we fuse the features using all three strategies after block 2, block 3, and block 4. Due to memory issue, fusion using 3-D concat after block 2 is unfeasible. On Kinetics, we found 3-D concat after block 3 works the best, and it’s reported in Table 2. In addition, we found 2-D concat works the best on AudioSet and uses less GFLOPs than 3-D concat. We speculate that the method for dimension matching is task-dependent.

c.3 SE Gate

Squeeze-and-Excitement network introduced in SENet applies a self-gating mechanism to produce a collection of per-channel weights. Similar strategies can be applied in a multi-modal network to take inputs from one stream and produce channel weights for the other stream.

Specifically, we perform global average pooling on one stream and use the same architectures in SENet to produce a set of weights for the other channel. Then we scale the channels of the other stream using the weights learned. We either do a ResNet-style skip connection to add the new features or directly replace the features with the scaled features. The gate can be applied from one direction to another, or on both directions. The gate can also be added at different levels for multiple times. We found that on Kinetics, it works the best when applied after block 3 and on both directions.

We note that we can also first concatenate the features and use features from both streams to learn the per-channel weights. The results are similar to learning the weights with a single stream.

c.4 NL Gate

Although lightweight, SE-gate fails to offer any spatial-temporal or frequency-temporal level attention. One alternative way is to apply an attention-based gate. We are inspired by the Query-Key-Value formulation of gates in AttentionAll17 . For example, if we are gating from audio stream to visual stream, then visual stream is Query and audio stream is Key and Value. The output has the same spatial-temporal dimension as Query.

Specifically, we use Non-Local gate in XiaolongWang18 as the implementation for Query-Key-Value attention mechanism. Details of the design are illustrated in fig. 4. Similar to SE-gate, NL-Gate can be added with multiple directions and at multiple positions. We found that it works the best when added after block 4, with a 2-D concat of audio and RGB features as Key-Value and visual features as Query to gate the visual stream.

Figure 4: NL-Gate Implementation. Figure of the implementation of NL-Gate on visual stream. Visual features are the Query. The 2D Mid-Concatenation of visual and audio features is the Key and Value.

Appendix D Additional Ablation Results

d.1 Training Accuracy

In section 4.2, we introduced the overfitting problem of joint training of multi-modal networks. Here we include both validation accuracy and train accuracy of the multi-modal problems (Table 8). We demonstrate that in all cases, the multi-modal networks are performing worse than their single best counterparts, while almost all of their train accuracy are higher (with the sole exception of OF+A, whose train accuracy is similar to audio network’s train accuracy).

Dataset Modality Validation Accuracy Train Accuracy
Kinetics A 19.7 85.9
RGB 72.6 90.0
OF 62.1 75.1
A + RGB 71.4 95.6
RGB + OF 71.3 91.9
A + OF 58.3 83.2
A + RGB + OF 70.0 96.5
mini-Sport A 22.1 56.1
RGB 62.7 77.6
A + RGB 60.2 84.2
Table 8: Multi-modal networks have lower validation accuracy but higher train accuracy. Table of Top-1 accuracy of single stream models and naive late fusion models. Single stream modalities include RGB, Optical Flow (OF), and Audio Signal (A). Its higher train accuracy and lower validation accuracy signal severe overfitting.

d.2 Early Stopping

In early stopping, we experimented with three different stopping schedules: using 25%, 50% and 75% of iterations per epoch. We found that although overfitting becomes less of a problem, the model tends to under-fit. In practice, we still found that the 75% iterations scheduling works the best among the three, though it’s performance is worse than full training schedule that suffers from overfitting. We summarize their learning curves in fig. 5.

Figure 5: Early stopping avoids overfitting but tends to under-fit. Learning curves for three early stopping schedules we experiment. When we train the model with less number of iterations, the model does not overfit, but the undesirable performance indicates an under-fitting problem instead.