Log In Sign Up

DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework

In multimodal tasks, we find that the importance of text and image modal information is different for different input cases, and for this motivation, we propose a high-performance and highly general Dual-Router Dynamic Framework (DRDF), consisting of Dual-Router, MWF-Layer, experts and expert fusion unit. The text router and image router in Dual-Router accept text modal information and image modal information, and use MWF-Layer to determine the importance of modal information. Based on the result of the determination, MWF-Layer generates fused weights for the fusion of experts. Experts are model backbones that match the current task. DRDF has high performance and high generality, and we have tested 12 backbones such as Visual BERT on multimodal dataset Hateful memes, unimodal dataset CIFAR10, CIFAR100, and TinyImagenet. Our DRDF outperforms all the baselines. We also verified the components of DRDF in detail by ablations, compared and discussed the reasons and ideas of DRDF design.


Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions

As multimodal learning finds applications in a wide variety of high-stak...

A Multimodal Late Fusion Model for E-Commerce Product Classification

The cataloging of product listings is a fundamental problem for most e-c...

Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation

In this paper, we introduce Cross-modal Alignment with mixture experts N...

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Text response generation for multimodal task-oriented dialog systems, wh...

Employ Multimodal Machine Learning for Content quality analysis

The task of identifying high-quality content becomes increasingly import...

MURAL: Multimodal, Multitask Retrieval Across Languages

Both image-caption pairs and translation pairs provide the means to lear...

1. Introduction

Multimodal learning has developed rapidly in recent years, and a number of excellent works have emerged(Li et al., 2019b; Lu et al., 2019; Kiela et al., 2019). They extract and combine features from different modalities to obtain richer information. However, most of them have the following drawbacks: (1) ignoring the difference of importance of different modal information, using one fixed architecture to deal with all multimodal problems even though the importance of modal information in different input cases is completely different; (2) researching about single model architecture, and it is difficult to integrate and complement each other’s work; (3) poor generality, which cannot be unified for unimodal tasks such as NLP, CV and multimodal tasks.

In our study, we found that the importance of information varies dramatically across modalities even in the same task. In real-world life, humans do not treat information of different modalities equally. For example, when reading a novel with illustrations, the main way we learn the plot of the novel is by reading the text rather than observing the illustrations, while in some comics we tend to pay more attention to the images than to the text. As shown in Figure 1, we can tell at a glance that the meaning of this picture is “you want to thank me”, while the main information is in the text “thank you”, and the background of the picture does not give us such clear information, even if we replace the cat in the background image with a dog or a fish, it has almost no effect on the meaning of this meme. Also during the creation of Hateful memes(Kiela et al., 2020), it was mentioned that the replacement of text or images in many memes actually does not change its original meaning, which means that in some multimodal samples, the information of one of the modalities will be more important. Even under a specific dataset like Hateful memes, the importance of the modal information in the specific input case inside is very different.

This difference in the importance of modal information is widespread in multimodal tasks, which means that the model is likely to lead to many errors if it treats different modal information equally, when facing specifically each different input case even in the same task.

Based on the above motivation, we designed the Dual-Router Dynamic Framework(DRDF) for both multimodal and unimodal learning, which is high-performance, general and modular. It consists of Dual-Router (image router and text router), multiple experts, a MWF-Layer(Modal Weight Fusion Layer) and a expert fusion unit. The image router and text router generate two different sets of weights(called text weights and image weights) based on the different modal information of the input case, and these two weights are fused in the MWF-Layer to form a set of fused weights after the determination of the importance of modal information, which will guide the different experts to fuse to get the final result that is most applicable to that input case.

Compared with previous work, our architecture (1) can compare the importance between multiple modalities in MWF-Layer and dynamically fuse to get the results that best fits the current input case; (2) is a highly modular framework that can be perfectly orthogonal to existing work and help existing work further improve performance; (3) is general and can be processed for a large number of models and can be adapted for both multimodal and unimodal tasks.

Experts can be a variety of existing model backbones. The experts have strengths and weaknesses, and Dual-Router and MWF-Layer output weights to guide the complementary integration of the experts, so DRDF can ultimately achieve a better performance than the original individual experts.

Image router and text router in Dual-Router are two simple neural networks that simply look at the input case and make a rough judgment about the modal information. Dual-Router generates two sets of weights, one from the text modality and one from the image modality. In each specific input case, the importance of the two modalities is different, so we design a novel MWF-Layer, which can determine the importance of each modal information based on the data distribution by two types of weights. MWF-Layer will fuse the text weights and image weights according to the importance of the different modal information, so as to produce fused weights that fit the current input case and guide the fusion process of experts to get the best final results in the direction of focusing on the more important modalities.

Experts work like a traditional multimodal model, accepting multiple modal information from the input case and then giving prediction results. So the core of DRDF lies in two data streams of input cases, the first one through Dual-Router, MWF-Layer, and the other one directly through experts and expert fusion unit, to get the prediction result.

Understandably, Dual-Router and experts can be arbitrarily replaced with any existing available backbone, and when facing unimodal tasks, we can use only one of the router, which makes DRDF modular and general.

In summary, our contributions are as follows:

  • we propose a novel high-performance and general framework DRDF with a novel Dual-Router adapted to both multimodal and unimodal tasks. We experimentally verified the effectiveness of Dual-Router and explored the impact of continuity on Dual-Router design.

  • we design a novel MWF-Layer that can determine the importance of different modal information and weight the fusion according to the importance. We verified its effectiveness by experiments.

  • We extend DRDF on 12 backbones such as VGG16, WideResnet, Resnet, MMBT, Visual Bert, etc., and do experiments on 4 datasets including multimodal datasets and unimodal datasets. DRDF outperforms all the baselines, fully illustrating its high performance and generality.

2. Related work

Multimodal Works

In recent years, there has been a lot of works on multimodality. In Visual BERT(Li et al., 2019b), unicoder-VL(Li et al., 2020), VL-BERT(Su et al., 2019), textual and visual information are fused at the beginning, and in ViLBERT(Lu et al., 2019), LXMERT(Tan and Bansal, 2019), the textual and visual information initially go through two separate encoding modules before fusing the different modal information through mutual attention mechanisms. They attempt to transfer some unimodal frameworks such as BERT to the multimodality and to build a generalizable feature learning model in the multimodality.

In the process of multimodal information fusion, there is quite a bit of work using attention. (Hori et al., 2017) uses attention to fuse the multimodal information and features, and achieves excellent performance in video description task. (Ovalle et al., 2017)

proposes GMU based on gated neural network, which can weight the fusion of information from different modalities in the video task.

While the above works are all optimized in the model itself, our work focus on the connection and scheduling between models. Our work is orthogonal to all the above works, and in our experiments that DRDF can extend these work as backbones, and we use Dual-Router and MWF-Layer to achieve full scheduling of these experts, and then help them achieve a higher performance.

Dynamic Neural Network and Mixture of Experts

According to this comprehensive and detailed survey(Han et al., 2021), dynamic neural networks and mixture of experts are increasingly popular fields, with excellent works coming out all the time. SENet(Hu et al., 2018) slices the traditional network into squeeze and excitation steps, and increases performance with a small extra cost by first descending and then ascending. In DY-CNN(Chen et al., 2020), each layer uses attention to gather multiple input independent convolutional kernels and fuse them, exchanging a small consumption for better performance. (Sharma et al., 2021) divides the dataset into manyshot, mediumshot and fewshot according to the number of class inclusions, trains on these subsets respectively to ensure the diversity of the experts, pretrains the baseline on the whole dataset for knowledge transfer, and then different experts for different subsets for finetuning to handle the long-tailed tasks.

Some methods(Chen et al., 2017; Woo et al., 2018) make the network dynamic by ignoring or cutting certain regions. SKNet(Li et al., 2019a) proposes an efficacious module on the basis of channel attention. (Zhong et al., 2020) proposes the pixel-group attention to enrich spatial information in SENet. Deformable Kernels(Gao et al., 2019) resamples the original kernel space to fit the effective receptive field. CondConv(Yang et al., 2019) uses the routing function to output weights for linear fusion with experts, which can improve performance while keeping the inference costs in an acceptable range.

Our work is very different from the above work, firstly, the scope of our task is mainly in multimodal tasks, secondly, CondConv only uses the routing function as a weight generator without any meaning, while our Dual-Router is more explainable, becuase it is designed for text modal and image model respectively, and used MWF-Layer to detect and fuse the importance of modal information, whether the overall design concept or model structure, DRDF is novel. In addition, DRDF is highly modular, experts can be compatible with many existing backbones, without any changes to the backbone internal.

Figure 2.

Workflow of the DRDF. The text and image inputs are passed through the text router and image router respectively, generating corresponding text weights and image weights. Two types of weights are determined in the Importance determiner (Text Std Calculator is used to get the standard deviation of weights), and the Weights Fusion Unit fuses them to generate fused weights according to the importance of the modalities. Experts directly processes the input multimodal information to obtain many results. Fused weights instruct experts to perform fusion in expert fusion unit to generate final results.

3. Dual-Router Dynamic Framework

DRDF is a general and high-performance framework, as shown in Figure 2, which consists of four parts, and we assume that the number of experts is for the sake of illustration. (1) Dual-Router contains text router and image router, which accept text and image information of multimodal input, respectively, and generate corresponding text weights and image weights by forward propagation. The dimension here is (without considering the batch size). These weights will be input to the MWF-Layer for importance determination and fusion. The roles of two routers are not for fine classification, but a preliminary analysis of two input modalities, so the router is generally a simple neural network, for example, the text router we choose is LSTM, while the image router we choose is VGG16. (2) The Importance Determiner in MWF-Layer determines the importance of the two modal information by the data distribution of text weights and image weights, and the Weights Fusion Unit fuses them according to the difference in importance to obtain fused weights (with the same dimension of N*1) for experts. We will explain the details of MWF-Layer later.(3) The structure is the same among experts, and the parameter values are different. They are responsible for accepting the input text information and image information, respectively, and outputting the many results. The base performance of the experts determines the overall performance of the whole DRDF.(4) In expert fusion unit, results will perform weighted fusion according to fused weights to get final results. The input cases need to pass through DRDF following two paths of data streams, the first one is used to get the fused weights according to the importance of each modality for the current input case dominated by Dual-Router, MWF-Layer, and experts, and the second one is directly through experts and expert fusion unit.

In addition, when facing a unimodal task such as image classification in CV, the image router can work, while the text router does not accept data, then the fused weights are exactly the same as the image weights, and DRDF can still run normally. Also if we are dealing with an NLP task, then we just need to start the text router and replace the experts with the backbones suitable for the current NLP task so that our framework is also compatible with unimodal NLP tasks.

When adding a new modality, as long as an existing neural network available for the modal task is found as router and experts (it is not difficult because DRDF can be adapted to almost all the backbones), MWF-Layer can still work and get the fused weights, so the generality of DRDF is excellent.

What can be seen is that the Dual-Router, experts in DRDF can be replaced arbitrarily without any internal changes to the backbones, which makes DRDF work together with many popular models today. The dynamic mechanism of combining Dual-Router and MWF-Layer essentially complements experts’ deficiencies so that ultimately achieves performance improvements.

3.1. Dual-Router

In DRDF, Dual-Router contains text router and image router, both of which have the same calculation principle as follows:


where is the input, is the text weights or image weights.

In fact, Dual-Router makes a coarse judgment of the multimodal information, and the result of this judgment is output in the form of weights for experts fusion. This means that Dual-Router does not require too fine and accurate judgments.

It is worth mentioning that the direct output results of router cannot be directly used to input into MWF-Layer. Because firstly, the backbone of text router and image router is very different, and the probability distribution of their outputs will also be very different, which is not comparable, and it is not reasonable to let MWF-Layer to make importance comparison directly. Secondly, in the subsequent experts fusion, if the output of router is used directly, there may be negative weights, which is not intuitive in the model fusion. The reason why Sigmoid is used is that DRDF does parameter fusion on the experts, which means that we prefer to get the reasonable weights of each expert instead of deliberately selecting a certain expert with the highest weight, so for Dual-Router, it essentially performs a multi-classification process instead of a single classification. In multi-classification, Sigmoid works better compared to Relu and Softmax. We also verify these in our subsequent experiments.

3.2. Modal Weight Fusion Layer(MWF-Layer)

MWF-Layer accepts weights from Dual-Router and judges its importance based on its data distribution, and then generates fused weights based on its importance. During the testing of routers, under our initialization conditions, we find when the router learns poorly, it always tends to go for outputting a set of average parameters, and when it outputs a set of parameters with a particularly pronounced tendency, it tends to achieve a relatively good performance improvement. It is also consistent with human intuition that the parameters output by the router should be distributed enough to help assemble a diverse set of final results. Based on the above motivation, after we tested many computational approaches, we chose the following algorithm:


where is the standard deviation, the above formula holds at , and is the importance of the corresponding modal information. If , that actually represents , which means that text weights and image weights assign exactly the same weight to each expert, so we think the text importance and image importance are consistent at this time, . Then we need to fuse the weights according to the importance of the modalities, as follows:


represents fused weights and will be used for fusion of experts to get final results. While in unimodal tasks, since there is only one modality, it is not necessary to calculate it, and it is directly considered that the of that modality is 1 and the other modality is 0. Then MWF-Layer is also compatible with unimodal tasks.

Unlike ordinary attention-based fusion of modal information and features, the MWF-Layer fusion is based on the weights output by Dual-Router for expert fusion, rather than directly on the modal information itself, which is a mixture of experts fusion. The purpose of its fusion is for subsequent multiple experts to fuse a final results that is more suitable for the current input sample, while the multimodal information fusion in previous work is to allow subsequent models to get more reasonable input vectors, which are fundamentally different.

MWF-Layer is essentially encouraging the diversity of router weights, because in the optimization process, modalities with large variance represent more diverse in their weights, thus acquiring greater importance and more involvement in the training process, which will also be subject to better optimization in the backpropagation process, forming a virtuous circle. We also experimentally validate the effect of MWF-Layer.

3.3. Experts

Expert can be a variety of existing backbones, and for specific tasks such as multimodal classification tasks, expert can be a backbone that can handle multimodal classification tasks alone.

The role of experts is to deal with the input case and give some candidate results for expert fusion unit to fuse to the final results as follows:


where is the -th expert, is the output of the -th expert, is the input case.

It is worth mentioning that the experts can be a single model or a multi-model combination of framework, as long as to ensure that its input and output meet the requirements of the current task. This means that our work is not in direct competition with previous works on multimodal information fusion, and the various multimodal models proposed by previous authors may become experts of our framework, thus obtaining performance gains.

3.4. Expert Fusion Unit

The expert fusion unit weights the fused weights from the previous MWF-Layer, and the expert fusion unit weights the results of each expert output and obtains the final result as follows:


where is the final output of DRDF, is the number of experts, is the fused weights from MWF-Layer, is the result from the -th expert. Dual-Router has actually made a rough classification of the input case, and the experts fusion according to the fused weights is a fine-grained classification for the input cases.

In fact, any fusion approach that is done in a weighted fusion manner can be applied in DRDF. For example, when the expert is a purely linear layer, the parameter fusion instead of result fusion mentioned in CondConv(Yang et al., 2019) can be applied. In addition, many dynamic network-related fusion methods can be substituted here, which also makes our framework very modular in the concept of the mixture of experts and dynamic neural network.

4. Experiments

4.1. Setup

We have extended DRDF to a lot of existing models and obtained significant performance improvements on several datasets. We validated both multimodal and unimodal tasks.

In the multimodal task, the dataset we selected is Hateful memes(Kiela et al., 2020), a dataset and benchmark centered around detecting hate speech in multimodal memes. Some of the memes in the dataset are original memes from social media, while others are new memes that are similar to the original memes but have very different meanings by manually replacing the background or textual information of the memes. We extended DRDF for backbones such as Late fusion, Concat BERT(Kiela et al., 2020), MMBT-Grid, MMBT-Region(Kiela et al., 2019), ViLBERT(Lu et al., 2019), Visual BERT(Li et al., 2019b), ViLBERT CC(Sharma et al., 2018)

, Visual BERT COCO. We use AUROC and accuracy as metrics here. It is worth mentioning that since the testing process of hateful memes needs to be tested online after uploading to the website, for convenience and fairness, we use the metrics on the validation set for all baselines and extended DRDF for comparison.

In the unimodal task, we tested the image classification task, selected datasets are CIFAR10(Krizhevsky et al., 2009), CIFAR100(Krizhevsky et al., 2009), TinyImagenet(Le and Yang, 2015)

(we think the unimodal test is not the core of the paper, but only to support the generality of DRDF, so there is no need to use the time-consuming full Imagenet). We borrowed part of the baselines of NBDT

(Wan et al., 2020), and made DRDF extensions on WiderResnet28*10(Zagoruyko and Komodakis, 2016), resnet18(He et al., 2016), VGG16(Simonyan and Zisserman, 2015). In the image classification task, we only use the image router, and do not need to use the text router.

For the unimodal text classification problem, we perform a DRDF extension to Text BERT(Devlin et al., 2019) and test it on hateful memes. Since Text BERT only accepts text modal information from hateful memes, this can be considered as a DRDF performance test for text modality.

For unimodal tests, we have used accuracy as the metric.

We use the pretrained backbones with noises as experts, and finetune them in downstream datasets. In the above evaluation, we set the text router to be traditional LSTM, the image router to be VGG16, and the number of experts to be 4.

Model Late Fusion Concat BERT MMBT-Grid MMBT-Region ViLBERT Visual BERT
Visual BERT
NN AUROC 65.97 65.25 68.57 71.03 71.13 70.60 70.07 73.97 64.65
Acc 61.53 58.60 58.20 58.73 62.20 62.10 61.40 65.06 58.26
AUROC 66.98 67.12 70.24 72.13 72.34 71.10 71.09 74.80 66.88
Acc 64.82 59.81 61.67 60.74 64.26 62.59 62.22 65.19 60.93
AUROC 1.01 1.87 1.67 1.10 1.21 0.50 1.02 0.83 2.23
Acc 3.29 1.21 3.47 2.01 2.06 0.49 0.82 0.13 2.67
Table 1. Evaluation results on valid sets on multimodal dataset hateful memes valid set.NN means original model, DRDF means that we use these backbones as experts to get DRDF instances. Text BERT is a unimodal model that accepts only text information in hateful memes, and Late Fusion, Concat BERT, MMBT-Grid, MMBT-Region, ViLBERT, Visual BERT, ViLBERT CC and Visual BERT COCO are all multimodal backbones.
(b) Accuracy
Figure 3. The AUROC and accuracy results chart of multimodal backbones on hateful memes valid set. NN means original model, DRDF means that we use these backbones as experts to get DRDF instances. Text BERT is a unimodal model that accepts only text information in hateful memes, and Late Fusion, Concat BERT, MMBT-Grid, MMBT-Region, ViLBERT, Visual BERT, ViLBERT CC and Visual BERT COCO are all multimodal backbones.

4.2. Multimodal Results

Table 1 and Figure 3 shows the results. It can be seen that after our DRDF extension, the AUROC and accuracy results of all the backbones have been significantly improved.

In terms of AUROC, the improvements of Concat BERT and MMBT-Grid are the highest, 1.87 and 1.67 respectively, the common point of both is that both use BERT for text feature extraction while accepting the original images directly instead of the extracted image features for image modality. The improvement of MMBT-Region is smaller than that of MMBT-Grid. The model structure of both is almost identical, and the difference is that MMBT-Grid receives the original image as input directly, while MMBT-Region receives the extracted image features as input. The image router also receives the original images instead of the extracted features, so its output weights are more suitable for models that use the original images as the input of the image modality, which is also an intuitive result.

Besides, we found that Visual BERT and Visual BERT COCO have the lowest AUROC improvements of 0.50 and 0.83. It is worth mentioning that both backbones, especially Visual BERT COCO, achieved the best result of 73.97 on hateful memes, which means that the model performs more comprehensively than others. DRDF essentially uses experts fusion to make up for the shortcomings of the experts, and for the relatively comprehensive backbone, the degree of improvement is smaller because there are fewer places to make up.

In terms of accuracy, we can see more significant improvements. Late fusion and MMBT-Grid have the highest improvements, reaching 3.29% and 3.47%, and MMBT-Grid still has a larger improvement than MMBT-Region, which also confirms the difference between the original image or the image features of the modal information mentioned above. Late fusion has a simple structure and its backbone achieves a performance that is downstream in the baseline, which means that it may have many defects that can be filled in as an expert, thus making the role of DRDF very obvious. The accuracy improvements of Visual BERT and Visual BERT COCO are still insignificant at 0.49% and 0.13%, which also indicates that their good performance on the hateful memes dataset makes the DRDF role insignificant. VilBERT and VilBERT CC as similar backbone but the difference in accuracy improvements is larger, 2.06% and 0.82% respectively, which may be the effect of the pretrained model of VilBERT CC, pretraining brings more comprehensive performance, making the improvements of DRDF are not obvious enough compared to the original backbone. Overall, the effect of DRDF is impressive, and we have verified through these extensive backbone that DRDF can work in a large number of existing models, and they do not form a direct competition but an orthogonal complementary relationship, our DRDF can easily help other models and frameworks to achieve a better performance.

4.3. Unimodal Results

Datasets WiderResnet28*10 Resnet18 VGG16
NN CIFAR10 97.62 94.97 93.46
CIFAR100 82.09 75.92 70.39
TinyImagenet 67.65 64.13 53.28
CIFAR10 97.80 95.42 94.21
CIFAR100 83.52 76.95 71.52
TinyImagenet 69.47 67.12 56.14
Table 2. The accuracy results of image classification tasks on CIFAR10,CIFAR100 and TinyImagenet. The metric is accuracy on the test set. NN means original model, and DRDF means that we use these backbones(WiderResnet28*10, Resnet18 and VGG16) as the experts. We only use image router VGG16 here instead of Dual-Router, and the number of experts is 4.

We have tested the adaptability of DRDF to unimodal, where the text modal test is shown in the column ”text BERT” in Table 1, we can see that the improvement of AUROC reaches 2.23, and the improvement of accuracy reaches 2.67%. .

In addition, Table 2

shows the effect of DRDF on CV task, image classification, the evaluation metric is accuracy, we can see that DRDF shows considerable improvement on all backbone and all datasets.

With the backbone WideResnet28*10, our DRDF achieves 97.80%, 83.52%, 69.47% on three datasets. On CIFAR10, DRDF outperforms WideResnet28*10 by 0.18%, on CIFAR100, DRDF achieves accuracy 1.43% higher than WideResnet28*10 and on TinyImageNet the DRDF outperforms the WideResnet28*10 by 1.82%.

With Resnet18, our DRDF achieves 95.42%, 76.95%, 67.12%. DRDF outperforms ResNet18 by 0.45%, 1.03%, 2.99% on three datasets.

With VGG16, our DRDF achieves 94.21%, 71.52%, 56.14%, outperforms original VGG16 by 0.75%, 1.13%, 2.86% on three datasets.

The above experiments fully illustrate the generality of DRDF for unimodal and multimodal tasks. This shows that the text router and image router in Dual-Router are also feasible to use alone, while in unimodal, DRDF exists as a mere MoE framework, which improves the performance by providing fused final results for each specific input sample.

4.4. Ablations

4.4.1. Dual-Router ablations

Sigmoid Softmax Relu Sigmoid Softmax Relu
NN 58.20 68.57
Text Router 59.26 61.30 59.81 68.90 67.11 67.27
Image Router 60.93 55.93 59.44 70.12 67.49 64.09
Dual Router 61.67 58.33 60.19 70.24 68.24 66.97
Table 3.

Ablations of Dual-Router with different activation functions on hateful memes valid set with MMBT-Grid as experts backbone. The number of experts is 4, text router is LSTM and image router here is VGG16. NN means the original backbone.

(b) Accuracy
Figure 4. The trend chart of ablations of routers with different activation functions on hateful memes with MMBT-Grid as experts backbones. The left sub-figure and right sub-figure both shows that dual router with Sigmoid works best.

To verify the effects of Dual-Router, we do ablations on hateful memes with MMBT-Grid as experts backbone. The number of experts is 4, text router is LSTM and image router here is VGG16. We compared, for different activation functions, the results of using only the text router, only the image router and the Dual-Router. Table 3 and Figure 4 shows the results.

For activation function, Sigmoid is the best, probably because Softmax is suitable for a single choice, there is a certain degree of mutual exclusivity among the weights, which will widen the gap between the output parameters, and the output of router needs to allow multiple high weights to coexist, so as to combine a more diverse and accurate mixed model. And this is the advantage of Sigmoid. Similarly, ReLu’s direct zeroing of negative parameters will result in serious loss of weights information.

From the router perspective, the introduction of any router under the Sigmoid activation function can give a performance boost, which validates one of the ideas of DRDF, that is, providing a suitable fused final result for each specific input case is able to indeed make up for some of the original shortcomings of backbone and ultimately improve performance. Dual Router outperforms text router and image router by 2.41% and 0.74% in accuracy, 1.34 and 0.12 in AUROC, indicating that Dual- Router’s design is suitable for multimodal tasks. In addition, we can see that the overall performance of image router is better than that of text router, probably because the model complexity of VGG16 is higher than that of LSTM. But the fusion of the two into Dual-Router can get better results, which shows that Dual-Router is not simply a stack of modal information, it can consider the information of the two modalities to output more reasonable weights, which is the core idea of the DRDF design.

4.4.2. Discrete router or continuous router

Single-choice Gate Multi-choice Gate Continuous(Ours)
NN 68.57
Text Router 66.23 67.11 68.90
Image Router 67.87 67.14 70.12
Dual Router 68.45 67.22 70.24
Table 4. Three different router attempts, single-choice gate means that only one expert is selected at a time, multi-choice gate means that more than one expert can be selected for fusion at a time, but the output weight is only the difference between 0 and 1. It’s on hateful memes with MMBT-Grid as experts backbones. Text router is LSTM and image router is VGG16. NN means the original backbone.
(a) Different gate
(b) Different router
Figure 5. The trend chart of ablations of three different router attempts. The left sub-figure shows that our continuous gate fusion policy works best, while the right sub-figure shows that the dual router works best.

The Dual-Router used by DRDF outputs continuous weights and thus fuses the results of the experts based on the weights. We also do tests with discrete routers to verify the effectiveness of continuous routers.

As shown in the Table 4 and Figure 5, we design three different routers, where the single-choice gate will take the largest weight on the fused weights output from the MWF-Layer and set it to 1 (or randomly take the maximum of the tied values if they are tied), and set the rest of the weights to 0, i.e., only one expert is activated at a time. The multi-choice gate will take all the values greater than 0.25 and set them to 1 on the fused weights output by MWF-Layer, and set the rest of the weights to 0 (if there is no value greater than 0.25 then take the maximum bit and set it to 1), i.e. multiple experts can be activated and fused at a time.

We find that our continuous router results outperform the other two by 1.79 and 3.02, respectively. When text router and image router are started simultaneously, single-choice gate performs better than multi-choice gate, and we see that the Dual-Router result for single-choice gate is very close to the original backbone results.

The reason may be that the single-choice gate may fall into a vicious circle in the middle of the training period, i.e., after selecting a certain expert many times, the expert is well trained, which causes the single-choice gate to be more inclined to select it, thus the final training result is close to the result of training a backbone alone. Although multi-choice can be trained by activating multiple experts at the same time, it also leads to the fact that each expert is not sufficiently trained and the final result is inferior to that of single-choice gate.

And obviously continuous-router is the right choice, probably because, firstly, the neural network with continuous values will be easier in training, while Dual-Router is close to the input and far from the output, which is inherently harder to optimize, and will be more difficult to train if discrete-router design is adopted. Secondly, the core idea of DRDF is to combine the unique fused weights for each specific multimodal input case according to the importance of its individual modalities, and only continuous weights can combine infinite possibilities to meet our motivation. Finally, discrete routers are demanding for the ability of the expert, because they cannot combine effective networks, and only rely more heavily on the experts. To the extreme, the single-choice gate is completely dependent on the ability of the expert, and does not help the experts to achieve better results.

4.4.3. MWF-Layer ablations

NN 58.20 68.57
Average 60.37 69.12
Multiply 57.22 65.64
MWF-Layer 61.67 70.24
Table 5. Ablations of MWF-Layer on hateful memes with MMBT-Grid as experts backbone. The number of experts is 4, text router is LSTM and image router is VGG16. NN means the original backbone. Average represents a direct summation of the weights, and Multiply represents a direct multiplication of the weights.

To verify the effects of MWF-Layer, we do ablations on hateful memes with MMBT-Grid as experts backbone. The number of experts is 4, text router is LSTM and image router here is VGG16. We compared various approaches to form fused weights, and the results are shown in the Table 5, where Average represents the direct summation of text weights and image router, and Multiply represents the direct multiplication.

What can be seen is that the effect of Average is better than the original backbone and Multiply is poorer. The point of improvement here may be because Dual-Router can make a rough judgment on the modal information, and text weights and image weights have a certain guiding effect on the fusion of experts. When these approaches cannot determine which modal information is more important, Average gives the same importance to both modal information and does not change the weights much, so the final effect of Dual-Router itself makes the performance improved. And Multiply brings more changes, so the final effect is poor.

This fully illustrates the effect of MWF-Layer, which can output fused weights by deciding the importance of information of different modalities, and finally guide the experts to fuse to get the final results that is more suitable for the current input sample.

4.4.4. Experts ablations

NN 58.20 68.57
2 60.56 67.19
3 63.52 69.39
4 61.67 70.24
8 62.22 69.78
Table 6. Ablations of the experts on hateful memes with MMBT-Grid as experts backbones. NofE means the number of experts. NN means the original backbone.
(b) Accuracy
Figure 6. The trend chart of ablations of the experts on hateful memes with MMBT-Grid as experts backbones. NN means the original backbone.

To verify the effects of experts, we do ablations on hateful memes with MMBT-Grid as experts backbone. Text router is LSTM and image router here is VGG16. We compared the DRDF with different number of experts, and the results are shown in the Table 6 and Figure 6.

When the number of experts is 2, accuracy is improved by 2.36% compared to backbone, while AUROC is reduced by 1.38. The improvement effect of DRDF is not obvious enough, probably because two experts are more difficult to train than one, and the performance of each individual expert decreases. We have verified this point, and found that in DRDF, the performance of almost every single expert is lower than the performance of the original backbone. The improvement of the overall number of parameters generated by two experts is not obvious enough compared to multiple experts, and the improvement of the representation power is not obvious enough, so there is a decline in performance at the stage of experts generating fused final results. This performance degradation is more obvious than the improvement brought by the dynamic mechanism of Dual-Router and MWF-Layer, so eventually, when the number of experts is small, the DRDF effect may be poor.

After the number of experts increases, we can see that the DRDF effect is improved obviously. Although the effect of individual experts decreases, because the number of experts is larger, its overall number of parameters is larger, and the representation power is stronger, finally reaching the performance increase.

But the number of experts is not as large as it is, we can see that the accuracy of DRDF reaches the highest 63.52% when the number is 3, which is 5.32% higher than the original backbone and 1.30% higher than the number is 8, while the AUROC of DRDF reaches the highest 70.24 when the number is 4, which is 1.67 higher than the original backbone, and 0.46 higher than the number of 8.

This may be because, in the process of inference, the increase in experts is essentially an increase in the number of parameters that can be selected in the entire DRDF dynamic network. As with traditional networks, the number of DRDF parameters is too large and the dataset is too small, then overfitting is likely to occur, resulting in a number of experts that cannot be increased indefinitely.

Since AUROC is the main metric on the hateful memes dataset, and we do not need an excessive number of experiments in terms of training time cost, the number of experiments we use in our experiments is 4.

5. Conclusion

We believe that the model should have the ability to judge the importance of different modal information for different input cases in a multimodal task, therefore, we propose a high-performance, highly general Dual-Router Dynamic Framework (DRDF), which is highly modular and applicable in both multimodal and unimodal tasks. DRDF receives text and image information through text router and image router in Dual-Router, and determines the importance of modal information through MWF-Layer and fuses them into fused weights to guide subsequent experts to fuse to solve the problem. We performed DRDF extensions for multiple backbones on multimodal and unimodal datasets, and our DRDF outperforms all the baselines.

In future work we will continue to optimize the importance determination mechanism of DRDF and test DRDF in a wider range of domains.


  • (1)
  • Chen et al. (2017) Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017.

    Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 5659–5667.
  • Chen et al. (2020) Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. 2020. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11030–11039.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).
  • Gao et al. (2019) Hang Gao, Xizhou Zhu, Stephen Lin, and Jifeng Dai. 2019. Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation. In International Conference on Learning Representations.
  • Han et al. (2021) Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. 2021. Dynamic neural networks: A survey. arXiv preprint arXiv:2102.04906 (2021).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hori et al. (2017) Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R Hershey, Tim K Marks, and Kazuhiko Sumi. 2017. Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision. 4193–4202.
  • Hu et al. (2018) Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141.
  • Kiela et al. (2019) Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine. 2019. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950 (2019).
  • Kiela et al. (2020) Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. Advances in Neural Information Processing Systems 33 (2020).
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
  • Le and Yang (2015) Ya Le and Xuan Yang. 2015.

    Tiny imagenet visual recognition challenge.

    CS 231N 7 (2015).
  • Li et al. (2020) Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , Vol. 34. 11336–11344.
  • Li et al. (2019b) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
  • Li et al. (2019a) Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. 2019a. Selective kernel networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 510–519.
  • Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 13–23.
  • Ovalle et al. (2017) John Edison Arevalo Ovalle, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González. 2017. Gated Multimodal Units for Information Fusion.. In ICLR (Workshop).
  • Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.
  • Sharma et al. (2021) Saurabh Sharma, Ning Yu, Mario Fritz, and Bernt Schiele. 2021. Long-tailed recognition using class-balanced experts. In Pattern Recognition: 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28–October 1, 2020, Proceedings 42. Springer, 86–100.
  • Simonyan and Zisserman (2015) K Simonyan and A Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. (2015).
  • Su et al. (2019) Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations.
  • Tan and Bansal (2019) Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    . 5100–5111.
  • Wan et al. (2020) Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Henry Jin, Suzanne Petryk, Sarah Adel Bargal, and Joseph E Gonzalez. 2020. NBDT: Neural-Backed Decision Trees. arXiv preprint arXiv:2004.00221 (2020).
  • Woo et al. (2018) Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3–19.
  • Yang et al. (2019) Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. 2019. Condconv: Conditionally parameterized convolutions for efficient inference. In Advances in Neural Information Processing Systems. 1307–1318.
  • Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In British Machine Vision Conference 2016. British Machine Vision Association.
  • Zhong et al. (2020) Zilong Zhong, Zhong Qiu Lin, Rene Bidart, Xiaodan Hu, Ibrahim Ben Daya, Zhifeng Li, Wei-Shi Zheng, Jonathan Li, and Alexander Wong. 2020. Squeeze-and-Attention Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13065–13074.