BM-NAS: Bilevel Multimodal Neural Architecture Search

04/19/2021 ∙ by Yihang Yin, et al. ∙ Penn State University Baidu, Inc. 0

Deep neural networks (DNNs) have shown superior performances on various multimodal learning problems. However, it often requires huge efforts to adapt DNNs to individual multimodal tasks by manually engineering unimodal features and designing multimodal feature fusion strategies. This paper proposes Bilevel Multimodal Neural Architecture Search (BM-NAS) framework, which makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme. At the upper level, BM-NAS selects the inter/intra-modal feature pairs from the pretrained unimodal backbones. At the lower level, BM-NAS learns the fusion strategy for each feature pair, which is a combination of predefined primitive operations. The primitive operations are elaborately designed and they can be flexibly combined to accommodate various effective feature fusion modules such as multi-head attention (Transformer) and Attention on Attention (AoA). Experimental results on three multimodal tasks demonstrate the effectiveness and efficiency of the proposed BM-NAS framework. BM-NAS achieves competitive performances with much less search time and fewer model parameters in comparison with the existing generalized multimodal NAS methods.



There are no comments yet.


page 3

page 6

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved a great success on various unimodal tasks (, image categorization [24, 15], language modeling [49, 10], and speech recognition [2]) as well as the multimodal tasks (, action recognition [43, 50], image/video captioning [53, 19, 18], visual question answering [32, 3], and cross-modal generation [41, 56]). Despite the superior performances achieved by DNNs on these tasks, it usually requires huge efforts to adapt DNNs to the specific tasks. Especially with the increase of modalities, it is exhausting to manually design the backbone architectures and the feature fusion strategies. It raises urgent concerns about the automatic design of multimodal DNNs with minimal human interventions.

Neural architecture search (NAS) [58, 27] is a promising data-driven solution to this concern by searching for the optimal neural network architecture from a predefined space. By applying NAS to multimodal learning, MMnas [54] searches the architecture of Transformer model for visual-text alignment and MMIF [37] searches the optimal CNNs structure to extract multi-modality image features for tomography. These methods lack generalization ability since they are designed for models on specific modalities. MFAS [38] is a more generalized framework which searches the feature fusion strategy based on the unimodal features. However, MFAS [38] only allows feature fusion of inter-modal features, and, its feature fusion operations are not searchable. It results in a limited space of feature fusion strategies when dealing with various modalities in different multimodal tasks.

Figure 1: An overview of our BM-NAS framework for multimodal learning. a Cell is a structured feature fusion unit which accepts two inputs from modality features or other Cells. In a bilevel fashion, we search the connections between Cells and the inner structures of Cells, simultaneously.

In this paper, we propose a generalized framework, named Bilevel Multimodal Neural Architecture Search (BM-NAS), to adaptively learn the architectures of DNNs for a variety of multimodal tasks. BM-NAS adopts a bilevel

searching scheme that it learns the unimodal feature selection strategy at the upper level and the multimodal feature fusion strategy at the lower level, respectively. As shown in the left part of Fig.

1, the upper level of BM-NAS consists of a series of feature fusion units, , Cells. The Cells are organized to combine and transform the unimodal features to the task output through a searchable directed acyclic graph (DAG). The right part of Fig. 1 illustrates the lower level of BM-NAS which learns the inner structures of Cells. A Cell is comprised of several predefined primitive operations. We carefully select the primitive operations such that different combinations of them can form a large variety of feature fusion modules, including those benchmark attention mechanisms multi-head attention (Transformer) [49] and Attention on Attention (AoA) [16]. The bilevel scheme of BM-NAS is end-to-end learned using the differentiable NAS framework [29]. We conduct extensive experiments on three multimodal tasks to evaluate the proposed BM-NAS framework. BM-NAS models show superior performances in comparison with the state-of-the-art multimodal learning methods. Compared with the existing generalized multimodal NAS frameworks, BM-NAS achieves competitive performances with much less search time and fewer model parameters. To the best of our knowledge, BM-NAS is the first multimodal NAS framework that supports the search of both the unimodal feature selection strategies and the multimodal fusion strategies.

The main contributions of this paper are three-fold.

  1. Towards a more generalized and flexible design of DNNs for multimodal learning, we propose a new paradigm that employs NAS to search both the unimodal feature selection strategy and the multimodal fusion strategy.

  2. We present a novel BM-NAS framework to address the proposed paradigm. BM-NAS makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme.

  3. We conduct extensive experiments on three multimodal learning tasks to evaluate the proposed BM-NAS framework. Empirical evidences indicate that both the unimodal feature selection strategy and the multimodal fusion method are significant to the performance of multimodal DNNs.

2 Related Work

2.1 Neural Architecture Search

Neural architecture search (NAS) aims at automatically finding the optimal neural network architectures for specific learning tasks. NAS can be viewed as a bilevel optimization problem by optimizing the weights and the architecture of DNNs at the same time. Since the network architecture is discrete, traditional NAS methods usually rely on the black-box optimization algorithms, resulting in a extremely large computing cost. For example, searching architectures using reinforcement learning

[57] or evolution [40]

would require thousands of GPU-days to find a state-of-the-art architecture on ImageNet dataset

[9] due to low sampling efficiency.

As a result, many methods were proposed for speeding up NAS. From the perspective of engineering, ENAS [39] and NASH [11] improve the sampling efficiency by weight-sharing and inheritance, respectively. From the perspective of optimization algorithm, PNAS [28] employs sequential model-based optimization (SMBO) [17], using a surrogate model to predict the performance of an architecture. Monte Carlo tree search (MTCS) [36] and Bayesian optimization (BO) [21] are also explored to enhance the sampling performance of NAS.

Recently, a remarkable efficiency improvement of NAS is achieved by differentiable architecture search (DARTS) [29]. DARTS introduces a continuous relaxation of the network architecture, making it possible to search an architecture via gradient-based optimization. However, applying DARTS to multimodal learning directly is ineffective, since the intermediate nodes of DARTS only use a on its inputs, and may not be the optimal fusion strategy for specific multimodal tasks. In this work, we devise a novel NAS framework named BM-NAS for multimodal learning. BM-NAS follows the optimization scheme of DARTS, however, it novelly introduces a bilevel searching scheme to search the unimodal feature selection strategy and the multimodal fusion strategy simultaneously, enabling an effective search scheme for multimodal fusion.

Figure 2: An example of a multimodal fusion network found by BM-NAS, which consists of a bilevel searching scheme, we denote searched edges in blue, and fixed edges in black. Left: The upper level BM-NAS. The input features are extracted by pretrained unimodal models. Each Cell accepts two inputs from its predecessors, , any unimodal feature or previous Cell. Right: The lower level BM-NAS. Within a Cell, each Step denotes a primitive operation selected from a predefined operation pool. The topologies of Cells and Steps are both searchable. The numbers of Cells and Steps are hyper-parameters such that BM-NAS can be adapted to a variety of multimodal tasks with different scales.

2.2 Multimodal Fusion

The multimodal fusion techniques for DNNs can be generally classified into two categories: early fusion and late fusion. Early fusion combines low-level features, while late fusion combines prediction-level features, , the output of the last layer of DNNs. To combine these features, a series of reduction operations such as weighted average

[35] and bilinear product [47] are proposed in previous works. As each unimodal DNNs backbone could have tens of layers or maybe more, manually sorting out the best intermediate features for multimodal fusion could be exhausting. Therefore, some works propose to enable fusion at multiple intermediate layers. For instance, CentralNet [50] and MMTM [20] join the latent representations at each layer and pass them as auxiliary information for deeper layers. Such methods achieve superior performances on several multimodal tasks including multimodal action recognition [42] and gesture recognition [55]. However, it would largely increase the parameters of multimodal fusion models.

In recent years, there is an increased interest of introducing the attention mechanisms such as Transformer [49] to multimodal learning. The multimodal-BERT family [7, 26, 31, 46] is a typical approach for inter-modal fusion. Moreover, DFAF [12] shows that intra-modal fusion could also be helpful. DFAF proposes a dynamic attention flow module to mix inter-modal and intra-modal features together through the multi-head attention [49]. Additional efforts are made to enhance multimodal fusion efficacy of attention mechanisms. For instance, AoANet [16]

proposes the attention on attention (AoA) module, showing that adding an attention operation on top of another one could achieve better performance on image captioning task.

Recently, the NAS approaches are making an exciting progress for DNNs, and it shows a huge potential to introduce NAS to multimodal learning. One representative work is MFAS [38], which employs SMBO algorithm [17] to search multimodal fusion strategies given the unimodal backbones. But as SMBO is a black-box optimization algorithm, every update step requires a bunch of DNNs to be trained, leading to the inefficiency of MFAS. Besides, MFAS only use concatenation and fully connected (FC) layers for unimodal feature fusion, and the stack of FC layers would be a heavy burden for computing. Further works including MMnas [54], and MMIF [37] adopt the efficient DARTS algorithm [29] for architecture search. MMIF [37] follows the original DARTS framework, which only support the search of unary operations on graph edges and use summation on every intermediate node for reduction. MMnas [54] allows searching the attention operations but the topological structure of the network is fixed during architecture search.

Different from these related works, our proposed BM-NAS supports to search both the unimodal feature selection strategy and the fusion strategy of multimodal DNNs. BM-NAS introduces a bilevel searching scheme. The upper level of BM-NAS supports both intra-modal and inter-modal feature selection. The lower level of BM-NAS searches the fusion operations within every intermediate step. Each step can flexibly form the summation, concatenation, multi-head attention [49], attention on attention [16], or any other unexplored fusion mechanisms. BM-NAS is a generalized and efficient NAS framework for multimodal learning. In experiments we show that BM-NAS can be applied to various multimodal tasks regardless of the modalities or backbone models.

3 Methodology

In this work, we propose a generalized NAS framework, named Bilevel Multimodal NAS (BM-NAS), to search the architectures of multimodal fusion DNNs. More specifically, BM-NAS searches a Cell-by-Cell architecture in a bilevel fashion. The upper level architecture is a directed acyclic graph (DAG) of the input features and Cells. The lower level architecture is a DAG of inner step nodes within a Cell. Each inner step node is a bivariate operation drawn from a predefined pool. The bilevel searching scheme ensures that BM-NAS can be easily adapted to various multimodal learning tasks regardless of the types of modalities. In the following, we discuss the unimodal feature extraction in Section

3.1, the upper and lower levels of BM-NAS in Sections 3.2 and 3.3, along with the architecture search algorithm and evaluation in Section 3.4.

3.1 Unimodal feature extraction

By following previous multimodal fusion works, such as CentralNet [50], MFAS [38] and MMTM [20], we also employ the pretrained unimodal backbone models as the feature extractors. We use the outputs of their intermediate layers as raw features (or intermediate blocks if the model has a block-by-block structure like ResNeXt [51]).

Since the raw features vary in shapes, we reshape them by applying pooling, interpolation, and fully connected layers on spatial, temporal, and channel dimensions, successively. By doing so, we reshape all the raw features to the shape of

, such that we can easily perform fusion operations between features of different modalities. Here is the batch size, is the embedding dimension or the number of channels, is the sequence length.

3.2 Upper Level: Cells for Feature Selection

The upper level of BM-NAS searches the unimodal feature selection strategy and it consists of a group of Cells. Formally, suppose we have two modalities A and B, and two pretrained unimodal models for each modality. Let and indicate the modality features extracted by the backbone models. We formulate the upper level nodes in an ordered sequence , as

Under the setting of , both inter-modal fusion and intra-modal fusion are considered in BM-NAS.

Feature selection. By adopting the continuous relaxation in differentiable architecture search scheme [29], all predecessors of will be connected to through weighted edges at the searching stage. This directed complete graph between Cells is called the hypernet. For two upper level nodes , let denote the edge weight between and . Each edge is a unary operation selected from a function set including

(1) , , selecting an edge.

(2) , , discarding an edge.

Then, the mixed edge operation on edge is


A Cell receives inputs from all its predecessors, as


In evaluation stage, the network architecture is discretized that an input pair 111We enforce the Cells to have different predecessors. will be selected for if


It is worth noting that, compared with searching the feature pairs directly, the Cell-by-Cell structure significantly reduces the complexity of the search space for unimodal feature selection. For an input pair from two feature sequences and , the number of candidate choices is under the Cell-by-Cell search setting. It is much smaller than , the number of candidates under the pairwise search setting.

3.3 Lower Level: Multimodal Fusion Strategy

The lower level of BM-NAS searches the multimodal fusion strategy, , the inner structure of Cells. Specifically, a Cell is a DAG consisting of a set of inner step nodes. The inner step nodes are the primitive operations drawn from a predefined operation pool. We introduce our predefined operation pool in the following.

Primitive operations.

All the primitive operations take two tensor inputs

, and outputs a tensor , where .

(1) : The operation discards an inner step completely. It will be helpful when BM-NAS decides to use only a part of the inner steps.


(2) : The DARTS [29] framework uses summation to combine two features as


(3) : We use the scaled dot-product attention [49]. As a standard attention module usually takes three inputs namely query, key, and value, we let the query be , the key and value be , which is also known as the guided-attention [54].


(4) : A linear layer with the gated linear unit (GLU) [8]. Let and be element-wise multiplication, then is


(5) : stands for passing the concatenation of

to a fully connected (FC) layer with ReLU activation

[34]. The FC layer reduces the channel numbers from to . Let , then is

Figure 3: The search space of a Cell in BM-NAS accommodates many existing multimodal fusion strategies with only two inner steps. (d) is a two-head version of multi-head attention [49], and more heads can be flexibly added by changing the number of inner steps. (e) is the Cell founded by BM-NAS on NTU RGB-D dataset [42], which outperforms the existing fusion strategies as shown in Table 7.

We elaborately choose these primitive operations such that they can be flexibly combined to form various feature fusion modules. In Fig. 3, we show that the search space of lower level BM-NAS accommodates many benchmark multimodal fusion strategies such as the summation used in DARTS [29], the ConcatFC used in MFAS [38], the multi-head attention used in Transformer [49], and the Attention on Attention used in AoANet [16]. There also remains flexibility to discover other better fusion modules for specific multimodal learning tasks.

Fusion strategy. In searching stage, the inner step set of is an ordered feature sequence ,


An inner step node transforms two input nodes to its output through an average over the primitive operation pool , as


where is the weights of primitive operations. In the evaluation stage, the optimal operation of an inner step node is derived as,


The continuous relaxation of the edges with weights between inner step nodes is similar to the upper level BM-NAS. For a simplicity, we omit the formulation in this paper. Note that unlike the upper level BM-NAS, the pairwise inputs in a Cell can be chosen repeatedly222We do not enforce the step nodes to have different predecessors., so the inner steps can form structures like multi-head attention [49].

1:Initialize architecture parameters
2:Initialize model parameters
3:Initialize genotype based on
4:Set genotype_best = genotype
5:Construct hypernet based on genotype_best
6:while  not converged do
7:     Update on training set
8:     Update , , on validation set
9:     Derive upper level genotype based on
10:     Derive lower level genotype based on
11:     Update hypernet based on genotype
12:     if higher validation accuracy is reached then
13:         Update genotype_best using genotype
14:     end if
15:end while
16:Return genotype_best
Algorithm 1 The Search Algorithm of Bilevel Multimodal NAS (BM-NAS)

3.4 Architecture Search and Evaluation

Search algorithm. In Sections 3.2 and 3.3, we introduced three variable as the architecture parameters. Algorithm 16 shows the searching process of BM-NAS, which follows DARTS [29] to optimize and model weights , alternatively. In Algorithm 16, the model in searching stage is called hypernet since all the edges and nodes are mixed operations. The searched structure description of the fusion netowrk is called genotype.

Implementation details. In order to make the whole BM-NAS framework searchable and flexible, Cells/inner step nodes should have the same number of inputs and output, so they can be put together in arbitrary topological order. The two-input setting follows the benchmark NAS frameworks (DARTS [29]), MFAS [38], MMIF [37], ). They all have only two searchable inputs for each Cell/step node. Also, it requires no extra effort to let the Cells or step nodes support 3 or more inputs, by just adding ternary (or other arbitrary) operations into the primitive operation pool.

Evaluation. In architecture evaluation, we select the genotype with the best validation performance as the searched fusion network. Then we combine the training and validation sets together to train the unimodal backbones and the searched fusion network jointly.

4 Experiments

In this work we evaluate the BM-NAS on three multimodal tasks, including (1) the multi-label movie genre classification task on MM-IMDB dataset

[4], (2) the multimodal action recognition task on NTU RGB-D dataset [42], and (3) the multimodal gesture recognition task on EgoGesture dataset [55]. Examples of these tasks are shown in Fig. 4. In the following, we discuss the experiments on the three tasks in Sections 4.1, 4.2, and 4.3, respectively. We perform computing efficiency analysis in Section 4.4. We further evaluate the search strategies of the proposed BM-NAS framework in Sections 4.5 and 4.6.

Figure 4: Examples of the evaluation datasets.

4.1 MM-IMDB Dataset

MM-IMDB dataset [4] is a multi-modal dataset collected from the Internet Movie Database, containing posters, plots, genres and other meta information of 25,959 movies. We conduct multi-label genre classification on MM-IMDB using posters (RGB images) and plots (text) as the input modalities. There are 27 non-mutually exclusive genres in total, including Drama, Comedy, Romance, etc. Since the number of samples in each class is highly imbalanced, we only use 23 genres for classification. The classes of News, Adult, Talk-Show, Reality-TV are discarded since they only count for 0.10% in total. We adopt the original split of the dataset where 15,552 movies are used for training, 2,608 for validation and 7,799 for testing.

For a fair comparison with other explicit multimodal fusion methods, we use the same backbone models. Specifically, we use Maxout MLP [13] as the backbone of text modality and VGG Transfer [44] as the backbone of RGB image modality. For BM-NAS, we adopt a setting of 2 fusion Cells and 1 step/Cell. For inner step representations, we set .

Table 1 shows that BM-NAS achieves the best Weighted F1 score in comparison with the existing multimodal fusion methods. Notice that as the class distribution of MM-IMDB is highly imbalanced, Weighted F1 score is in fact a more reliable metric for measuring the performance of multi-label classification than other kinds of F1 score.

4.2 NTU RGB-D Dataset

Method Modality F1-W(%)
Unimodal Methods
Maxout MLP [13] Text 57.54
VGG Transfer [44] Image 49.21
Multimodal Methods
Two-stream [43] Image + Text 60.81
GMU [4] Image + Text 61.70
CentralNet [50] Image + Text 62.23
MFAS [38] Image + Text 62.50
BM-NAS (ours) Image + Text 62.92 0.03
Table 1: Multi-label genre classification results on MM-IMDB dataset [4]. Weighted F1 (F1-W) is reported.
Method Modality Acc(%)
Unimodal Methods
Inflated ResNet-50 [5] Video 83.91
Co-occurrence [25] Pose 85.24
Multimodal Methods
Two-stream [43] Video + Pose 88.60
GMU [4] Video + Pose 85.80
MMTM [20] Video + Pose 88.92
CentralNet [50] Video + Pose 89.36
MFAS [38] Video + Pose 89.50
BM-NAS (ours) Video + Pose 90.48 0.24
Table 2: Action recognition results on NTU RGB-D dataset [42].

The NTU RGB-D dataset [42] is a large scale multimodal action recognition dataset, containing a total of 56,880 samples with 40 subjects, 80 view points, and 60 classes of daily activities. In this work we use the skeleton and RGB video modality for fusion experiments. We measure the performance of methods using cross-subject (CS) accuracy. We follow the dataset split of MFAS [38]. In detail, we use subjects 1, 4, 8, 13, 15, 17, 19 for training, 2, 5, 9, 14 for validation, and the rest for test. There are 23760, 2519 and 16558 samples in the training, validation, and test dataset, respectively.

For a fair comparison, we use two CNN models, the Inflated ResNet-50 [5] for video modality and Co-occurrence [25] for skeleton modality as backbones, ensuring all the methods in our experiments share the same backbones. We test the performances of MFAS [38], MMTM [20], and the proposed BM-NAS using our data prepossessing pipeline, such that the performances of these methods are not the same as they were original reported. For BM-NAS, we use 2 fusion Cells and 2 Steps/Cell. For inner step representations we set .

In Table 2, our method achieves an cross-subject accuracy of , showing an state-of-the-art result on NTU RGB-D [42] with video and pose modalities.

Comparing with MFAS[38], our framework has several advantages. First, MFAS discarded too much information by reshaping all the modality features using global average pooling. For video features, the spatial and temporal dimensions are pooled together to a single number. By contrast, we only pool the dimension, and the dimension is interpolated to the leng of for attention modules. Second, the feature selection strategy of MFAS may be problematic. We found the feature pair (Video_4, Skeleton_4) is included in all the five architecture provided by MFAS. In Table 6 we found the performance of MFAS is just the same as a late fusion model which concatenates (Video_4, Skeleton_4). This means other feature pairs selected by MFAS may be useless. Also MFAS force the features to be selected across the modality, however in our experiments we found BM-NAS favors feature pair (Video_3, Video_4) the most, as addressed in DFAF [12], intra-modal fusion could also be helpful.

4.3 EgoGesture Dataset

The EgoGesture dataset [55] is a large scale multimodal gesture recognition dataset, containing 24,161 gesture samples of 83 classes collected from 50 distinct subjects and 6 different scenes. We follow the original split of EgoGesture dataset [55], , a cross-subject split. there are 14,416 samples for training, 4,768 for validation, and 4,977 for testing.

We use the ResNeXt-101 [23] as the backbone on both RGB and depth video modality. As former works like CentralNet [50] and MFAS [38] did not perform experiments on this dataset, we compared our method with other unimodal and multimodal methods. For our BM-NAS, we use 2 fusion Cells and 3 steps/Cell, for inner step representations we set .

Table 3 reports the experiment results on EgoGesture [55] dataset. Compared with other unimodal/multimodal methods, BM-NAS achieves a state-of-the-art fusion performance, showing that BM-NAS is effective for enhancing gesture recognition performance on EgoGesture dataset.

Method Modality Acc(%)
Unimodal Methods
VGG-16 + LSTM [43] RGB 74.70
C3D + LSTM + RSTTM [48] RGB 89.30
I3D [6] RGB 90.33
ResNext-101 [23] RGB 93.75
VGG-16 + LSTM [52] Depth 77.70
C3D + LSTM + RSTTM [33] Depth 90.60
I3D [6] Depth 89.47
ResNeXt-101 [23] Depth 94.03
Multimodal Methods
VGG-16 + LSTM [6] RGB + Depth 81.40
C3D + LSTM + RSTTM [1] RGB + Depth 92.20
I3D [14] RGB + Depth 92.78
MMTM [20] RGB + Depth 93.51
MTUT [14] RGB + Depth 93.87
BM-NAS (ours) RGB + Depth 94.96 0.07
Table 3: Gesture recognition results on EgoGesture dataset [4]. We use ResNext-101 [23] as backbones for both RGB and depth modality for our BM-NAS method.

4.4 Computing Efficiency

Model size. Table 4 compares the model sizes of different multimodal fusion methods on NTU RGB-D. All the three methods share exactly the same unimodal backbones. Compared with the manually designed fusion model MMTM [20] and the fusion model searched by MFAS [38], our BM-NAS achieves a better performance with fewer model parameters.

Search cost. Table 5 compares the search cost of generalized multimodal NAS frameworks including MFAS [38] and our BM-NAS. Thanks to the efficient differentiable architecture search framework [29], BM-NAS is about 10x faster than MFAS [38] when searching on MM-IMDB [4] and NTU RGB-D [42].

Method Dataset Parameters Acc(%)
MMTM [20] NTU 8.61 M 88.92
MFAS [38] NTU 2.16 M 89.50
BM-NAS (ours) NTU 0.98 M 90.48
Table 4: Model size and performance on NTU RGB-D [42].
MFAS [38] 9.24 603.64
BM-NAS (ours) 1.24 53.68
Table 5: Search cost (GPUhours) of generalized multimodal NAS methods.
Figure 5: Best model found on NTU RGB-D dataset [42].

4.5 Ablation Study

In this section, we conduct ablation study to verify the effectiveness of the unimodal feature selection strategy and the multimodal fusion strategy, respectively.

Unimodal feature selection. Table 6 compares different unimodal feature selection strategies on NTU RGB-D dataset [42]. We compare the best strategy found by BM-NAS against the random selection, the late fusion, and the best strategy found by MFAS [38]. For all the random baselines, the inner-structure of Cells are the same. We randomly selects the input features and the connections between Cells, and we report the result averaged over 5 trials. For the late fusion baseline, we concatenate the outputs of the last layers of the backbones, , concatenating features Video_4 and Skeleton_4 in Fig. 5. MFAS selects four feature pairs: (Video_4, Skeleton_4), (Video_2, Skeleton_4), (Video_2, Skeleton_2), and (Video_4, Skeleton_4). As shown in Table 6, the searched feature selection strategy is better than all baselines, demonstrating that a better unimodal feature selection strategy benefits the multimodal fusion performance.

Features Dataset Accuracy(%)
Random NTU 86.35 0.68
Late fusion NTU 89.49
Searched (MFAS [38]) NTU 89.50
Searched (BM-NAS) NTU 90.48
Table 6: Ablation study for feature selection.
Fusion Framework Dataset Acc (%)
Sum DARTS [29] NTU 87.64
ConcatFC MFAS [38] NTU 89.20
MHA Transformer [49] NTU 88.29
AoA AoANet [16] NTU 89.11
Searched BM-NAS NTU 90.48
Table 7: Ablation study for fusion strategy.

Multimodal fusion strategy. Table 7 evaluates different multimodal fusion strategies on NTU RGB-D dataset [42]. All the strategies in Table 7 adopt the same feature selection strategy. We compare the best Cell structure found by BM-NAS against the summation used in DARTS [29], the ConcatFC used in MFAS [38], the multi-head attention (MHA) used in Transformer [49], and the attention on attention (AoA) used in AoANet [16]. All these fusion strategies can be formed as certain combinations of our predefined primitive operations, as shown in Fig. 3. In Table 7, the fusion strategy derived by BM-NAS outperforms the baseline strategies, showing the effectiveness of searching fusion strategy for multimodal fusion models.

4.6 Search Configurations

To better understand the proposed BM-NAS framework, we empirically study many various search configurations of BM-NAS on NTU RGB-D [42] and EgoGesture [55]. The configurations include the number of Cells , the number of steps , and the inner representation size . We list the top-4 configurations in Table 8 in conjunction with the val/test accuracies.

Table 8 suggests that might be a good choice. We find that when setting , BM-NAS would lean to selecting the late fusion strategy (, selecting the last features of backbones). But as shown in Table 6, late fusion may not be the best choice. Regarding the number of steps , already includes many existing fusion strategies (as denoted by figure 3), while makes a slightly larger search space. We observe that larger and may easily lead to overfitting, as there is a total of inner steps in a Cell.

With the search configurations in Table 8, Fig. 6 shows the validation accuracies of the hypernets during search. Fig. 7 further compares the performances of the hypernets, and, compares the performances of the sampled architectures. Figs. 6 and 7 show that the performances of different search configurations of BM-NAS are consistent over searching and evaluation. It suggests that we can select good search configurations according to the validation performance of hypernets instead of performing additional evaluation on the test set with the sampled architecture.

Dataset ID N M C L Val Acc Test Acc
NTU 1 2 2 128 8 94.48 90.48
NTU 2 2 1 256 8 94.16 89.19
NTU 3 2 1 128 8 93.01 88.77
NTU 4 4 2 256 8 92.22 88.30
Ego 1 2 3 128 8 98.87 94.96
Ego 2 1 2 128 8 98.58 94.45
Ego 3 4 2 192 8 98.56 93.25
Ego 4 2 2 192 12 98.60 94.33
Table 8: Top-4 search configurations on NTU RGB-D [42] and EgoGesture [55] with accuracy (%).
Figure 6: The validation accuracy of hypernets in searching stage. Results on NTU RGB-D [42] and EgoGesture [55] are reported.
Figure 7: Performance of hypernets (searching stage) and sampled fusion networks structures (evaluation stage).

5 Conclusion

In this paper, we have presented a novel multimodal NAS framework BM-NAS to learn the architectures of multimodal fusion models via a bilevel searching scheme. To our best knowledge, BM-NAS is the first NAS framework that supports to search both the unimodal feature selection strategies and the multimodal fusion strategies for multimodal DNNs. In experiments, we have demonstrated the effectiveness and efficiency of BM-NAS on three various multimodal learning tasks.


  • [1] M. Abavisani, H. R. V. Joze, and V. M. Patel (2019) Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In CVPR, pp. 1165–1174. Cited by: Table 3.
  • [2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In ICML, pp. 173–182. Cited by: §1.
  • [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pp. 6077–6086. Cited by: §1.
  • [4] J. Arevalo, T. Solorio, M. Montes-y-Gómez, and F. A. González (2017) Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992. Cited by: §A.2, §A.2, Table 9, Figure 11, §B.3, §B.3, Table 10, §4.1, §4.4, Table 1, Table 2, Table 3, §4.
  • [5] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor (2018) Glimpse clouds: human activity recognition from unstructured feature points. In CVPR, pp. 469–478. Cited by: §B.2, §4.2, Table 2.
  • [6] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308. Cited by: Table 3.
  • [7] Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2019) Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: §2.2.
  • [8] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In ICML, pp. 933–941. Cited by: §3.3.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §2.1.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [11] T. Elsken, J. Metzen, and F. Hutter (2017)

    Simple and efficient architecture search for convolutional neural networks

    arXiv preprint arXiv:1711.04528. Cited by: §2.1.
  • [12] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR, pp. 6639–6648. Cited by: §2.2, §4.2.
  • [13] I. J. Goodfellow, D. Warde-Farley, and M. M. A. Courville (2013) Bengio, yoshua. maxout networks. In ICML, pp. 1319–1327. Cited by: §4.1, Table 1.
  • [14] V. Gupta, S. K. Dwivedi, R. Dabral, and A. Jain (2019) Progression modelling for online and early gesture detection. In 3DV, pp. 289–297. Cited by: Table 3.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1.
  • [16] L. Huang, W. Wang, J. Chen, and X. Wei (2019) Attention on attention for image captioning. In ICCV, pp. 4634–4643. Cited by: §B.1, §1, §2.2, §2.2, §3.3, §4.5, Table 7.
  • [17] F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §2.1, §2.2.
  • [18] T. Jin, S. Huang, M. Chen, Y. Li, and Z. Zhang (2020) SBAT: video captioning with sparse boundary-aware transformer. In IJCAI, Cited by: §1.
  • [19] T. Jin, S. Huang, Y. Li, and Z. Zhang (2019) Low-rank hoca: efficient high-order cross-modal attention for video captioning. In EMNLP, pp. 2001–2011. Cited by: §1.
  • [20] H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida (2020) MMTM: multimodal transfer module for cnn fusion. In CVPR, pp. 13289–13299. Cited by: §2.2, §3.1, §4.2, §4.4, Table 2, Table 3, Table 4.
  • [21] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with bayesian optimisation and optimal transport. In NeurIPS, pp. 2016–2025. Cited by: §2.1.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2, §A.2.
  • [23] O. Köpüklü, A. Gunduz, N. Kose, and G. Rigoll (2019) Real-time hand gesture detection and classification using convolutional neural networks. In FG, pp. 1–8. Cited by: §B.2, §4.3, Table 3.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
  • [25] C. Li, Q. Zhong, D. Xie, and S. Pu (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055. Cited by: §B.2, §4.2, Table 2.
  • [26] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §2.2.
  • [27] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In ECCV, pp. 19–34. Cited by: §1.
  • [28] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In ECCV, pp. 19–34. Cited by: §2.1.
  • [29] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2.1, §2.2, §3.2, §3.3, §3.3, §3.4, §3.4, §4.4, §4.5, Table 7.
  • [30] I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    arXiv preprint arXiv:1608.03983. Cited by: §A.2.
  • [31] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, pp. 13–23. Cited by: §2.2.
  • [32] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NeurIPS, pp. 289–297. Cited by: §1.
  • [33] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In CVPR, pp. 4207–4215. Cited by: Table 3.
  • [34] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §3.3.
  • [35] P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, R. Prasad, and P. Natarajan (2012) Multimodal feature fusion for robust event detection in web videos. In CVPR, pp. 1298–1305. Cited by: §2.2.
  • [36] R. Negrinho and G. Gordon (2017) Deeparchitect: automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792. Cited by: §2.1.
  • [37] Y. Peng, L. Bi, M. Fulham, D. Feng, and J. Kim (2020) Multi-modality information fusion for radiomics-based neural architecture search. In MICCAI, pp. 763–771. Cited by: §1, §2.2, §3.4.
  • [38] J. Pérez-Rúa, V. Vielzeuf, S. Pateux, M. Baccouche, and F. Jurie (2019) Mfas: multimodal fusion architecture search. In CVPR, pp. 6966–6975. Cited by: §B.1, §1, §2.2, §3.1, §3.3, §3.4, §4.2, §4.2, §4.2, §4.3, §4.4, §4.4, §4.5, §4.5, Table 1, Table 2, Table 4, Table 5, Table 6, Table 7.
  • [39] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.1.
  • [40] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.1.
  • [41] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §1.
  • [42] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In CVPR, pp. 1010–1019. Cited by: §A.2, §A.2, Table 9, Figure 9, §B.1, §B.2, §B.3, §2.2, Figure 3, Figure 5, Figure 6, §4.2, §4.2, §4.4, §4.5, §4.5, §4.6, Table 2, Table 4, Table 8, §4.
  • [43] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, pp. 568–576. Cited by: §1, Table 1, Table 2, Table 3.
  • [44] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1, Table 1.
  • [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    15 (1), pp. 1929–1958.
    Cited by: §A.2.
  • [46] H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §2.2.
  • [47] D. Teney, P. Anderson, X. He, and A. Van Den Hengel (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In CVPR, pp. 4223–4232. Cited by: §2.2.
  • [48] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, pp. 4489–4497. Cited by: Table 3.
  • [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §1, §1, §2.2, §2.2, Figure 3, §3.3, §3.3, §3.3, §4.5, Table 7.
  • [50] V. Vielzeuf, A. Lechervy, S. Pateux, and F. Jurie (2018) Centralnet: a multilayer approach for multimodal fusion. In ECCV, pp. 0–0. Cited by: §1, §2.2, §3.1, §4.3, Table 1, Table 2.
  • [51] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, pp. 1492–1500. Cited by: §3.1.
  • [52] X. Yang and Y. Tian (2014)

    Super normal vector for activity recognition using depth sequences

    In CVPR, pp. 804–811. Cited by: Table 3.
  • [53] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In CVPR, pp. 4651–4659. Cited by: §1.
  • [54] Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian (2020) Deep multimodal neural architecture search. arXiv preprint arXiv:2004.12070. Cited by: §1, §2.2, §3.3.
  • [55] Y. Zhang, C. Cao, J. Cheng, and H. Lu (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia 20 (5), pp. 1038–1050. Cited by: §A.2, §A.2, Table 9, Figure 10, §B.2, §B.2, §B.2, §B.3, §2.2, Figure 6, §4.3, §4.3, §4.6, Table 8, §4.
  • [56] X. Zhou, S. Huang, B. Li, Y. Li, J. Li, and Z. Zhang (2019) Text guided person image synthesis. In CVPR, pp. 3663–3672. Cited by: §1.
  • [57] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2.1.
  • [58] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1.

Appendix A Learning Details

a.1 More Details on Architecture Parameters

The function of the weights of primitive operations () and inner step nodes edges () is shown in Fig. 8, is used for feature selection within the cell, selecting two inputs for each inner step node. And is used for operation selection on each inner step node.

a.2 Hyper-parameters

Dataset ID Cells and Steps Basic Arch Optim Network Optim Model Size Search Time Search Score Eval Score C L N M Ep BS Drpt LR L2 MaxLR MinLR L2 NTU 1 128 8 2 2 30 96 0.2 3e-4 1e-3 1e-3 1e-6 1e-4 0.98 M 53.68 94.48 90.48 NTU 2 256 8 2 1 30 96 0.2 3e-4 1e-3 1e-3 1e-6 1e-4 1.71 M 47.76 94.16 89.19 NTU 3 128 8 2 1 30 96 0.2 3e-4 1e-3 1e-3 1e-6 1e-4 0.98 M 45.84 93.01 88.78 NTU 4 256 8 4 2 30 96 0.2 3e-4 1e-3 1e-3 1e-6 1e-4 2.57 M 58.64 92.22 88.30 Ego 1 128 8 2 3 7 72 0.0 3e-4 1e-3 3e-3 1e-6 1e-4 0.61 M 20.67 98.87 94.96 Ego 2 128 8 1 2 7 72 0.2 3e-4 1e-3 1e-2 1e-6 1e-4 0.45 M 27.60 98.58 94.45 Ego 3 192 8 4 2 7 72 0.2 3e-4 1e-3 3e-3 1e-6 1e-4 1.17 M 36.82 98.56 93.25 Ego 4 192 12 2 2 7 72 0.0 3e-4 1e-3 3e-3 1e-6 1e-4 1.59 M 33.62 98.60 94.33 MM IMDB 1 192 16 2 1 12 96 0.1 3e-4 1e-3 1e-3 1e-6 1e-4 0.65 M 1.24 53.44 62.92
Table 9: Top-4 Configurations on NTU [42] and EgoGesture [55] datasets and the best configuration on MM-IMDB [4] dataset.

We describe the detailed hyper-parameter configurations on MM-IMDB [4], NTU RGB-D [42], and EgoGesture [55] datasets in Table 9, where the notations are discussed in the following.

Cells and steps. C is the channels, L is length. In the paper, we refer (C, L) as inner representation size. N is the number of cells, M is the number of steps in each cell.

Basic training settings. Ep

is the number of epochs during the searching stage. In the evaluation stage, it could be larger and we roughly set it between

and in the experiments. BS is the batch size and Drpt is the Dropout rate [45]. BS and Drpt is the same for both the searching stage and the evaluation stage.

Architecture optimization. For architecture parameter optimization, we use the Adam [22] optimizer. The architecture parameters control the structures of the cells and steps, , in the paper. LR is the learning rate. L2 is the weight decay term.

Network optimization. For the network parameters, we use the Adam [22] optimizer with a Cosine Annealing scheduler [30] . The network parameters are trainable parameters from the fusion network, including the reshaping layers, cells and the classifier. MaxLR and MinLR are the learning rate boundaries used by the Cosine Annealing scheduler [30]. L2 is the the weight decay term.

Model size and search time. Model Size is the total number of parameters of the fusion network (in millions), excluding the backbone models. Search Time is the time taken for the searching stage (GPU·hours). We use 8 NVIDIA M40 GPUs in our experiments.

Searching and evaluation scores. The Search Score is the performance of the hypernet in searching stage on the validation set. The Eval Score is the performance of the fusion network in evaluation stage on the test set. For MM-IMDB [4] dataset, it includes a multi-label classification task and we use the Weighted F1 score (F1-W) as the metric for performance measurement. For NTU RGB- D[42] dataset and EgoGesture [55] dataset, we use the classification accuracy (%) as the metric.

Figure 8: The function of and .

Appendix B Discovered Architectures

b.1 NTU RGB-D Dataset

(a) NTU Config 1 (b) NTU Config 2 (c) NTU Config 3 (d) NTU Config 4
Figure 9: The top-4 architectures found by BM-NAS on NTU [42] dataset. ‘NTU Config 1’ is the best architecture found on NTU dataset. ‘C1_S1’ denotes of , and so on. The blue edges are the connections at the upper level, and the dark edges are the connections at the lower level.

We tune the hyper parameters extensively on NTU RGB-D [42] dataset. The top-4 configurations are shown in Table 9, and the architectures found under these configurations are shown in Fig. 9. The ‘NTU Config 1’ is the best architecture found by our BM-NAS framework.

For feature selection strategy, we find that Video_3, Video_4, and Skeleton_4 are always selected by our BM-NAS framework no matter how many Cells and steps used. It indicates these are the most effective modality features. Especially Video_3 is strongly favored in all the found architectures. MFAS [38] also selects Video_4 and Skeleton_4 in every found architectures, but it does not pay much attention to Video_3.

For fusion strategy, we find that adding more inner steps (increasing ) is more effective than adding more cells (increasing ). However, since we have steps in total, setting or too large would easily lead to an overfitting. Roughly we find that setting , is a good option. means we have two different feature pairs for Cells, which is sufficient to cover the three most important features Video_3, Video_4 and Skeleton_4. And is sufficient for BM-NAS to form all the fusion strategy like concatenation, attention on attention (AoA) [16], , as shown in the paper. The best fusion strategy found by BM-NAS on NTU is very similar to an AoA [16] module, see ‘NTU Config 1’ in Fig. 9.

b.2 EgoGesture Dataset

(a) Ego Config 1 (b) Ego Config 2 (c) Ego Config 3 (d) Ego Config 4
Figure 10: The top-4 architectures found by BM-NAS on EgoGesture[55] dataset. ‘Ego Config 1’ is the best architecture found on EgoGesture dataset.

For the experiments on EgoGesture [55] dataset, we basically follow the settings as those in NTU RGB-D dataset. The top-4 configurations are shown in Table 9, and the architectures found under these configurations are shown in Fig. 10. The ‘Ego Config 1’ is the best architecture found by our BM-NAS framework.

For feature selection strategy, we find Depth_1, Depth_2, and RGB_2 are the most important features for EgoGesture [55].

For fusion strategy, we find that a combination of is the most effective, shown as ‘Ego Config 1’ in Fig. 10

, probably because the backbone models share the same architecture. Unlike the experiments on NTU RGB-D

[42] which use Inflated ResNet-50 [5] for RGB videos and Co-occurrence [25] for skeletons modality, EgoGesture [55] uses ResNeXt-101 [23] backbone for both the depth videos and the RGB videos. These two backbone models have exactly the same architecture, except for the input channels of the first convolutional layer. Therefore, the depth features and the RGB features probably share the semantic levels for features of the same depths, such as Depth_2 and RGB_2 in ‘Ego Config 1’.

b.3 MM-IMDB Dataset

Figure 11: MM-IMDB Config 1, which is the best architecture found by BM-NAS on MM-IMDB [4] dataset.

We do not tune the hyper-parameters extensively on MM-IMDB [4] since it is a relatively simple task compared with NTU RGB-D [42] and EgoGesture [55]. The configuration can be found in Table 9. As shown in Fig. 11, we find Image_2 and Text_0 are the most important modality features. The best fusion operation is ConcatFC for Image_2 and Text_0, and LinearGLU for Cell_0 and Text_0.

It is worth noting that we use the Weighted F1 score (F1-W) as the metric for performance, since we perform a multi-label classification task on MM-IMDB[4] dataset. Although the Macro F1 score (F1-M) is also reported in the paper, we only use F1-W for model selection, because the distribution of labels in MM-IMDB [4] is highly imbanlanced as illustrated in Table 10. Thus, F1-W would be a better metric as F1-M does not take label imbalance into account.

Label #Samples Label #Samples Drama 13967 War 1335 Comedy 8592 History 1143 Romance 5364 Music 1045 Thriller 5192 Animation 997 Crime 3838 Musical 841 Action 3550 Western 705 Adventure 2710 Sport 634 Horror 2703 Short 471 Documentary 2082 Film-Noir 338 Mystery 2057 News 64 Sci-Fi 1991 Adult 4 Fantasy 1933 Talk-Show 2 Family 1668 Reality-TV 1 Biography 1343
Table 10: Label distribution of MM-IMDB[4] dataset.