Deep neural networks (DNNs) have achieved a great success on various unimodal tasks (, image categorization [24, 15], language modeling [49, 10], and speech recognition ) as well as the multimodal tasks (, action recognition [43, 50], image/video captioning [53, 19, 18], visual question answering [32, 3], and cross-modal generation [41, 56]). Despite the superior performances achieved by DNNs on these tasks, it usually requires huge efforts to adapt DNNs to the specific tasks. Especially with the increase of modalities, it is exhausting to manually design the backbone architectures and the feature fusion strategies. It raises urgent concerns about the automatic design of multimodal DNNs with minimal human interventions.
Neural architecture search (NAS) [58, 27] is a promising data-driven solution to this concern by searching for the optimal neural network architecture from a predefined space. By applying NAS to multimodal learning, MMnas  searches the architecture of Transformer model for visual-text alignment and MMIF  searches the optimal CNNs structure to extract multi-modality image features for tomography. These methods lack generalization ability since they are designed for models on specific modalities. MFAS  is a more generalized framework which searches the feature fusion strategy based on the unimodal features. However, MFAS  only allows feature fusion of inter-modal features, and, its feature fusion operations are not searchable. It results in a limited space of feature fusion strategies when dealing with various modalities in different multimodal tasks.
In this paper, we propose a generalized framework, named Bilevel Multimodal Neural Architecture Search (BM-NAS), to adaptively learn the architectures of DNNs for a variety of multimodal tasks. BM-NAS adopts a bilevel
searching scheme that it learns the unimodal feature selection strategy at the upper level and the multimodal feature fusion strategy at the lower level, respectively. As shown in the left part of Fig.1, the upper level of BM-NAS consists of a series of feature fusion units, , Cells. The Cells are organized to combine and transform the unimodal features to the task output through a searchable directed acyclic graph (DAG). The right part of Fig. 1 illustrates the lower level of BM-NAS which learns the inner structures of Cells. A Cell is comprised of several predefined primitive operations. We carefully select the primitive operations such that different combinations of them can form a large variety of feature fusion modules, including those benchmark attention mechanisms multi-head attention (Transformer)  and Attention on Attention (AoA) . The bilevel scheme of BM-NAS is end-to-end learned using the differentiable NAS framework . We conduct extensive experiments on three multimodal tasks to evaluate the proposed BM-NAS framework. BM-NAS models show superior performances in comparison with the state-of-the-art multimodal learning methods. Compared with the existing generalized multimodal NAS frameworks, BM-NAS achieves competitive performances with much less search time and fewer model parameters. To the best of our knowledge, BM-NAS is the first multimodal NAS framework that supports the search of both the unimodal feature selection strategies and the multimodal fusion strategies.
The main contributions of this paper are three-fold.
Towards a more generalized and flexible design of DNNs for multimodal learning, we propose a new paradigm that employs NAS to search both the unimodal feature selection strategy and the multimodal fusion strategy.
We present a novel BM-NAS framework to address the proposed paradigm. BM-NAS makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme.
We conduct extensive experiments on three multimodal learning tasks to evaluate the proposed BM-NAS framework. Empirical evidences indicate that both the unimodal feature selection strategy and the multimodal fusion method are significant to the performance of multimodal DNNs.
2 Related Work
2.1 Neural Architecture Search
Neural architecture search (NAS) aims at automatically finding the optimal neural network architectures for specific learning tasks. NAS can be viewed as a bilevel optimization problem by optimizing the weights and the architecture of DNNs at the same time. Since the network architecture is discrete, traditional NAS methods usually rely on the black-box optimization algorithms, resulting in a extremely large computing cost. For example, searching architectures using reinforcement learning or evolution 
would require thousands of GPU-days to find a state-of-the-art architecture on ImageNet dataset due to low sampling efficiency.
As a result, many methods were proposed for speeding up NAS. From the perspective of engineering, ENAS  and NASH  improve the sampling efficiency by weight-sharing and inheritance, respectively. From the perspective of optimization algorithm, PNAS  employs sequential model-based optimization (SMBO) , using a surrogate model to predict the performance of an architecture. Monte Carlo tree search (MTCS)  and Bayesian optimization (BO)  are also explored to enhance the sampling performance of NAS.
Recently, a remarkable efficiency improvement of NAS is achieved by differentiable architecture search (DARTS) . DARTS introduces a continuous relaxation of the network architecture, making it possible to search an architecture via gradient-based optimization. However, applying DARTS to multimodal learning directly is ineffective, since the intermediate nodes of DARTS only use a on its inputs, and may not be the optimal fusion strategy for specific multimodal tasks. In this work, we devise a novel NAS framework named BM-NAS for multimodal learning. BM-NAS follows the optimization scheme of DARTS, however, it novelly introduces a bilevel searching scheme to search the unimodal feature selection strategy and the multimodal fusion strategy simultaneously, enabling an effective search scheme for multimodal fusion.
2.2 Multimodal Fusion
The multimodal fusion techniques for DNNs can be generally classified into two categories: early fusion and late fusion. Early fusion combines low-level features, while late fusion combines prediction-level features, , the output of the last layer of DNNs. To combine these features, a series of reduction operations such as weighted average and bilinear product  are proposed in previous works. As each unimodal DNNs backbone could have tens of layers or maybe more, manually sorting out the best intermediate features for multimodal fusion could be exhausting. Therefore, some works propose to enable fusion at multiple intermediate layers. For instance, CentralNet  and MMTM  join the latent representations at each layer and pass them as auxiliary information for deeper layers. Such methods achieve superior performances on several multimodal tasks including multimodal action recognition  and gesture recognition . However, it would largely increase the parameters of multimodal fusion models.
In recent years, there is an increased interest of introducing the attention mechanisms such as Transformer  to multimodal learning. The multimodal-BERT family [7, 26, 31, 46] is a typical approach for inter-modal fusion. Moreover, DFAF  shows that intra-modal fusion could also be helpful. DFAF proposes a dynamic attention flow module to mix inter-modal and intra-modal features together through the multi-head attention . Additional efforts are made to enhance multimodal fusion efficacy of attention mechanisms. For instance, AoANet 
proposes the attention on attention (AoA) module, showing that adding an attention operation on top of another one could achieve better performance on image captioning task.
Recently, the NAS approaches are making an exciting progress for DNNs, and it shows a huge potential to introduce NAS to multimodal learning. One representative work is MFAS , which employs SMBO algorithm  to search multimodal fusion strategies given the unimodal backbones. But as SMBO is a black-box optimization algorithm, every update step requires a bunch of DNNs to be trained, leading to the inefficiency of MFAS. Besides, MFAS only use concatenation and fully connected (FC) layers for unimodal feature fusion, and the stack of FC layers would be a heavy burden for computing. Further works including MMnas , and MMIF  adopt the efficient DARTS algorithm  for architecture search. MMIF  follows the original DARTS framework, which only support the search of unary operations on graph edges and use summation on every intermediate node for reduction. MMnas  allows searching the attention operations but the topological structure of the network is fixed during architecture search.
Different from these related works, our proposed BM-NAS supports to search both the unimodal feature selection strategy and the fusion strategy of multimodal DNNs. BM-NAS introduces a bilevel searching scheme. The upper level of BM-NAS supports both intra-modal and inter-modal feature selection. The lower level of BM-NAS searches the fusion operations within every intermediate step. Each step can flexibly form the summation, concatenation, multi-head attention , attention on attention , or any other unexplored fusion mechanisms. BM-NAS is a generalized and efficient NAS framework for multimodal learning. In experiments we show that BM-NAS can be applied to various multimodal tasks regardless of the modalities or backbone models.
In this work, we propose a generalized NAS framework, named Bilevel Multimodal NAS (BM-NAS), to search the architectures of multimodal fusion DNNs. More specifically, BM-NAS searches a Cell-by-Cell architecture in a bilevel fashion. The upper level architecture is a directed acyclic graph (DAG) of the input features and Cells. The lower level architecture is a DAG of inner step nodes within a Cell. Each inner step node is a bivariate operation drawn from a predefined pool. The bilevel searching scheme ensures that BM-NAS can be easily adapted to various multimodal learning tasks regardless of the types of modalities. In the following, we discuss the unimodal feature extraction in Section3.1, the upper and lower levels of BM-NAS in Sections 3.2 and 3.3, along with the architecture search algorithm and evaluation in Section 3.4.
3.1 Unimodal feature extraction
By following previous multimodal fusion works, such as CentralNet , MFAS  and MMTM , we also employ the pretrained unimodal backbone models as the feature extractors. We use the outputs of their intermediate layers as raw features (or intermediate blocks if the model has a block-by-block structure like ResNeXt ).
Since the raw features vary in shapes, we reshape them by applying pooling, interpolation, and fully connected layers on spatial, temporal, and channel dimensions, successively. By doing so, we reshape all the raw features to the shape of, such that we can easily perform fusion operations between features of different modalities. Here is the batch size, is the embedding dimension or the number of channels, is the sequence length.
3.2 Upper Level: Cells for Feature Selection
The upper level of BM-NAS searches the unimodal feature selection strategy and it consists of a group of Cells. Formally, suppose we have two modalities A and B, and two pretrained unimodal models for each modality. Let and indicate the modality features extracted by the backbone models. We formulate the upper level nodes in an ordered sequence , as
Under the setting of , both inter-modal fusion and intra-modal fusion are considered in BM-NAS.
Feature selection. By adopting the continuous relaxation in differentiable architecture search scheme , all predecessors of will be connected to through weighted edges at the searching stage. This directed complete graph between Cells is called the hypernet. For two upper level nodes , let denote the edge weight between and . Each edge is a unary operation selected from a function set including
(1) , , selecting an edge.
(2) , , discarding an edge.
Then, the mixed edge operation on edge is
A Cell receives inputs from all its predecessors, as
In evaluation stage, the network architecture is discretized that an input pair 111We enforce the Cells to have different predecessors. will be selected for if
It is worth noting that, compared with searching the feature pairs directly, the Cell-by-Cell structure significantly reduces the complexity of the search space for unimodal feature selection. For an input pair from two feature sequences and , the number of candidate choices is under the Cell-by-Cell search setting. It is much smaller than , the number of candidates under the pairwise search setting.
3.3 Lower Level: Multimodal Fusion Strategy
The lower level of BM-NAS searches the multimodal fusion strategy, , the inner structure of Cells. Specifically, a Cell is a DAG consisting of a set of inner step nodes. The inner step nodes are the primitive operations drawn from a predefined operation pool. We introduce our predefined operation pool in the following.
All the primitive operations take two tensor inputs, and outputs a tensor , where .
(1) : The operation discards an inner step completely. It will be helpful when BM-NAS decides to use only a part of the inner steps.
(2) : The DARTS  framework uses summation to combine two features as
(3) : We use the scaled dot-product attention . As a standard attention module usually takes three inputs namely query, key, and value, we let the query be , the key and value be , which is also known as the guided-attention .
(4) : A linear layer with the gated linear unit (GLU) . Let and be element-wise multiplication, then is
(5) : stands for passing the concatenation of
to a fully connected (FC) layer with ReLU activation. The FC layer reduces the channel numbers from to . Let , then is
We elaborately choose these primitive operations such that they can be flexibly combined to form various feature fusion modules. In Fig. 3, we show that the search space of lower level BM-NAS accommodates many benchmark multimodal fusion strategies such as the summation used in DARTS , the ConcatFC used in MFAS , the multi-head attention used in Transformer , and the Attention on Attention used in AoANet . There also remains flexibility to discover other better fusion modules for specific multimodal learning tasks.
Fusion strategy. In searching stage, the inner step set of is an ordered feature sequence ,
An inner step node transforms two input nodes to its output through an average over the primitive operation pool , as
where is the weights of primitive operations. In the evaluation stage, the optimal operation of an inner step node is derived as,
The continuous relaxation of the edges with weights between inner step nodes is similar to the upper level BM-NAS. For a simplicity, we omit the formulation in this paper. Note that unlike the upper level BM-NAS, the pairwise inputs in a Cell can be chosen repeatedly222We do not enforce the step nodes to have different predecessors., so the inner steps can form structures like multi-head attention .
3.4 Architecture Search and Evaluation
Search algorithm. In Sections 3.2 and 3.3, we introduced three variable as the architecture parameters. Algorithm 16 shows the searching process of BM-NAS, which follows DARTS  to optimize and model weights , alternatively. In Algorithm 16, the model in searching stage is called hypernet since all the edges and nodes are mixed operations. The searched structure description of the fusion netowrk is called genotype.
Implementation details. In order to make the whole BM-NAS framework searchable and flexible, Cells/inner step nodes should have the same number of inputs and output, so they can be put together in arbitrary topological order. The two-input setting follows the benchmark NAS frameworks (DARTS ), MFAS , MMIF , ). They all have only two searchable inputs for each Cell/step node. Also, it requires no extra effort to let the Cells or step nodes support 3 or more inputs, by just adding ternary (or other arbitrary) operations into the primitive operation pool.
Evaluation. In architecture evaluation, we select the genotype with the best validation performance as the searched fusion network. Then we combine the training and validation sets together to train the unimodal backbones and the searched fusion network jointly.
In this work we evaluate the BM-NAS on three multimodal tasks, including (1) the multi-label movie genre classification task on MM-IMDB dataset, (2) the multimodal action recognition task on NTU RGB-D dataset , and (3) the multimodal gesture recognition task on EgoGesture dataset . Examples of these tasks are shown in Fig. 4. In the following, we discuss the experiments on the three tasks in Sections 4.1, 4.2, and 4.3, respectively. We perform computing efficiency analysis in Section 4.4. We further evaluate the search strategies of the proposed BM-NAS framework in Sections 4.5 and 4.6.
4.1 MM-IMDB Dataset
MM-IMDB dataset  is a multi-modal dataset collected from the Internet Movie Database, containing posters, plots, genres and other meta information of 25,959 movies. We conduct multi-label genre classification on MM-IMDB using posters (RGB images) and plots (text) as the input modalities. There are 27 non-mutually exclusive genres in total, including Drama, Comedy, Romance, etc. Since the number of samples in each class is highly imbalanced, we only use 23 genres for classification. The classes of News, Adult, Talk-Show, Reality-TV are discarded since they only count for 0.10% in total. We adopt the original split of the dataset where 15,552 movies are used for training, 2,608 for validation and 7,799 for testing.
For a fair comparison with other explicit multimodal fusion methods, we use the same backbone models. Specifically, we use Maxout MLP  as the backbone of text modality and VGG Transfer  as the backbone of RGB image modality. For BM-NAS, we adopt a setting of 2 fusion Cells and 1 step/Cell. For inner step representations, we set .
Table 1 shows that BM-NAS achieves the best Weighted F1 score in comparison with the existing multimodal fusion methods. Notice that as the class distribution of MM-IMDB is highly imbalanced, Weighted F1 score is in fact a more reliable metric for measuring the performance of multi-label classification than other kinds of F1 score.
4.2 NTU RGB-D Dataset
|Maxout MLP ||Text||57.54|
|VGG Transfer ||Image||49.21|
|Two-stream ||Image + Text||60.81|
|GMU ||Image + Text||61.70|
|CentralNet ||Image + Text||62.23|
|MFAS ||Image + Text||62.50|
|BM-NAS (ours)||Image + Text||62.92 0.03|
|Inflated ResNet-50 ||Video||83.91|
|Two-stream ||Video + Pose||88.60|
|GMU ||Video + Pose||85.80|
|MMTM ||Video + Pose||88.92|
|CentralNet ||Video + Pose||89.36|
|MFAS ||Video + Pose||89.50|
|BM-NAS (ours)||Video + Pose||90.48 0.24|
The NTU RGB-D dataset  is a large scale multimodal action recognition dataset, containing a total of 56,880 samples with 40 subjects, 80 view points, and 60 classes of daily activities. In this work we use the skeleton and RGB video modality for fusion experiments. We measure the performance of methods using cross-subject (CS) accuracy. We follow the dataset split of MFAS . In detail, we use subjects 1, 4, 8, 13, 15, 17, 19 for training, 2, 5, 9, 14 for validation, and the rest for test. There are 23760, 2519 and 16558 samples in the training, validation, and test dataset, respectively.
For a fair comparison, we use two CNN models, the Inflated ResNet-50  for video modality and Co-occurrence  for skeleton modality as backbones, ensuring all the methods in our experiments share the same backbones. We test the performances of MFAS , MMTM , and the proposed BM-NAS using our data prepossessing pipeline, such that the performances of these methods are not the same as they were original reported. For BM-NAS, we use 2 fusion Cells and 2 Steps/Cell. For inner step representations we set .
Comparing with MFAS, our framework has several advantages. First, MFAS discarded too much information by reshaping all the modality features using global average pooling. For video features, the spatial and temporal dimensions are pooled together to a single number. By contrast, we only pool the dimension, and the dimension is interpolated to the leng of for attention modules. Second, the feature selection strategy of MFAS may be problematic. We found the feature pair (Video_4, Skeleton_4) is included in all the five architecture provided by MFAS. In Table 6 we found the performance of MFAS is just the same as a late fusion model which concatenates (Video_4, Skeleton_4). This means other feature pairs selected by MFAS may be useless. Also MFAS force the features to be selected across the modality, however in our experiments we found BM-NAS favors feature pair (Video_3, Video_4) the most, as addressed in DFAF , intra-modal fusion could also be helpful.
4.3 EgoGesture Dataset
The EgoGesture dataset  is a large scale multimodal gesture recognition dataset, containing 24,161 gesture samples of 83 classes collected from 50 distinct subjects and 6 different scenes. We follow the original split of EgoGesture dataset , , a cross-subject split. there are 14,416 samples for training, 4,768 for validation, and 4,977 for testing.
We use the ResNeXt-101  as the backbone on both RGB and depth video modality. As former works like CentralNet  and MFAS  did not perform experiments on this dataset, we compared our method with other unimodal and multimodal methods. For our BM-NAS, we use 2 fusion Cells and 3 steps/Cell, for inner step representations we set .
Table 3 reports the experiment results on EgoGesture  dataset. Compared with other unimodal/multimodal methods, BM-NAS achieves a state-of-the-art fusion performance, showing that BM-NAS is effective for enhancing gesture recognition performance on EgoGesture dataset.
|VGG-16 + LSTM ||RGB||74.70|
|C3D + LSTM + RSTTM ||RGB||89.30|
|VGG-16 + LSTM ||Depth||77.70|
|C3D + LSTM + RSTTM ||Depth||90.60|
|VGG-16 + LSTM ||RGB + Depth||81.40|
|C3D + LSTM + RSTTM ||RGB + Depth||92.20|
|I3D ||RGB + Depth||92.78|
|MMTM ||RGB + Depth||93.51|
|MTUT ||RGB + Depth||93.87|
|BM-NAS (ours)||RGB + Depth||94.96 0.07|
4.4 Computing Efficiency
Model size. Table 4 compares the model sizes of different multimodal fusion methods on NTU RGB-D. All the three methods share exactly the same unimodal backbones. Compared with the manually designed fusion model MMTM  and the fusion model searched by MFAS , our BM-NAS achieves a better performance with fewer model parameters.
Search cost. Table 5 compares the search cost of generalized multimodal NAS frameworks including MFAS  and our BM-NAS. Thanks to the efficient differentiable architecture search framework , BM-NAS is about 10x faster than MFAS  when searching on MM-IMDB  and NTU RGB-D .
|MMTM ||NTU||8.61 M||88.92|
|MFAS ||NTU||2.16 M||89.50|
|BM-NAS (ours)||NTU||0.98 M||90.48|
4.5 Ablation Study
In this section, we conduct ablation study to verify the effectiveness of the unimodal feature selection strategy and the multimodal fusion strategy, respectively.
Unimodal feature selection. Table 6 compares different unimodal feature selection strategies on NTU RGB-D dataset . We compare the best strategy found by BM-NAS against the random selection, the late fusion, and the best strategy found by MFAS . For all the random baselines, the inner-structure of Cells are the same. We randomly selects the input features and the connections between Cells, and we report the result averaged over 5 trials. For the late fusion baseline, we concatenate the outputs of the last layers of the backbones, , concatenating features Video_4 and Skeleton_4 in Fig. 5. MFAS selects four feature pairs: (Video_4, Skeleton_4), (Video_2, Skeleton_4), (Video_2, Skeleton_2), and (Video_4, Skeleton_4). As shown in Table 6, the searched feature selection strategy is better than all baselines, demonstrating that a better unimodal feature selection strategy benefits the multimodal fusion performance.
|Searched (MFAS )||NTU||89.50|
Multimodal fusion strategy. Table 7 evaluates different multimodal fusion strategies on NTU RGB-D dataset . All the strategies in Table 7 adopt the same feature selection strategy. We compare the best Cell structure found by BM-NAS against the summation used in DARTS , the ConcatFC used in MFAS , the multi-head attention (MHA) used in Transformer , and the attention on attention (AoA) used in AoANet . All these fusion strategies can be formed as certain combinations of our predefined primitive operations, as shown in Fig. 3. In Table 7, the fusion strategy derived by BM-NAS outperforms the baseline strategies, showing the effectiveness of searching fusion strategy for multimodal fusion models.
4.6 Search Configurations
To better understand the proposed BM-NAS framework, we empirically study many various search configurations of BM-NAS on NTU RGB-D  and EgoGesture . The configurations include the number of Cells , the number of steps , and the inner representation size . We list the top-4 configurations in Table 8 in conjunction with the val/test accuracies.
Table 8 suggests that might be a good choice. We find that when setting , BM-NAS would lean to selecting the late fusion strategy (, selecting the last features of backbones). But as shown in Table 6, late fusion may not be the best choice. Regarding the number of steps , already includes many existing fusion strategies (as denoted by figure 3), while makes a slightly larger search space. We observe that larger and may easily lead to overfitting, as there is a total of inner steps in a Cell.
With the search configurations in Table 8, Fig. 6 shows the validation accuracies of the hypernets during search. Fig. 7 further compares the performances of the hypernets, and, compares the performances of the sampled architectures. Figs. 6 and 7 show that the performances of different search configurations of BM-NAS are consistent over searching and evaluation. It suggests that we can select good search configurations according to the validation performance of hypernets instead of performing additional evaluation on the test set with the sampled architecture.
|Dataset||ID||N||M||C||L||Val Acc||Test Acc|
In this paper, we have presented a novel multimodal NAS framework BM-NAS to learn the architectures of multimodal fusion models via a bilevel searching scheme. To our best knowledge, BM-NAS is the first NAS framework that supports to search both the unimodal feature selection strategies and the multimodal fusion strategies for multimodal DNNs. In experiments, we have demonstrated the effectiveness and efficiency of BM-NAS on three various multimodal learning tasks.
-  (2019) Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In CVPR, pp. 1165–1174. Cited by: Table 3.
-  (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In ICML, pp. 173–182. Cited by: §1.
-  (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pp. 6077–6086. Cited by: §1.
-  (2017) Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992. Cited by: §A.2, §A.2, Table 9, Figure 11, §B.3, §B.3, Table 10, §4.1, §4.4, Table 1, Table 2, Table 3, §4.
-  (2018) Glimpse clouds: human activity recognition from unstructured feature points. In CVPR, pp. 469–478. Cited by: §B.2, §4.2, Table 2.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308. Cited by: Table 3.
-  (2019) Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: §2.2.
-  (2017) Language modeling with gated convolutional networks. In ICML, pp. 933–941. Cited by: §3.3.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §2.1.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528. Cited by: §2.1.
-  (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR, pp. 6639–6648. Cited by: §2.2, §4.2.
-  (2013) Bengio, yoshua. maxout networks. In ICML, pp. 1319–1327. Cited by: §4.1, Table 1.
-  (2019) Progression modelling for online and early gesture detection. In 3DV, pp. 289–297. Cited by: Table 3.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1.
-  (2019) Attention on attention for image captioning. In ICCV, pp. 4634–4643. Cited by: §B.1, §1, §2.2, §2.2, §3.3, §4.5, Table 7.
-  (2011) Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §2.1, §2.2.
-  (2020) SBAT: video captioning with sparse boundary-aware transformer. In IJCAI, Cited by: §1.
-  (2019) Low-rank hoca: efficient high-order cross-modal attention for video captioning. In EMNLP, pp. 2001–2011. Cited by: §1.
-  (2020) MMTM: multimodal transfer module for cnn fusion. In CVPR, pp. 13289–13299. Cited by: §2.2, §3.1, §4.2, §4.4, Table 2, Table 3, Table 4.
-  (2018) Neural architecture search with bayesian optimisation and optimal transport. In NeurIPS, pp. 2016–2025. Cited by: §2.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2, §A.2.
-  (2019) Real-time hand gesture detection and classification using convolutional neural networks. In FG, pp. 1–8. Cited by: §B.2, §4.3, Table 3.
-  (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
-  (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055. Cited by: §B.2, §4.2, Table 2.
-  (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §2.2.
-  (2018) Progressive neural architecture search. In ECCV, pp. 19–34. Cited by: §1.
-  (2018) Progressive neural architecture search. In ECCV, pp. 19–34. Cited by: §2.1.
-  (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2.1, §2.2, §3.2, §3.3, §3.3, §3.4, §3.4, §4.4, §4.5, Table 7.
Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §A.2.
-  (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, pp. 13–23. Cited by: §2.2.
-  (2016) Hierarchical question-image co-attention for visual question answering. In NeurIPS, pp. 289–297. Cited by: §1.
-  (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In CVPR, pp. 4207–4215. Cited by: Table 3.
-  (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §3.3.
-  (2012) Multimodal feature fusion for robust event detection in web videos. In CVPR, pp. 1298–1305. Cited by: §2.2.
-  (2017) Deeparchitect: automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792. Cited by: §2.1.
-  (2020) Multi-modality information fusion for radiomics-based neural architecture search. In MICCAI, pp. 763–771. Cited by: §1, §2.2, §3.4.
-  (2019) Mfas: multimodal fusion architecture search. In CVPR, pp. 6966–6975. Cited by: §B.1, §1, §2.2, §3.1, §3.3, §3.4, §4.2, §4.2, §4.2, §4.3, §4.4, §4.4, §4.5, §4.5, Table 1, Table 2, Table 4, Table 5, Table 6, Table 7.
-  (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.1.
-  (2019) Regularized evolution for image classifier architecture search. In AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.1.
-  (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §1.
-  (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In CVPR, pp. 1010–1019. Cited by: §A.2, §A.2, Table 9, Figure 9, §B.1, §B.2, §B.3, §2.2, Figure 3, Figure 5, Figure 6, §4.2, §4.2, §4.4, §4.5, §4.5, §4.6, Table 2, Table 4, Table 8, §4.
-  (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, pp. 568–576. Cited by: §1, Table 1, Table 2, Table 3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1, Table 1.
Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research15 (1), pp. 1929–1958. Cited by: §A.2.
-  (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §2.2.
-  (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In CVPR, pp. 4223–4232. Cited by: §2.2.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, pp. 4489–4497. Cited by: Table 3.
-  (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §1, §1, §2.2, §2.2, Figure 3, §3.3, §3.3, §3.3, §4.5, Table 7.
-  (2018) Centralnet: a multilayer approach for multimodal fusion. In ECCV, pp. 0–0. Cited by: §1, §2.2, §3.1, §4.3, Table 1, Table 2.
-  (2017) Aggregated residual transformations for deep neural networks. In CVPR, pp. 1492–1500. Cited by: §3.1.
Super normal vector for activity recognition using depth sequences. In CVPR, pp. 804–811. Cited by: Table 3.
-  (2016) Image captioning with semantic attention. In CVPR, pp. 4651–4659. Cited by: §1.
-  (2020) Deep multimodal neural architecture search. arXiv preprint arXiv:2004.12070. Cited by: §1, §2.2, §3.3.
-  (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia 20 (5), pp. 1038–1050. Cited by: §A.2, §A.2, Table 9, Figure 10, §B.2, §B.2, §B.2, §B.3, §2.2, Figure 6, §4.3, §4.3, §4.6, Table 8, §4.
-  (2019) Text guided person image synthesis. In CVPR, pp. 3663–3672. Cited by: §1.
-  (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2.1.
-  (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1.
Appendix A Learning Details
a.1 More Details on Architecture Parameters
The function of the weights of primitive operations () and inner step nodes edges () is shown in Fig. 8, is used for feature selection within the cell, selecting two inputs for each inner step node. And is used for operation selection on each inner step node.
Cells and steps. C is the channels, L is length. In the paper, we refer (C, L) as inner representation size. N is the number of cells, M is the number of steps in each cell.
Basic training settings. Ep
is the number of epochs during the searching stage. In the evaluation stage, it could be larger and we roughly set it betweenand in the experiments. BS is the batch size and Drpt is the Dropout rate . BS and Drpt is the same for both the searching stage and the evaluation stage.
Architecture optimization. For architecture parameter optimization, we use the Adam  optimizer. The architecture parameters control the structures of the cells and steps, , in the paper. LR is the learning rate. L2 is the weight decay term.
Network optimization. For the network parameters, we use the Adam  optimizer with a Cosine Annealing scheduler  . The network parameters are trainable parameters from the fusion network, including the reshaping layers, cells and the classifier. MaxLR and MinLR are the learning rate boundaries used by the Cosine Annealing scheduler . L2 is the the weight decay term.
Model size and search time. Model Size is the total number of parameters of the fusion network (in millions), excluding the backbone models. Search Time is the time taken for the searching stage (GPU·hours). We use 8 NVIDIA M40 GPUs in our experiments.
Searching and evaluation scores. The Search Score is the performance of the hypernet in searching stage on the validation set. The Eval Score is the performance of the fusion network in evaluation stage on the test set. For MM-IMDB  dataset, it includes a multi-label classification task and we use the Weighted F1 score (F1-W) as the metric for performance measurement. For NTU RGB- D dataset and EgoGesture  dataset, we use the classification accuracy (%) as the metric.
Appendix B Discovered Architectures
b.1 NTU RGB-D Dataset
We tune the hyper parameters extensively on NTU RGB-D  dataset. The top-4 configurations are shown in Table 9, and the architectures found under these configurations are shown in Fig. 9. The ‘NTU Config 1’ is the best architecture found by our BM-NAS framework.
For feature selection strategy, we find that Video_3, Video_4, and Skeleton_4 are always selected by our BM-NAS framework no matter how many Cells and steps used. It indicates these are the most effective modality features. Especially Video_3 is strongly favored in all the found architectures. MFAS  also selects Video_4 and Skeleton_4 in every found architectures, but it does not pay much attention to Video_3.
For fusion strategy, we find that adding more inner steps (increasing ) is more effective than adding more cells (increasing ). However, since we have steps in total, setting or too large would easily lead to an overfitting. Roughly we find that setting , is a good option. means we have two different feature pairs for Cells, which is sufficient to cover the three most important features Video_3, Video_4 and Skeleton_4. And is sufficient for BM-NAS to form all the fusion strategy like concatenation, attention on attention (AoA) , , as shown in the paper. The best fusion strategy found by BM-NAS on NTU is very similar to an AoA  module, see ‘NTU Config 1’ in Fig. 9.
b.2 EgoGesture Dataset
For the experiments on EgoGesture  dataset, we basically follow the settings as those in NTU RGB-D dataset. The top-4 configurations are shown in Table 9, and the architectures found under these configurations are shown in Fig. 10. The ‘Ego Config 1’ is the best architecture found by our BM-NAS framework.
For feature selection strategy, we find Depth_1, Depth_2, and RGB_2 are the most important features for EgoGesture .
For fusion strategy, we find that a combination of is the most effective, shown as ‘Ego Config 1’ in Fig. 10
, probably because the backbone models share the same architecture. Unlike the experiments on NTU RGB-D which use Inflated ResNet-50  for RGB videos and Co-occurrence  for skeletons modality, EgoGesture  uses ResNeXt-101  backbone for both the depth videos and the RGB videos. These two backbone models have exactly the same architecture, except for the input channels of the first convolutional layer. Therefore, the depth features and the RGB features probably share the semantic levels for features of the same depths, such as Depth_2 and RGB_2 in ‘Ego Config 1’.
b.3 MM-IMDB Dataset
We do not tune the hyper-parameters extensively on MM-IMDB  since it is a relatively simple task compared with NTU RGB-D  and EgoGesture . The configuration can be found in Table 9. As shown in Fig. 11, we find Image_2 and Text_0 are the most important modality features. The best fusion operation is ConcatFC for Image_2 and Text_0, and LinearGLU for Cell_0 and Text_0.
It is worth noting that we use the Weighted F1 score (F1-W) as the metric for performance, since we perform a multi-label classification task on MM-IMDB dataset. Although the Macro F1 score (F1-M) is also reported in the paper, we only use F1-W for model selection, because the distribution of labels in MM-IMDB  is highly imbanlanced as illustrated in Table 10. Thus, F1-W would be a better metric as F1-M does not take label imbalance into account.