Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradient-based NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and superior experimental results show that MMnasNet significantly outperforms existing state-of-the-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding. Code will be made available.READ FULL TEXT VIEW PDF
The developments in deep neural networks enable the machine to deal with complicated multimodal learning tasks that require a fine-grained understanding of both vision and language clues, e.g., visual captioning (Xu et al., 2015; Anderson et al., 2018), visual grounding (Rohrbach et al., 2016; Yu et al., 2018a), image-text matching (Kim et al., 2018; Nam et al., 2017), and visual question answering (VQA) (Fukui et al., 2016; Yu et al., 2017b). Existing approaches have pushed state-of-the-art performance on respective tasks, however, their architectures are usually dedicated to one specific task, preventing them from being generalized to other tasks. This phenomenon raises a question: Is it possible to design a generalized framework that can simultaneously adapt to various multimodal learning tasks?
One promising answer to this question is the multimodal-BERT framework (Tan and Bansal, 2019; Chen et al., 2019; Lu et al., 2019; Li et al., 2019), which is inspired by the de facto BERT model (Devlin et al., 2019)
in the natural language processing (NLP) community. Using the Transformer-based architecture(Vaswani et al., 2017)
as its backbone, BERT adopts a two-stage learning paradigm that first pre-trains a universal backbone via self-supervised learning, and then fine-tune the model for the specific task via supervised learning. Analogously, the multimodal-BERT family pre-trains the Transformer-based backbone to obtain generalizable representations from a large-scale corpus consisting of paired multimodal data (e.g., images and their associated captions). Thereafter, the generalized multimodal backbone is fine-tuned to downstream tasks such as VQA and visual grounding. Despite that the multimodal-BERT approaches deliver promising results on the benchmarks of various multimodal learning tasks, their computational costs are usually very high (e.g., 10M training samples (Tan and Bansal, 2019) or 300M model size (Lu et al., 2019; Chen et al., 2019)), which severely limits their applicability.
In this paper, we tackle the generalized multimodal learning problem from another perspective. Rather than pre-training one generalized model for various tasks, we design a generalized framework instead, which can adaptively learn the optimal architecture for various tasks. To do this, we introduce neural architecture search (NAS) (Zoph and Le, 2016) into multimodal learning and propose a deep multimodal neural architecture search (MMnas) framework (see Figure 1). Inspired by the modularized MCAN model (Yu et al., 2019b), we first define a set of primitive operations as the basic unit to be searched. Taking image and sentence features as inputs, we design a unified encoder-decoder backbone by respectively feeding features into the encoder and decoder. The encoder (or the decoder by analogy) in the unified backbone consists of multiple encoder blocks cascaded in depth, where each block corresponds to an operation searched from the encoder operation pool. On top of the unified backbone, task-specific heads are respectively designed for each task (e.g., VQA, visual grounding). By attaching the unified backbone with each head (i.e., task), we use a gradient-based one-shot NAS algorithm to efficiently search the optimal composition of the operations to obtain the MMnasNet to the respective task. Compared to hand-crafted composition by MCAN, the automatically searched composition by MMnasNet can adapt better to fit the characteristics of each task and hence lead to better performance. It is worth noting that the proposed MMnasNet is not conflict with the multimodal-BERT approaches. We can also apply the pre-training strategy on MMnasNet to further enhance its performance.
To summarize, the main contributions of this study is three-fold:
We put forward a new generalized multimodal learning paradigm that uses the neural architecture search (NAS) algorithm to search for the optimal architecture for different tasks. Compared with the multimodal-BERT approaches that use large-scale data to pre-train a generalized model, our paradigm can better capture the characteristics of each task and be more parametric efficient.
We devise a novel MMnas framework, which consists of a unified encoder-decoder backbone and task-specific heads to deal with different task, including visual question answering, image-text matching, and visual grounding.
We conduct extensive experiments on five commonly used benchmark datasets. The optimal MMnasNet delivers new state-of-the-art performance, highlighting the effectiveness and generalizability of the proposed MMnas framework.
We briefly review previous studies on typical multimodal learning tasks and neural architecture search.
Multimodal Learning Tasks: Multimodal learning aims to build models that can understand and associate information from multiple modalities. From early research on audio-visual speech recognition (Yuhas et al., 1989; Dupont and Luettin, 2000) to the recent explosion of interest in vision-and-language tasks (Antol et al., 2015; Chen et al., 2015; Yu et al., 2016), multimodal learning is a multi-disciplinary field of significant importance and potential. At present, multimodal learning with deep neural networks is the de facto paradigm for modern multimodal learning tasks, such as visual question answering (VQA) (Antol et al., 2015)(Kim et al., 2018)(Yu et al., 2019b), image-text matching (Karpathy and Fei-Fei, 2015; Lee et al., 2018), and visual grounding (Yu et al., 2017a)(Yu et al., 2018a). In the following, we briefly describe three typical multimodal learning tasks and a few representative approaches accordingly.
The VQA task aims to answer a question in natural language with respect to a given image, which requires a fine-grained and simultaneous understanding of both image and question. Antol et al. presented a large-scale VQA benchmark with human annotations and some baseline methods (Antol et al., 2015). Fukui et al. (Fukui et al., 2016), Kim et al. (Kim et al., 2017), Ben et al. (Ben-Younes et al., 2017), and Yu et al. (Yu et al., 2017b)
devised different approximated bilinear pooling models to effectively fuse multimodal features with second-order interactions and then integrate them with attention-based neural networks. Most recently, deep co-attention models that were proposed to integrate multimodal fusion and attention learning and delivered new state-of-the-art performance on the benchmark datasets(Nguyen and Okatani, 2018; Kim et al., 2018; Gao et al., 2019; Yu et al., 2019b).
Image-text matching aims to learn two respective mapping functions for the image modality and the text modality, which are then projected into a common semantic space for distance measurement. Karpathy et al. proposed a deep fragment embedding approach to learn the fine-grained similarity between the visual object in the image and textual word in the caption by maximizing their dot-product similarity under a multi-instance learning framework (Karpathy and Fei-Fei, 2015). Lee et al. proposed a stacked cross attention network to exploit the correspondences between textual words and image regions in discovering full latent alignments (Lee et al., 2018). Wang et al. introduced a cross-modal message passing approach that adaptively controls the information flow across modalities to model fine-grained image-text interactions (Wang et al., 2019).
Visual grounding (a.k.a, referring expression comprehension) aims to localize an object in an image referred to by a textual query. Rohrbach et al. proposed a GroundeR model to localize the referred object by reconstructing the sentence using attention mechanism (Rohrbach et al., 2016). Yu et al. introduced a modular attention network that simultaneously models the language-based attention and visual-based attention to capture rich contextual information for accurate localization (Yu et al., 2018a). Yang et al. proposed a dynamic graph attention network to perform language-driven visual reasoning by modeling the relationships among the visual objects in the image and the linguistic structure of the query expression (Yang et al., 2019).
The tasks above have the same input modalities (i.e., image and text), however, their solutions are diverse and task-specific, thus preventing them from being generalized to other tasks. Inspired by the success of BERT model (Devlin et al., 2019) in the NLP community, multimodal-BERT approaches are proposed to learn generalized multimodal representation in a self-supervised manner (Tan and Bansal, 2019; Chen et al., 2019; Lu et al., 2019; Li et al., 2019). Although they have obtained promising results, they usually suffer from tremendous computational costs which limit their usability in practical scenarios.
Neural Architecture Search: Neural architecture search (NAS), a.k.a.
AutoML, has drawn an increasing interest in the last couple of years, and has been successfully applied to various deep learning tasks, such as image recognition(Zoph et al., 2018), object detection (Ghiasi et al., 2019), and language modeling (So et al., 2019)
. Early NAS methods use the reinforcement learning to search neural architectures, which are computationally exhaustive(Zoph and Le, 2016; Zoph et al., 2018). Recent works accelerate the searching process by using weight-sharing (Pham et al., 2018) or hypernetwork (Brock et al., 2018). Although these methods bring acceleration by orders of magnitude, they require a meta-controller (e.g., a hypernetwork or an RNN) which still burdens computational speed. Recently, one-shot NAS methods have been proposed to eliminate the meta-controller by modeling the NAS problem as a single training process of an over-parameterized supernet that comprises all candidate paths (Bender et al., 2018; Liu et al., 2018; Cai et al., 2018).
The most closely related study to our work is the MFAS approach (Pérez-Rúa et al., 2019), which also incorporates NAS to search the optimal architecture for multimodal tasks. However, MFAS focuses on a simpler problem to search for the multimodal fusion model given two input features, which cannot be directly used to address the multimodal learning tasks in this paper.
In this section, we introduce a generalized multimodal learning framework MMnas via neural architecture search, which can be flexibly adapted to a wide range of multimodal learning tasks involving image-sentence inputs. As shown in Figure 2, MMnas contains a unified encoder-decoder backbone and task-specific heads. Taking an image and its associated sentence (e.g., a question, a caption or a query) as inputs, the unified encoder-decoder backbone learns the multimodal interactions with a deep modularized network consisting of stacked encoder and decoder blocks, where each block is searched within a set of predefined primitive operations. On top of the unified backbone, we design task-specific heads to deal with the VQA, image-text matching (ITM), and visual grounding (VG) tasks, respectively. Before presenting the MMnas framework, we first introduce its basic building blocks, the primitive operations.
In the following, we present four types of primitive operations, termed as the self-attention (SA), guided-attention (GA), feed-forward network (FFN), and relation self-attention (RSA) operations. First, we introduce a generalized formulation of the scaled dot-product attention proposed in (Vaswani et al., 2017), which is the core of our primitive operations below.
Denote queries and key-value pairs as , , respectively, where is the common dimensionality. The original scaled dot-product attention function in (Vaswani et al., 2017) obtains the output features by weighted summation over all values with respect to the attention learned from the scaled dot-product of and :
Without loss of generality, the commonly used multi-head mechanism (Vaswani et al., 2017) can also be incorporated with the generalized scaled dot-product attention function, which consists of paralleled heads (i.e., independent attention functions) to further improve the representation capacity of the attended features:
where each refers to an independent scaled dot-product attention function. are the projection matrices for the -th head, and . is the dimensionality of the output features from each head and is usually set to .
SA(X): Taking a group of input features of dimension , the output features of the SA operation are obtained by feeding the inputs through Eq.(3) as follows:
where each encodes the intra-modal interactions between and all features within . 0
is an all-zero matrix indicating that no relation prior is provided.
GA(X, Y): Taking two group of features and of dimension and respectively, the GA operation transforms them into as follows:
where each encodes the inter-modal interactions between and all features within .
This operation is a two-layer MLP network with ReLU activation and dropout in between. Taking one group of input features, the transformed output features of the FFN operation are obtained as follows:
where is a fully-connected layer of output dimension and is a dropout layer with dropout rate . The symbol denotes a composition of two layers.
RSA(X, R): This operation takes a group of features along with their relation features as inputs, where is the dimensionality of the relation features. The output features of the RSA operation are obtained as follows:
where denotes a two-layer MLP network with transformations applied on the last axis of . is a small constant to avoid the underflow problem.
Inspired by (Yu et al., 2019b), we construct a unified encoder-decoder as the backbone to model the deep interactions between the bimodal inputs consisting of an image and its associated sentence. In the following, we describe each component of the backbone in detail.
Sentence and Image Representations:
The input sentence is first tokenized and then trimmed (or zero-padded) into a sequence of
words. Each word is represented as a 300-D vector using the pre-trained GloVe word embeddings(Pennington et al., 2014). The word embeddings are fed into a one-layer LSTM network with hidden units, resulting in the final sentence features .
Following the strategy in (Anderson et al., 2018), the input image is represented as a set of objects extracted from a pre-trained object detection model (e.g., Faster R-CNN). For each image, the object detector predicts objects with each object being represented as a group of visual features and relation features, respectively. The visual features are obtained by pooling the convolutional features from the detected objects. The relation features are calculated by the relative spatial relationships of object pairs111Denote the location of the -th object as , where refer to the center of the object, and refer to the width and height of the object, respectively. Following the strategy in (Hu et al., 2018), the 4-D relation feature between the -th object and the -th object is defined as ..
Sentence Encoder and Image Decoder: Taking the word-level sentence features as inputs, the sentence encoder learns the intra-modal interactions of sentence words by passing through encoder blocks recursively:
where and . Each corresponds to an operation searched from an encoder operation pool with independent operation weights. Similar to (Yu et al., 2019b), the encoder operation pool consists of two candidate operations: SA and FFN.
Analogous to the sentence encoder, we design an image decoder consisting of decoder blocks . Slightly different from that of the encoder, the decoder operation pool contains four operations: SA, RSA, GA, and FFN. Taking the visual features and relation features from the image, along with the output features from the sentence encoder as inputs, the image decoder models the intra- and inter-modal interactions of the multimodal inputs in a recursive manner:
where and . Each takes at least one input (i.e., ) and may have an additional input (i.e., or ) if specific operation is searched (i.e., RSA or GA).
The output sentence features and image features from the unified encoder-decoder backbone contain rich information about the attentive interactions between the sentence words and image objects. On top of the backbone, we attach task-specific heads to address the visual question answering (VQA), image-text matching (ITM), and visual grounding (VG) tasks, respectively.
VQA Head: Similar to most existing works (Antol et al., 2015; Yu et al., 2017b; Kim et al., 2018), we resolve the VQA problem by predicting the best-matched answer to the question from a large answer vocabulary. Inspired by the multimodal fusion model in (Yu et al., 2019b), we use two independent attentional reduction models for and to obtain their reduced features and , respectively:
where are the attention weights to be learnt. corresponds to a two-layer MLP network. After that, the reduced features are fused together as follows:
where are two projection matrices to embed the input features into a -dimensional common space. LayerNorm is appended on the fused feature to stabilize training (Ba et al., 2016).
The fused feature is then projected into a vector and then fed into a -way classification loss, where denotes the size of the answer vocabulary. For the dataset that provides multiple answers to each question, we formulate it as a multi-label classification problem and use binary cross-entropy (BCE) loss to train the model. For the dataset that only has one answer to each question, we regard it as a single-label classification problem and use the softmax cross-entropy loss instead.
ITM Head: Image-text matching aims to learn a matching score to measure the cross-modal similarity between the image-text pair. Since the outputs of the ITM and VQA tasks are similar, we therefore reuse part of the model in the VQA head. On top of the fused feature from Eq.(11), the matching score is obtained as follows:
denotes the sigmoid function. Denote the predicted matching score of an input image-text pair as, where represents a positive sample with correspondence. We use BCE loss with hard negatives mining for
as our loss function to train the matching model:
where and denote the hard negative text and image samples for
mined from the whole training set per training epoch.
VG Head: We address the visual grounding task by predicting a ranking score and a refined bounding box for each visual object in the image referred to by the query. To do this, we first feed the word-level query features into the attentional reduction model in Eq.(10) to obtain the reduced feature vector . After that, is broadcasted and integrated with the object-level image features as follows:
where correspond to the fused features of objects in the image. Each object feature is then linearly projected into a ranking score and a 4-D bounding box offset , respectively. Similar to (Yu et al., 2018c), we design a multi-task loss function consisting of a ranking loss and a regression loss :
where is a hyper-parameter to balance the two terms. The term penalizes the KL-divergence between the predicted scores and the ground-truth scores for objects, where are obtained by calculating the IoU scores of all objects with respect to the unique ground-truth bounding box. Softmax normalizations are respectively applied to and to form a score distribution. The term penalizes the smoothed distance (Girshick, 2015) between the predicted offset and the ground-truth offset for the objects with their IoU scores larger than a threshold . The offset is obtained by calculating the translations between the bounding box of the input object and the bounding box of ground-truth object (Girshick, 2015).
To obtain the optimal MMnasNet architecture for each task on specific dataset, we introduce an efficient one-shot search algorithm that search the optimal architecture within an over-parameterized supernet with weight sharing.
Denote a supernet as that encodes the whole search space of MMnas, where and correspond to the model weights and architecture weights of all the possible operations in the supernet, respectively222Given a MMnas supernet consisting of encoder blocks and decoder blocks, the size of the search space is and the number of all the possible operations in the supernet is +, where 2 and 4 correspond to the sizes of the encoder and decoder operation pools, respectively.. The optimal architecture is obtained by minimizing the expectation with respect to and jointly:
where represents the loss function applied on the training set for each task. refers to a valid architecture sampled from the search space. refers to the model weights of the architecture inherited from in a weight sharing strategy. Based on , the optimal architecture is obtained by selecting the operation with the largest architecture weight in each block of the backbone.
Inspired by the strategy in (Cai et al., 2018), we adopt an iterative algorithm to optimize the architecture weights and the model weights alternatively. We first separate the training set into two non-overlapping sets and . When training the model weights , we first freeze the architecture weights and stochastically sample exactly one operation for each block with respect to after softmax activation, which results in a valid architecture . After that, we update the model weights activated by via standard gradient descent on . When training the architecture weights , we freeze the model weights , sample a valid architecture , and then update via gradient descent on .
As claimed in (Chu et al., 2019), the iterative optimization of and inevitably introduces bias to certain architectures and leave the rest ones poorly optimized. To alleviate the problem, we introduce an additional warming-up stage before the iterative optimization. In the warming-up stage, we do not train the architecture weights and sample operations uniformly to train the model weights . This ensures that the model weights are well initialized thus leading to more impartial and robust architecture search.
The detailed search algorithm is illustrated in Algorithm 1.
We evaluate the searched MMnasNets on three multimodal learning tasks and perform thorough comparative analysis to the state-of-the-art methods on five benchmark datasets. Furthermore, we conduct comprehensive ablation experiments to explore the reasons why MMnas is effective. The statistics and evaluation metrics of the datasets are shown in Table1.
Universal Setup: We use the following hyper-parameters for MMnasNet as the default settings unless otherwise stated. For each primitive operation, the latent dimensionality in the multi-head attention is 512 and the number of heads is 8. The dimensionality of the fused features is set to . The number of encoder blocks and decoder blocks are respectively set to 12 and 18 to match the number of blocks in the 6-layer MCAN model333A -layer MCAN model corresponds to a special case of the MMnasNet model consisting of encoder blocks (with repeated SA-FFN operations) and decoder blocks (with repeated SA-GA-FFN operations). (Yu et al., 2019b).
For each dataset, we use its train split to perform architecture search. The train set is further random split into two subsets and with . Each randomly initialized model is warmed-up for epochs and then searched for another epochs with a mini-batch size 256. For both the warming-up and searching stages, the early stopping strategy is used if the accuracy on the validation set does not improve for 5 epochs. Adam solver with and is used as the optimizer (Kingma and Ba, 2014). The frequency ratio for updating the model and architecture weights is set to 5. With the searched optimal architecture, we train the MMnasNet model again from scratch to obtain the final model. All the experiments below are conducted on a workstation with 4 Titan-V GPUs. The searching process for the models takes 10120 GPU hours for different tasks with multi-GPU parallelization.
VQA Setup: For VQA-v2, we follow the setting in (Yu et al., 2019b) that all questions are processed to a maximum length of and the size of the answer vocabulary is set to 3129. The visual features and relation features are extracted from a pre-trained Faster R-CNN model on Visual Genome (Anderson et al., 2018). The number of extracted objects is determined by a confidence threshold.
ITM Setup: For Flickr30K, we follow the strategy in (Karpathy and Fei-Fei, 2015) to split the data into 29K/1K/1K training/validation/test images. The maximum length of texts (i.e., captions) is set to . The visual features and relation features are extracted from a Faster R-CNN model pre-trained on Visual Genome with the number of objects (Anderson et al., 2018). For each positive image-text pair in the training set, we use the following hard sample mining strategy before each training epoch: we randomly sample 64 negative images per text and 64 negative texts per image from the whole training set to generate negative image-text pairs. Thereafter, we feed all these negative pairs to the current model checkpoint to predict their matching scores and regard the top-5 ranked negative samples as the hard negative samples according to their scores. Finally, we randomly pick one hard image sample and one hard text sample from the candidate hard negative samples, respectively.
VG Setup: We use the same settings for the three visual grounding datasets. For the textual queries, the maximum length is set to . For the images, we adopt two pre-trained object detectors to extract the visual features: 1) a Mask R-CNN model trained on COCO (He et al., 2017); and 2) a Faster R-CNN model trained on Visual Genome (Ren et al., 2015). During the training data preparation for the two detectors, we excluded all the images that exist in the training, validation and testing sets of RefCOCO, RefCOCO+, and RefCOCOg to avoid the data leakage problem. For both detectors above, we detect objects for each image to extract the visual and relation features. The loss weight is set to 1.
|VQA||VQA-v2 (Goyal et al., 2017)||COCO||204K||1.1M||Accuracy|
|VG||RefCOCO (Kazemzadeh et al., 2014)||COCO||20K||142K||Accuracy|
|RefCOCO+ (Kazemzadeh et al., 2014)||20K||142K|
|RefCOCOg (Mao et al., 2016)||26K||95K|
|ITM||Flickr30K (Plummer et al., 2015)||Flickr||31K||155K||Recall@K|
Search Space: In Table 2, we compare the MMnasNet models searched from different decoder operation pools. From the results, we can see that: 1) modeling the intra-modal attention among visual objects by SA or RSA is vital to object counting performance (i.e., the number type answers), which is consistent with the results reported in (Yu et al., 2019b); 2) introducing the RSA operation which models the relative spatial relationships between paired objects can further facilitate the object counting performance; and 3) SA and RSA are complementary to each other, hence modeling them together leads to the best performance on all answer types.
Model Depth: In Table 2, we compare MMnasNet to the reference MCAN model (Yu et al., 2019b) under different model depths (i.e., number of encoder blocks and decoder blocks ). The results reveal that: 1) MMnasNet consistently outperforms MCAN, especially when the model depth is relatively shallow (e.g., ). This can be explained that the optimal architectures for different model depths are quite different; 2) with the same and , the model size of MMnasNet is slightly larger than MCAN. This is because MMnasNet tends to use more FFN operations, which introduces more parameters to increase the nonlinearity of the model; and 3) with the increase of model depth, both MCAN and MMnasNet saturate at =12 and =18, which reflects the bottleneck of the used deep encoder-decoder framework.
Random vs. Searched: To prove the necessity and superiority of the searched architectures over randomly generated ones, we conduct the experiments in Table 2 by alternatively using the searched or random architectures for the encoder and decoder, respectively. From the results, we can see that: 1) the searched architectures outperforms the random counterparts by up to 0.9 points; 2) the design of the decoder architecture is much more important than the encoder architecture; and 3) the all-random architecture also performs well compared to some recent works (Kim et al., 2018; Gao et al., 2019). This suggests the used primitive operations that constitute the architecture also play a key role in model performance.
Efficiency vs. Accuracy: With the optimal MMnasNet architecture (=12 and =18), we explore the trade-off between efficiency and accuracy by training MMnasNet variants with different latent dimensionality . By setting the variant with =512 as the reference model (), we vary the scaling factors with respect to and report the parameters-accuracy results in Figure 3. We can see that: 1) the reference MMnasNet () model steadily outperforms all the existing state-of-the-art methods by 0.62.6 points with about 60M parameters, showing the parametric efficiency of MMnasNet; 2) with only 1/3 number of parameters, MMnasNet () is still competitive with MCAN-6; 3) MMnasNet (1.5) brings only 0.1 point improvement over the reference model at the expense of twice the model size; and 4) MMnasNet () obtains a very compact model at the expense of a dramatical accuracy drop. Therefore, we use MMnasNet (1) and MMnasNet (0.5) in the following experiments to compare with the state-of-the-art models.
|VC (Zhang et al., 2018)||-||COCO||FRCN||VGG-16||73.3||67.4||-||58.4||53.2||-||-||-|
|Spe.+Lis.+Rein.+MMI (Yu et al., 2017a)||-||COCO||SSD||VGG-16||73.7||65.0||69.5||60.7||48.8||55.7||59.6||60.2|
|Spe.+Lis.+Rein.+MMI (Yu et al., 2017a)||-||COCO||SSD||VGG-16||73.1||64.9||69.0||60.0||49.6||54.9||59.2||59.3|
|MAttNet (Yu et al., 2018a)||14M||COCO||MRCN||ResNet-101||81.1||70.0||76.7||71.6||56.0||65.3||67.3||66.6|
|DDPN (Yu et al., 2018c)||10M||Genome||FRCN||ResNet-101||80.1||72.4||76.8||70.5||54.1||64.8||67.0||66.7|
|MUAN-10 (Yu et al., 2019a)||75M||Genome||FRCN||ResNet-101||86.5||78.7||82.8||79.5||64.3||73.2||74.3||74.2|
|Bottom-Up (Teney et al., 2018)||22M||65.32||81.82||44.21||56.05||65.67|
|MFH+CoAtt (Yu et al., 2018b)||116M||68.76||84.27||49.56||59.89||-|
|BAN-8 (Kim et al., 2018)||79M||69.52||85.31||50.93||60.26||-|
|BAN-8 (+G+C) (Kim et al., 2018)||90M||70.04||85.42||54.04||60.52||70.35|
|DFAF-8 (Gao et al., 2019)||114M||70.22||86.09||53.32||60.49||70.34|
|MCAN-6 (Yu et al., 2019b)||58M||70.63||86.82||53.26||60.72||70.90|
|MUAN-10 (Yu et al., 2019a)||83M||70.82||86.77||54.40||60.89||71.10|
Taking the ablation studies into account, we compare the best-performing MMnasNet models (with =12 and =18) to the state-of-the-art approaches on five benchmark datasets. In addition to the standard MMnasNet (1) model, we also report the results of the compact MMnasNet (0.5) model on each dataset. Figure 4 illustrates the optimal MMnasNet backbones searched for different tasks (over specific datasets). This verifies our hypothesis that the optimal architectures for different tasks may vary prominently. Note that we do not compare MMnasNet to the multimodal-BERT approaches (e.g., LXMRET (Tan and Bansal, 2019) or UNITER (Chen et al., 2019)), since they introduce additional training datasets for model pre-training thus may lead to unfair comparison.
In Table 3, we compare MMnasNets to the state-of-the-art methods on VQA-v2. The demonstrated results show that: 1) with 1/41/3 model size, MMnasNet (0.5) model achieves competitive performance to the previous state-of-the-art models; and 2) with nearly the same model size, MMnasNet (1) outperforms existing top performance approaches by a clear margin on all answer types.
|DAN (Nam et al., 2017)||55.0||81.8||89.0||39.4||69.2||79.1|
|DPC (Zheng et al., 2017)||55.6||81.9||89.5||39.1||69.2||80.9|
|SCO (Huang et al., 2018)||55.5||82.0||89.3||41.1||70.5||80.1|
|SCAN (Lee et al., 2018)||61.8||87.5||93.7||45.8||74.4||83.0|
|SCAN (Lee et al., 2018)||67.7||88.9||94.0||44.0||74.2||82.6|
|CAMP (Wang et al., 2019)||68.1||89.7||95.2||51.5||77.1||85.3|
In Table 4, we report the comparative results on RefCOCO, RefCOCO+, and RefCOCOg, respectively. We use the commonly used accuracy metric (Yu et al., 2018a), where a prediction is considered to be correct if the predicted bounding box overlaps with the ground-truth of IoU 0.5. With the standard visual features (i.e., MRCN pre-trained on COCO), MMnasNet (0.5) significantly outperforms the previous state-of-the-art MAttNet model (Yu et al., 2018a) with a similar model size, and MMnasNet (1) obtain slight improvement over MMnasNet (0.5) on RefCOCO+ and RefCOCOg. Be equipped with the powerful visual features (i.e., FRCN pre-trained on Visual Genome), MMnasNet (1) obtains remarkable improvement and delivers the new state-of-the-art performance across all datasets.
Table 5 contains the image-text matching results on Flickr30K. Similar to most existing works (Lee et al., 2018; Wang et al., 2019), we report the matching results in terms of Recall@, where denotes the top- results retrieved from a database and ranges within . The cross-modal matching results from two directions, i.e., image-to-texts and text-to-images, are demonstrated in Table 5 to compare with the state-of-the-art approaches. From the results, we can see that MMnasNet (0.5) significantly outperforms existing state-of-the-art methods in terms of all evaluation metrics. Furthermore, the standard MMnasNet (1) model steadily outperforms the compact (0.5) model as expected. Since the model sizes of MMnasNets do not change much across different tasks, we do not report them further due to space limitations.
In this paper, we present a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Different from the existing approaches that design hand-crafted and task-specific architectures to address only a single task, MMnas can be generalized to automatically learn the optimal architectures of different tasks. To achieve this, we construct a unified encoder-decoder backbone with each encoder/decoder block corresponding to an operation searched from a candidate set of predefined operations. On top of the unified backbone, we attach task-specific heads to deal with different tasks. The optimal architecture for each task is learned by an efficient neural architecture search (NAS) algorithm to obtain task-specific MMnasNet. Extensive experiments are conducted on the VQA, visual grounding, and image-text matching tasks to show the generalizability and effectiveness of the proposed MMnas framework. Comprehensive results from five benchmark datasets validate the superiority of MMnasNet over existing state-of-the-art methods.
Different from existing multimodal-BERT approaches that use large-scale multimodal pre-training, we introduce an alternative way to address the generalized multimodal learning problem via a NAS framework. We hope our work may serve as a solid baseline to inspire future research on multimodal learning.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
International Conference on Machine Learning (ICML). 550–559.
CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. InIEEE International Conference on Computer Vision (ICCV). 5764–5773.
International Joint Conference on Artificial Intelligence (IJCAI)(2018).