The task of visual question answering (VQA) is building a model to answer questions given an image-question pair. Recently, it has received much attention of the researchers in the area of computer vision[29, 13, 1, 14, 28, 24]
. VQA requires techniques from both image recognition and natural language processing, and most existing works use Convolutional Neural Networks (CNNs) to extract visual features from images and Recurrent Neural Network (RNNs) to generate textual features from questions, and combine them to generate the final answers.
However, most of the existing VQA datasets are artificially created and thus may not be suitable as training data for real-world applications. For example, VQA 2.0  and Visual7W , arguably two of the most popular datasets for VQA, were created using images from MSCOCO  with questions asked by crowd workers. Therefore, the images are typically of high quality and the questions are less conversational than the reality. On the contrary, the recently proposed VizWiz  dataset was collected from blind people who take photos and ask questions about them. Therefore, the images in VizWiz are often of poor quality, and questions are more conversational while some of the questions might be unanswerable due to the poor quality of the images. The VizWiz dataset reflects more realistic setting for VQA, but its size is much smaller due to the difficulty of collecting such data. A straightforward method to solve this problem is to first train a model on the VQA 2.0 dataset and then fine-tune it using the VizWiz data. This solution can only provide limited improvement. There are two major issues. First, the VQA datasets are constructed in a different way, making them differ significantly in visual features, textual features and answers. 
did an experiment to classify different VQA datasets with a simple multi-layer perceptron (MLP) of one hidden layer, which achieved over 98% accuracy. This is a strong indication of the significant bias across different datasets. Our experiments also show that directly fine-tuning the model trained on VQA 2.0 results in minor improvement on VizWiz. Second, the two modalities (visual and textual) also pose a big challenge on the generalizability across datasets. It is challenging to consistently bridge the domain gap in a coordinated fashion when multiple modalities are involved due to the nature of the multi-modal heterogeneity with no common feature representations.
Domain adaptation methods, which handle the difference between two domains, have been developed to address the first issue [11, 15, 25, 4, 23, 6, 8, 27]. However, most of the existing domain adaptation methods focus on single-modal tasks such as image classification and sentiment classification, and thus may not be directly applicable to multi-modal settings. On the other hand, these methods usually are subject to a strong assumption on the label distribution in that the source domain and the target domain share the same (usually small) label space, which may be unrealistic in real-world applications.  proposed a new framework for unsupervised multi-modal domain adaptation, but it did not target at the VQA tasks. Recently, several VQA domain adaptation methods have been proposed to address the multi-modal challenge. However, to the best of our knowledge, all the existing VQA domain adaptation methods focus on the multiple choice setting, where several answer candidates are provided and the model only needs to select one from them. In contrast, we focus on a more challenging open-ended setting where there is no prior knowledge of answer choices and the model can select any term from a vocabulary.
In this paper, we address the aforementioned challenges by proposing a novel multi-modal domain adaptation framework. We develop a method under the framework which can simultaneously learn a domain invariant and downstream-task-discriminative multi-modal feature embedding based on an adversarial loss and a classification loss. We additionally incorporate the maximum mean distance (MMD) to further reduce the domain distribution mismatch for multiple modalities, i.e., visual embeddings, textual embeddings and joint embeddings. We conduct experiments on two popular VQA benchmark datasets. The results show that the proposed model outperforms the state-of-the-art VQA models and the proposed domain adaptation method surpasses other state-of-the-art domain adaptation methods on the VQA task. Our contributions are summarized as follows:
We propose a novel supervised multi-modal domain adaptation framework.
We tackle the more challenging open-ended VQA task with the proposed domain adaptation method. To the best of our knowledge, this is the first attempt of using domain adaptation for open-ended VQA.
The proposed method can simultaneously learn domain invariant and downstream-task-discriminative multi-modal feature embedding with an adversarial loss and a classification loss. At the same time, it minimizes the difference of cross-domain feature embeddings jointly over multiple modalities.
We conduct extensive experiments between two popular VQA benchmark datasets, VQA 2.0 and VizWiz, and the results show the proposed method outperforms the existing state-of-the-art methods.
2 Related Works
VQA Datasets Over the past few years, several VQA datasets [30, 7, 9, 16, 2] and tasks were proposed to encourage researchers to develop algorithms that answer visual questions. One limitation of many existing datasets is that they were created either automatically or from an existing large vision dataset like MSCOCO , and the questions were either generated automatically or contrived by human annotators on Amazon Mechanical Turk (AMT). Therefore, the images in these datasets are typically of high quality but the questions are less conversational. They might not be directly applicable to real-world applications such as  which aims to answer the visual questions asked by blind people in their daily life. The main differences between  and other artificial VQA datasets are as follows: 1) Both the image and question quality of  are lower as they suffer from poor lighting, out of focus and audio recording problems like clipping a question at either end or catching background audio content; 2) The questions can be unanswerable since blind people cannot verify whether the images contain the visual content they are asking about, due to blurring, inadequate lighting, framing errors, finger covering the lens, etc. Our experiments also reveal that fine-tuning the model trained on the somewhat artificial VQA 2.0 dataset provides limited improvement on VizWiz, due to the significant difference in bias between these two datasets.
VQA Settings There are two main VQA settings, namely multiple choice and open-ended. Under the multiple choice setting, the model is provided with multiple candidates of answers and is expected to select the correct one from them. VQA models following this setting usually take characteristics of all answer candidates like word embeddings as the input to make a selection [22, 12]. However, in the open-ended setting, there is neither prior knowledge nor answer candidates provided, and the model can respond with any free-form answers. This makes this setting more challenging and realistic [14, 13, 24, 1].
VQA Models Recently, a plethora of VQA models were proposed by researchers [29, 13, 1, 14, 24]. Most of them consist of image and question encoders, and a multi-modal fusion module followed by a classification module.  used an LSTM to encode the question and a residual network  to compute the image features with a soft attention mechanism.  implemented a bottom-up attention using Faster R-CNN  to extract features of detected image regions, and then a top-down mechanism used task-specific context to predict an attention distribution over the image regions. The final output was generated by an MLP after fusing the image and question features.  used a bilinear attention between two groups of input channels on top of low-rank bilinear pooling which extracted the joint representations for each pair of channels.  proposed an approach that takes original image features, bottom-up attention features from object detection module, question features and the optical character recognition (OCR) strings detected from the image as the input, and answers either with an answer from the fixed answer vocabulary or by selecting one of the OCR strings detected in the image. Similar to the state-of-the-art model , our VQA base model also takes original image features, bottom-up attention features and question features to predict the final answer. Details of our VQA base model is described in the next section.
Domain Adaptation Domain adaptation techniques have been proposed to learn a common domain invariant latent feature space where the distributions of two domains are aligned. Recent works typically focused on transferring neural networks from a labeled source domain to a target domain where there is no or limited labeled data [11, 15, 25, 23, 4, 6, 8].  optimized for domain invariance to facilitate domain transfer and used a soft label distribution matching loss to transfer information between tasks.  proposed a framework which combines discriminative modeling, untied weight sharing and a GAN loss to reduce the difference between domains. estimated empirical Wasserstein distance between the source and the target samples and optimized the feature extractor network to minimize the estimated Wasserstein distance in an adversarial manner.  utilized gradient reversal layer to incorporate the training process of domain classifier, label classifier and feature extractor to align domains. Similarly,  simultaneously minimized the classification error, preserved the structure within and across domains, and restricted similarity on target samples. The major difference between our work and these works is that we propose a novel multi-modal domain adaptation framework, while these works assumed a single modality.
Domain Adaptation for VQA Although domain adaptation has been successfully applied to computer vision tasks, its applicability to VQA has yet to be well-studied. There was a recent work that investigated domain adaptation for VQA . It reduces the difference in statistical distributions by transforming the feature representation of the data in the target domain. However, one major limitation is the assumption of a multiple choice setting, where four answer candidates are provided as the input to the model. It is unrealistic in real-world applications because one can never guarantee that the ground truth answer is among four candidates. Moreover, it is unclear how to create answer candidates for an image-question pair. On the contrary, our model is only provided with an image-question pair and can generate any free-form answers. This makes our task more challenging and realistic.
3 The VQA Framework
In this section, we describe our base VQA framework. Given an image and a question , the VQA model estimates the most likely answer from a large vocabulary based on the content of the image, which can be written as follows:
Our base framework consists of four components: 1) a question encoder; 2) an image encoder; 3) a multi-modal fusion module; and 4) a classification module at the output end. We will elaborate each component in the following subsections.
Question Encoding The question of length is first tokenized and encoded using word embedding based on pretrained GloVe  as . These embeddings are then fed into a GRU cell . The encoded question is obtained from the last hidden state at time step denoted as , where , for , and is the feature dimension.
Image Encoding Similar to  and , we first feed the input image to an object detector  pretrained on the Visual Genome dataset  based on Feature Pyramid Networks (FPN)  with ResNeXt  as the backbone. The output from the fully connected layer is used as the region-based features, i.e., with as the feature for -th object. In the meanwhile, we divide the entire image into a grid, and obtain the grid-based features by average pooling features from the penultimate layer of a pretrained ResNet-101 network 
on ImageNet dataset. Finally, we combineand as well as question embedding to obtain the joint feature embedding in a multi-modal fusion module as described in the next paragraph.
Multi-Modal Fusion and Classification The question embedding is used to obtain the top-down, i.e. region-based attention on image features . Then, the region-based features are averaged based on the attention weights to obtain the weighted region-based image features. Similarly, grid-based features are fused with question embedding by concatenation. The fused grid-based features and the weighted region-based image features are then concatenated to obtain the final image features . We denote the final image feature embedding as . The final joint embedding is then calculated by taking the Hadamard product of and , which is then fed to an MLP for classification, i.e., . The final answer is represented by .
4 Multi-Modal Domain Adaptation
In this section, we present our framework for supervised multi-modal domain adaptation. We assume there are two modalities111For simplicity, we assume the data has two modalities, but it can be easily generalized to more modalities. of source samples , where , denote the two modalities, and labels
drawn from a source domain joint distribution, as well as the two modalities of target samples and labels drawn from a target joint distribution . We also assume there are sufficient source data so that a good pretrained source model can be built but the amount of target data is limited so that learning on only the target data leads to poor performance. Our goal is to learn target representations for two modalities , , multi-modal fusion and target classifier with the help of pretrained source representations , , and source classifier . For the VQA task in our work, denote visual and textual modalities, respectively.
A typical approach to achieving this goal is to regularize the learning of the source and target joint representations by minimizing the distance of empirical distributions between the source and target domains, i.e., between and . In this way, the data from the source domain and the target domain are projected onto a similar latent space, such that well-performing source model can lead to well-performing target model. Following this idea, we propose a novel multi-modal domain adaptation framework as shown in Figure 1.
4.1 Joint Embedding Alignment
We propose to reduce the difference of joint embeddings between the source and the target domains by minimizing the Maximum Mean Discrepancy (MMD). The intuition is that two distributions are identical if and only if all of their moments coincide. Suppose we have two distributions, over a set . Let , where is a reproducing kernel Hilbert space (RKHS). Then, we have:
where is the kernel mean embedding of and is a kernel function such as a Gaussian kernel. Let and , the empirical estimate of the distance between and is
We then define the loss function as
where and . By minimizing the difference between source and target joint embeddings, we enforce that the joint embeddings of both source domain and target domain will be projected onto a similar latent space.
4.2 Multi-Modal Embedding Alignment
It is more challenging to reduce multi-modal domain shift than conventional single-modal domain shift. The previous loss in Eq. (4) does not explicitly consider the multi-modal property. Aligning only the joint feature embedding is insufficient to adapt the source domain to the target domain. This is because the feature extractor for each modality has its own complexity of domain shift, which often differs from each other (e.g., visual vs. textual). Aligning only the fused features cannot fully reduce domain differences.
Therefore, we introduce the following term to minimize the maximum mean discrepancy between every single modality, i.e., and . Then, the loss function we try to minimize can be written as
where and are trade-off parameters for two modalities.
While minimizing the distance between source and target embeddings, we also want to maintain the classification performance on both the source domain and the target domain. Similarly as in a standard supervised learning setting, we employ the cross entropy loss for classification:
where denotes the standard cross entropy loss, and is a trade-off parameter between the two domains.
4.4 Domain Discriminator
We also propose to use a domain classifier to reduce the mismatch between the source domain and target domain by confusing the domain classifier such that it cannot correctly distinguish a sample from source domain or target domain. The domain classifier has a similar structure to or except the last layer outputs a scalar in with the value indicating how likely the sample comes from the source domain. Thus, can be optimized according to a standard cross-entropy loss. To make the features domain-invariant, the source and target mappings are optimized according to a constrained adversarial objective. The domain classifier minimizes this objective while the encoding model maximizes this objective. The generic formulation for domain adversarial technique is:
For simplicity, we denote as the parameters of all feature mappings and as the parameters of all label predictors. Putting all together, we obtain our final objective function to minimize as follows:
where we seek the parameters which attain a saddle point of , satisfying the following conditions:
At the saddle point, the parameters of the domain classifier minimize the domain classification loss while the parameters of the label predictor minimize the label prediction loss . The feature mapping parameters minimize the label prediction loss such that the features are discriminative, while maximizing the domain classification loss such that the features are domain-invariant.
In this section, we validate our method on the open-ended VQA task and compare it with state-of-the-art methods.
Two standard VQA benchmarks are used in our experiments, VQA 2.0  and VizWiz . A comparison of the statistics for these datasets are listed in Table 1, which shows that the scale of VizWiz is much smaller in terms of the numbers of images and questions. Although VizWiz has more unique answers, only 824 out of its top 3,000 answers overlap with the top 3,000 answers in VQA 2.0. This explains why models trained on VQA 2.0 perform poorly on VizWiz, and their limited transferability. We find 28.63% of questions in VizWiz are even not answerable due to reasons mentioned before, making the domain gap even more significant. Figure 2 shows some examples from both VQA 2.0 and VizWiz datasets.
|# unique answers||3,126||58,789|
5.2 Evaluation Metrics
In VQA, each question is usually associated with 10 valid answers from 10 annotators. We follow the conventional evaluation metric on the open-ended VQA setting to compute the accuracy using the following formula:
An answer is considered correct if at least three annotators agree on the answer. Note that the true answers in VizWiz test set are not publicly available. In order to obtain the performance on the test set, results need to be uploaded to the official online submission system (https://evalai.cloudcv.org/web/challenges/challenge-page/102).
5.3 Implementation Details
In all our experiments, we extract objects for each image to construct our region-based features and set the visual feature dimension to . We also set the hidden dimension of GRU to and hidden dimension after fusion to . The question length is truncated at . In the training phase, we apply a warm-up strategy by gradually increasing the learning rate from to in the first iterations. It is then multiplied by after every iterations. We use a batch size of .
For domain adaptation, we let the source and target networks share the same parameters up to the penultimate layer, i.e., and . In multi- or single-modal alignment, we use Gaussian kernel to compute MMD. The trade-off parameters are set as , , , , , and .
5.4 Experimental Setup
First, we conduct experiments by using the VQA 2.0 dataset as the source domain and the VizWiz dataset as the target domain, to evaluate the effectiveness of our proposed method for multi-modal domain adaptation. We also conduct experiments in the opposite way, using VizWiz as the source domain and VQA 2.0 as the target domain, to further demonstrate the effectiveness of our approach.
We need to emphasize that we choose not to use an overly strong base model (i.e., question embedding from FastText, complex fusion techniques, OCR tokens etc.), as our focus is on multi-modal adaptation instead of the base model itself. Despite that, we will show that our proposed domain adaptation method with a weaker base model still outperforms the fine-tuned state-of-the-art model.
5.5 Results and Analysis
Adaptation from VQA 2.0 to VizWiz As discussed in previous sections, we first pretrain a source model on the VQA 2.0 dataset, and then adapt the pretrained source model to the target dataset VizWiz. The results of our proposed method and other leading methods are shown in Table 2.
We first compare our method with the original VizWiz baseline proposed by , the previous state-of-the-art VQA model BAN by  and the current state-of-the-art VQA model Pythia by . It is clear that our method outperforms the state-of-the-art models by a significant margin from Table 2.
In order to validate that the better performance of our method is not due to a strong base model, we additionally report the results of our method in Table 3, with 1) training our single base model from scratch using only the VizWiz dataset (Target only), 2) fine-tuning from the model pretrained on the VQA 2.0 dataset (Fine-tune), and 3) our proposed domain adaptation method (DA). From Table 3, it shows that our model fine-tuned from VQA 2.0 is about percent worse than Pythia fine-tuned from VQA 2.0 ( vs. ), indicating that the better performance of our final model than the state-of-the-art is not from a strong base model. Moreover, the accuracy of our base model trained from scratch is , falling behind percent to Pythia trained from scratch, which is consistent with our observation that our method even with a weaker base model can achieve superior final results.
|Pythia222Please note that, the accuracy for Pythia was obtained by fine-tuning from the model pretrained on the VQA 2.0 dataset.||54.72|
|(Accuracy in %)||Overall||Yes/No||Number||Answerable||Other|
Results breakdown into answer categories Table 4 shows the accuracy breakdown into different answer categories. The results show that our model achieves new state-of-the-art performance on “Number” and “Other” categories as well as overall accuracy. Note that the overall accuracy for Pythia in this table is instead of which we were unable to reproduce using the released code and there are no breakdown numbers reported associated with it. The best we can achieve with Pythia (after fine-tuning from VQA 2.0) is and the corresponding breakdown numbers are reported in the table.
|(+ Fine-tune)||53.97||+ 0.86|
|+ MMD on V and Q, CLS||55.46||+ 1.49|
|+ MMD, GRL on joint||55.87||+ 0.41|
|+ Ensemble of 3 models||56.20||+ 0.33|
Ablation study We conduct an ablation study to show the contributions of different components of our method. Specifically, we consider: 1. Target only: Training the base model using only the data in the target domain. 2. +Fine-tune: Pretrain a model on the source VQA 2.0 dataset and then fine-tune the model on the target VizWiz dataset. Please note that this is unavailable during adaptation thus it is marked inside “()”. 3. +MMD on V and Q, CLS: Our domain adaptation method with MMD alignment on visual and textual features separately, and classification modules applied for both domains. 4. +MMD, GRL on joint: Our domain adaptation method with MMD alignment also on the joint embeddings of both domains, along with the domain discriminator by gradident reversal layer. 5. +Ensemble of 3 models. The results show that the multi-modal MMD brings the most significant performance gain, which validates that aligning on every single modality is beneficial to the transferability of multi-modal tasks. In addition, MMD on joint embedding and discriminator is also crucial to bring further performance gain. Not surprisingly, an ensemble of three models pushes our performance even higher to , which is the state-of-the-art performance to date.
Comparisons on domain adaptation methods We compare our multi-modal domain adaptation method with some popular domain adaptation methods, including DANN , ADDA , WDGRL , and SDT . Note that DANN, ADDA and WDGRL were originally designed for unsupervised domain adaptation. For fair comparison, we fine-tune the model using target labels after unsupervised adaptation (hence they are indicated by a suffix ‘+’). SDT is currently the most popular and best-performing supervised domain adaptation method. The results shown in Table 6 illustrate that compared to direct fine-tuning, the existing domain adaptation methods do not help much (DANN performs even worse) in the multi-modal task, while our method outperforms both direct fine-tuning and existing domain adaptation methods by a notable margin.
|Target data used||Target only||Fine-tune||DA|
Adaptation with fewer target training samples We also validate the robustness of our framework by reducing the target training dataset size. We experiment with various target sizes of 1/8 (2,500), 1/4 (5,000), 1/2 (10,000) and all data (20,000). The results are shown in Table 7. We can observe that with the increase of the amount of training data, the performance gain over fine-tuning is decreasing. We conjecture that this is because when we have limited amount of target data, having more prior knowledge is beneficial to model performance, while having more target data will make prior knowledge less helpful. However, our method can stably improve the performance because it sufficiently makes use of target data and source data. It is more promising that our domain adaptation method using fewer samples can achieve comparable or better performance compared to training from scratch using doubled amount of data (especially when target data is scarce), e.g., our method using 1/4 data (48.93%) outperforms training from scratch using 1/2 data (47.48%).
Adaptation from VizWiz to VQA 2.0 In order to further validate the robustness of our method, we reverse the source domain and the target domain and perform adaptation. We pretrain the source model on VizWiz and adapt the source model to VQA 2.0. The results are shown in Table 8, from which we still can observe a significant improvement for our method against fine-tuning. As a comparison, the performance of BAN and Pythia trained from scratch are 69.08% and 69.21%, and our DA model achieves comparable performance to the state-of-the-art on VQA 2.0.
We have presented a novel supervised multi-modal domain adaptation framework for open-ended visual question answering. Under the framework, we have developed a new method for VQA which can simultaneously learn domain-invariant and downstream-task-discriminative multi-modal feature embedding. We validate our proposed method on two popular VQA benchmark datasets, VQA 2.0 and VizWiz, in both directions of adaptation. The experimental results show our method outperforms the state-of-the-art methods.
Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §1, §2, §2, §3.
-  (2015) VQA: visual question answering. In ICCV, Cited by: §2.
-  (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, Cited by: §3.
-  (2015) Unsupervised domain adaptation by backpropagation. In ICML, Cited by: §1, §2, §5.5.
-  (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: §3.
-  (2012) Geodesic flow kernel for unsupervised domain adaptation. In CVPR, Cited by: §1, §2.
-  (2019) Making the v in vqa matter: elevating the role of image understanding in visual question answering. IJCV, pp. 398–414. Cited by: §1, §2, §5.1.
-  (2012) Cross language text classification via subspace co-regularized multi-view learning. In ICML, Cited by: §1, §2.
-  (2018) VizWiz grand challenge: answering visual questions from blind people. In CVPR, Cited by: §1, §2, §5.1, §5.5.
-  (2015) Deep residual learning for image recognition. In CVPR, Cited by: §2, §3.
-  (2015) Simultaneous deep transfer across domains and tasks. In ICCV, Cited by: §1, §2, §5.5.
-  (2016) Revisiting visual question answering baselines. In ECCV, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cited by: §2.
-  (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. In CVPR, Cited by: §1, §2, §2.
-  (2018) Bilinear attention networks. In NIPS, Cited by: §1, §2, §2, §5.5.
Domain adaptation by mixture of alignments of second- or higher-order scatter tensors. In CVPR, Cited by: §1, §2.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV, pp. 32–73. Cited by: §2, §3.
-  (2016) Feature pyramid networks for object detection. In CVPR, Cited by: §3.
-  (2014) Microsoft coco: common objects in context. In ECCV, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cited by: §1, §2.
Glove: global vectors for word representation. In EMNLP, Cited by: §3.
-  (2018) A unified framework for multimodal domain adaptation. In ACM Multimedia, Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. TPAMI. Cited by: §2.
-  (2018) Cross-dataset adaptation for visual question answering. In CVPR, Cited by: §1, §2, §2.
-  (2017) Wasserstein distance guided representation learning for domain adaptation. In AAAI, Cited by: §1, §2, §5.5.
-  (2019) Towards VQA models that can read. In CVPR, Cited by: §1, §2, §2, §3, §5.5.
-  (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1, §2, §5.5.
-  (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §3.
-  (2015) Semi-supervised domain adaptation with subspace learning for visual recognition. In CVPR, Cited by: §1.
-  (2018) Learning to count objects in natural images for visual question answering. In ICML, Cited by: §1.
-  (2015) Simple baseline for visual question answering. In arXiv:1512.02167, Cited by: §1, §2.
-  (2016) Visual7W: grounded question answering in images. In CVPR, Cited by: §1, §2.