Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the major tricks for DNN, has been widely used in many computer vision tasks. However, there are few works studying the data augmentation problem for VQA and none of the existing image based augmentation schemes (such as rotation and flipping) can be directly applied to VQA due to its semantic structure – an ⟨ image, question, answer⟩ triplet needs to be maintained correctly. For example, a direction related Question-Answer (QA) pair may not be true if the associated image is rotated or flipped. In this paper, instead of directly manipulating images and questions, we use generated adversarial examples for both images and questions as the augmented data. The augmented examples do not change the visual properties presented in the image as well as the semantic meaning of the question, the correctness of the ⟨ image, question, answer⟩ is thus still maintained. We then use adversarial learning to train a classic VQA model (BUTD) with our augmented data. We find that we not only improve the overall performance on VQAv2, but also can withstand adversarial attack effectively, compared to the baseline model. The source code is available at


VQA-LOL: Visual Question Answering under the Lens of Logic

Logical connectives and their implications on the meaning of a natural l...

SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering

While Visual Question Answering (VQA) has progressed rapidly, previous w...

Can you fool AI with adversarial examples on a visual Turing test?

Deep learning has achieved impressive results in many areas of Computer ...

Rethinking Data Augmentation for Robust Visual Question Answering

Data Augmentation (DA) – generating extra training samples beyond origin...

Visual Question Answering for Cultural Heritage

Technology and the fruition of cultural heritage are becoming increasing...

Consistency-preserving Visual Question Answering in Medical Imaging

Visual Question Answering (VQA) models take an image and a natural-langu...

SimVQA: Exploring Simulated Environments for Visual Question Answering

Existing work on VQA explores data augmentation to achieve better genera...

1 Introduction

Both computer vision and natural language processing (NLP) have made enormous progress on many problems using deep learning in recent years. Visual question answering (VQA) is a field of study that fuses computer vision and NLP to achieve these successes. The VQA algorithm aims to predict a correct answer to the given question referring to an image. The recent benchmark study

[vqafut] demonstrates that the performance of VQA algorithms hinges on the amount of training data. Existing algorithms can always benefit greatly from more training data. This suggests that data augmentation without manual annotations is an intuitive attempt to improve the VQA performance, just like its success on the other deep learning applications.

Existing Data augmentation approaches enlarge the training dataset size by either data warping or oversampling [dasur]. Data warping transforms data and keeps their labels. Typical examples include geometric and color transformations, random erasing, adversarial training, and neural style transfer. Oversampling generates synthetic instances and adds them to the training set. Data augmentation has shown to be effective in alleviating the overfitting problem of DNNs [dasur]. However, data augmentation in VQA is barely studied due to the challenge of maintaining an triplet semantically correct. Neither geometric transform nor randomly erasing the image could preserve the answer. For example, when asking about What is the position of the computer?, Is the car to the left or right of the trash can?, flipping or rotating images results in the opposite answers. Randomly erasing the image associated with the question How many …? would miss the number of objects. Such transforms need tailored answers which are unavailable. On the textual side, it is challenging to come up with generalized rules for language transformation. Universal data augmentation techniques in NLP have not been thoroughly explored. Therefore, it is non-trivial to explore the data augmentation technique to facilitate VQA.

Previous works have generated reasonable questions based on the image content [othervqg] and the given answer [dualvqg]

, namely Visual Question Generation (VQG). However, a significant portion of the generated questions either have grammatical errors or are oddly phrased. In addition, they learn from the questions and images in the same target dataset, thus the generated data are drawn from the same distribution of the original data. Since the training and test data usually do not share the same distribution, the generated data could not help to relieve the overfitting.

In this paper, we propose to generate semantic equivalent adversarial examples of both visual and textual data as augmented data. Adversarial examples are strategically modified samples that could successfully fool the deep models to make incorrect predictions. However, the modification is imperceptible that keeps the semantics of data while driving the underlying distribution of adversarial examples away from that of the original data [advprop]. In our method, visual adversarial examples are generated by an un-targeted gradient-based attacker [scale], and textual adversarial examples are paraphrases that could fool the VQA model (predicting a wrong answer) while keeping the questions semantically equivalent. The existence of adversarial examples not only reveals the limited generalization ability of ConvNets, but also poses security threats on the real-world deployment of these models.

We adversarially train the strong baseline method Bottom-Up-Attention and Top-Down (BUTD) [butd] on VQAv2 dataset [vqa] with clean examples and adversarial examples generated on-the-fly. We regard the adversarial training as a regularizer acting in a period of training time. Experimental results demonstrate that our proposed adversarial training framework not only better boosts the model performance on clean examples than other data augmentation techniques, but also improves the model robustness against adversarial attacks. Although there are few works studying the data augmentation problem for VQA [vqada, cyc, ctm, cau], they merely generate either new questions or images. To our best knowledge, our work is the first to augment both visual and textual data in VQA.

To summarize, our major contributions are threefold:

  • We propose to generate visual and textual adversarial examples to augment the VQA dataset. Our generated data preserve the semantics and explore the learned decision boundary to help improve the model generalization.

  • We propose an adversarial training scheme that enables VQA models to take advantage of the regularization power of adversarial examples.

  • We show that the model trained with our method achieves 65.16% accuracy on the clean validation set, beating its vanilla training counterpart by 1.84%. Moreover, the adversarially trained model significantly increases accuracy on adversarial examples by 21.55%.

2 Related Work

2.0.1 Vqa.

A large number of VQA algorithms have been proposed, including spatial attention [butd, stacked, hierarchical, mutan], compositional approaches [nmn, compose, e2e], and bilinear pooling schemes [multimodal, hadamard]. Spatial attention [butd] is one of the most widely used methods for both natural and synthetic image VQA. A large portion of prior arts [ban, count, graph, intra] are built upon the bottom-up top-down (BUTD) attention method [butd]. We also choose the BUTD as our baseline VQA model. Instead of developing a more sophisticated answering machine, we propose a VQA data augmentation technique that can potentially benefit existing VQA methods since data is the fuel.

2.0.2 Data Augmentation.

Compared to vision, a few efforts have been done on augmenting text for classification problems. Wei et al. [eda] make a comprehensive extension for text editing techniques on NLP data augmentation and achieve gains on text classification. However, our study shows that it could degrade the model performance on the VQA task (see Section 4). Other works generate paraphrases [qanet, para] and add noise to smooth text data [noise]. There are fewer works [vqada, ctm, cyc, cau, patro2018differential] that learn data augmentation for VQA. Kafle et al.[vqada] do a pioneer work where they generate new questions by using semantic annotations on images. Work of [ctm] automatically generates entailed questions for a source QA pair , but it uses additional data in Visual Genome [vg] to add diversity to the generated questions. Work of [cyc] proposes a cyclic-consistent training scheme where it generates different rephrasings of question and train the model such that the predicted answers across the generated and original questions remain consistent. The method [cau] employ a GAN-based re-synthesis technique to automatically remove objects to strengthen the model robustness against semantic visual variations. Note that all of these methods augment data in a single modality (text-only or image-only) and heavily rely on complex modules to achieve slight performance gains.

2.0.3 Adversarial Attack and Defense.

In recent years, research works [szegedy, fgsm] add imperceptible perturbations to input images, named adversarial examples, to evaluate the robustness of deep neural networks against such perturbation attacks. In the NLP community, state-of-the-art textual DNN attackers [belinkov2017synthetic, blohm2018comparing, ebrahimi2018adversarial] use a different approach from those in the visual community to generate textual adversarial examples. Our work is inspired by SCPNs [scpn] and SEA [sea] which generate paraphrases of the sentence as textual adversarial examples. Meanwhile, previous works [fgsm]

show that training with adversarial examples can improve the model generalization on small dataset (e.g., MNIST), but degrade the performance on large datasets (e.g., ImageNet), in the fully-supervised setting. Recent notable work

[advprop] suggests that adversarial training could boost model performance even on ImageNet with a well-designed training scheme. A number of methods [att, fool] have investigated adversarial attack on the VQA task. However, they merely attack the image and do not discuss how the adversarial examples can benefit the VQA model. To summarize, how adversarial examples can facilitate VQA remains an open problem. This work sheds light on utilizing adversarial examples as augmented data for VQA.

Figure 1: Framework of the proposed data augmentation method. We generate adversarial examples of both visual and textual data as augmented data, which are passed through the VQA model to obtain incorrect answers. The augmented and original data are jointly trained using the proposed adversarial training scheme, which can boost model performance on clean data while improving model robustness against attack.

3 Method

We now introduce our data augmentation method to train a robust VQA model. As illustrated in Fig. 1, given an triplet, we first generate the paraphrases of questions and store them, then, generate visual adversarial examples on-the-fly to obtain semantically equivalent additional training triplets, which are used in the proposed adversarial training scheme. We describe them in detail as follows.

3.1 VQA Model

Answering questions about images can be formulated as the problem of predicting an answer a given an image v and a question q

according to a parametric probability measure:



represents a vector of all parameters to learn and

is a set of all answers. VQA requires solving several tasks at once involving both visual and textual inputs. Here we use Bottom-Up-Attention and Top-Down (BUTD) [butd]

as our backbone model because it has become a golden baseline in VQA. In BUTD, region-specific image features extracted by fine-tuned Faster R-CNN

[frcnn] are utilized as visual inputs. In this paper, let be a collection of visual features extracted from K image regions and the question is a sequence of words . The triplet has a strong semantic relation that neither image nor question can be easily transformed to augment the training data while preserving original content.

3.2 Data Augmentation

Due to the risk of affecting answers, we avoid manipulating the raw inputs (i.e., images and questions) directly, such as cropping the image or changing the word order. Inspired by the adversarial attack and defense, we propose to generate adversarial examples as additional training data. In this section, we present how to generate adversarial examples of images and questions while preserving the original labels and how to use them to augment the training data.

3.2.1 Visual Adversarial Examples Generation.

Adversarial attacks are originated from the computer vision community. In general, the overarching goal is to add the least amount of perturbation to the input data to cause the desired misclassification. We employ an efficient gradient-based attacker Iterative Fast Gradient Sign Method (IFGSM)[ifgsm] to generate visual adversarial examples. Before illustrating IFGSM, we firstly introduce FGSM, as IFGSM is an extension of it. Goodfellow et al.[fgsm] proposed the FGSM as a simple way to generate adversarial examples. We could apply it on visual input as:


where is the adversarial example of , is the set of model parameters, denotes the cost function used to train the VQA model,

is the size of the adversarial perturbation. The attack backpropagates the gradient to the input visual feature to calculate

while fixing the network parameters. Then, it adjusts the input by a small step in the direction (i.e. ) that maximize the loss. The resulting perturbed, , is then misclassified by the VQA model (e.g., the model predicts Double in Fig. 1).

A straightforward extension of FGSM is to apply it multiple times with small step size, referred to IFGSM as:


where denotes element-wise clipping , with clipped to the range , is step size in each iteration. In this paper, we summarize gradient-based method as .

One-step methods of adversarial example generation generate a candidate adversarial image after computing only one gradient. Iterative methods apply many gradient updates. They typically do not rely on any approximation of the model and typically produce more harmful adversarial examples when running for more iterations. Our experimental results show that the accuracy of the BUTD vanilla trained model on visual adversarial examples generated by IFGSM is about 17%30% for . It implies that adversarial examples have different distribution to normal examples.

3.2.2 Semantic Equivalent Questions Generation.

To generate adversarial example of a question, we cannot directly apply approaches from image DNN attackers since textual data is discrete. In addition, the perturbation size that measured by norm in image is also inapplicable for textual data. Moreover, the small changes in texts, e.g., character or word change, would easily destroy the grammar and semantics, rendering the possibility of attack failure. Adhere to the principle of not changing the semantics of input data, inspired by [scpn, para], we generate semantically equivalent adversarial questions by using a sequence-to-sequence paraphrasing model.

Here we propose a paraphrasing model based purely on neural networks and it is an extension of the basic encoder-decoder Neural Machine Translation (NMT) framework. In the neural encoder-decoder framework, the encoder (RNN) is used to compress the meaning of the source sentence into a sequence of vectors. The decoder, a conditional RNN language model, generates a target sentence word-by-word. The encoder takes a sequence of original question words

as inputs, and produces a sequence of context. The decoder produces, given the source sentence, a probability distribution over the target sentence

with a softmax function:


However, in the case of paraphrasing, there is not a path from English to English, but a path from English to a pivot language to English can be used. For example, the source English sentence , is translated into a single French sentence . Next, is translated back into English, giving a probability distribution over English sentences, , which acts as paraphrase distribution:


Our paraphrasing model pivots through the set of -best translations of . This ensures that multiple aspects (semantic and syntactic) of the source sentence are captured. Translating multiple pivot sentences into one sentence producing a probability distribution over the target vocabulary could be formed as:


We further expand on the multi pivot approach by pivoting over multiple sentences in multiple languages (e.g. Portuguese). Deriving from Eq. 6, we obtain and . Then averaging these two distributions, producing a multi-sentence, multi-lingual paraphrase probability:


which is used to obtain the probability distributions over sentences:


We employ the pre-trained NMT model111 which is trained for EnglishPortu-guese and EnglishFrench to generate paraphrase candidates. A score [sea] that measures the semantic similarity between paraphrase and its original text is defined as:


where is the probability of a paraphrase given original question defined in Eq. 8, , which approximates how difficult it is to recover , is used to normalize different distributions. We penalize those candidates with edit distance more than or unknown words by adding a large negative number to the similarity score. We select the paraphrase candidates with the top-k semantic scores as our . The generation algorithm of is denoted .

Figure 2: Examples of our generated . The first question in bold in each block is the original question. The words in brackets are model predictions of the corresponding question; the numbers in brackets are the semantic score of .

Our paraphrases edit at least words to maintain syntax and semantics rather than exploring the linguistic variations regardless of the possibility of being perceived. We illustrate two examples of our in Fig. 2. They show that generated paraphrases could easily “break” the BUTD model. A predicted label is considered “flipped” if it differs from the prediction on the corresponding original question (assume that we do not attack visual data in this part). We observe that not only flip from positive predictions to negative ones but also correct the negative predictions to positive ones in some cases. Surprisingly, the flip rate of the vanilla trained model is 36.72% causing an absolute accuracy drop of 10%. It suggests that there is brittleness in the model decision and indicates that the model exploits spurious correlations while making their predictions.

3.3 Adversarial Training with Augmented Examples

Considering the adversarial training framework [scale, advprop], we treat adversarial examples as additional training samples and train networks with a mixture of adversarial and clean examples. The augmented questions are model-agnostic and generated before training, while visual adversarial examples are continually generated at every step of training. There are two kinds of visual adversarial examples depending on the question inputs:


For each pair, we have 4 additional training pairs, , , and . All these four pairs are semantically equivalent which means they hold the same ground truth answer. We maintain the original triplet but augment the original example at least four times, in the case of only one

generated. We formulate a loss function that allows control of the relative weight of additional pairs in each batch:


where is a loss on a batch of and examples with true answer ,

is a parameter which controls the relative weight of adversarial examples in the loss. Our main goal is to improve network performance on clean images by leveraging the regularization power of adversarial examples. We empirically find that training with a mixture of adversarial and clean examples from beginning to end would not converge well on clean samples. Therefore, we mix them in a period of training time and fine-tune with clean examples in the rest epochs. Not only does this boost the performance on clean examples, but also improves the robustness of the model to adversarial examples. We present our adversarial training scheme in Algorithm


Input: A set of clean visual and textual examples , with answers
Output: Network parameter
2 for each training step i do
3      Sample a mini-batch of clean visual examples , clean textual examples and textual adversarial examples with answer ;
4       if i is in adversarial training period time then
5            Generating the corresponding mini-batch of additional training pairs , , and ;
6             Minimize the loss in Eq. 3.3 w.r.t. network parameter
7      else
8             Minimize the loss w.r.t. network parameter
9       end if
11 end for
Algorithm 1 Pseudo code of our adversarial training

4 Experiments

4.1 Experiments Setup

4.1.1 Dataset.

We conduct experiments on the VQAv2 [vqa], which is improved from the previous version to emphasize visual understanding by reducing the answer bias in the dataset. VQAv2 contains 443K train, 214K validation and 453K test examples. The annotations for the test set are unavailable except for the remote evaluation servers. We provide our results on both validation and test set, and perform ablation study on the validation set.

4.1.2 VQA Architectures.

We use a strong baseline Bottom-Up-Attention and Top-Down (BUTD) [butd] which combines a bottom-up and a top-down attention mechanism to perform VQA, with the bottom-up mechanism generating object proposals from Faster R-CNN [frcnn], and the top-down mechanism predicting an attention distribution over those proposals. This model obtained first place in the 2017 VQA Challenge. Following setting in [butd, tip], we use a maximum of 100 object proposals per image, which are 2048 dimensional features, as visual input. We represent question words as 300 dimensional embeddings initialized with pre-trained GloVe vectors [glove], and process them with a one-layer GRU to obtain a 1024 dimensional question embedding.

4.1.3 Training Details.

For fair comparison, we train the BUTD baseline and our framework using Adamax [adam] with a batch size of 256 on the training split for a total of 25 epochs. Baseline achieves 63.32% accuracy on the validation set and we save this checkpoint to evaluate the attackers in the following. To augment data by our framework, we adjust the learning rate at different stages. We set an initial learning rate of 0.001, and then decay it after five epochs at the rate of 0.25 for every two epochs. We inject the additional data merely in a period of epochs , where is the epoch when we start adversarial training and is the epoch when we start standard training. We set the number of iterations of IFGSM to 2 and the number of generated paraphrases per question to 1 for saving training time. In paraphrase generating, we set the edit distance threshold and penalization score . To avoid label leaking effect [scale], we replace the true label in Eq. 2 and 3 with the most likely label predicted by the model when adversarial training. Our best result is achieved by using values

. These hyperparameters are chosen based on grid search, and other settings are tested in the ablation studies.

4.2 Results

4.2.1 Overall Performance.

Table 1 shows the results on VQAv2 validation, test-dev and test-std sets. We compare our method with the BUTD vanilla training setting. Our method outperforms vanilla trained baseline, making gains of 1.82%, 2.55%, 2.6% on validation, test-dev and test-std set, respectively. Furthermore, our adversarial training method only consumes a small amount of additional time (4 min for an epoch) while allows for a considerable increase in standard accuracy.

4.2.2 Comparison with Other Data Augmentation Methods.

We compare our method with related VQA data augmentation method CC [cyc], and NLP data augmentation method EDA [eda] and report the results on VQAv2 in Table 1. The model of CC is trained to predict the same (correct) answer for a question and its rephrasing, which are generated by a VQG module in their training scheme. Their outperforming validation accuracy is in contrast to the less competitive accuracy on the test-dev set. It reveals CC is less capable of generalizing on unseen data. Other related studies (e.g., CausalVQA [cau], CTM[ctm]) explore VQA data augmentation as a complementary result for constructing a new VQA dataset, and they evaluate their data augmentation method on the new dataset instead of VQAv2, so it is hard to compare our method with them. EDA is a text editing technique boosting model performance on the text classification task. We implement it to generate three augmented questions per original question and set the percent of words in a question that are changed . However, results (see Table 1) show that EDA could degrade the performance on clean data and make a 0.59% accuracy drop. It demonstrates that text editing techniques for generating question are not applicable as large numbers of questions are too short that could not be allowed to insert, delete or swap words. Moreover, sometimes the text editing may change the original semantic meaning of the question, which leads to noisy and even incorrect data.

Since our augmented data might be regarded as injecting noise to original data, we also set comparison by injecting random noise with a standard deviation of 0.3 (same as our

in reported results) to visual data. Random noise, as well, could be regarded as a naive attacker that causes a 0.9% absolute accuracy drop on the vanilla model. However, jointly training with clean and noising data could not boost the performance on clean data, as reported in Table 1. It proves that our generated data are drawn from the proper distribution that let the model take full advantage of the regularization power of adversarial examples.

Method Val Test-dev Test-std
Overall Yes/no Number Others
BUTD [butd] 63.32 65.23 81.82 44.21 56.05 65.67
  +Noise 63.28 64.80 81.03 43.96 55.70 -
  +EDA-3 [eda] 62.73 - - - - -
  +CC [cyc] 65.53 67.55 - - - -
  +Ours Aug-Q 65.05 67.58 83.85 47.34 58.31 -
  +Ours Aug-V 64.69 67.45 83.55 46.96 58.37 -

Table 1: Performance and ablation studies on VQAv2.0. All models listed here are single model, which trained on the training set to report Val scores and trained on training and validation sets to report Test-dev and Test-std scores. The first row represents the vanilla trained baseline model. The rows begin with represents the data augmentation method added to the first row. EDA-3 represents that we generate three augmented questions per original questions using EDA [eda]. This method is implemented based on a stronger BUTD (see [cyc]) and obtains a relatively small improvement (0.48%) on validation score, even so, its test-dev score is surpassed by our method.

4.3 Analysis

4.3.1 Training Set Size Impact.

Furthermore, we conduct experiments using a fraction of the available data in the training set. As overfitting tends to be more severe when training on smaller datasets, we show that our method has more significant improvements for smaller training sets. We run both vanilla training and our method for the following training set fractions (%): . Performances are shown in Table 3. The best accuracy without augmentation, 63.32%, was achieved using 100% of the training data. Our method surpasses it with 80% of the training data, achieving 64.27%.

Training set size BUTD +Ours
80% 62.77 64.27 (+1.50)
60% 61.55 63.11 (+1.56)
40% 59.47 61.35 (+1.88)
20% 55.45 57.39 ()
Table 2: Validation accuracy (%) across BUTD with and without our framework on different training set sizes.
(5,25) 63.93
(10,25) 64.08
(15,20) 64.18
Table 3: Validation accuracy (%) of our method using different adversarial training period.
Figure 3: Ablation on visual attacker strength and type. The top row is the accuracy of the vanilla model on adversarial examples generated by FGSM, IFGSM, and PGD, respectively. The bottom row is the standard accuracy of our model that adversarially trained with the corresponding attacker. The number of iterations is fixed to 2.

4.3.2 Effect of Augmenting Time.

We empirically find that the time when the adversarial examples are injected into training has an effect on accuracy. We demonstrate this via ablation studies in Table 3. We try several adversarial training period , , . They respectively evaluate the effect of delaying injecting additional training data after different epochs and prove the advantage gained from fine-tuning with clean data in the last few epochs. Results show that is the optimal adversarial training period, and it surpasses the baseline model and achieves 65.16% accuracy. One explanation is that adversarial examples have different underlying distributions to normal examples, and if boosting model performance on clean examples is our main goal, the fitting ability of model on clean examples need to be retrieved at the beginning and the end of the training process. It is inappropriate to inject the perturbed examples at an early stage where the model has not warm up. We fix the adversarial training period as and reuse the same partially trained (for 10 epochs) model as a starting point for the other ablation experiments.

4.4 Ablation Studies

4.4.1 Augmentation Decomposed.

Results from ablation studies to test the contributions of our method’s components are given in Table 1. The augmentation on visual and textual (question) data both make their individual contribution to improve the accuracy. We observe that visual adversarial examples are critical to our performance, and removing it causes a 0.47% accuracy drop (see Ours Aug-V) on the validation set. The question augmentation also leads to comparable improvements, see the model of Ours Aug-V.

4.4.2 Ablation on Adversarial Attackers.

We now ablate the effects of attacker strength and type used in our method on network performance. To evaluate the regularization power of adversarial examples, we first compute the accuracy of the vanilla model after being attacked by the gradient-based attacker with a variety of sets of parameters. Since the visual input ranges from 0 to 83, we try perturbation size among , approximately following the ratio of to pixel value in [advprop], and step size among .

Fig. 3 reflects the attacker strength changes with different parameter settings (accuracy declines implies strength increases) while Fig. 3 reflects how the model performance changes with attacker strength. Obviously, the accuracy decline when increases. We observe that the accuracy on clean data is inversely proportional to attacker strength. As weaker attackers push the distribution of adversarial examples less away from the distribution of clean data, the model is better at bridging domain differences. However, the extremely weak attacker (e.g., random noise, ) yields negligible improvement on standard accuracy, since the generated data are drawn similar distribution with original data.

We then study the effects of applying different gradient-based attackers in our method on model performance. Specifically, we try two other attackers, FGSM and PGD [pgd]. FGSM is the one-step version of IFGSM, and PGD is a universal “first-order adversary” that adds the random noise initialization step to IFGSM. Their performances are reported in Fig. 3, 3, 3, 3. We observe that all attackers substantially improve model performance over the vanilla training baseline. The two iterative attackers obtain almost the same results while FGSM is less competitive. This result suggests that our VQA data augmentation is not designed for a specific attacker.

4.5 Model Robustness

Improvement of model robustness against adversarial attacks is a reward of our adversarial training scheme. As shown in Table 4, we are able to significantly increase accuracy on visual adversarial examples by 13.74%, when using the training attacker at test-time. Following [carlini2019evaluating], we test a stronger PGD attacker () and our model could beat the baseline by 4.59%. On the textual side, the accuracy of the vanilla model on is 54.03% and the flip rate (the rate of changing the original answers, lower is better) is 36.72% while our adversarially trained model obtained an accuracy of 63.18% and a flip rate of 18.8% on . When attacking both visual and textual sides in test-time, our model beats the vanilla model by 21.55%. These results indicate that our model is capable of defending against both visual and textual common attackers.

4.6 Human Evaluation of Semantic Consistency

In order to show the semantic consistency of our generated paraphrases with original questions, we conduct a human study. We sampled 100 questions and their paraphrases with top1 semantic similarity score defined in Eq. 9, and asked 4 human evaluators to assign labels (e.g., positive for similar or negative for not similar). We averaged the opinions of different evaluations for each query to get a positive score of 84%. It demonstrates that the majority of paraphrases are similar to the originals.

Clean IFGSM Parap. IFGSM Parap. PGD
BUTD [butd] 63.32 30.83 54.03 22.09 18.05
  +Ours 44.57 63.18 43.64 22.64
Table 4: Validation accuracy (%) of vanilla and adversarially trained (using IFGSM ) network on clean and adversarial examples with various test-time attackers. Parap. represents the generated paraphrases in our method. Note that the IFGSM and PGD still act as the white-box attacker when testing.

5 Conclusions

In this paper, we propose to generate visual and textual adversarial examples as augmented data to train a robust VQA model with our adversarial training scheme. The visual adversaries are generated by gradient-based adversarial attacker and textual adversaries are paraphrases. Both of them keep modification imperceptible and maintain the semantics. Experimental results show that our method not only outperforms prior arts of VQA data augmentation, and also improves model robustness against adversarial attacks. To the best of our knowledge, this is the first work that uses both semantic equivalent visual and textual adversaries as data augmentation for the visual question answering problem.

5.0.1 Acknowledgements.

This work was supported by National Key Research and Development Program of China (2016YFB1001003), NSFC (U19B2035, 61527804, 60906119), STCSM (18DZ1112300). C. Ma was sponsored by Shanghai Pujiang Program.