Visual Question Answering (VQA) (Antol et al., 2015)
is an important task at the intersection of computer vision (CV) and natural language processing (NLP). In the last decade, deep neural networks (DNN) have made promising progress in VQA. However, recent studies(Agrawal et al., 2018, 2016; Manjunatha et al., 2019) have found that existing VQA models are prone to language biases and directly output answers relying on the superficial correlation between answers and questions. As a result, they always suffer from sharp performance drop when the answer distributions of the training set and test set are different (out-of-distribution, OOD).
The large-scale vision-language pre-trained models (VLPs) achieve great improvements in the in-distribution standard VQA benchmark. Nevertheless, they also fail to address the language-bias problem. For example, the performance of the pre-trained LXMERT model Tan and Bansal (2019) drops significantly from 71.27% on in-distribution VQA v2 Goyal et al. (2017) to 48.01% on the OOD benchmark VQA-CP v2 Agrawal et al. (2018). At the same time, the improvement brought by VLPs is partly due to their large model size, which increases the computational cost of deploying VQA models. To facilitate the application of VLPs to VQA tasks, the two problems should be addressed simultaneously. However, existing researches mostly focus on each of them separately.
To address the language-bias problem, a large number of debiasing methods have been proposed for the conventional VQA models, such as UpDn (Anderson et al., 2018) and S-MRL (Cadene et al., 2019). These methods can be divided into two categories: Data-augmentation methods (Chen et al., 2020a; Gokhale et al., 2020) generate additional training samples, that are inconsistent with the current dataset biases, for alleviating biases in the training set. Non-data-augmentation methods (Clark et al., 2019; Mahabadi and Henderson, 2019; Liang et al., 2021b) regularize the training loss according to the bias degree of training samples.
In terms of the increased computational cost, a line of recent efforts have been made to compress pre-trained language models (PLMs) in the NLP fieldLi et al. (2020b); Michel et al. (2019); Liu et al. (2021); Chen et al. (2020b); Prasanna et al. (2020); Liu et al. (2022); Liang et al. (2021a) and VLPs for visual-linguistic tasks Fang et al. (2021); Gan et al. (2022). They show that large-scale PLMs and VLPs can be compressed into lightweight models without degrading performance.
In this paper, we jointly study the compression and debiasing problems of VLP for the VQA task. To this end, we combine the existing debiasing and pruning methods to establish a training and compression pipeline, and conduct extensive experiments with the pre-trained LXMERT model, a representative VLP, on the VQA task under OOD setting. We show that there exist sparse LXMERT subnetworks that are more robust than the full model, which suggests that the goal of OOD robustness and computational efficiency can be achieved simultaneously. We also present a comprehensive study on the design of the training and compression pipeline, as well as the assignment of sparsity to different modules of LXMERT, in order to identify subnetworks with better OOD generalization. Our empirical results highlight the importance of 1) introducing the debiasing objective throughout the training and compression process and 2) assigning modality-specific sparsity to different modules of LXMERT.
We summary our main contributions as follows:
We present the first (to our knowledge) systematic study on sparsity and OOD robustness for VLPs, which can facilitate the application of VLP-based VQA systems.
We present a comprehensive study on the training and compression pipeline, as well as the assignment of sparsity to different modules of LXMERT, which can serve as a useful guideline to the design of VLP subnetwork searching methods in the future.
The obtained subnetworks in this paper can remain the full pre-trained model’s OOD performance with much fewer parameters, and achieve the best performance among the SoTA debiasing methods with similar model parameter counts (see Figure 1).
Overcoming Language-bias Problem in VQA
Most VQA systems heavily rely on the information of the question to predict answers no matter the content of the given image. They are not robust and always perform poor in the OOD setting where the language biases they learned in training set are invalid for test set. To promote the development of models that overcome such problem, VQA-CP v2 is proposed and has become the standard OOD benchmark in VQA. Currently, the widely used debiasing methods can be roughly grouped into non-data-augmentation and data-augmentation methods. The former applies a biased model (trained with question only) to regularize the model training and thus prevent learning from question. The latter generates samples to balance the training data and directly erase the biases in the training set. However, the augmented data also increase the training cost, and overcoming the language-bias problem remaining the original dataset biases unchanged still remains a major challenge (Liang et al., 2021b; Niu et al., 2021). Thus, we only focus on non-data-augmentation methods, represented by classic LMH (Clark et al., 2019).
Vision-Language Pre-trained Models
Recently, VLPs Wang et al. (2021b); Dou et al. (2022); Li et al. (2021, 2020a); Zhang et al. (2021); Wang et al. (2021a) based on the Transformer backbone Vaswani et al. (2017) have achieved encouraging success. Specially, OFA Wang et al. (2022) and Florence Yuan et al. (2021) establish the SoTA on the in-distribution VQA v2. To learn better cross-modality representations and vision-language alignment, they are trained with large-scale pre-training data and generally have huge model capacity. Among them, LXMERT is the most widely used VLP as the backbone model in VQA field (e.g., some data-augmentation debiasing methods Si et al. (2021); Gokhale et al. (2020); Wen et al. (2021) and the competitive model MuKEA Ding et al. (2022) for OKVQA Marino et al. (2019)). In this paper, we therefore choose LXMERT as the backbone model and extend LMH to it for in-depth research on compressing and debiasing.
Model Compression and Robustness
Model compression techniques for Transformer-based pre-trained models are well developed (mainly around BERT), including pruning Gordon et al. (2020); Michel et al. (2019); Gale et al. (2019), knowledge distillation Sanh et al. (2019); Sun et al. (2019); Jiao et al. (2020), parameter sharing Lan et al. (2020) and quantization Zafrir et al. (2019); Zhang et al. (2020). Inspired by lottery ticket hypothesis Frankle and Carbin (2019), many recent studies show that BERT can be pruned to a sparse subnetwork after Gale et al. (2019) and before fine-tuning Chen et al. (2020b); Prasanna et al. (2020); Liang et al. (2021a); Liu et al. (2022), without performance degrading. On this basis, we extend the pruning paradigm to the fine-tuned LXMERT for OOD scenario in VQA, which incorporates the debiasing methods when fine-tuning and pruning.
In the NLP and CV fields, some recent efforts have also been made to study model compression and robustness to adversarial attacks Gui et al. (2019); Ye et al. (2019); Sehwag et al. (2020); Fu et al. (2021); Xu et al. (2021) and spurious correlations Xu et al. (2021); Du et al. (2021) (which is more common than the worst-case adversarial attack). Language-bias problem is a typical symptom of spurious correlations and poses a challenge to the application of VQA models. We are the first to thoroughly investigate the sparsity and OOD robustness for VLPs in VQA.
LXMERT Architecture and Subnetworks
LXMERT is composed of an embedding layer, a visual fc layer, a pooler layer, a stack of Transformer layers and a VQA-specific classifier. The embedding layer and visual fc layer map language-modality input (token sequences obtained by WordPiece tokenizer) and vision-modality input (36 object features obtained by Faster R-CNNRen et al. (2015)) into the same-dimension space. The pooler layer connects the Transformer top layer and the classifier. The Transformer layers involve three encoders: language encoder (), object relationship encoder () and cross-modality encoder (), and are usually composed of attention modules and feed-forward networks (FFN). See App. A for more details of the Transformer layers. The attention modules have four kinds of weight matrices, i.e., the query, key and value matrices , and the output matrix . FFN contains two linear layers , .
We adopt unstructured pruning to obtain a compressed version (i.e., a subnetwork) of the original VLPs. Specifically, given a VLP with parameters , we apply a binary pruning mask to the model parameters, which gives rise to , where is the element-wise product. For LXMERT, we focus on the embedding layer, visual fc layer, pooler layer and Transformer layers of which the parameters are pre-trained, while the classifier is excluded. The language encoder, visual encoder, cross-modality encoder have , and Transformer layers respectively. The parameters to be pruned are:
where , and are the weights of embedding layer, vision fc layer and pool layer, are the parameters of Transformer layers:
where , and denote the language self-attention, visual self-attention and cross-attention modules, respectively.
There are two major kinds of approaches to prune the model. The first one is to compute the importance score of every single parameter based on certain heuristics, and then prune out the least important ones(LeCun et al., 1989; Han et al., 2015; Molchanov et al., 2017). The second one is to directly optimize the subnetwork structure via training the pruning mask (Louizos et al., 2018; Sanh et al., 2020; Sehwag et al., 2020; Ramanujan et al., 2020). In this work, we consider two typical pruning methods, i.e., magnitude-based pruning and mask training, each from one of the two categories.
Magnitude-based pruning approximates the importance of model parameters based on their absolute values and eliminates the less important ones. In this work, we adopt the basic version of magnitude-based pruning, namely one-shot magnitude pruning (OMP). OMP can optionally be combined with further fine-tuning of the pruned subnetwork to recover the performance drop.
Mask training directly optimizes the binary pruning mask towards the given objectives. Specifically, each weight matrix is associated with two mask matrices, namely a binary mask and a real-valued mask . In the forward propagation, is computed from
where is the threshold. Then, the original weight matrix is replaced with a pruned one . When it comes to backward propagation, we follow Mallya et al. (2018); Zhao et al. (2020); Radiya-Dixit and Wang (2020); Liu et al. (2022) and use the straight-through estimator Bengio et al. (2013)
to estimate the gradients ofusing the gradients of , and then update as , where is the learning rate.
We initialize according to the magnitudes of the pre-trained weights of LXMERT. This strategy is shown to be more effective than random initialization for pre-trained language models Radiya-Dixit and Wang (2020); Liu et al. (2022) and we also validate this in our experiments with LXMERT (see App. B). Specifically, the real-valued mask is initialized as follows:
where is a hyper-parameter. At initialization, we set the threshold (any other value with the same order of magnitude should also be fine). To ensure that the subnetwork satisfies the given sparsity, is re-computed every training steps.
In this paper, we mainly experiment with Learned-Mixin +H (LMH) as the debiasing method. LMH is a representative debiasing method widely studied for the OOD scenario of VQA Si et al. (2021); Chen et al. (2020a); Liang et al. (2020) and natural language understanding McCoy et al. (2019); Zhang et al. (2019); Schuster et al. (2019); Jia and Liang (2017) tasks. For comparison, we also display the binary cross-entropy loss (BCE) here.
BCE computes the cross-entropy between the predicted distribution (from main model) and the soft target score of each ground-truth , which can be formalized as follow:
denotes the sigmoid function.
LMH takes a step further based on Produce of Experts (PoE) Hinton (2002), which simply combines the predicted distributions of the main model and the biasd model as follow:
where is the predicted distribution of biased model, and represents the bias degree of the sample.
To selectively adjust the main model’s behavior, LMH adds a learn function to explicitly determine how much to trust the learned biases:
where is the cross-modality representation from the last hidden layer of LXMERT, is trainable. To prevent being ignored, LMH also adds an entropy penalty item to the training loss :
where z is the hyperparameter. The training loss of LMH is:
Given the pre-trained LXMERT , our goal is to find a subnetwork that satisfies a target sparsity level and maximizes the OOD performance:
where denotes the OOD evaluation, is the norm and is the total number of parameters in . This goal is achieved by searching the optimal and through model training and compression.
Eq. 10 only specifies the overall sparsity. In this work, we also explore a finer-grained control over sparsity, which allocates different sparsity to different modules of LXMERT, given that the overall sparsity is satisfied. Concretely, we consider three modules from different modalities, namely the language module (which consists of the language encoder and the embedding layer), the visual module (which consists of the visual encoder and the visual fc layer) and the cross-modality module (which only contains the cross-modality encoder). The constraint in the optimization problem is then re-written as111For simplicity, the 0.5M parameters of the pooler layer is not included in eq. 11, and we directly set it to the target sparsity .:
where , and are model parameters of the language module, visual module and cross-modality encoder, respectively. , and are the binary masks for the three modules, respectively. , and are the target sparsity levels for the three modules, respectively.
If not otherwise specified, we set the sparsity of every weight matrix to target sparsity. For instance, if and there is no modality-specific constraint, then all the weight matrices are at 70% sparsity (uniform sparsity). If , then all the weight matrices in are at 50% sparsity, while and could be different (modality-specific sparsity).
Results of subnetworks pruned from the BCE fine-tuned LXMERT (upper) and from the LMH fine-tuned LXMERT (lower). “lxmert(bce/lmh)” denotes full model fine-tuning in Stage1, “mask train(bce/lmh)” and “OMP” mean pruning in Stage2. “bce/lmh ft” means further fine-tuning in Stage3. “Gap” denotes the improvement achieved by mask train(bce/lmh) over full lxmert(bce/lmh). These abbreviations are used throughout this paper. The shadowed areas denote standard deviations.
Training and Compression Pipeline
Before delving into the details, we first define two notations:
: Training using loss .
: Compressing using pruning method and loss (if applicable). The output is a binary pruning mask .
A typical training and compression pipeline consists of three stages:
Stage1: Full Model Fine-tuning.
The pre-trained LXMERT is fine-tuned on the VQA dataset using loss , which produces .
Stage2: Model Compression.
The fine-tuned LXMERT is compressed and we obtain the subnetwork , where .
Stage3: Further Fine-tuning (Optional).
The subnetwork is further fine-tuned using loss , which results in .
In this section, we mainly investigate three questions: (1) How does compression affect the OOD generalization ability of LXMERT? (2) How to design the training and compression pipeline to achieve a good sparsity-performance trade-off? (3) How should we assign sparsity to different modality-specific modules?
Dataset, Model and Implementation Details
We conduct experiments on the OOD benchmark VQA-CP v2 (Agrawal et al., 2018)
that evaluate the robustness of VQA systems, with the accuracy-based evaluation metric(Antol et al., 2015). For VLP, we adopt the LXMERT-base-uncased model (Tan and Bansal, 2019) released by huggingface Wolf et al. (2020b)
. It has about 202M parameters, and 197.7M parameters are involved in the pruning process (4.5M parameters are left to the classifier). The three modules from different modalities, namely the language module, the visual module and the cross-modality module, contain 83.1M, 35.3M and 78.8M parameters respectively. We train the models for 20 epochs with a batch size of 128 on two Tesla-V100-32G or 256 on A100-80GB. The AdamW(Loshchilov and Hutter, 2017) optimizer is adopted with a learning rate of 5e-5. Our codes are based on the huggingface transformers library Wolf et al. (2020a). All the results are averaged over 4 seeds.
The Effect of Compression on OOD Generalization
Subnetworks from BCE Fine-tuned LXMERT
The BCE fine-tuned LXMERT encodes language bias from the training set. We compress this model using OMP and mask training and introduce either or in the pruning (for mask training) or further fine-tuning process (for OMP).
The results are shown in the upper row of Figure 2. We can derive several observations: 1) With plain pruning, which does not involve any debiasing methods, the subnetworks of “mask train(bce)” and “OMP + bce ft” can improve over the full LXMERT by 1.35% 2.79%, even with a large proportion of the model parameters being compressed (up to 70% sparsity), as shown by the “Acc Gap”. This implies that LXMERT is overparameterized and pruning may remove some parameters related to the bias features, which alleviates the over-fitting to the training set. 2) Compared with “mask train(bce)” and “OMP + bce ft”, “mask train(lmh)” and “OMP + lmh ft” achieve further performance boost, exceeding the full LXMERT by a large margin (11.05% 14.02%). Especially, as mask training does not change the value of parameters, the results of “mask train (lmh)” implicate that the biased “full lxmert(bce)” already contains sparse and robust subnetworks that effectively alleviate the language-bias problem. Such sparse and robust subnetworks can be identified across a wide range of sparsity levels (from 10% 90%). 3) “mask train” outperforms “OMP” in general, which suggests that directly optimizing the subnetwork structure is more effective than debiasing a compressed subnetwork by further fine-tuning.
For the three types of questions , as shown in the right three plots of Figure 2, we find that: 1) The performance on “Num” questions is sensitive to the varying sparsity levels while that on “Y/N” questions is relatively stable in general except at 90% sparsity. Specially, with the increase of sparsity, the performance on “Num” questions of “mask train(lmh)” and “OMP + lmh ft” counterintuitively greatly promote. This shows that language biases for the “Num” questions exist in a large proportion of the parameters of biased LXMERT. 2) For the “Other” questions, debiasing methods have little gain on the performance of subnetworks. For example, the performance of “mask train(lmh)” is similar with that of “mask train(bce)”. This indicates that the language biases for “Other” questions is minor in training set. Therefore, “Other” questions request more reasoning than debiasing. 3)There is a sharp decline of all the subnetworks’ performance on “Other” questions from 70% 90% sparsity. We conjecture that this is because reducing the model’s capacity too drastically hurt the reasoning ability which is necessary to answer the “Other” questions correctly.
Subnetworks from LMH Fine-tuned LXMERT
We have shown that compression has a positive effect on the biased BCE fine-tuned LXMERT. Now, let us investigate how does compression affect the OOD robustness of the relatively unbiased LMH fine-tuned LXMERT.
From the results of the lower row of Figure 2, we can find that: 1) For the full LXMERT, the OOD performance is obviously promoted with the LMH debiasing method, as shown by the blue and red dashed lines. 2) Unlike subnetworks from lxmert(bce), subnetworks from lxmert(lmh) do not exhibit significant improvement over the full model. However, the “mask train(lmh)” and “OMP + lmh ft” subnetworks can preserve the lxmert(lmh)’s performance at up to 50% sparsity and there is no significant performance decline at 70% sparsity. Such subnetworks can serve as an alternative to the LMH fine-tuned full LXMERT to promote efficiency. 3) “mask train(bce)” and “OMP + bce ft” (the green and yellow lines) clearly underperform their lmh counterparts, which demonstrates that using in the pruning stage or the further fine-tuning stage has a negative impact on the OOD performance when the full model is lxmert(lmh). This suggests that it is important to use the debiasing method in pruning and subnetwork further fine-tuning even when the full model is already trained with the debiasing method. 4) For the “Num” questions, when compressing LMH fine-tuned LXMERT (the grey and maroon lines), the performance of subnetworks no longer rises with sparsity growth. This demonstrates that language biases for the “Num” questions exist in a much smaller proportion of the parameters of debiased LXMERT than that of biased LXMERT. 5) For “Other” questions, “lxmert(bce) + mask train(lmh)” is consistently superior to “lxmert(lmh) + mask train(lmh)”, which demonstrates that further debiasing the debiased full LXMERT in the pruning process sacrifices the reasoning ability.
How to Design the Training and Compression Pipeline for better OOD performance?
In this section, we study the proper design of the training and compression pipeline, under the basic framework described in the methodological section “Training and Compression Pipeline”. Here we focus on the mask training compression method, as it has been shown to generally outperform OMP with further fine-tuning. Our main conclusions can be described from two perspectives:
First, it is recommended to introduce the LMH debiasing loss across Stage1, Stage2 and (if applicable) Stage3. The reason is three-fold: 1) As shown by Figure 4, the subnetworks at 10%, 30% and 70% sparsity levels have better performance when starting from lxmert(lmh), as compared with the lxmert(bce). At 90% sparsity, “lxmert(lmh) + mask train(lmh)” underperforms “lxmert(bce) + mask train(lmh)” (see App. C for reasons), but the Accuracy gap is small. Therefore, adopting in Stage1 is a better choice than , especially when the subnetworks are not at extremely high sparsity. 2) As we discussed in the previous section, introducing in the mask training process (Stage2) substantially outperforms for both lxmert(lmh) and lxmert(bce). 3) When both Stage1 and Stage2 adopt the BCE loss, further fine-tuning the subnetworks with LMH loss in Stage3 can significantly boost the performance, as shown by the results of “lxmert(bce) + mask train(bce)” w/o ft and w/ lmh ft in Figure 4.
Second, it is unnecessary to further fine-tune the subnetworks if Stage2 and Stage3 have the same training objective. Comparing the blue and red (or cyan) bars in Figure 4, we can see that further fine-tuning with the same training objective generally degrades the performance of “lxmert(lmh) + mask train(lmh)”, “lxmert(bce) + mask train(lmh)” and “lxmert(bce) + mask train(bce)”.
How to Assign Sparsity to Different Modality-specific Modules?
Pruning Each Single Modality-specific Module
Since LXMERT uses different modules to encode the multi-modal data, it is intuitive to hypothesize that different modules of LXMERT may capture the language bias to different extents. To validate this hypothesis, we compress the language, visual and cross-modality modules, respectively. As presented by Figure 3, the compression of different modality-specific modules indeed exhibits different effects.
When the full model is lxmert(bce) (the orange and cyan lines), compressing the language or cross-modality module has a positive effect on the OOD performance, and the accuracy generally stably improves as sparsity increases from 10% to 90%. By contrast, compressing the visual module results in performance obviously inferior to compressing the other two modules, even if the number of remaining parameters is larger (note that the visual module has a smaller number of parameters than the other two modules). These results suggest that, for the biased lxmert(bce), the language and cross-modality modules capture more training set bias than the visual module, which supports the above hypothesis.
In terms of “lxmert(lmh) + mask train(lmh)” (the red line), although compression does not lead to performance improvement like compressing lxmert(bce), the results also demonstrate that the language and cross-modality modules are more compressible than the visual module.
Searching for Appropriate Modality-specific Sparsity
Motivated by the above findings, we search for the appropriate modality-specific sparsity by performing mask training with a variety of sparsity configurations (see App. D) for the three modules while keeping the overall sparsity the same.
As we can see in Figure 6, at 50% and 70% overall sparsity, the configuration that achieves the best result assigns slightly higher sparsity to the language and cross-modality modules and significantly lower sparsity to the visual module, as compared with the configuration that assigns sparsity uniformly. This phenomenon is in accordance with the findings in Figure 3, implicating that compressing the three modules uniformly is suboptimal (at 50% 70% sparsity) and the language and cross-modality modules should be compressed to a larger extent than the visual module. At 90% sparsity, the sparsity configuration’s comfort zone is in the proximity of the uniform point. Further increasing the sparsity of the language and cross-modality modules result in performance decline or only minor improvements. This is because 90% sparsity already approaches the compression upper bound, even for the language and cross-modality modules that are more compressible.
A more direct comparison between the uniform and modality-specific sparsity assessment is presented in Figure 6. We also introduce another baseline called “matrix-specific sparsity”, which ranks all the model parameters, instead of the parameters in each weight matrix. This also results in different sparsity levels for different weight matrices, while there is no explicit control over the modality-specific sparsity. We can see that modality-specific sparsity achieves the best performance across the three overall sparsity levels from 50% to 90%, demonstrating its superiority. Besides, the results also suggest that, although simply allowing different matrices to have different sparsity is more flexible than uniform sparsity, it is not conducive to the final performance.
Comparison with Debiasing SoTAs
|S-MRL Cadene et al. (2019)||-||60M||38.46||42.85||12.81||43.20|
|UpDn Anderson et al. (2018)||-||35M||39.74||42.27||11.93||46.05|
|lxmert(bce) + mask train(bce) (Ours)||10% lxmert||24M||45.42||46.55||19.59||51.91|
|lxmert(bce) + mask train(bce) (Ours)||30% lxmert||64M||51.34||54.69||27.02||56.26|
|lxmert(bce) + mask train(bce) (Ours)||50% lxmert||103M||51.43||54.38||25.10||57.10|
|LXMERT Tan and Bansal (2019)||full lxmert||202M||48.01||48.24||20.04||55.57|
|RUBi Cadene et al. (2019)||S-MRL||60M||47.11||68.65||20.28||43.18|
|VGQE Kv and Mittal (2020)||S-MRL||60M||50.11||66.35||27.08||46.77|
|LPF Liang et al. (2021b)||S-MRL||60M||53.38||88.06||25.00||42.99|
|CF-VQA Niu et al. (2021)||S-MRL||60M||55.05||90.61||21.50||45.61|
|AdvReg. (Ramakrishnan et al., 2018)||UpDn||35M||41.17||65.49||15.48||35.48|
|GRL (Grand and Belinkov, 2019)||UpDn||35M||42.33||59.74||14.78||40.76|
|RUBi (Cadene et al., 2019)||UpDn||35M||44.23||67.05||17.48||39.61|
|Loss-Rescaling Guo et al. (2021)||UpDn||35M||47.09||68.42||21.71||42.88|
|VGQE Kv and Mittal (2020)||UpDn||35M||48.75||-||-||-|
|DLR (Jing et al., 2020)||UpDn||35M||48.87||70.99||18.72||45.57|
|LMH (Clark et al., 2019)||UpDn||35M||52.01||72.58||31.12||46.97|
|CF-VQA (Niu et al., 2021)||UpDn||35M||53.55||91.15||13.03||44.97|
|LPF (Liang et al., 2021b)||UpDn||35M||55.34||88.61||23.78||46.57|
|CGE Han et al. (2021)||UpDn||35M||57.32||87.04||27.75||49.59|
|lxmert(lmh) + mask train(lmh) (Ours)||10% lxmert||24M||58.87||75.45||57.09||50.66|
|lxmert(lmh) + mask train(lmh) (Ours)||30% lxmert||64M||63.11||79.43||55.24||56.72|
|lxmert(lmh) + mask train(lmh) (Ours)||50% lxmert||103M||63.88||80.30||56.95||57.17|
|LMH (Clark et al., 2019)||full lxmert||202M||63.55||81.84||55.00||56.32|
Table 1 (upper) compares the plain VQA models with our LXMERT subnetworks that involve no debiasing method. We can observe that: 1) The accuracy of our model (10% lxmert) beats the popular VQA models (UpDn and S-MRL) by 5.68% and 6.96% with only 69% and 40% model sizes of them. 2) Our model (30% lxmert) outperforms full LXMERT by 3.33% with only 64M parameters which are close to that of plain VQA models. We will release our models and encourage researchers to use them (replacing the full lxmert) as strong VQA backbones. Table 1 (lower) compares the debiasing SoTAs with our models. We find that: 1) The accuracy of our models (10% lxmert and 30% lxmert) establishes new state-of-the-art accuracy, beating the previous best with 1.55% and 5.79% with fewer and similar amounts of parameters respectively. 2) Our models (30% lxmert and 50% lxmert) are competitive with the debiased full LXMERT, even with much fewer parameters.
To facilitate the application of VLP-based VQA systems, this paper presents the first joint study on the compression and debiasing problems of VLP for the VQA task. Through extensive experiments with LXMERT, a representative VLP, we analyze the impact of compression on the OOD generalization ability. We present a comprehensive study on the design of the training and compression pipeline for a good sparsity-performance trade-off, and provide some valuable findings about the assignment of sparsity to different modality-specific modules. The compressed LXMERT subnetworks in this paper outperform the SoTA debiasing methods with fewer or similar model parameter counts.
Don’t just assume; look and answer: overcoming priors for visual question answering.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980. Cited by: Introduction, Introduction, Dataset, Model and Implementation Details.
- Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356. Cited by: Introduction.
Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086. Cited by: Introduction, Table 1.
- Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: Introduction, Dataset, Model and Implementation Details.
Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432. Cited by: Mask Training..
- Rubi: reducing unimodal biases for visual question answering. Advances in neural information processing systems 32, pp. 841–852. Cited by: Introduction, Table 1.
- Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809. Cited by: Introduction, Debiasing Methods.
- The lottery ticket hypothesis for pre-trained BERT networks. In NeurIPS, pp. 15834–15846. Cited by: Introduction, Model Compression and Robustness.
- Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. arXiv preprint arXiv:1909.03683. Cited by: Introduction, Overcoming Language-bias Problem in VQA, Table 1.
- MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5089–5098. Cited by: Vision-Language Pre-trained Models.
- An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18166–18176. Cited by: Vision-Language Pre-trained Models.
- What do compressed large language models forget? robustness challenges in model compression. CoRR abs/2110.08419. Cited by: Model Compression and Robustness.
- Compressing visual-linguistic model via knowledge distillation. In ICCV, pp. 1408–1418. Cited by: Introduction.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: Model Compression and Robustness.
- Drawing robust scratch tickets: subnetworks with inborn robustness are found within randomly initialized networks. In NeurIPS, pp. 13059–13072. Cited by: Model Compression and Robustness.
- The state of sparsity in deep neural networks. CoRR abs/1902.09574. Cited by: Model Compression and Robustness.
- Playing lottery tickets with vision and language. In AAAI, pp. 652–660. Cited by: Introduction.
- Mutant: a training paradigm for out-of-distribution generalization in visual question answering. arXiv preprint arXiv:2009.08566. Cited by: Introduction, Vision-Language Pre-trained Models.
Compressing BERT: studying the effects of weight pruning on transfer learning. In RepL4NLP@ACL, pp. 143–155. Cited by: Model Compression and Robustness.
- Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913. Cited by: Introduction.
- Adversarial regularization for visual question answering: strengths, shortcomings, and side effects. NAACL HLT 2019, pp. 1. Cited by: Table 1.
- Model compression with adversarial robustness: A unified optimization framework. In NeurIPS, pp. 1283–1294. Cited by: Model Compression and Robustness.
- Loss re-scaling vqa: revisiting the language prior problem from a class-imbalance view. IEEE Transactions on Image Processing 31, pp. 227–238. Cited by: Table 1.
- Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28, pp. 1135–1143. Cited by: Pruning Methods.
- Greedy gradient ensemble for robust visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1584–1593. Cited by: Table 1.
Training products of experts by minimizing contrastive divergence. Neural Comput. 14 (8), pp. 1771–1800. External Links: Cited by: Learned-Mixin +H..
- Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2021–2031. External Links: Cited by: Debiasing Methods.
- TinyBERT: distilling BERT for natural language understanding. In EMNLP (Findings), pp. 4163–4174. Cited by: Model Compression and Robustness.
Overcoming language priors in vqa via decomposed linguistic representations.
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11181–11188. Cited by: Table 1.
- Reducing language biases in visual question answering with visually-grounded question encoder. In European Conference on Computer Vision, pp. 18–34. Cited by: Table 1.
ALBERT: A lite BERT for self-supervised learning of language representations. In ICLR, Cited by: Model Compression and Robustness.
- Optimal brain damage. In NIPS, pp. 598–605. Cited by: Pruning Methods.
- Align before fuse: vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, pp. 9694–9705. Cited by: Vision-Language Pre-trained Models.
- Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409. Cited by: Vision-Language Pre-trained Models.
- Train large, then compress: rethinking model size for efficient training and inference of transformers. CoRR abs/2002.11794. Cited by: Introduction.
- Super tickets in pre-trained language models: from model compression to improving generalization. In ACL/IJCNLP, pp. 6524–6538. Cited by: Introduction, Model Compression and Robustness.
- LPF: a language-prior feedback objective function for de-biased visual question answering. arXiv preprint arXiv:2105.14300. Cited by: Introduction, Overcoming Language-bias Problem in VQA, Table 1.
- Learning to contrast the counterfactual samples for robust visual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3285–3292. Cited by: Debiasing Methods.
- ROSITA: refined BERT compression with integrated techniques. In AAAI, pp. 8715–8722. Cited by: Introduction.
- Learning to win lottery tickets in BERT transfer via task-agnostic mask training. CoRR abs/2204.11218. Cited by: Introduction, Model Compression and Robustness, Mask Training., Mask Training..
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: Dataset, Model and Implementation Details.
- Learning sparse neural networks through l_0 regularization. In ICLR (Poster), Cited by: Pruning Methods.
- Simple but effective techniques to reduce biases. arXiv preprint arXiv:1909.06321 9. Cited by: Introduction.
- Piggyback: adapting a single network to multiple tasks by learning to mask weights. In ECCV, Lecture Notes in Computer Science, Vol. 11208, pp. 72–88. Cited by: Mask Training..
- Explicit bias discovery in visual question answering models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9562–9571. Cited by: Introduction.
- Ok-vqa: a visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 3195–3204. Cited by: Vision-Language Pre-trained Models.
- Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In ACL, pp. 3428–3448. Cited by: Debiasing Methods.
- Are sixteen heads really better than one?. In NeurIPS, pp. 14014–14024. Cited by: Introduction, Model Compression and Robustness.
Pruning convolutional neural networks for resource efficient inference. In ICLR (Poster), Cited by: Pruning Methods.
- Counterfactual vqa: a cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12700–12710. Cited by: Overcoming Language-bias Problem in VQA, Table 1.
- When BERT plays the lottery, all tickets are winning. In EMNLP, pp. 3208–3229. Cited by: Introduction, Model Compression and Robustness.
How fine can fine-tuning be? learning efficient language models.
Proceedings of Machine Learning Research, Vol. 108, pp. 2435–2443. Cited by: Mask Training., Mask Training..
- Overcoming language priors in visual question answering with adversarial regularization. arXiv preprint arXiv:1810.03649. Cited by: Table 1.
- What’s hidden in a randomly weighted neural network?. In CVPR, pp. 11890–11899. Cited by: Pruning Methods.
- Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28, pp. 91–99. Cited by: LXMERT Architecture and Subnetworks.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. Cited by: Model Compression and Robustness.
- Movement pruning: adaptive sparsity by fine-tuning. In NeurIPS, pp. 20378–20389. Cited by: Pruning Methods.
- Towards debiasing fact verification models. In EMNLP/IJCNLP, pp. 3417–3423. Cited by: Debiasing Methods.
- HYDRA: pruning adversarially robust neural networks. In NeurIPS, Cited by: Model Compression and Robustness, Pruning Methods.
- Check it again: progressive visual question answering via visual entailment. arXiv preprint arXiv:2106.04605. Cited by: Vision-Language Pre-trained Models, Debiasing Methods.
- Patient knowledge distillation for BERT model compression. In EMNLP/IJCNLP, pp. 4322–4331. Cited by: Model Compression and Robustness.
- Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: Introduction, Dataset, Model and Implementation Details, Table 1.
- Attention is all you need. In NIPS, pp. 5998–6008. Cited by: Vision-Language Pre-trained Models.
- OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. ICML. Cited by: Vision-Language Pre-trained Models.
- Vlmo: unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358. Cited by: Vision-Language Pre-trained Models.
- Simvlm: simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904. Cited by: Vision-Language Pre-trained Models.
- Debiased visual question answering from feature and sample perspectives. Advances in Neural Information Processing Systems 34, pp. 3784–3796. Cited by: Vision-Language Pre-trained Models.
- Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. Cited by: Dataset, Model and Implementation Details.
- Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Cited by: Dataset, Model and Implementation Details.
- Beyond preserved accuracy: evaluating loyalty and robustness of BERT compression. In EMNLP (1), pp. 10653–10659. Cited by: Model Compression and Robustness.
- Adversarial robustness vs. model compression, or both?. In ICCV, pp. 111–120. Cited by: Model Compression and Robustness.
- Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432. External Links: Cited by: Vision-Language Pre-trained Models.
- Q8BERT: quantized 8bit BERT. In EMC2@NeurIPS, pp. 36–39. Cited by: Model Compression and Robustness.
- Vinvl: revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588. Cited by: Vision-Language Pre-trained Models.
- TernaryBERT: distillation-aware ultra-low bit BERT. In EMNLP, pp. 509–521. Cited by: Model Compression and Robustness.
- PAWS: paraphrase adversaries from word scrambling. In NAACL-HLT, pp. 1298–1308. Cited by: Debiasing Methods.
- Masking as an efficient alternative to finetuning for pretrained language models. In EMNLP, pp. 2226–2241. Cited by: Mask Training..
Appendix A A: Transformer Layers in LXMERT
Each Transformer layer of the language encoder and object relationship encoder has a multi-head self-attention module and a feed-forward network (FFN). Each Transformer layer of the cross-modality encoder has a language self-attention module, a visual self-attention module and a multi-head cross-attention module. Only the language self-attention and visual self-attention modules are followed by FFN. All the weight matrices of Transformer layers are summarized in eq. 2 in the main paper.
Appendix B B: The Effect of Different Initialization Strategies of for Mask Training
We conduct experiments with different subnetworks to validate the effectiveness of initializing according to the magnitudes of LXMERT’s pre-trained weights. From Figure 7, it can be seen that “lxmert(bce) + mask train(bce)”, “lxmert(bce) + mask train(lmh)”, “lxmert(lmh) + mask train(bce)” (dashed lines) consistently outperform “lxmert(bce) + rand-init mask train(bce)”, “lxmert(bce) + rand-init mask train(lmh)”, “lxmert(lmh) + rand-init mask train(bce)” (full lines) at all sparsity levels. As the sparsity increases, the gaps widen. This shows the initialization strategy we adopt is more effective than random initialization.
Appendix C C: A Close Look at The Performance of Subnetworks at 90% Sparsity
From Figure 8, we derive two abnormal observations at the extremely high sparsity, i.e., 90%: 1) Pruning with “OMP + lmh ft” (pink and grey lines) is better than pruning with “mask train(lmh)” (cyan and brown lines). 2) Starting from “lxmert(bce)” (pink and cyan lines) is better than starting from “lxmert(lmh)” (grey and brown lines). The two observations at 90% sparsity are contrary to other sparsity. For the first observation, we conjecture that this is because mask training (which involves binarization and gradient estimation) is more difficult to optimize at 90% compared with further fine-tuning of the OMP subnetworks. The second observation can be explained by that: Further debiasing the debiased full LXMERT in the pruning process slightly sacrifices the performance on “Other” questions, which require more reasoning ability than debiasing ability (as shown in the rightmost two plots of Figure 2 in main paper). Therefore, At the extremely high sparsity, when the benefits of debiasing on “Y/N” and “Num” questions are small, the performance penalty on “Other” questions results in a drop in “Overall” accuracy. Nevertheless, the gaps between “lxmert(lmh) + mask train(lmh)” and the other two pipelines are small at 90% sparsity.
Appendix D D: Sparsity Configurations for the Three Modality-specific Modules
For the overall target sparsity of 50% and 70%, we adopt the following procedure to search the comfortable zone for the modality-specific sparsity:
First, we traverse (i.e., step size of 20%) to assign modality-specific sparsity for any two modules, and compute the modality-specific sparsity for the remaining one222We exclude the configurations where the computed sparsity for the remaining module is greater than 1 or smaller than 0. according to eq. 11 in the main paper. From the experimental results of these sparsity configurations (as shown in Table 2 and Table 4), we can determine the approximate range where the pruned subnetworks perform better.
Second, we use the same method to traverse the reduced range with a smaller step size of 5%, as shown in Table 3 and Table 5. In this way, we can determine the most comfortable zone for the modality-specific sparsity.
Similarly, when the overall target sparsity is 90%, we directly traverse 80% 98% with a step size of 2% (as shown in Table 6) to search the most comfortable zone of the modality-specific sparsity.