With the success of Transformer and self-supervised learning, we have recently witnessed a boosting number of research works on cross-modal learning, especially on vision-language pre-training (VLPT)[chen2020uniter, li2019unicoder, li2019visualbert, lu2019vilbert, su2019vl, tan2019lxmert, zhou2019unified]. VLPT models learn better cross-modal representation with large-scale easy-accessible image-text pairs. They have established state-of-the-art results in many vision-language tasks, such as visual question answering (VQA) [Antol_2015_ICCV], image-text retrieval [lin2014microsoft], natural language for visual reasoning (NLVR) [suhr2019corpus], etc. Visual representation plays an important role in VLPT models. The recent success of VLPT models has been accompanied by the usage of region-based image features, which are extracted by object detectors pre-trained on the Visual Genome dataset [Anderson_2018_CVPR]. However, there are three challenges to directly utilize region-based image features for vision-language understanding. Firstly, regions focus on objects inside bounding boxes while neglecting the contextual information out of the boxes, which is important for relation understanding and reasoning. For example in Figure 1, we can easily detect “man”, “woman” and “boat” in the image. However, without the contextual information out of these boxes, a model will misunderstand the relation as “people boating” and result in an incorrect answer for either text retrieval or VQA. Secondly, visual understanding of images will be limited to the pre-defined categories for regions. Thirdly, most region-based image features are extracted by a detection model, which will suffer from low quality, noise, and over-sampling [Anderson_2018_CVPR] and rely on large-scale boxes annotation data. Although some works try to train detection model[2017Multiple, zeng2019wsod2] with weakly-supervised, the performance is far below the requirements. Recently, some works challenge that grid-based convolutional features are also effective to learn visual representations [desai2021virtex, huang2020pixel, jiang2020defense, sariyildiz2020icmlm]. Among them, Jiang et al. show that grid features can be equally effective as region features for VQA [jiang2020defense]. Sariyildiz et al. and Desai et al. use image-text data to train visual backbone for recognition tasks (e.g., image classification) [desai2021virtex, sariyildiz2020icmlm]. These models are designed either for specific vision-language task [jiang2020defense] or vision task [desai2021virtex, sariyildiz2020icmlm]. In this paper, we focus on VLPT and propose one of the first end-to-end VLPT model without relying on region features.
To overcome the limitation of region-based image features and better utilize image-text pairs for cross-modal understanding, we propose SOHO, an end-to-end vision-language pre-training framework to directly learn image embedding, language embedding, and their semantic alignment from image-text pairs. Compared with existing VLPT works, SOHO adopts a simple pipeline that does not require a complicated visual backbone for pre-training and releases the design effort for VLPT tasks. Without the requirement of laborious annotated categories or boxes, SOHO can enrich visual semantics by directly optimizing visual representations by a wider range of image-text data.
End-to-end learning for vision and language raises challenges by different representations of the two modalities. Visual representation at pixel-level is much more diverse and dense than language embedding. And the lack of explicit supervision for pixel-level language adds the difficulty to alignment learning. To tackle the above problems, we introduce a visual dictionary (VD) which represents more comprehensive and compact semantics in visual domain. To learn the visual dictionary, we design a moving-averaged encoder to group visual pixels with similar visual semantics. VD can be dynamically updated through our trainable CNN backbone directly from visual-language data during pre-training. For pre-training tasks, we propose a novel Masked Vision Modeling (MVM) based on the learned visual dictionary besides two commonly used tasks, Masked Language Modeling (MLM) and Image-Text Matching (ITM).
Our contributions are summarized as follows: (i) We propose SOHO, one of the first end-to-end VLPT models to learn cross-modal representation directly with image-text pairs. Without the need of extracting bounding boxes, our model can achieve at least 10 times speedup for inference. (ii) To better align visual features and language tokens, we propose a novel dynamic-updated visual dictionary that represents a visual abstraction of similar semantics in images. (iii) We conduct extensive experiments on four well-established downstream tasks. Experimental results show that SOHO can improve the SOTA performance with absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR test-P split, 6.7% accuracy on SNLI-VE test split, and 0.56% VQA score on VQA2.0 test-std split. We will release both model and code to facilitate the research community111https://github.com/researchmm/soho.
2 Related Work
2.1 Visual Representation for Vision-Language
Visual representation learning for vision-language understanding is a long-standing research topic. Early works use CNN classification models pre-trained on ImageNet to extract visual features[deng2009imagenet, li2018tell, liu2018beyond, NIPS2016_9dcb88e0, yang2016stacked, yu2017multi]. Later on, Anderson et al. propose a Bottom-Up and Top-Down Attention (BUTD) detection model [Anderson_2018_CVPR]
pre-trained on Visual Genome dataset to extract salient region features as visual inputs for VQA and image captioning tasks. The BUTD features are adopted by many vision-language works[Anderson_2018_CVPR, krishna2017visual, singh2018pythia, suhr2019corpus] and pre-training works [chen2020uniter, kim2018bilinear, tan2019lxmert]. Recently, some works propose to directly learn visual representations in the form of grid features with convolutional networks in specific vision-language tasks [jiang2020defense] or vision recognition tasks [desai2021virtex, sariyildiz2020icmlm]. Our work shares a similar format of visual representation with [jiang2020defense] while we focus on the area of vision-language pre-training and propose the first end-to-end VLPT model without relying on the box annotations.
VideoBERT [sun2019videobert] and the bag of words [fei2005bayesian]
literature also use vector quantization to represent visual information. The key difference between VD and related works is that we dynamically update the VD-based embedding with the output of a trainable visual encoder, instead of pre-computed input features. The dynamic updating mechanism for VD can capture text-guided semantics from the vision-language dataset. Thus the model can be directly optimized with high-level semantics for VL understanding and alignment.
2.2 Pre-training for Vision-Language
Many vision-language pre-training (VLPT) works have been proposed to learn cross-modal representations [chen2020uniter, li2019unicoder, li2019visualbert, lu2019vilbert, su2019vl, sun2019videobert, tan2019lxmert, zhou2019unified]. They can be categorized as two-stream or single-stream models. The two-stream models process visual and language information respectively and fuse them afterward by another Transformer layer [lu2019vilbert, tan2019lxmert]. Contrarily, the single-stream models use BERT [devlin2018bert]
to learn a bi-directional joint distribution over the detection bounding box feature and text embedding feature[alberti2019fusion, chen2020uniter, li2019unicoder, li2019visualbert, su2019vl, zhou2019unified]. Both types use the Transformer-based model to learn vision-language joint embedding features. While they neglect that visual representation learning is also important to vision-language tasks.
The key differences between our SOHO and existing VLPT works are 1) SOHO adopts a simple VLPT pipeline. Our vision backbone only uses ImageNet pre-trained parameters, and achieves even higher performance than existing VLPT works using VG features on five downstream tasks. 2) SOHO uses the least annotations to achieve SOTA performances. 3) SOHO enriches visual semantics by directly optimizing visual inputs for target language tasks.
The overall architecture of our proposed vision-language pre-training framework SOHO is shown in Figure 2. SOHO is an end-to-end framework, which consists of a trainable CNN-based visual encoder, a visual dictionary (VD) embedding module, and a multi-layer Transformer. The visual encoder takes an image as input and produces the visual features. VD embedding module is designed to aggregate diverse visual semantic information into visual tokens with a proposed visual dictionary. The Transformer is adopted to fuse features from visual and language modalities, and produce task-specific output. SOHO can be end-to-end pre-trained by Masked Vision Modeling (MVM), Masked Language Modeling (MLM), and Image-Text Matching (ITM) tasks. SOHO can also be easily adapted to several downstream tasks including Image-Text Retrieval, VQA, NLVR, and Visual Entailment.
3.1 Trainable Visual Encoder
Most recent vision-language researches follow Bottom-up and Top-Down attention [Anderson_2018_CVPR] to extract region-level visual features by a Faster R-CNN [ren2015faster] detector which is pre-trained on the Visual Genome [krishna2017visual] dataset. The representation ability of such extracted region-based features will be limited by the pre-defined object and attribute categories (i.e. objects and attributes). Besides, some contextual information out of the regions is important for VL understanding, while being neglected because they are out of the pre-defined categories. Even though considering the whole image as a region and extract its feature as global representation is an improved solution, the detector can not guarantee the feature quality because such global regions are unseen in the training stage. Despite that, most recent VLPT models adopt pre-extracted region-level visual features because it is complicated to end-to-end fine-tune an object detector in VL tasks. Besides, the extracted region-level visual features have a semantic gap with the language domain, while existing works try to bridge such domain gap by only one or several fully-connected layers.
To keep all visual information, we propose to use a trainable CNN visual encoder, which takes the whole image as input and produces image-level visual features instead of region-level features. Without the limitation of bounding boxes, the visual encoder can be end-to-end learned and updated from pre-training losses or downstream task-specific losses, and further optimize the cross-modal learning in turn. Given an input image , we get its feature by:
where is the visual feature encoder with parameter . denotes the number of embedded feature vectors, and is the embedded dimension. We adopt ResNet [he2016deep] pre-trained on ImageNet [deng2009imagenet] followed by a convolutional layer and a max pooling layer as the architecture of the encoder . For simplicity, we use to denote the feature vector of for the rest of this paper.
3.2 Visual Dictionary
The visual feature extracted by visual feature encoder is more diverse and dense than language word tokens, which will bring difficulty to the learning of cross-modal understanding. To bridge its representation gap from language tokens, we propose a visual dictionary (VD) to tokenize the visual features by aggregating similar visual semantic into the same image feature.
Visual Dictionary Embedding. We define a visual dictionary (VD) as a matrix which contains embedding vectors with -dim. The embedding vector is denoted as . For visual feature , we compute it mapping index by searching nearest neighbor in , denoted as:
We define visual dictionary embedding as a mapping function , which maps to by:
which uses the nearest embedding vector to represent the visual feature. We denote as an inverse mapping function, which maps the index back to a group of visual features. We use to represent the inverse mapping group size, and use to represent the encoding features.
Momentum Learning for Visual Dictionary Update. The visual dictionary is randomly initialized, and further updated by a moving average operation in one mini-batch, which is denoted as:
where indicates the updated embedding vector of , and is a momentum coefficient whose value range is . Note that Eqn. LABEL:eq:4 will only be applied when .
Gradient Back Propagation. Since the argmin operation is not differentiable, the gradient back propagation will be stopped by the visual dictionary. To make the visual feature encoder trainable, we follow [van2017neural] to update by:
where is the stop gradient operator.
The visual dictionary performs an online clustering on visual feature maps based on feature similarity, and represents each feature vector by its cluster center. Feature vectors sharing similar semantics will be aggregated into the same cluster, and the clustered index can be considered as a virtual visual semantic label. Since the clustering can be affected by the vision-language learning tasks, the learned semantics of each embedding vector is more suitable for cross-modal understanding and alignment.
The visual dictionary faces a cold-start problem, where directly copying the gradient from randomly initialized embedding vectors to visual feature maps will lead to incorrect model optimization direction (i.e., mode collapse). Therefore, we freeze the parameters of ResNet in the visual feature encoder in the first
|Task||Dataset||Train Split||Test Split||Metric|
|Image-Text Retrieval||MSCOCO [lin2014microsoft]||train+restval*||test*||Recall@1,5,10|
|Visual Question Answering||VQA2.0 [goyal2017making]||train+val||test-dev/test-std||VQA-score [goyal2017making]|
|Visual Reasoning||NLVR [suhr2019corpus]||train||dev/test-P||Top-1 Accuracy|
|Visual Entailment||SNLI-VE [xie2018visual]||train||val/test||Top-1 Accuracy|
3.3 Pre-training Pipeline
We apply a multi-layer Transformer to learn cross-modal representations with the fusion of visual and language features. In order to learn a universal representation for vision and language-related tasks, we apply the self-supervised method to pre-train the model on a large aggregated dataset. We follow the existing works [chen2020uniter, li2019unicoder, lu2019vilbert, su2019vl, tan2019lxmert, zhou2019unified] to adopt Masked Language Modeling (MLM) and Image-Text Matching (ITM) pre-training tasks. Besides, we propose a novel Masked Visual Modeling (MVM) pre-training task based on the virtual visual semantic labels produced by the visual dictionary.
Cross-Modal Transformer. For visual representation, we utilize 2-D position embedding computed by sine function to encode spatial information of visual tokens following other works [carion2020end, dosovitskiy2020image, parmar2018image]. For the input sentence, we follow BERT [devlin2018bert] to tokenize it, and then represent the tokens by embedding vectors . We use to denote the embedding vector in . The word embedding and the VD embeddings share the dimension on their outputs. We concatenate the VD embeddings and word embedding vectors together to form an input sequence for cross-modal learning. Similar to other VLPT models, we add two special tokens [CLS] and [SEP] into the input sequence to indicate classification position and the end of a text, respectively. A multi-layer Transformer is adopted to take the joint vision-language input, and outputs the attended features.
Masked Language Modeling. We follow [chen2020uniter] and adopt Masked Language Modeling (MLM) to encourage the model to build the mapping between language tokens and visual contents. The goal of MLM is to predict the masked word tokens based on other word tokens and all image features by minimizing the negative log-likelihood. The learning target can be formulated as:
where indicate hereinafter the whole training dataset. We adopt the same masking strategy used in BERT [devlin2018bert].
Masked Visual Modeling. We propose Masked Visual Modeling (MVM) by visual dictionary, which is a symmetry to the MLM. We randomly mask the image features before feeding them into the Transformer. The learning target of MVM is denoted as:
The goal of MVM is to predict the masked image features based on their surrounding image features and all language tokens by minimizing the negative log-likelihood. MVM can encourage the model to infer visual knowledge from the contextual visual information as well as language. When an image feature is masked, its mapping index in VD is considered as its label. In visual feature maps, neighbor features may have similar values, and thus share the same mapping index. This will cause the model to directly copy the label from surrounding features as predictions in a lazy way. To prevent this, in the masking stage, we first randomly select an existing label index , then replace all visual embedding vectors in with the special [MASK] token embedding vector.
Image-Text Matching. To enhance the cross-modal matching, we adopt Image-Text Matching (ITM) task for pre-training as in previous works [chen2020uniter]
. We apply a binary classifieron the joint embedding feature of [CLS
] token to predict whether the input image and text are matched or not. ITM task is driven by the following loss function:
where indicates whether the image and text is matched () or not ().
The visual feature encoder, VD-based image embedding module and the cross-modal Transformer is end-to-end jointly trainable. We assign equal loss weight to the three pre-training objectives, and thus the full pre-training objective of SOHO is:
3.4 Pre-training Datasets
Several large-scale datasets have been proposed to facilitate VL pre-training. According to typical settings in UNITER [chen2020uniter], these datasets can be categorized into two classes: “in-domain” and “out-domain”. In our work, we use “in-domain” as a pre-training dataset as most VL pre-training tasks are built on them [chen2020uniter, li2019visualbert, tan2019lxmert]. We construct our pre-training datasets with MSCOCO [lin2014microsoft] and VG [krishna2017visual].
To avoid data leak, we only use the train and restval splits of MSCOCO dataset, and the train and val splits of VG dataset in the training stage. The detailed statistic of our pre-training datasets can be found in Table 1. Detailed comparisons of pre-training dataset usage of most VLPT works, including our train/test image and text numbers, are included in our supplementary material.
|1K Test set||5K Test set|
Evaluation of image-to-text retrieval (TR) and text-to-image retrieval (IR) task on MSCOCO Dataset. ”-” indicates the detail is not reported.
4.1 Implementation Details
For the language processing, we follow BERT to use the WordPiece tokenizer [wu2016google] to split each text into language tokens. For the visual processing, as most previous works adopt feature extractor which uses as input resolution, we also adopt setting to resize the shorter edge of input images to , and limit the longer edge to be lower than for a fair comparison. We use pre-trained models based on public accessible ImageNet [deng2009imagenet] and BERT [devlin2018bert] to initialize the parameters of our visual backbone and Transformer architecture, respectively. Specifically, we adopt the widely-used ResNet-101 backbone and 12-layer Transformer to fairly compare with other works, while we adopt a lightweight ResNet-18 backbone and 3-layer Transformer in our ablation studies to reduce experiment cost. We will use to denote X-layer ResNet architecture in the rest of this paper for simplicity (e.g. R101 denotes ResNet-101). Since the visual backbone and Transformer favor different kinds of optimizers [zhang2019adam], we follow the suggestion of Zhang et al. [zhang2019adam] to use SGD and AdamW optimizers for visual backbone and Transformer respectively. We use SGD with learning rate and weight decay for the visual backbone, and apply AdamW with learning rate and weight decay for Transformer. We pre-train our model with 32 NVIDIA Tesla V100 GPUs with a batch size of image-text pairs. The training process takes 40 epochs until convergence and we empirically decay the learning rate by 10 times at and epoch.
We adopt mixed-precision training to reduce memory cost and speed up training procedure. An image will be paired with four texts in each batch during pre-training, including two positive pairs and two negative pairs. We only apply MLM and MVM on the positive image-text pairs.
4.2 Downstream Tasks and Results
We test the performance of SOHO on four well-established downstream tasks, include image-text retrieval, visual question answering (VQA), natural language for visual reasoning(NLVR), and fine-grained visual reasoning (Visual Entailment, or VE).Image-text retrieval task includes two sub-tasks, i.e., image-to-text retrieval (TR) and text-to-image retrieval (IR), and are conducted on Flickr30K [xie2017aggregated] and MSCOCO [lin2014microsoft] datasets. The tasks of VQA, NLVR, and VE are conducted on datasets of VQA 2.0 [goyal2017making], NLVR [suhr2019corpus] and SNLI-VE [xie2018visual] respectively. Table 1 summarizes the statistics of all our downstream tasks.
We compare our approach with several task-specific methods and pre-training models. Most pre-training models adopt Transformer-like architectures [vaswani2017attention] with BERT-like objectives [devlin2018bert] to learn cross-modal representations [chen2020uniter, li2019unicoder, li2019visualbert, lu2019vilbert, su2019vl, tan2019lxmert, zhou2019unified]. For downstream tasks, we find that using input features of the VD module for visual representation is better than directly applying VD embedding. We adopt the former setting in our experiment. This shows that VD suits visual representation learned with a very large scale of semantics while dense features provide more details in a relatively small dataset.
4.2.1 Task I: Image-Text Retrieval
Image-text retrieval requires a model to retrieve the most relevant caption from candidate images, or vice versa. It is one of the most typical tasks in the field of vision-language learning which enables a broad range of applications (e.g., image searching). Image-text retrieval includes two sub-tasks of image-to-text retrieval (TR) and text-to-image retrieval (IR). During training, we construct aligned and unaligned pairs inside of a mini-batch like most image-text retrieval models. We randomly sample aligned image-caption pairs from ground truth annotations to form a mini-batch. All the other captions are used as the unaligned captions for each image. To encourage the model to predict the right labels for both the aligned and unaligned pairs, we consider the retrieval task as a binary classification problem.
In our implementation, we use the joint embedding representation of the [CLS] token from Transformers to predict whether an image-caption pair is aligned or not. Since the objective of image-text retrieval task is consistent with the image-text matching (ITM) task in pre-training stage, the pre-trained parameters can well be inherited for fine-tuning. We adopt AdamW optimizer with learning rate and weight decay. The mini-batch size is set to . We train 20 epochs until convergence and decay the learning rate by half at , , and epoch empirically.
We conduct experiments on MSCOCO [lin2014microsoft] and Flickr30k [plummer2015flickr30k], and the results are shown in Table 2 and Table 3 respectively. It worth noting that UNITER additionally uses out-of-domain datasets and the results are expected to be better than merely use in-domain datasets as they reported [chen2020uniter]. Unicoder-VL [li2019unicoder] adopts merely out-of-domain datasets, which is also not directly comparable to our SOHO. Nevertheless, SOHO outperforms the most recent VLPT works under most metrics on both MSCOCO and Flickr30k. The performance improvements indicate that SOHO learns better image-text embeddings by our end-to-end pre-training architecture, and exploits comprehensive yet compact visual semantic abstraction by the proposed visual dictionary.
|Unified VLP [zhou2019unified]||X101||70.50||70.70|
4.2.2 Task II: Visual Question Answering
Visual Question Answering (VQA) requires a model to take an image and a question as input and output an answer. This task requires machines to act like humans and reason across vision and language, which is approaching intelligent AI. We model VQA as a classification problem by learning multi-layer perception from the [CLS] token. We follow [kim2018bilinear] to treat is as a -way classification problem, and optimize the model via binary cross-entropy loss. We fine-tune for 18 epochs with a batch size of 256 until convergence. We set the optimizer the same as in the pre-training stage. The initial learning rates are also set the same as pre-training, and we decay the learning rate by 10 at the and epochs empirically.
Results are presented in Table 4. The most direct comparable baseline to our SOHO is LXMERT [tan2019lxmert], which adopts the same backbone and pre-training dataset as our SOHO. SOHO obtains 0.83% and 0.93% absolute improvements on test-dev and test-std split over LXMERT respectively. It is worth noting that SOHO outperforms UNITER [chen2020uniter] even under an inferior experimental setting, where UNITER additionally uses out-domain datasets in the pre-training stage. The promising results of SOHO on VQA demonstrate that our end-to-end pre-training approach enables intelligent question answering on visual contents.
4.2.3 Task III: Visual Reasoning
Visual Reasoning with Natural Language (NLVR) requires a model to predict whether a text is related to a given pair of images. Compared with VQA, NLVR addresses the challenge of compositional visual reasoning on relations, comparisons, and quantities. We conduct this task on NLVR dataset [suhr2019corpus]. In our implementation, we follow LXMERT [tan2019lxmert] and UNITER [chen2020uniter] to input two pairs of image-text to Transformer and get two embedding vectors from [CLS] tokens. Then we learn a classifier on the concatenation of the embedding vectors over “true” or “false” by a cross-entropy loss. The settings of the optimizer, epoch number, and learning rate are the same as VQA settings. Since the number of input images for NLVR is twice as VQA, the batch size of NLVR is half of VQA.
We mainly compare with the SOTA results provided by LXMERT [tan2019lxmert] and UNITER [chen2020uniter] under the same settings for fair comparisons. From the results shown in Table 5, we observe 0.52% and 1.52% absolute gains of SOHO against UNITER on dev and test-P split respectively. This result validates that SOHO also has advantages when applying to compositional visual reasoning tasks.
|VD size||Text Retrieval||Image Retrieval||VQA||NLVR||SNLI-VE|
4.2.4 Task IV: Visual Entailment
Visual Entailment (VE) is a fine-grained visual reasoning task to predict whether an image semantically entails a text. In pursuit of visual intelligence, the relationship between an image and a text pair in the VE task is more fine-grained than VQA and NLVR, which can be true (entailment), false (contradiction) or neutral. SNLI-VE dataset [xie2018visual] is proposed for the VE task and is constructed based on Stanford Natural Language Inference (SNLI) [bowman2015large] and Flickr30K [plummer2015flickr30k] datasets. We follow UNITER [chen2020uniter] to treat the VE task as a three-way classification problem and predict the scores of each class by a fully-connected layer on the representation of the [CLS] token from the output of the Transformer. We fine-tune the model for 6 epochs with batch size 128 until convergence. The learning rate is initialized as 1e-4, and decay to 1e-5 after four epoch empirically.
We compare SOHO with a VLPT work UNITER [chen2020uniter] and a task-specific method EVE-Image [xie2018visual]. As reported in Table 6, SOHO achieves 85.00% and 84.95% accuracy on val and test split respectively. The results significantly outperform the SOTA results provided by UNITER [chen2020uniter], where 6.41% and 6.67% absolute accuracy improvements are obtained on the val and test split respectively. The results indicate the advantage of our end-to-end framework for refining the CNN backbone together with the cross-modal Transformer to facilitate thorough vision-language alignment.
4.3 Ablation Study
We perform ablation studies to validate the effectiveness of the visual dictionary (VD) on all downstream tasks. We first establish a baseline model without VD, then incorporate VD with the baseline and further evaluate the influence of the embedding vector size (VD size) .
Results are presented in Table 7. Generally, we observe that for most tasks, a VD size of 2048 or 4096 achieves the best results among four sizes ranging from 1024 to 8192. This is reasonable as VD is designed to aggregate similar visual semantics into the same image feature. With such design, the bigger VD could learn to group more fine-grained and complete visual semantics, which benefits the VL alignment as expected. However, too fine-grained visual semantics being grouped into different image features may deteriorate the abstraction of visual semantics, which consequently is harmful to VL alignment. We empirically find that works the best in most cases, thus we adopt as our default setting.
When compared with the baseline without VD, our proposed method with VD enjoys better performances under almost all metrics with a wide range of (i.e., 1024, 2048, and 4096). It validates the effectiveness of VD and shows that VD is generally applicable to a broad range of tasks.
4.4 Visualization of Visual Dictionary
To share insights on what the proposed Visual Dictionary (VD) learned, we visualize some representative VD indices in Figure 3. As introduce in Sec 3.2, a VD index is correlated with many visual features, where each visual feature corresponds to an image patch. We randomly sample some indices from VD and visualize their corresponding image patches. As shown in Figure 3, the VD groups meaningful and consistent image patches into different indices, which reflects an abstraction of visual semantics. The visualization shows the strong capability of the learned VD. More cases can be found in supplementary materials.
4.5 Inference Time
BUTD-based methods mainly include three inference stages: CNN forwarding, region feature generation, and Transformer forwarding [Anderson_2018_CVPR]. In contrast, SOHO only includes two inference stages of CNN and Transformer forwarding. To compare the inference efficiency of SOHO and BUTD-based methods, we set up an experiment on a V100 GPU with
input resolution, a ResNet-101 backbone, a 12-layer Transformer, 100 boxes, 16 sentence padding length. The average inference time for extracting BUTD features on ResNet-101 isms. The input sequence length of the Transformer for BUTD-based methods and SOHO are and , respectively. Thus the inference time of Transformer is ms and ms for BUTD-based methods and SOHO, respectively. For BUTD-based methods, in addition to a ms time cost of region feature generation , the main time cost, however, comes from the non-maximum suppression which s required to be applied to all categories. Consequently, the ms time cost of SOHO for an inference step is about times faster than the ms time cost of BUTD-based methods. Therefore, our highly-efficient SOHO could be better applied to real applications.
In this paper, we show a new perspective for vision-language model design. Particularly, we propose SOHO, one of the first end-to-end vision-language pre-training models that learns comprehensive yet compact visual representation for cross-modal understanding. To generate visual features that can be fused with language tokens, we propose a novel visual dictionary to transform an image to concrete semantics. Three pre-training tasks are conducted to build connections between images and languages. Performances on four downstream tasks show the superiority of SOHO over pre-training models with region-based image features. Moreover, we relieve the requirement for bounding box annotations, and reduce heavy human labeling costs. This end-to-end framework also shows the merit of accelerating the inference time in vision-language tasks about 10 times, which enables more online vision-language applications. In the future, we will further explore vision-language generation tasks, and study the utilization of large-scale unpaired multi-modal data for cognition-level visual understanding.
This work is supported by Ministry of Science and Technology International Exchange Project (No. 43-17).
Appendix A Appendix
a.1 Dataset Statistics
Here we first summarize the detailed train/test image and text numbers of our pre-training and downstream datasets in Table 9. Then we provide a detailed comparisons of pre-training dataset usage of recent VLPT works in Table 10.
We follow UNITER [chen2020uniter] to classify pre-training datasets into two classes of “in-domain” and “out-of-domain”. MSCOCO Captions (MSCOCO)[lin2014microsoft] and Visual Genome Dense Captions (VG) [krishna2017visual] are typical in-domain datasets for many VL downstream tasks (e.g., image-text retrieval). In contrast, Conceptual Captions [sharma2018conceptual] and SBU Captions [ordonez2011im2text] are out-of-domain datasets which are noisier than in-domain datasets. We show the dataset usage of recent VLPT works in Table 10. For example, VisualBERT [li2019visualbert], LXMERT [tan2019lxmert] and UNITER [chen2020uniter] pre-train with in-domain datasets. Among them, UNITER [chen2020uniter] additionally use out-of-domain data for model training. The ablation study of UNITER [chen2020uniter] shows that the additional usage of out-of-domain further improves performance.
In our work, we focus on in-domain datasets as they are commonly used in many VL tasks (e.g., image-text retrieval) and adopted by many VLPT works (e.g., VisualBERT [li2019visualbert], LXMERT [tan2019lxmert] and UNITER [chen2020uniter]). When comparing with UNITER, we fairly compare with its in-domain pre-training results if they are provided. Otherwise, our “in-domain” dataset setting is inferior to the “in-domain+out-of-domain” pre-training setting of UNITER, and our results are not directly comparable.
We plan to include out-of-domain data in our pre-training data as a future work.
|Dataset||Split||#Image (K)||#Text (K)|
a.2 Implementation Details
We adopt two strategies to speed up the training procedure. First, we adopt mixed-precision training to reduce memory cost and speed up training procedure. Second, we re-organize the input data in one mini-batch. Within a mini-batch, we only forward an image once to the visual backbone if it has multiple corresponding texts, while concatenating it with each text into cross-modal transformers. For example, an image will be paired with four texts in each batch during pre-training, including two positive pairs and two negative pairs. We only apply MLM and MVM on the positive image-text pairs.
a.3 Visualization of Visual Dictionary
To show the semantic of visual dictionary (VD) items, we visualize the image patches that are grouped in each indices. We have shown two examples in the paper, and in the supplementary material, we randomly select ten more indices from the VD. From the visualization shown in Figure 4, we can find that each item in VD has meaningful and consistent semantics. In other words, our model is able to learn unified representations to represent different semantics of the image even though we do not have object bounding box annotations for supervision.
For image-text retrieval task, the traditional approaches [faghri2017vse++]
first project an image and a text to a common representation space and then correlate their representations by late fusion. For example, the widely-used late fusion method is calculating cosine similarity based on a dot-product operation, which is simple and fast. In contrast, Transformer-based approaches early fuse the image and text by a multi-layer Transformer to get an united representation. The unified representation captures the deep relation between an image and a text with self-attention mechanism, thus is able to achieve a better result than the late fusion representation. However, the early fusion Transformer-based approaches cannot produce separate representation for images and texts, thus suffers from slow speed due to exhaustive computation of each possible image-text combination. Our model as well as other vision-language pre-training models are based on Transformers, and the inference speed has become a bottleneck for applying these models to real-world search engines. For future works, we are curious about how we could speedup the Transformer-based approaches in image-text retrieval task.
|Dataset||Visual Genome [krishna2017visual]||MSCOCO [lin2014microsoft]||Conceptual Captions [sharma2018conceptual]||SBU [ordonez2011im2text]|
|Unified VLP [zhou2019unified]||✓|