Recent studies (Lu et al., 2019; Su et al., 2019; Tan and Bansal, 2019; Chen et al., 2019; Li et al., 2020c; Yu et al., 2020) on vision-language pre-training have substantially advanced the state of the art across a variety of vision-and-language (V+L) tasks. These approaches typically follow a two-stage pipeline: 1) a pre-trained object detector is first used to encode an image by identifying a set of salient visual objects from the image; and 2) a cross-modal fusion model is pre-trained to learn the cross-modal representations. While most recent VLP methods utilize region-based visual features (Anderson et al., 2018) extracted by object detection, this paper explores a new way for vision-language pre-training, by using vanilla grid-based convolutional features from ConvNets (He et al., 2016; Xie et al., 2017).
Despite their superior performance, region-based VLP solutions suffer from the three problems: 1) The task-specific object detector has a strong impact on the performance of the VLP models. Important visual information may be lost during object detection; 2) In object detection, the extraction and post-processing of region features is very time-consuming; 3) The feature extraction process is non-differentiable, which imposes unnecessary constraints on model designs and makes it difficult to support end-to-end training.
To address the limitations, we revisit the grid-based convolutional features for vision-language pre-training. Specifically, we encode an image into a feature map with convolutional networks such as ResNet (He et al., 2016), and then conduct cross-modal fusion directly between text and the derived image feature map. In this way, we step out of the bounding box design of local region features and make the full use of all visual information for vision and language learning. On the other hand, computing grid features enables more flexible and simpler model design and thus leads to inference speedup, by skipping expensive region-related steps.
In this paper, we propose a simple yet effective grid-based VLP method, namely Grid-VLP, by directly pre-training with the grid-based convolutional features instead of the dominant region-based features from bottom-up attention (Anderson et al., 2018). We follow LXMERT (Tan and Bansal, 2019) and adopt the typical Masked Language Modeling, Image-Text Matching and Image Question Answering as our pre-training tasks. To handle the challenge of modeling long sequences of grid features and accelerate the training process, we utilize a random grid sampling mechanism as in PixelBERT (Huang et al., 2020), which helps improve the robustness of visual feature learning. We pre-train Grid-VLP only with in-domain datasets (Visual Genome (Krishna et al., 2017)
and MS-COCO(Lin et al., 2014)), and then fine-tune it on three popular V+L understanding tasks: VQA (Antol et al., 2015), NLVR2 (Suhr et al., 2018) and GQA (Hudson and Manning, 2019). Our results show that with the grid features, Grid-VLP outperforms the competitive region-based VLP methods across all tasks. Besides, we also provide in-depth analysis on the influence of different design choices regarding feature types, image encoder architectures and resolution of input images.
2 Related Work
Existing approaches to VLP (Li et al., 2020a; Su et al., 2019; Chen et al., 2019; Li et al., 2020c; Tan and Bansal, 2019; Yu et al., 2020; Huang et al., 2020) mainly take a two-step training pipeline, which consists of extracting semantic visual features by specific object detector and training the cross-modal pre-training model to align text and visual features. There are typically two lines of research about this topic. The first line uses a single-stream transformer architecture to model both image and text representations in a unified semantic space such as VLBERT (Su et al., 2019), UNITER (Chen et al., 2019) and OSCAR (Li et al., 2020c)
. In contrast, the other line adopts a two-stream Transformer architecture that first encodes the image and text modalities separately, and then fuses the cross-modal representations with another Transformer network, such as LXMERT(Tan and Bansal, 2019) and ERNIE-ViL (Yu et al., 2020). Recently, VinVL (Zhang et al., 2021) pre-trained a large-scale object-attribute detection model with much larger amounts of data on four public object detection datasets for further improving the performance. In addition to image-text pairs, UNIMO (Li et al., 2020b) also employed large scale of free text corpus and image collections for enhancing the cross-modal learning. These methods rely heavily on a task-specific bounding box (or region) based object detector, which impose unnecessary constraints on model designs and limit potential applications of existing vision and language systems. In this paper, Grid-VLP explores a new way for vision-language pre-training by utilizing the grid-based convolutional features, which skips the expensive region-related steps and is not restricted to the necessity of object detector.
3 Grid-VLP Approach
3.1 Model Architecture
The architecture of Grid-VLP is shown in Figure 1. Given a pair of aligned image and caption text, we first utilize a pre-trained CNN encoder to extract the grid features for the image, then a Transformer-based model is used to conduct cross-modal fusion directly between the token embeddings and image feature map. Different V+L pre-training tasks are designed to further enhance the cross-modal learning. To make up for the difficulty of modeling long grid feature sequence and accelerate training, we adopt a simple random grid sampling mechanism to select part of image feature map during cross-modal fusion. The architecture is flexible and no expensive region-related steps are involved.
3.2 Input Representations
The input to Grid-VLP is an image and its related text (e.g. caption text). We first introduce the way to represent the text sequence and image.
Similar to BERT (Devlin et al., 2018), each sentence is first split into a sequence of sub-words by WordPiece tokenizer. Then each token
is assigned three kinds of learnable embeddings: token, modal type and position embeddings. The three embeddings are summed and layer-normalized to represent input sentence as a sequence of embedding vectors, where and are special tokens in BERT.
Image Grid Features
We use the traditional grid-based convolutional features instead of region-based features for representing an image. Starting from the initial image with 3 color channels, a CNN-based image encoder generates a lower-resolution activation map using the typical values as in Faster R-CNN (Ren et al., 2015): and . To make the CNN encoder aware of fine-grained semantics, we add a RoIPool layer on top of the CNN encoder followed by two fully-connected layers and pre-train the encoder on Visual Genome dataset as in (Jiang et al., 2020). After pre-training, the CNN encoder is fixed as a grid feature extractor. As the cross-modal fusion network expects a sequence as input, we collapse the spatial dimensions of into one dimension, resulting in a feature map . Then we take a linear projection layer to reduce the channel dimension of the high-level feature map from to a smaller dimension so as to match the dimension of token embeddings. To distinguish between different modalities, we supplement the grid feature map with a learnable modal type embedding that are added to the output of linear projection layer. Finally, the sequential image representation can be seen as a length of -dimensional vector.
3.3 Cross-modal Fusion
Given the embeddings of the tokens for the sentence and the sequential image representations , we adopt the Transformer encoder to learn cross-modal fusion between image grids and language tokens. To allow a fine-grained feature-level fusion, we concatenate the derived image features and text embeddings to construct the input sequence, which is formulated as: . To facilitate cross-modal understanding, we follow LXMERT (Tan and Bansal, 2019) and conduct three popular pre-training tasks111Different from LXMERT, we do not have any vision pre-training task without the concept of region., including Masked Language Modeling (MLM), Image-Text Matching (ITM) and Image Question Answering (QA).
Random Grid Sampling
During pre-training, to further increase the difficulty of cross-modal pre-training tasks, we adopt a random sampling strategy as in (Huang et al., 2020) by randomly sampling a fixed number of grids for each image. At each iteration step, given the sequence of extracted grid features, we will dynamically sample a part from them and feed it into Transformer. In this way, we encourage the model to learn cross-modal relation with incomplete visual input, so as to enhance the robustness. Besides, it can largely accelerate the training process by reducing the total input sequence length. During fine-tuning, we still use the completed grid features to keep all the extracted visual information.
4.1 Pre-training Dataset
The same in-domain data as in LXMERT (Tan and Bansal, 2019) is used for pre-training. It consists of the image caption data from MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and image question answering data from VQA v2.0 (Antol et al., 2015), GQA balanced version (Hudson and Manning, 2019) and VG-QA (Zhu et al., 2016). The total amount of the dataset is 9.18M image-and-sentence pairs on 180K distinct images.
4.2 Implementation Details
The maximum sequence length for the sentence is set as 20. The visual encoder is selected as ResNeXt with different sizes (Xie et al., 2017) as in (Jiang et al., 2020; Huang et al., 2020). We resize the shorter side of the input image to 600, and limit the longer side to at most 1000. We select a fixed number of 100 random grids each time during pre-training 222We also test with different number of selected grids such as 64 and 128, it does not make much difference.. We pre-train Grid-VLP with 12 layers of Transformer encoder. The basic settings of the Transformer is the same as BERT (Devlin et al., 2018)
. We pre-train Grid-VLP with a total batch size of 512 for 30 epochs on 8 V100 GPUs. We use the AdamW optimizor and set the initial learning rate as.
4.3 Main Results on Downstream Tasks
We compare Grid-VLP model against all the state-of-the-art VLP models of the comparable model size on three popular V+L understanding tasks: VQA (Antol et al., 2015), NLVR2 (Suhr et al., 2018) and GQA (Hudson and Manning, 2019). The detailed task descriptions can be referred in LXMERT (Tan and Bansal, 2019). The results are shown in Table 1. We can see that with only in-domain pre-training data on MS-COCO and Visual Genome, Grid-VLP can outperform almost all the other existing VLP methods 333Only one exception is VinVL, which uses large amounts of objection detection data, out-of-domain pre-training data and fine-tuning tricks such as using unbalanced data on GQA. by a large margin, without using any complicated strategies such as data augmentation, adversarial training and knowledge enrichment. Grid-VLP even outperforms an end-to-end method PixelBERT, which directly optimizes on pixel level. It shows the effectiveness of the proposed approach for conducting VLP on a grid feature level, which provides a brand new way for vision-language pre-training.
4.4 Influence of Image Encoder
|Backbone||Avg Time (ms)||VQA||NLVR2|
As the image encoder is critical for extracting grid features, we further study the importance of the image encoder by changing different ResNet visual backbone layers. To allow a fair comparison between grid features and region features, we also add one experiment that using Faster R-CNN (Ren et al., 2015) with the the same ResNeXt backbone to extract region features of 100 objects. The result is shown in Table 2. We can see that using more complicated visual backbones can contribute to the final performance gain, which proves the importance of the image encoder for cross-modal understanding. Besides, given the same settings with only the difference of extracted feature type, the performance of grid features can even outperform region-based features with a much shorter feature extraction cost.
4.5 Impact of Input Image Size
|shorter side||longer side|
As the sequence length of the grid features is determined by the image size and . Therefore, the final sequence length of the input to the transformer also largely depends on the image size, which is important to control effectiveness and efficiency tradeoff. We further analyze the impact of input image size to Grid-VLP as shown in Table 3. From the results, we can see that Grid-VLP benefits from larger image sizes as input, where more fine-grained information can be extracted. Moreover, resizing the image to a smaller size can significantly improve the inference speed without decreasing performance much, e.g., no more than 1% for 2 times speedup.
In this paper, we revisit grid features as an alternative to the dominant bottom-up region features for vision-language pre-training. We propose a rather simple yet effective grid-based VLP method, and find that with larger and well pre-trained visual backbones, our Grid-VLP method can also outperform most competitive region-based VLP methods by equipped with the grid-based convolutional features. We hope our findings can help further advance the progress of vision-language pre-training and potentially provide new perspectives to vision-language pre-training.
Bottom-up and top-down attention for image captioning and visual question answering. In , pp. 6077–6086. Cited by: §1, §1.
- Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §1, §4.1, §4.3.
- UNITER: universal image-text representation learning. Cited by: §1, §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2, §4.2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §1.
- Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849. Cited by: §1, §2, §3.3, §4.2.
- Gqa: a new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709. Cited by: §1, §4.1, §4.3.
- In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10267–10276. Cited by: §3.2, §4.2.
- Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §1, §4.1.
Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11336–11344. Cited by: §2.
- UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409. Cited by: §2.
- Oscar: object-semantics aligned pre-training for vision-language tasks. arXiv preprint arXiv:2004.06165. Cited by: §1, §2.
- Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §4.1.
- Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: §1.
- Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §3.2, §4.4.
- Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §1, §2.
- A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491. Cited by: §1, §4.3.
- Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §1, §1, §2, §3.3, §4.1, §4.3.
Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1, §4.2.
- ERNIE-vil: knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934. Cited by: §1, §2.
VinVL: making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529. Cited by: §2.
- Visual7w: grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4995–5004. Cited by: §4.1.