Log In Sign Up

VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval

Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation for pairwise text-image inputs via early interaction, the accuracy of vision-language (VL) transformers has outperformed existing methods for text-image retrieval. However, when the same paradigm is used for inference, the efficiency of the VL transformers is still too low to be applied in a real cross-modal SE. Inspired by the mechanism of human learning and using cross-modal knowledge, this paper presents a novel Vision-Language Decomposed Transformer (VLDeformer), which greatly increases the efficiency of VL transformers while maintaining their outstanding accuracy. By the proposed method, the cross-model retrieval is separated into two stages: the VL transformer learning stage, and the VL decomposition stage. The latter stage plays the role of single modal indexing, which is to some extent like the term indexing of a text SE. The model learns cross-modal knowledge from early-interaction pre-training and is then decomposed into an individual encoder. The decomposition requires only small target datasets for supervision and achieves both 1000+ times acceleration and less than 0.6% average recall drop. VLDeformer also outperforms state-of-the-art visual-semantic embedding methods on COCO and Flickr30k.


Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval

Recently, the cross-modal pre-training task has been a hotspot because o...

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that ...

T-VSE: Transformer-Based Visual Semantic Embedding

Transformer models have recently achieved impressive performance on NLP ...

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Visual appearance is considered to be the most important cue to understa...

ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

In this paper, we re-examine the task of cross-modal clip-sentence retri...

BagFormer: Better Cross-Modal Retrieval via bag-wise interaction

In the field of cross-modal retrieval, single encoder models tend to per...

Continual learning in cross-modal retrieval

Multimodal representations and continual learning are two areas closely ...

1 Introduction

Vision-language transformers (VL transformers)  [13, 21, 15] are well known for their superior accuracy in cross-modal retrieval, which involves searching for instances semantically similar to the query from another modal. These models learn cross-modal matching relationships with an early-interaction dataflow and produce joint representation for the text-image input. For example, ViLBERT  [17] computes co-attention among the outputs of each layer in the text and image encoders and learns fused text and image representations. The model significantly outperforms conventional multimodal encoders and proves the effectiveness of pre-training and early interaction. The following works [13, 21, 15] unify the image and text encoders with one BERT network [3] and learn one joint representation for a text-image pair. The BERT network tightens the interaction with fully connected attention among all vision and language features and achieves superior accuracy on several cross-modal tasks, including retrieval.

Figure 1: Illustration of different cross-modal retrieval models. (a) VL transformers that produce a joint representation for a text-image pair. (b) Late-interaction two-branch encoders. (c) Pre-trained two-branch encoders. (d) VLDeformer that decomposes the VL transformer into an individual encoder.
Type Model T2I I2T Pre-training data Params
(b) DIME[22] 59.3 43.1 - 116M
(a) VinVL[31] 74.6 58.1 8.8M 111M
(c) LightningDOT[26] 70.0 54.0 9.5M 116M
(c) ALIGN-small[9] 52.0 39.2 180M 235M
(c) ALIGN[9] 71.9 54.7 1800M 900M
Table 1: Retrieval accuracy of state-of-the-arts for each types. There is still a performance gap between per-trained Two-branch encoders (type c) and VL transformer (type a) in similar pre-training data.

However, to fulfill the requirements of modern applications such as search engines, social media, and e-commerce, the cross-modal retrieval model should achieve not only high accuracy but also fast retrieval speed. Although VL transformers achieve superior ability for the retrieval task in terms of accuracy, they suffer from low retrieval speed. They have to compute a joint representation for every matching composition because of the early-interaction dataflow, as shown in Fig. 1(a). The paradigm results in inference times on text-image data. In practice, finding the most similar pairs from text-image records requires inferences, which takes about hours on a modern V100 machine. Therefore, improving the cross-modal retrieval speed of VL transformers is important while applying to real-world large-scale retrieval scenarios.

Faster in speed are conventional cross-retrieval methods [29, 5, 6, 11, 30, 22] that use a late-interaction dataflow to match the multimodal embedding at the output of encoders. They commonly use contrastive learning to learn the multimodal embedding with two-branch networks, as shown in Fig. 1(b). During retrieval, the multimodal embedding could be reused for other comparisons, which results in inference times. These methods are faster but usually not as accurate as the pre-trained VL transformers.

Recent works addressed the trade-off between accuracy and speed by pre-training the two-branch encoders, as illustrated in Fig. 1(c). However, as shown in Table. 1, these models [9] require thousands of times larger pre-training data to outperform VL transformers. Those works that pre-train two-branch transformers as individual encoders with a similar pre-training data scale to VL transformers, such as LightingDOT [26], still have a gap to the VL transformers accuracy, which needs to be made up by using a VL transformer as a post re-ranker.

We believe that VL transformers are still superior models, and to make them more efficient in real-world scenarios, we aim to drastically improve their speed by decomposing the vision-language interaction inside the VL transformer. We noticed that the pre-training and fine-tuning process of VL transformers differs a lot from how the human brain uses cross-modal associations. While humans are able to handle the multimodal information separately during the retrieval by first getting the query and then searching for the content, the existing VL transformers are only able to process the multimodal information simultaneously. To enable them to handle the image and text information separately, we propose to change the process of building the VL transformers: they could be trained with early-interaction dataflow and decomposed as an individual encoder with fine-tuning, as shown in Fig. 1(d).

Since we want to benefit from the speed of the conventional cross-retrieval methods and the accuracy of the VL transformers, we modify the VL transformer by applying the principles of conventional cross-retrieval: late-interaction dataflow, the two-branch structure, and contrastive learning. We thereby propose Vision-Language Transformer Decomposing (VLDeformer), which improves the speed of the VL transformer while maintaining its accuracy. Considering both effectiveness and efficiency, VLDeformer provides a superior selection for cross-modal retrieval. Our contributions can be summarized as follows:

  1. We propose Vision-Language Transformer Decomposing (VLDeformer) that modifies the pre-trained VL transformer to an individual encoder for single text or image.

  2. We propose to use contrastive learning as the objective of VLDeformer to maintain the accuracy of the backbone VL transformer after decomposing.

  3. The proposed VLDeformer reaches thousands of times of retrieval speed acceleration with accuracy comparable to state-of-the-art cross-modal retrieval methods.

  4. VLDeformer outperforms state-of-the-art cross-modal embedding models as an individual encoder.

2 Related Work

Figure 2: Overall structure of the Vison-Language Transformer Decomposing (VLDeformer). The vision-language interaction inside the transformer module is decomposed (illustrated by the red line with the scissors) so that the model produces individual embedding for every input image or text. The networks for image and text inputs share the same weights.

Cross-modal Embedding Models

Text-image retrieval methods 

[8, 19, 33]

usually learn embeddings for the image and text with a two-branch network. Generally, the network includes a convolutional neural network as the image encoder and a sequence model as the text encoder. Researchers have found that the fine-grained relations between the visual objects and text tokens are essential for improving the embedding quality. For example,

[11] calculates attention between detected object features and word embedding from both visual and text view, respectively. The following work SCAN [12] explores early-interaction structure by stacked cross attention and gets significant improvements. However, the early-interaction dataflow decreases its retrieval speed. During inference, the late-interaction dataflow enables extracting representations offline to achieve fast online retrieval. Therefore, most of the following cross-modal embedding methods still use late-interaction dataflow, e.g, [32, 22] develop the interaction from global and local views.

Contrastive learning  is widely used by cross-modal embedding models[6, 23] to enclose the distance between similar samples and increase that of dissimilar samples, yet to the best of our knowledge, there is still no work using contrastive learning to decompose the early-interaction dataflow.

VL Transformer Pre-trained VL transformers have shown impressive performance for many multimodal tasks. These models learn the cross-modal interaction with an early-interaction dataflow, where the text and image features in each layer are fused with the attention mechanism. Using a BERT [3] network, these models learn cross-modal relations from self-supervision signals, e.g, text-image alignment [13], word-region alignment [2], and object label [15]. Early VL transformers [17, 27]

achieve early-interaction with two-branch transformer networks connected by co-attention. In the model, the outputs of the image and text of each layer are fused by a third transformer. The following models 

[14, 13, 2, 25, 15, 21] use single-stream architecture where the features are fused with fully-connected attention mechanisms. These models achieve improvement on cross-modal retrieval tasks with different self-supervised tasks.

In common practice, the pre-trained VL transformers are fine-tuned on downstream tasks while keeping the early-interaction dataflow. Therefore the network still needs to compute a fused representation for every compared text-image pair during inference, which leads to high computation costs in large-scale cross-modal retrieval tasks.

Pre-trained Cross-modal Embedding Models Pre-training transformers usually use several millions of text-image pairs. Recently, some researchers have explored pre-training cross-modal embedding methods with larger text-image pairs. For example, CLIP [23] trains a two-branch network with contrastive learning on 400 million image text pairs and achieves competitive results on various downstream tasks. ALIGN [9] expands the pre-training data even further to a larger and noisier 1,800 million scale. The model consists of an EfficientNet [28] with global pooling as the image encoder and a BERT as the text encoder and achieves higher performance than the pre-trained VL transformers after fine-tuning. Although these pre-trained cross-modal embedding models achieve advanced performance, their corpora are hundreds of times larger than VL transformers. Inspired by the pre-training of VL transformers, LightingDOT [26] tries to train two-branch transformers with a data scale close to the VL transformers. The network uses a late-interaction dataflow to accelerate the inference time. The cross-modal interaction is enabled at the transformer output for self-supervised pre-training tasks. As a result, the model outperforms no pre-trained cross-modal embedding methods. However, there is still a performance gap from the VL transformers, which has to be made up by collecting top- samples and using a VL transformer to select the final top- () results.

Different from the above methods, we explore a new form for vision-language models. We believe the transformer should be first pre-trained with an early-interaction dataflow and then be fine-tuned as an independent embedding encoder. Following this novel process, our model could achieve state-of-the-art accuracy and fast speed at the same time.

3 Methodology

The architecture of the VLDeformer is illustrated in Fig. 2. It composes of three components: Image and Text Input, decomposed VL Transformer, and Contrastive Representation. VLDeformer can be applied to many pre-trained VL transformers using early-interaction dataflow, in this section we take the pre-trained VinVL-base [31] model as an example to introduce the proposed method.

3.1 Image and Text Input

The formats of the image and text input follow the backbone transformer with one simple difference: the image and text input are not concatenated. Both inputs consist of three parts: position embedding, segment embedding, and token embedding. The input text is tokenized as a token sequence where is the length of the WordPiece [10] tokenizer output. The input image is pre-processed by the object detection network [31] to extract region features and tags.

A special token is added to the beginning of each token sequence, and is added to the end. As for the segment tokens, we assign segment token to mark the word tokens and the object tags, and to represent the region features. The position index for tokens, tags, and objects are assigned separately: The text position index ranges from to , while for the image input, the position index of tags and object features are arranged from to where is the number of the objects or tags. The final embedding for both text and image input is obtained by the summing up of position, segment, and token embedding, followed by a layer normalization.

3.2 decomposed VL Transformer

The decomposed VL transformer module uses pre-trained VinVL weights [31] which is a BERT [3]

network. In late-interaction dataflow, we simply isolate the cross-modal dataflow by feeding the text and image embeddings to the network separately so that there is no early-interaction inside the same transformer. As a result, the decomposed VL transformer obtains individual representation for the input images and texts. Since the transformers are mostly pre-trained to produce a joint representation for combined pairwise input, some natural language processing researchers 

[24] have found that pre-trained BERT networks are not good at representing individual sentences. We will prove that directly using a pre-trained VL transformer for individual text and image representation will also lose their outstanding performance. Then we will show that the performance loss could be extremely reduced through simply fine-tuning with contrastive loss.

3.3 Contrastive Representation

The outputs of the decomposed VL transformer then pass the average pooling layer with activation to produce representation or . A contrastive learning loss is applied to minimize the cosine distance between semantically aligned samples and maximize the distance between dissimilar samples. We do not follow the common practice in transformers to use vector as the representation due to two reasons. First, is assigned at the beginning of the text tokens followed by image features, so the position of is not symmetric to the text and image inputs during pre-training. Second, during pre-training, the transformer outputs of the text tokens and image features are used for self-supervision objectives e.g, mask language prediction [3], so that they are likely to keep the information of the input and therefore suitable for representation. The experiment results will prove that the average pooling output is more effective than the vector.

3.4 Bi-modal Hard Negative

Hard negative samples has been shown effective for learning high quality representations [6, 7] in contrastive learning. In the VL alignment, we propose to construct hard negative samples from both image and text perspectives. In the contrastive objective, the hard negative texts are used as an additional negative sample for the aligned image or vice versa.

The hard negative texts are constructed by replacing verb, nouns, adjective, and adverbial in the sentence. The replace ratio of each words is . The selected nouns are replaced with randomly sampled object tags. The other selected words are replaced with their antonym in WordNet [18]. It is challenging to directly construct hard negative image from the raw image. Therefore we create the hard negative by changing the objects including the tag tokens and the corresponding features. We replace the objects that appeared in the aligned text with random object from the mini-batch. Besides, half of the remaining objects are also replaced to increase the variation.

3.5 Optimization with Contrastive Learning

To maintain the performance of the transformer after Decomposing, we propose to train VLDeformer with contrastive learning objective. The pre-trained VL transformer is therefore decomposed with contrastive learning loss. In a mini-batch with text-image pairs, we regard the aligned pairs as the positive samples and other combinations as the negatives. We use an objective as Eq. 1 to pull semantically close images representation to the text representation and push non-close samples apart:


where is a temperature hyper-parameter, and

is the cosine similarity

. The term can also be regarded as optimizing text-to-image retrieval in a mini-batch.

Symmetric to , we use the loss term as in Eq. 2 to learn the image-to-text condition.


The complete contrastive learning loss is the summing up of these two terms:


Since the main goal of this paper is to show the effectiveness of contrastive learning for VL transformer Decomposing, we find that the simple contrastive loss is enough to maintain comparable performance to the backbone VL transformer. Other self-supervision could also be useful, which will be left for future work.

3.6 VLDeformer based Cross-modal Retrieval

Figure 3: Illustration of the text-to-image retrieval process with VLDeformer.

VLDeformer modifies the VL transformer into an individual encoder and therefore enables encoding the retrieval contents offline. For example, in text-to-image retrieval, the images are encoded to embeddings offline so that the online computation only includes the query encoding and the cosine similarity, which is the main reason to achieve the retrieval speed acceleration.

In this part, we take text-to-image retrieval as an example to introduce the retrieval process. To formulate, the image set is denoted as where is the image set size. The query is denoted as .

In the offline encoding stage, the images are processed following Sec. 3.1 to get the position, segment, and token embeddings and then passed to the VLDeformer model to get the image embedding . The image embeddings could be reused to compare with each text query.

During online retrieval, the query text is first transfer to position, segment, and token embeddings then encoded into query embedding . The index of top- related image to the query text is calculated as Eq. 4,


The top- retrieved images could be obtained from the image set.

3.7 Implementation Details

All the processed images are first resized to , then region-of-interests are detected with corresponding features and the object tags. The max sequence length of text tokens is set to . The batch size for contrastive Decomposing is set to , while the temperature is set to . During the contrastive Decomposing, the AdamW optimizer is adopted with a learning rate of and weight decay of . The VLDeformer is trained on an NVIDIA DGX with 8 V100. The system is Ubuntu 18.04 with CUDA

and pytorch


4 Experiments

4.1 Datasets and Evaluation Protocols

Methods Flickr30k Test (1k images) COCO Test (1k images) COCO Test (5k images)
Text Retrieval Image Retrieval Text Retrieval Image Retrieval Text Retrieval Image Retrieval
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
LightningDOT 83.9 97.2 98.6 69.9 91.1 95.2 - - - - - - 70.0 91.6 95.5 54.0 80.8 88.5
+Reranker[26] 87.2 98.3 99.0 75.6 94.0 96.5 - - - - - - 74.2 92.4 96.0 57.4 82.7 89.9
UnicoderVL[13] 86.2 96.3 99.0 71.5 90.9 94.9 84.3 97.3 99.3 69.7 93.5 97.2 62.3 87.1 92.8 46.7 76.0 85.3
Uniter[2] 86.9 98.1 99.2 75.5 94.0 96.6 - - - - - - 65.7 88.6 93.8 52.9 79.9 88.0
ImageBERT[21] 87.0 97.6 99.2 73.1 92.6 96.0 85.4 98.7 99.8 73.6 94.3 97.2 66.4 89.8 94.4 50.5 78.7 87.1
Oscar-base[15] - - - - - - 88.4 99.1 99.8 75.7 95.2 98.3 70.0 91.1 95.5 54.0 80.8 88.5
VinVL-base[31] - - - - - - 89.8 98.8 99.7 78.2 95.6 98.0 74.6 92.6 96.3 58.1 83.2 90.1
VLDeformer 91.7 98.7 99.6 74.2 89.4 91.3 89.2 98.9 99.9 75.9 95.4 98.0 71.9 91.8 96.2 54.7 81.0 88.6
Table 2: Cross-modal retrieval comparison results to VL transformers on COCO and Flickr30k dataset.
Methods COCO Test (5k images) Flickr30k Test (1k images) Pre-training data
Text Retrieval Image Retrieval Text Retrieval Image Retrieval
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Two-branch encoders
CAAN[32] 52.5 83.3 90.9 41.2 70.3 82.9 70.1 91.6 97.2 52.8 79.0 87.9 -
IMRAM[1] 53.7 83.2 91.0 39.7 69.1 79.8 74.1 93.0 96.6 53.9 79.4 87.2 -
SGRAF[4] 57.8 - 91.6 41.9 - 81.3 77.8 94.1 97.4 58.5 83.0 88.8 -
DIME[22] 59.3 85.4 91.9 43.1 73.0 83.1 81.0 95.9 98.4 63.6 88.1 93.0 -
Pre-trained two-branch encoders
ALIGN-small[9] 52.0 - - 39.2 - - - - - - - - 180M
ALIGN[9] 77.0 93.5 96.9 59.9 83.3 89.8 95.3 99.8 99.9 84.9 97.4 98.6 1800M
LightningDOT[26] 70.0 91.6 95.5 54.0 80.8 88.5 83.9 97.2 98.6 69.9 91.1 95.2 9.5M
VLDeformer 71.9 91.8 96.2 54.7 81.0 88.6 91.7 98.7 99.6 74.2 89.4 91.3 8.8M
Table 3: Comparison results with two-branch encoders (Fig.1 (b)) and pre-trained two-branch encoders (Fig.1 (c)) on COCO and Flickr30k dataset. The pre-training data is the scale of text-image pairs.

Datasets The COCO [16] and the Flickr30k [20] datasets are used for optimization and evaluation. Each image in the two datasets has caption texts. The COCO dataset contains K images and divided into K train, K valid, and K test images. We also use a common split of k tests for comprehensive evaluation. The Flickr30k dataset contains K images which are divided into 29K/1K/1K for train, valid, and test.

Evaluation Protocols The retrieval performance is measured by the recall at top samples (R@). Following common practice, three values R@1, R@5, and R@10, are reported for text-to-image retrieval and vice versa.

We evaluate the retrieval speed on the text-to-image retrieval task. We record the time cost that computes the similarity of all text-image pairs on 1k, 5k, and 10k data. We also record the time of a single query response which calculates the similarity between one text query and all the images. We calculate the P95 and p99.99 percentile values for the single response time on 1k query sentences and 10k images.

4.2 Cross-modal Retrieval Accuracy

4.2.1 Comparison with VL transformers

Table. 2 shows the comparison results between the VLDeformer network and pre-trained VL transformers on COCO and Flickr30k datasets. VLDeformer achieves close performance to the backbone VinVL-base model and even outperforms it on R@5 and R@10 at COCO 1k text retrieval set. The scores are also higher than other VL transformers like Unicoder-VL and ImageBERT. Compared with the pre-trained two-branch transformer, LightningDOT, VLDeformer achieves comparable results with LightingDOT using a Reranker. Since the reranker is a post-processing technique that relies on other VL transformers [26], we also compare VLDeformer with the pure LightingDOT encoder and prove its superior performance on both COCO 5k and Flickr30k datasets.

We notice that there is still a performance gap between VLDeformer and the backbone VinVL model, which is most obvious on the R@1 metrics, e.g, % on COCO 5k and % on COCO 1k for image retrieval. Interestingly, the gap is small on R@ and R@, which means that many ground truth images are not hit by the top result but recalled within top records, indicating that future work about hard negatives is likely to benefit the accuracy. Besides, based on a commonsense that there could be many images that depicting the same semantics, we are also curious about the bad top qualitative cases, which will be analyzed in Sec.4.3.

4.2.2 Comparison with Visual-Semantic Embeddings

Now that VLDeformer achieves comparable performance to the VL transformers as an individual encoder, we wonder how it compares with other cross-modal embedding models. As is shown in Table. 3, both VLDeformer and the other pre-trained models substantially outperform the w/o pre-training embedding models like CAAN and DIME.

It is worth noting that the performance of pre-trained cross-modal embeddings varies depending on the pre-training data scale. For example, the ALIGN model trained on the largest data outperforms the other models. However, the data of ALIGN is times larger than ours and times larger than that of LightningDOT, making it hard to judge these models fairly. What’s more, the smaller ALIGN-small model trained on M text-image pairs has a dramatic performance drop as the data scale decrease. Since VLDeformer outperforms the pure LightingDOT model, which is pre-trained with similar data size, we can conclude that VLDeformer is the most effective individual encoder on this data scale. On the other hand, the performance gap between VLDeformer and the ALIGN model indicates that using larger data for pre-training may bring more improvement, which is left as future work.

Query Text Ground Truth Top@1 Top@2 Top@3 Top@4 Top@5
Figure 4: Retrieved top5 images that VLDeformer flips VinVL from right to wrong in R@. (Better viewed with zoom-in)

4.3 Qualitative Case Analysis

Since the R@ metrics only calculate the hit ratio of the one aligned ground truth image, it may be inflected by other semantically similar samples. Therefore, we inspect the cases that are properly predicted by the backbone VinVL model at top but flipped by the VLDeformer. Fig. 4 shows the top retrieved images for such cases. Interestingly, many images share the same semantics with the query text although they are not the ground truth, e.g, “two giraffes next to a pole” or “four-way stop signs”. For such queries, the top metric is not applicable to judge the retrieval results. Some queries like the third case have rough semantics could be aligned with a wide range of images, e.g, “table with different food”, also decrease the top metrics because it is hard to recall the ground truth at top or even top records.

The samples also show some limitations of VLDeformer. For example, the fourth case mainly focuses on the “clock mounted on outdoor post” but fails to distinguish the “roman number” on the dial, indicating that more detailed matching is necessary for future works.

4.4 Retrieval Efficiency Analysis

4.4.1 Text-to-image Retrieval

In this experiment, the model compares the similarity between all the text-image pairs. We compare VLDeformer with the backbone VinVL-base and LightningDOT model on the same machine using one V100 GPU. The batch size of every tested model is set to 400. The 1k and 5k text-image pairs are from the COCO test set and 10k from the COCO train dataset. Since VinVL costs too much time on 5k and 10k data, we only run the first 1k batches and estimate the total time through average batch time cost.

Figure 5: Text-to-image retrieval time on 1k and 5k and 10k image corpus with mini-batch size.

The time costs are shown in Fig. 5. VinVL uses a very long time (about 4 hours on 1k and even more on larger data) for the retrieval, and the time cost goes like a quadratic function with the data size. In contrast, the late-interaction VLDeformer costs substantially less time. In quantitative comparison, VLDeformer achieves more than times acceleration on 1k data and times on 5k than VinVL. Both VLDeformer and LightningDOT show linear time cost curves as data size increases, but LightningDOT costs more time than VLDeformer, likely because the model is built on a larger BERT-large network. It is worth noting that when LightningDOT uses an Oscar-large [15] reranker to achieve compatible accuracy to VLDeformer, its retrieval time will increase by order of magnitude.

4.4.2 Single Query Response Time

Model Batch size Avg Min Max P95 P99.99 Mem
VinVL 250 758 632 16634 846 10567 3.66
500 375 338 7172 398 6130 5.73
1000 197 168 3903 228 3542 9.84
VLDeformer 250 18 16 360 17 31 1.59
500 16 15 369 17 51 1.60
1000 16 15 382 17 33 1.60
Table 4: Single Query Response time (ms) and corresponding memory space (GB) of single text-to-image retrieval query on 10k images.

We further simulate text-to-image search and test the service response. The image embeddings of VLDeformer are computed offline as the real-world retrieval scenario. As is shown in Table 4, VLDeformer achieves an average response time of ms. % queries are finished real-time (less than ms) and % within ms. The response time of VinVL can be saved as batch size increases but with a sacrifice of space usage. In comparison, VLDeformer is stable in both the time and space cost because it enables encoding images offline so the main computations for one query are sentence encoding and cosine similarity operation.

4.5 Ablation study

To verify the designed components for VLDeformer, we conduct an ablation study on the COCO 1k test set. The compared models are trained with the same hyper-parameters and epochs. We also conduct two comparisons to verify the importance of decomposing and pre-training the VL transformer. The results are shown in Table. 

How much do average pooling/contrastive learning loss matter? We test the performance of using in (w/o avg pool) to verify the selection of average pooling output instead of the feature for representation. The performance decreases significantly on all metrics, qualify the hypothesis in Sec. 3 that the average features are more suitable than feature to represent input information. To evaluate the effectiveness of the contrastive learning loss in decomposing, we replace the objective with pairwise cosine similarity (w/o contrastive). As a result, the metrics decrease dramatically, especially on R@ (% on text retrieval and % on image retrieval), proving that contrastive learning loss is essential for distinguishing similarity samples and maintaining the performance of the backbone VL transformer.
How much does the pre-trained VL transformer matter to VLDeformer? We test the retrieval performance of VLDeformer while using a randomly initialized transformer instead of using the pre-trained VinVL weight (w/o pre-train). The model still achieves performance compatible with the w/o pre-training cross-modal embedding models shown in Table 2. However, the model has a large performance gap to the VLDeformer and pre-trained VL transformers. Therefore, pre-training the VL transformer is necessary to achieve state-of-the-art performance. In other words, some of the knowledge in the pre-trained VL transformer is kept after the decomposing.
How much does the decomposing matter for the pre-trained VL transformer? Since pre-training the VL transformer benefits the VLDeformer performance, we wonder how the pre-trained model works without decomposing. Therefore we use the VinVL model as an individual embedding encoder that produces representation for the text and images and then test its retrieval ability (w/o decompose*). As the results from the last row, the metrics are very low, indicating that decomposing stage is necessary to transfer the VL transformer into an individual encoder while maintaining the performance.

Methods Text Retrieval Image Retrieval
R@1 R@5 R@10 R@1 R@5 R@10
VLDeformer 89.2 98.9 99.9 75.9 95.4 98.0
w/o contrastive 72.5 95.8 99.0 60.7 90.2 96.2
w/o avg pool 83.0 95.3 96.6 69.7 87.8 90.4
w/o pre-train 81.5 97.3 98.8 64.8 91.6 95.9
w/o decompose 0.3 1.0 2.0 0.1 0.2 1.6
Table 5: Ablation study of VLDeformer on COCO 1k Test. (w/o decompose is tested using the pre-trained VinVL transformer as an individual encoder)

5 Conclusion

We proposed a novel Vision-language Transformer Disentangling (VLDeformer) to modify the VL transformer as an individual encoder for a single image or text, achieving thousands of times retrieval speed acceleration. Meanwhile, we proposed to train the image and text representations through contrastive learning, which enables the VLDeformer to maintain the outstanding accuracy of the backbone VL transformer. The model achieved superior performance on COCO and Flickr30k datasets. Future direction may consider using more training data to improve the performance.


  • [1] Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 12652–12660, 2020.
  • [2] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120, 2020.
  • [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  • [4] Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. Similarity reasoning and filtration for image-text matching. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 35, pages 1218–1226, 2021.
  • [5] Aviv Eisenschtat and Lior Wolf. Linking image and text with 2-way nets. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4601–4611, 2017.
  • [6] Aviv Eisenschtat and Lior Wolf. Vse++: Improving visual-semantic embeddings with hard negatives. In BMVC, page 12, 2018.
  • [7] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
  • [8] Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7181–7189, 2018.
  • [9] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.
  • [10] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al.

    Google’s multilingual neural machine translation system: Enabling zero-shot translation.

    Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
  • [11] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision, pages 201–216, 2018.
  • [12] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–216, 2018.
  • [13] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11336–11344, 2020.
  • [14] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  • [15] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137, 2020.
  • [16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [17] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019.
  • [18] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  • [19] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 299–307, 2017.
  • [20] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  • [21] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv, pages 1–12, 2020.
  • [22] Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1104–1113, 2021.
  • [23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  • [24] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019.
  • [25] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2019.
  • [26] Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 982–997, 2021.
  • [27] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5100–5111, 2019.
  • [28] Mingxing Tan and Quoc Le.

    Efficientnet: Rethinking model scaling for convolutional neural networks.


    International Conference on Machine Learning

    , pages 6105–6114. PMLR, 2019.
  • [29] Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):394–407, 2018.
  • [30] Yaxiong Wang, Hao Yang, Xiuxiu Bai, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. Pfan++: Bi-directional image-text retrieval with position focused attention network. IEEE Transactions on Multimedia, pages 1–1, 2020.
  • [31] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588, 2021.
  • [32] Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3536–3545, 2020.
  • [33] Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):1–23, 2020.