Log In Sign Up

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

by   Siqi Sun, et al.

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature indexes offline, and employing instant dot-product matching with further re-ranking, which significantly speeds up retrieval process. In fact, LightningDOT achieves new state of the art across multiple ITR benchmarks such as Flickr30k, COCO and Multi30K, outperforming existing pre-trained models that consume 1000x magnitude of computational hours. Code and pre-training checkpoints are available at


page 9

page 14

page 15

page 16


Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Recent Transformer-based large-scale pre-trained models have revolutioni...

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Video-and-language pre-training has shown promising improvements on vari...

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

In the past few years, the emergence of vision-language pre-training (VL...

Team Yao at Factify 2022: Utilizing Pre-trained Models and Co-attention Networks for Multi-Modal Fact Verification

In recent years, social media has enabled users to get exposed to a myri...

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is an important research topic across multim...

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

Most of the currently existing vision and language pre-training (VLP) me...

Component-based Attention for Large-scale Trademark Retrieval

The demand for large-scale trademark retrieval (TR) systems has signific...

Code Repositories


source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

view repo

1 Introduction

Image-text retrieval (ITR) has been widely studied as a staple benchmark task in both NLP and computer vision communities. Traditional ITR search engines typically deploy ranking-based models built upon visual-semantic embedding matching 

Faghri et al. (2017); Huang et al. (2018) or deep cross-modal fusion with attention mechanism Lee et al. (2018); Li et al. (2020a, b). Earliest works Kiros et al. (2014); Faghri et al. (2017); Wang et al. (2018) employ separate image encoder (e.g., CNN) and text encoder (e.g., RNN), the embeddings from which are then measured by doc product for similarity matching (Figure 1(a)). Later studies Lee et al. (2018, 2019); Wang et al. (2019); Zhang et al. (2020) improve this paradigm by employing advanced region-level visual encoder (e.g., Faster-RCNN) and applying cross-attention between word features and region features for multimodal fusion (Figure 1(b)).

With the advent of Transformer Vaswani et al. (2017) and BERT Devlin et al. (2019), cross-modal retrieval tasks are more recently dominated by vision-and-language (V+L) pre-trained models, such as ViLBERT Lu et al. (2019), UNITER Chen et al. (2020), OSCAR Li et al. (2020b), and VILLA Gan et al. (2020). Large-scale pre-trained models learned from massive corpus of image-text pairs can power heterogeneous downstream tasks that take diverse modalities as inputs (e.g., text, image, video, audio). These models benefit from the self-attention mechanism in Transformer architecture, learning joint image+text embeddings through pre-training objectives such as masked language modeling (MLM) and masked region modeling (MRM) (Figure 1(c)).

However, the very ingredient that engenders the success of these pre-trained models, cross-modal attention between two modalities (through self-attention), also destines the inevitable latency and huge computation cost in training and deploying such massive-scale models. For example, UNITER Chen et al. (2020) builds upon 12/24 Transformer layers, and trains over 10 million image+text pairs. The inference time of such large models with 110 million parameters is 48 seconds on average for text query from COCO dataset Chen et al. (2015), not scalable in real-life applications serving millions of queries per second.

To make real-time ITR possible with low latency, we ask a bold question: can we go back to the beginning, reverting to simple dot product for efficient cross-modal retrieval? To make this retro experiment feasible, we rely on Transformer to pre-train high-quality image and text encoders, but use efficient dot product for multimodal fusion instead of computationally heavy self-attention. To still facilitate effective cross-modal embedding learning, we use a special [CLS] token on both encoders, which transfers the learned embedding from the other modality (Figure 1(d)). We name this new paradigm LightningDOT, for its lightening speed benefiting from dot product computation.

By removing the time-consuming cross-attention between modalities, the model can learn visual-semantic embeddings without extensive matching between each image-text pair during inference, as used in existing pre-trained models Chen et al. (2020); Li et al. (2020b); Lu et al. (2019). Further, by eliminating the dependency on real-time computation over image-text pairs, we can compute all image and text embeddings independently offline just for once, and reuse these embeddings as cached indexes for new queries on the fly (Figure 2).

For model training, we propose three learning objectives to jointly train two Transformer blocks: Image Encoder and Language Encoder. Specifically, Visual-embedding fused MLM (namely VMLM) and Semantic-embedding fused MRM (namely SMRM) ensure cross-modal information is harnessed even without cross-modality self-attention. A cross-modal retrieval objective (namely CMR) encourages the model to learn multimodal fusion through pre-training. To maintain competitive model performance, we further introduce a re-ranking mechanism to bring back the benefit of cross-attention methods.

In summary, LightningDOT is designed with late fusion to learn visual-semantic embeddings. Experiments on popular ITR benchmarks show that LightningDOT is 600/1900 times faster than existing pre-trained models on Flickr30k/COCO, while achieving new state-of-the-art results. When retrieving from larger candidate pool (>120K images), LightningDOT is 23,000 times faster. To the best of our knowledge, this is the first known effort on improving V+L model efficiency.

Figure 2:

An overview of our proposed framework. (a) LightningDOT is pre-trained with Sementic-embedding Fused Mask Region Modeling (SMRM), Visual-embedding Fused Mask Language Modeling (VMLM) and Cross-modal Retrieval (CMR). (b) LightningDOT ITR pipeline (image retrieval as an example). Similarities between input textual query and image candidates are computed via dot product. During inference, image representations can be computed offline, and a re-ranker can be applied for better accuracy, still with significant speedup.

2 Related Work

V+L Pre-training

Inspired by the success of Transformer-based Vaswani et al. (2017) language model pre-training Devlin et al. (2019); Liu et al. (2019); Yang et al. (2019); Raffel et al. (2020); Lan et al. (2020); Clark et al. (2020), vision-and-language pre-training Huang et al. (2020b); Su et al. (2020); Li et al. (2020b, 2019a) has become the prevailing paradigm in learning multimodal representations, with strong results on tasks such as image-text retrieval Kiros et al. (2014), visual question answering Antol et al. (2015) and referring expression comprehension Yu et al. (2016). Exemplary works include two-stream  (Tan and Bansal, 2019; Lu et al., 2019) and single-stream models (Chen et al., 2020; Li et al., 2020a; Zhou et al., 2020). Multi-task learning Lu et al. (2020) and adversarial training Gan et al. (2020) are also explored. This family of pre-training methods aims for general-purpose V+L without computation cost consideration. To the best of our knowledge, our work is the first known effort on pre-training visual-semantic embedding that enables low-latency real-time cross-modal retrieval. Ours is concurrent work with CLIP Radford et al. (2021).

Image-Text Retrieval

Early cross-modal embedding works Kiros et al. (2014); Wang et al. (2018); Faghri et al. (2017) focus on using a two-stream model to learn a unified visual-semantic embedding, with progressive improvement on two popular benchmarks: Flickr30K Plummer et al. (2015) and COCO Chen et al. (2015). Later methods with cross-attention Lee et al. (2018, 2019); Wang et al. (2019); Zhang et al. (2020) become more popular, with significant performance gain. Pre-trained V+L models also fall into this category. By exploiting large-scale image-text datasets, pre-trained V+L models further push the performance on Flickr30K and COCO. Although achieving high recall, cross-attention requires excessive computation cost during inference that cannot be overlooked.222The total inference time is quadratic to the dataset size with cross-attention for image-text retrieval task. In this work, inspired by dense retrieval in text retrieval domain Guu et al. (2020); Karpukhin et al. (2020); Xiong et al. (2020); Mao et al. (2020); Lewis et al. (2020), we propose a more efficient attention-less framework. With pre-training, our model achieves better performance while being significantly faster than cross-modal attention methods. Note that the proposed approach is orthogonal to model compression techniques that reduce the number of layers/parameters  Sun et al. (2019); Jiao et al. (2020), since we do not reduce the number of parameters from the UNITER baseline. These two approaches can be combined to further boost the speed, which is an interesting future work direction.

3 LightningDOT Framework

In this section, we present the proposed LightningDOT framework, which consists of two deep Transformers as image and language encoders. We first introduce three tasks designed to pre-train the model, then present our inference pipeline from offline feature extraction to online instant retrieval.

3.1 Model Pre-training

We denote the Transformer-based Vaswani et al. (2017) image encoder and language encoder by and , respectively (, are learnable parameters). Given a dataset of paired image and text , we first extract region features (, is the number of regions) for image , along with bounding box positions of regions via a pre-trained Faster-RCNN (Ren et al., 2015; Anderson et al., 2018).333 is a special [CLS] embedding. The image encoder encodes this sequence of image regions into a -dimensional space . The corresponding text

is tokenized into sub-word units and projected into high-dimensional feature vectors

(, is the number of tokens) following Devlin et al. (2019).444A 30k BPE (Sennrich et al., 2016) vocabulary (bert-base-cased) is used to tokenize the text. A special [CLS] token is also prepended following the common practice (). Similarly, the text encoding process can be written as . We regard the output [CLS] embedding as global image representation, and as global text representation. Following sections discuss how to jointly train these two encoders to learn strong visual-semantic embeddings, through three pre-training objectives.

Visual-embedding Fused Masked Language Modeling (VMLM)

Masked Language Modeling (MLM) pre-training is first proposed by Devlin et al. (2019), where 15% of the words are masked555In practice, this 15% is further decomposed into 10% random words, 10% unchanged, and 80% [MASK]. and the model is trained to reconstruct the masked words. Formally, we denote as masked tokens, where is the set of masked indices of size , randomly sampled from a natural number . are the unmasked words. MLM can be optimized by minimizing the negative log-likelihood:


where is the additional parameters introduced to map hidden states

to word probabilities.

Under the V+L setting, the textual input is usually highly correlated with the image. To leverage this cross-modal relation, we propose visual-embedding fused MLM (VMLM), in which the paired image is considered as additional input when training the model to reconstruct masked tokens in sentence

. The loss function of VMLM can be formulated as:


where and the word probabilities are conditioned on the corresponding image via the global image representation . Although VMLM takes a similar mathematical form to the MLM task proposed in UNITER, they differ in two main aspects: 1) LightningDOT uses two separate encoders ( is computed by ); and 2) visual dependency is explicitly injected to text representations (), instead of implicitly learned through cross-modal attention.

Semantic-embedding Fused Masked Region Modeling (SMRM)

Recent works on V+L pre-training (Lu et al., 2019; Tan and Bansal, 2019) have shown that mask-then-reconstruct pre-training on image regions also helps image+text embedding learning. Similar to MLM, Masked Region Modeling (MRM) is supervised by:


where can be any differentiable distance function. Among the variants of MRM, we consider Masked Region Feature Regression (MRFR) with L2 distance and Masked Region Classification with KL-Divergence (MRC-kl), due to their proven success in learning V+L representations (Chen et al., 2020).666In our implementation, no textual inputs are directly concatenated with image regions due to separate encoding of image and text. In MRFR, the distance between two feature vectors and is defined as:

where denotes -norm, and

is a learnable Multi-layer Perceptron (MLP) with parameters

. The KL-divergence

in MRC-kl measures distance between two probability distributions:

where is the parameters of a trainable MLP that maps feature vector to the object class distribution predicted by Faster R-CNN.

To incorporate language information encoded in the paired text, we extend MRM to Semantic-embedding fused MRM (SMRM), where the global text representation is exploited when reconstructing masked regions.


The specific variants SMRFR and SMRC-kl can be derived using the corresponding distance function, which is omitted for simplicity. Note that both the cross-modal fusion introduced in Eqn. (2) and Eqn. (4) uses simple addition without introducing extra parameters from their uni-modal counterpart. Moreover, the extra parameters and is not needed at downstream inference so will not slow down the retrieval.

Cross-modal Retrieval Objective (CMR)

Beyond image or text focused reconstructive objectives, we also propose a new pre-training task, Cross-modal Retrieval (CMR), to leverage the paired information between image and text. With this learning objective, the model is optimized to promote high similarity score for a matched image-sentence pair and vice versa. The similarity score between query and image is defined as:


where denotes the inner product between two vectors, and and are the output [CLS] embeddings from image encoder and language encoder , respectively.

In order to capture both image-retrieval and text-retrieval supervision signals in a single forward-backward pass, we propose a bi-directional variant of contrastive loss. Given any matched image-text pair , we treat text as the query, sample negative images , and then compute the objective function as:

where . Similarly, we take image as query (), sample negative text, and compute:

to optimize for text retrieval.

Figure 3: An illustration of the proposed CMR Loss. Note that positive pairs lie in the diagonal of the matrix.

Following Henderson et al. (2017); Gillick et al. (2019); Karpukhin et al. (2020), we use in-batch negatives to avoid the actual sampling of a negative image or text: given a batch of positive image-text pairs , we use all other images from within the batch as negatives () for every positive pair , and vice versa for negative text. The final CMR loss for batch is:


An illustration of is presented in Figure 3.777

The whole similarity matrix can be computed efficiently with one batched matrix multiplication call. This operation can take advantage of GPU hardware with Tensor Cores for faster training.

Through joint pre-training with CMR, VMLM and SMRM, the visual-semantic embeddings learned from image encoder and language encoder can be readily applied to downstream tasks. During finetuning stage, we directly adopt CMR loss to supervise the training process.

3.2 Real-time Inference

For simplicity, we take text-to-image retrieval as an example to introduce the real-time inference pipeline (Figure 2(b)): Offline image feature extraction and encoding; Online retrieval with text query; and Online re-ranking with top-retrieved images. Text retrieval is conducted in a symmetric manner.

Offline Feature Extraction

Image retrieval task requires the model to rank every image in an image database based on its similarity to a text query . In LightningDOT, we first apply the image encoder to all images in , and cache the resulting global image representations into an index Johnson et al. (2019) in memory for later use. Note that the entire image-to-index process, including Faster-RCNN feature extraction and Transformer encoding, can all be conducted offline. Therefore, for every new query at real time, the cached index can be reused for maximum inference time saving.

Online Retrieval

During inference, given a text query , we encode it with the language encoder , and then compute its similarity score to the embedding of every image in (stored in memory index) via Eqn (5). Finally, the images will be ranked by their similarity scores, from the highest to lowest. In practice, people are more interested in top- retrieval, with a list of images satisfying:


This optimization problem has been well studied, and we use FAISS (Johnson et al., 2019) to solve it in our implementation. It is worth noting that in order to apply fast search, the similarity function has to be decomposable. Therefore, we choose the simple dot product as

instead of a more complicated neural network function. Similarly, for text retrieval, the same architecture can be applied by simply pre-computing the embedding for all sentences and using an image as query instead.

Model COCO Test (5k images) Flickr30K Test (1k images)
Text Retrieval Image Retrieval Text Retrieval Image Retrieval
R@1 R@5 R@10 R@1 R@5 R@10 AR R@1 R@5 R@10 R@1 R@5 R@10 AR
VSE++ 41.3 69.2 81.2 30.3 59.1 72.4 58.9 52.9 80.5 87.2 39.6 70.1 79.5 68.3
SCO 42.8 72.3 83.0 33.1 62.9 75.5 61.6 55.5 82.0 89.3 41.1 70.5 81.1 69.9
GXN 42.0 - 84.7 31.7 - 74.6 - 56.8 - 89.6 41.5 - 80.0 -
SCAN-single 46.4 77.4 87.2 34.4 63.7 75.7 64.1 67.9 89.0 94.4 43.9 74.2 82.8 75.4
R-SCAN 45.4 77.9 87.9 36.2 65.6 76.7 65.0 66.3 90.6 96.0 51.4 77.8 84.9 77.8
CAMP 50.1 82.1 89.7 39.0 68.9 80.2 68.3 68.1 89.7 95.2 51.5 77.1 85.3 77.8
CAAN 52.5 83.3 90.9 41.2 70.3 82.9 70.2 70.1 91.6 97.2 52.8 79.0 87.9 79.8
ViLBERT - - - - - - - - - - 58.2 84.9 72.8 -
Unicoder-VL 62.3 87.1 92.8 46.7 76.0 85.3 75.0 86.2 86.3 99.0 71.5 90.9 94.9 88.1
UNITER-base 64.4 87.4 93.1 50.3 78.5 87.2 76.8 85.9 97.1 98.8 72.5 92.3 95.9 90.4
UNITER-large 65.7 88.6 93.8 52.9 79.9 88.0 78.1 86.9 98.1 99.2 75.5 94.0 96.6 91.7
OSCAR 73.5 92.2 96.0 57.5 82.8 89.8 82.0 - - - - - - -
LightningDOT 60.1 85.1 91.8 45.8 74.6 83.8 73.5 83.9 97.2 98.6 69.9 91.1 95.2 89.3
+UNITERbase Re-Ranker 64.6 87.6 93.5 50.3 78.7 87.5 77.0 86.5 97.5 98.9 72.6 93.1 96.1 90.8
+UNITERlarge Re-Ranker 65.7 89.0 93.7 53.0 80.1 88.0 78.2 87.2 98.3 99.0 75.6 94.0 96.5 91.8
+OSCAR Re-Ranker 74.2 92.4 96.0 57.4 82.7 89.9 82.1 - - - - - - -
Table 1: Evaluation results on image-to-text and text-to-image retrieval over Flickr30k and COCO test sets. We compare the proposed method with both task-specific models: VSE++ Faghri et al. (2017), GXN Gu et al. (2018), SCO Huang et al. (2018), SCAN Lee et al. (2018), R-SCAN Lee et al. (2019), CAMP Wang et al. (2019) and CAAN Zhang et al. (2020), and V+L pre-trained models: ViLBERT Lu et al. (2019), Unicoder-VL Li et al. (2020a), UNITER Chen et al. (2020) and OSCAR Li et al. (2020b). Models in bold are embedding-based methods without cross-attention.


To further improve retrieval accuracy, we propose a two-stage approach by adopting an optional re-ranking model. In the first stage, we use LightningDOT to retrieve top- images (or texts), where is an integer much smaller than the database (index) size. Next, we apply a stronger retrieval model (usually slower due to the use of cross-attention) to re-rank the retrieved top- pairs from the first stage. The final similarity scores obtained from the second stage will be used to re-compute the desired top- retrieval () in Eqn. (7). Please refer to figure 2 for a more detailed visualization. Our experiments show that this two-stage approach can benefit from the best of both worlds: maintaining a constant fast speed per query888The computation time of LightningDOT is negligible compared to that of UNITER. Therefore, the empirical speed is proportional to the number of pairs UNITER has to rank: constant for LightningDOT UNITER vs. the whole database (index) size for UNITER only. while achieving state-of-the-art accuracy. Another advantage of this pipeline is that it can readily incorporate any advanced model as the re-ranker, thus future stronger image-text retrieval models can take advantage of LightningDOT for better efficiency.

4 Experiments

This section discusses our experiments on pre-training and evaluating LightningDOT on downstream ITR benchmarks.

4.1 Datasets and Metrics

For pre-training, we use pre-processed data provided by Chen et al. (2020), including 4.2 million images with 9.5 million associated captions from COCO (Chen et al., 2015), VG (Krishna et al., 2017), Conceptual Captions (Sharma et al., 2018), and SBU captions (Ordonez et al., 2011).

For evaluation, we use Flickr30k Plummer et al. (2015) and COCO Lin et al. (2014) datasets, which include 31K/123K images, respectively, each associated with 5 human-written captions. Following (Faghri et al., 2017), we split COCO into 114K/5K/5K and Flickr30K into 29K/1k/1k images for train, validation and test.

Downstream performance is measured by recall at (R@K) for both image and text retrieval tasks. We also use an additional metric “AR”, the average of R@K for all across both image and sentence retrieval tasks.

Model COCO Full (123K Images) Flickr30K Full (31K Images)
Text Retrieval Image Retrieval Text Retrieval Image Retrieval
R@5 R@10 R@20 R@5 R@10 R@20 AR R@5 R@10 R@20 R@5 R@10 R@20 AR
LightningDOT 40.1 51.0 62.0 28.2 37.4 47.8 44.4 69.6 78.9 86.1 51.8 62.3 72.3 70.2
+ Re-Ranker-base 47.9 58.5 67.8 35.7 45.2 55.2 51.7 74.2 81.7 88.2 56.9 66.7 75.6 73.9
+ Re-Ranker-large 48.0 59.0 68.9 37.3 46.8 56.4 52.7 75.1 83.9 90.5 60.1 69.5 78.3 76.2
Table 2: Results on the extreme retrieval setting of full Flickr30k and full COCO datasets.

4.2 Results on Flickr30K and COCO

We compare the proposed approach with state-of-the-art methods (with and without pre-training) and report the results in Table 1. Without cross-attention, our method outperforms non-pre-training approaches by large margins on all metrics. Specifically, our model improves over CAAN Zhang et al. (2020) (SOTA method with cross-attention) by 3.3% (73.5 vs. 70.2) on COCO and 9.5% (89.3 vs. 79.8) on Flickr30K in terms of AR. When compared with methods without cross-attention (VSE++ Faghri et al. (2017) and SCO Huang et al. (2018)), LightningDOT achieves nearly 20-point gain on AR. Although LightningDOT achieves slightly lower AR than UNITER (pre-training method with cross-attention), with 3.5/1.1 points drop on Flickr30K/COCO, it is 600/1900 faster than UNITER during inference time.

We further apply second-stage re-ranking, and use UNITER to score top- retrieved image-text pairs from LightningDOT to obtain the final top- ranked lists. With re-ranking, LightningDOT achieves an instant performance lift, surpassing UNITER on both benchmarks, while still 46-95 times faster than UNITER. With an even stronger re-ranker OSCAR, LightningDOT achieves similar results to the state-of-the-art performance on COCO.

Method #images SCAN Ours +Re-ranker
Flickr30K-test 1,000 1.8 639 46
COCO-test 5,000 1.9 1,927 95
Flickr30K-full 31,014 1.8 6,591 1,255
COCO-full 123,287 1.9 23,869 2,235
Table 3: Speedup w.r.t. UNITER-base. We compare LightningDOT (Ours) and +Re-Ranker, plus a lightweight cross-attention method SCAN Lee et al. (2018). LightningDOT with/without UNITER-base re-ranker is significantly faster.

4.3 Speed & Space Improvement

To demonstrate the efficiency of LightningDOT, we use UNITER-base as baseline to compare inference speed. We also compare with a more lightweight cross-attention method SCAN Lee et al. (2018), which uses GRU Chung et al. (2014) instead of a 12-layer Transformer. All methods are tested on a single TITAN RTX GPU, with batch size of 400. As shown in Table 3, SCAN is 1.9 faster than UNITER-base across both benchmarks, as the computational cost of GRU is much cheaper than that of Transformer (performance drop is significant though). However, the speedup from SCAN is limited, as it computes cross-attention between each query and all images. On the other hand, LightningDOT is 639 faster than UNITER on Flickr30K. When tested with 5 times more images in COCO, the speedup from LightningDOT is 1927. Even with re-ranking, LightningDOT is still much more efficient than UNITER-base (46 faster on Flickr30K and 95 faster on COCO).

To mimic a real-life scenario for image retrieval, where the candidate pool contains hundreds of thousands of images, we combine all images from training, validation and test set to form a larger candidate pool. Note that models are still trained on the training set. Although the number of text queries remain the same, the number of candidate images scales up by >20, where cross-attention methods immediately become impractical. We refer this setting on both benchmarks as Flickr30k-full (31k) and COCO-full (123k). Our algorithm is 6,591 faster on Flickr30k-full and 23,869 faster on COCO-full, which clearly shows the advantage of LightningDOT and its potential in real-world applications. With re-ranking, LightningDOT is still more than 1,000 and 2,000 faster on Flickr30k-full and COCO-full, respectively. In general, for other re-rankers such as OSCAR, our algorithm can approximately speed up inference by times, where is the number of candidate images, and is number of re-ranked images from top- retrieved results by LightningDOT.

Similarly, we construct a full setting for text retrieval by combining all text queries from training, validation and test set. Results are summarized in Table 2. Considering the size of candidate pool has become more than 20

larger, we adopt recall at top 5, 10, 50 as evaluation metrics. Our method achieves reasonably good performance, with AR of 44.4 on COCO and 70.2 on Flickr30K. Re-ranking further lifts AR to 56.4 and 76.2. Results from UNITER or SCAN are not included as the computation of pairwise scores is extremely expensive, given the excessive amount of retrieval candidates. While LightningDOT only takes minutes to evaluate, UNITER-base is estimated to take about 28 days

999This estimation is based on the inference time taken by UNITER-base on a smaller dataset. to evaluate under the full setting for both image retrieval and text retrieval.

In addition, We compare all models with the same setting: cache as much as possible for fastest speed, where our model outperforms others in both speed and space on image retrieval. The proposed algorithm maps each image to a 768-dimensional vector, which only consumes about 300Mb storage space for the whole COCO dataset. For cross-attention models such as SCAN, UNITER or OSCAR, they also need to cache image features, which typically requires to save a 36 x 2048 dimensional vector per image, and it consumes about 28GB storage space for COCO dataset.

Text Retrieval Image Retrieval
Method R@1 R@5 R@10 R@1 R@5 R@10 AR
R-CNN only 62.2 85.9 91.1 42.0 70.9 80.3 72.1
+Image Encoder 73.4 92.5 95.6 59.5 84.5 90.3 82.6
+PT 83.5 96.4 98.7 68.6 90.5 94.8 88.8
LightningDOT 85.2 96.4 98.7 69.9 90.4 94.5 89.2
Table 4: Ablation studies on model design over Flickr30K validation set. PT indicates pre-training with MLM+MRM+CMR, while LightningDOT is pre-trained with VMLM+SMRM+CMR.
Text Retrieval Image Retrieval
LightningDOT R@1 R@5 R@10 R@1 R@5 R@10 AR
No PT 73.4 92.5 95.6 59.5 84.5 90.3 82.6
PT(CMR) 75.0 93.9 97.3 61.5 85.5 91.1 84.0
PT(All) 78.1 94.0 96.9 62.6 85.7 91.8 84.8
Table 5: Ablation studies on pre-training tasks over Flickr30K validation set after finetuning on the corresponding training set. All pre-training experiments are conducted on COCO dataset only. PT is short for pre-training. PT(CMR) refers to pre-training using CMR task only, and PT(All) refers to pre-training with all of the three tasks.

4.4 Ablation Studies

We conduct ablation studies on Flickr30K (Table 4) and compare LightningDOT (L4) against 3 ablated instances: ()“R-CNN only” (L1): image representations are extracted from Faster R-CNN directly, with no image encoder applied; () “+Image Encoder” (L2): regional features are encoded with a 12-layer Transformer as the image encoder; () “+PT” (L3): our model is pre-trained with MLM+MRM+CMR, then finetuned on Flickr30K. Note that the difference between MLM vs. VMLM and MRM vs. SMRM is whether the predictions of masked tokens (regions) rely on infused embeddings from the other modality.

Results show that “R-CNN only” is not sufficient in learning good image representations for ITR task, while image encoder with Transformer architecture can effectively learn contextualized image representations, hence achieving better performance. Pre-trained models (L3-4) generally achieve better performance, compared to non-pretrained models (L1-2). Comparing “+PT” to the full instance of LightningDOT, dependency on the other modality in VMLM and SMRM brings universal performance lift across all metrics. This indicates that these cross-modal dependencies introduced by VMLM and SMRM are effective in learning the association between image and text inputs.

Multi30K COCO
Method DE FR CS ZH JA Meta-Ave
S-LIWE 72.1 63.4 59.4 73.6 70.0 67.7
MULE 64.1 62.3 57.7 75.9 75.6 67.1
SMALR 69.8 65.9 64.8 77.5 76.7 70.9
82.0 73.5 70.2 81.8 86.8 78.9
UNITER 85.9 87.1 85.7 88.4 85.9 86.6
LightningDOT 83.3 83.7 82.2 87.2 82.3 83.7
+Re-Ranker 86.1 87.1 86.2 88.4 86.1 86.8
Table 6: Evaluation on multilingual image-text retrieval over Multi30K and COCO datasets. We compare with task-specific methods: S-LIWE (Wehrmann et al., 2019), MULE (Kim et al., 2020), SMALR (Burns et al., 2020), pre-trained method MHuang et al. (2020a) and UNITER with translate-test. Numbers in blue indicate the use of different dev/test splits of COCO compared to other methods. UNITER and Re-ranker are large model size.

In addition, we investigate the effectiveness of each pre-training task in Table 5. Comparing to baseline without pre-training, pre-training with CMR alone lifts on AR. Pre-training with all three tasks achieves the best performance, indicating that the learning of contextualized word and region representations promotes better global alignment between image and text, and these three pre-training tasks work collaboratively to yield better visual-semantic embeddings.

Figure 4: Retrieved top 10 images from the query "Sky view of a blue and yellow biplane flying near each other." The ground truth is in the red rectangle.

4.5 Multilingual Image-Text Retrieval

We further report results on multilingual image-text retrieval tasks. Specially, we evaluate LightningDOT under the translate-test setting, which is to translate the test captions in other languages to English by leveraging Machine Translation (MT) tool.101010We use Microsoft Azure Translation API Service. Note that our method is only trained on English captions, without exploiting the original or translated captions from multilingual benchmarks.

We consider two benchmarks: Multi30K (Elliott et al., 2016, 2017; Barrault et al., 2018) with captions in German, French and Czech; and COCO Japanese (Yoshikawa et al., 2017) and Chinese (Li et al., 2019b).

Average Recall (AR) is used as the evaluation metric. Meta-Ave, the average of AR over different languages across two benchmarks, is used as a global metric. More details on multilingual ITR benchmarks are included in Appendix.

We compare LightningDOT against 3 task-specific methods: S-LIWE (Wehrmann et al., 2019), MULE (Kim et al., 2020) and SMALR (Burns et al., 2020), which all exploit captions in different languages to learn multilingual or language-agnostic word embeddings. We also compare with a pre-trained model M(Huang et al., 2020a), which is alternatively pre-trained with image-caption pairs labeled in English and cross-lingual corpus in 100 different languages. Note that all methods discussed above are trained/finetuned on captions in different languages. For fair comparison, we report performance of UNITER under the same translate-test setting, which is finetuned with English captions only and tested on translated captions.

Table 6 shows similar trends of performance improvements as on English benchmarks. Compared to both state-of-the-art task-specific methods and pre-trained models, LightningDOT under translate-test setting achieves new state of the art on most languages and establishes a strong baseline for future study on these multilingual benchmarks.

4.6 Qualitative Examples

We show an example of image retrieval results here at figure 4 for query as "Sky view of a blue and yellow biplane flying near each other". In addition to the ground truth image in the red rectangle, all the 10 images retrieved by our model are valid retrieval since multiple keywords ("sky", "blue", "yellow", "airplane", "near") are captured for each image. Please see the appendix A.4 for more examples.

5 Conclusion

In this paper, we propose a pre-training framework that learns joint visual-semantic embedding without any cross-attention between modalities. LightningDOT outperforms previous state of the art, while significantly speeding up inference time by 600-2000 on Flickr30K and COCO image-text retrieval benchmarks. Future work includes extending the efficient training framework to other V+L tasks.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In CVPR, Cited by: §3.1.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In ICCV, Cited by: §2.
  • L. Barrault, F. Bougares, L. Specia, C. Lala, D. Elliott, and S. Frank (2018) Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Cited by: §A.2, §4.5.
  • A. Burns, D. Kim, D. Wijaya, K. Saenko, and B. A. Plummer (2020) Learning to scale multilingual representations for vision-language tasks. In ECCV, Cited by: §4.5, Table 6.
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §1, §2, §4.1.
  • Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) UNITER: universal image-text representation learning. In ECCV, Cited by: §A.1, Figure 1, §1, §1, §1, §2, §3.1, Table 1, §4.1.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    In Deep Learning and Representation Learning Workshop, Cited by: §A.3, §4.3.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) Electra: pre-training text encoders as discriminators rather than generators. In ICLR, Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §2, §3.1, §3.1.
  • D. Elliott, S. Frank, L. Barrault, F. Bougares, and L. Specia (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Second Conference on Machine Translation, Cited by: §A.2, §4.5.
  • D. Elliott, S. Frank, K. Sima’an, and L. Specia (2016) Multi30K: multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, Cited by: §A.2, §4.5.
  • F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2017) Vse++: improving visual-semantic embeddings with hard negatives. In BMVC, Cited by: Figure 1, §1, §2, Table 1, §4.1, §4.2.
  • Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu (2020) Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, Cited by: §1, §2.
  • D. Gillick, S. Kulkarni, L. Lansing, A. Presta, J. Baldridge, E. Ie, and D. Garcia-Olano (2019) Learning dense representations for entity retrieval. In CoNLL, Cited by: §3.1.
  • J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In CVPR, Cited by: Table 1.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) Realm: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909. Cited by: §2.
  • M. Henderson, R. Al-Rfou, B. Strope, Y. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos, and R. Kurzweil (2017) Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652. Cited by: §3.1.
  • H. Huang, L. Su, D. Qi, N. Duan, E. Cui, T. Bharti, L. Zhang, L. Wang, J. Gao, B. Liu, J. Fu, D. Zhang, X. Liu, and M. Zhou (2020a) M3P: learning universal representations via multitask multilingual multimodal pre-training. arXiv preprint arXiv:2006.02635. Cited by: §4.5, Table 6.
  • Y. Huang, Q. Wu, C. Song, and L. Wang (2018) Learning semantic concepts and order for image and sentence matching. In CVPR, Cited by: §1, Table 1, §4.2.
  • Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu (2020b) Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849. Cited by: §2.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2020) Tinybert: distilling bert for natural language understanding. In Findings of EMNLP, Cited by: §2.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §3.2, §3.2.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: §A.2.
  • V. Karpukhin, B. Oğuz, S. Min, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906. Cited by: §2, §3.1.
  • D. Kim, K. Saito, K. Saenko, S. Sclaroff, and B. A. Plummer (2020) MULE: Multimodal Universal Language Embedding. In AAAI, Cited by: §4.5, Table 6.
  • R. Kiros, R. Salakhutdinov, and R. S. Zemel (2014)

    Unifying visual-semantic embeddings with multimodal neural language models

    In Deep Learning and Representation Learning Workshop, Cited by: §1, §2, §2.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §4.1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    Albert: a lite bert for self-supervised learning of language representations

    In ICLR, Cited by: §2.
  • K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In ECCV, Cited by: Figure 1, §1, §2, Table 1, §4.3, Table 3.
  • K. Lee, H. Palangi, X. Chen, H. Hu, and J. Gao (2019) Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. arXiv preprint arXiv:1909.09953. Cited by: §1, §2, Table 1.
  • P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, et al. (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401. Cited by: §2.
  • G. Li, N. Duan, Y. Fang, D. Jiang, and M. Zhou (2020a) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In AAAI, Cited by: §1, §2, Table 1.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019a) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §2.
  • X. Li, C. Xu, X. Wang, W. Lan, Z. Jia, G. Yang, and J. Xu (2019b) COCO-cn for cross-lingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia. Cited by: §A.2, §4.5.
  • X. Li, X. Yin, C. Li, X. Hu, P. Zhang, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. (2020b) Oscar: object-semantics aligned pre-training for vision-language tasks. In ECCV, Cited by: §1, §1, §1, §2, Table 1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §A.1, §A.2, §4.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR, Cited by: §A.1.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §1, §1, §2, §3.1, Table 1.
  • J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee (2020) 12-in-1: multi-task vision and language representation learning. In CVPR, Cited by: §2.
  • Y. Mao, P. He, X. Liu, Y. Shen, J. Gao, J. Han, and W. Chen (2020) Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553. Cited by: §2.
  • V. Ordonez, G. Kulkarni, and T. L. Berg (2011) Im2text: describing images using 1 million captioned photographs. In NeurIPS, Cited by: §4.1.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, Cited by: §A.2, §2, §4.1.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer


    Journal of Machine Learning Research

    Cited by: §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPs, Cited by: §3.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In ACL, Cited by: footnote 4.
  • P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, Cited by: §4.1.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) VL-bert: pre-training of generic visual-linguistic representations. In ICLR, Cited by: §2.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019) Patient knowledge distillation for bert model compression. In EMNLP, Cited by: §2.
  • H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, Cited by: §2, §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §1, §2, §3.1.
  • L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.
  • Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In ICCV, Cited by: §1, §2, Table 1.
  • J. Wehrmann, D. M. Souza, M. A. Lopes, and R. C. Barros (2019) Language-agnostic visual-semantic embeddings. In ICCV, Cited by: §4.5, Table 6.
  • L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk (2020) Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808. Cited by: §2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In NeurIPS, Cited by: §2.
  • Y. Yoshikawa, Y. Shigeto, and A. Takeuchi (2017) STAIR captions: constructing a large-scale Japanese image caption dataset. In ACL, Cited by: §A.2, §4.5.
  • L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg (2016) Modeling context in referring expressions. In ECCV, Cited by: §2.
  • Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li (2020) Context-aware attention network for image-text retrieval. In CVPR, Cited by: §1, §2, Table 1, §4.2.
  • L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao (2020) Unified vision-language pre-training for image captioning and vqa. In AAAI, Cited by: §2.

Appendix A Appendix

a.1 Implementation Details

To further facilitate the reproductivity of our proposed method, we include more details about the choice of model size and hyper-parameters for both pre-training and fine-tuning.

The model dimensions are set to (L=12, H=768, A=12) for both image encoder and language encoder, where L is the number of stacked Transformer blocks; H stands for hidden activation dimension, and A is the number of attention heads. The total number of parameters in LightningDOT is 220M. Pre-training and finetuning learn the parameters of both encoders. During inference, with offline representation caching, only the forwarding pass with one encoder from the query modality will be performed online.

For both pre-training and finetuning, AdamW (Loshchilov and Hutter, 2019) is used to optimize the model training, with , . We adopt a learning rate warmup strategy, where the learning rate is linearly increased during the first 10% of training steps, followed by a linear decay to 0. We set the L2 weight decay to be 0.01.

During pre-training, we follow UNITER Chen et al. (2020) to randomly sample 1 task per mini-batch update.111111Code obtained from Our best model is pre-trained on VMLM+SMRM+CRM for 300,000 optimization steps. We set the batch size to 10240 per GPU (batch size is specified by #tokens + #regions, as in UNITER). Pre-training experiments are conducted on 8 V100 GPUs with 6-step gradient accumulation, and the learning rate is set to be 5e-5. For ablation studies presented in Table 5, the ablated instances of our model are pre-trained for 30k steps on COCO dataset Lin et al. (2014) only, and the same choice of learning rate and batch size are applied as in the best pre-training setting.

For finetuning, we set batch size to 96 (

is in examples, instead of the sequence length of tokens and regions), and search learning rate from {1e-5, 2e-5, 5e-5}. We select models based on their AR on the validation set. The best learning rate is 5e-5 for COCO and 1e-5 for Flickr30K. Our models are trained for 15 epochs on Flickr30k, and 20 epochs on COCO. For re-ranking, we choose

from {20, 50}.

a.2 Multilingual Image-Text Retrieval Benchmarks

When evaluating on ITR under the multilingual setting, we consider two benchmarks: Multi30K (Elliott et al., 2016, 2017; Barrault et al., 2018) and COCO Japanese (Yoshikawa et al., 2017) and Chinese (Li et al., 2019b). Multi30K is constructed by manually translating English captions in Flickr30K (Plummer et al., 2015) to German, French, and Czech. Each image in Multi30K is paired with 5 captions in German, 1 caption in French and Czech. We adopt the same train/val/test split as in Flickr30K. COCO Japanese (Yoshikawa et al., 2017) collected 820K Japanese captions for 165K COCO images (Lin et al., 2014). We use the same train/dev/test splits for COCO Japanese as in Karpathy and Fei-Fei (2015), and present results on the 1K test set. Similarly, Li et al. (2019b) collected 1-2 Chinese captions per image for 20K COCO images to build COCO Chinese. We follow the original split defined in Li et al. (2019b).

Method #images UNITER-base SCAN LightningDOT LightningDOT+Re-ranker
Flickr30K-test 1000 0.41 0.23 0.00064 0.0089
COCO-test 5000 1.95 1.04 0.00101 0.020
Flickr30K-full 31014 12.8* 7.10* 0.00193 0.010
COCO-full 123287 48.0* 25.7* 0.00201 0.021
Table 7: Image retrieval time cost measured by computation time (in seconds) for each query. The computation time for UNITER and SCAN is roughly linear to #images. Numbers with * are estimated by running time on test set.

a.3 Inference Time

We present the detailed inference time of UNITER-base, SCAN the proposed LightningDOT and LightningDOT with UNITER-base re-ranker in Table 7, measured by seconds/query. UNITER clearly is the slowest, as the 12-layer Transformer model inference needs to be run between each query and all images. Comparing between Flickr30k-test and COCO-test, its inference time scales up linearly with the number of images. With the lightweight GRU Chung et al. (2014), SCAN is 1.9 faster than UNITER. Across all settings, LightningDOT is significantly faster than both cross-attention methods (UNITER-base and SCAN). When adding UNITER-base as the re-ranker, our method slows down by 10, but still achieves decent speedup.

Figure 5: Retrieved top-10 images for query "romantic".
Figure 6: Retrieved top-10 images for query "blue girl boy ball"

a.4 More Qualitative Examples

We show several qualitative results of image retrieval (top-10). All results are retrieved from COCO-Full dataset (123k images in total). Our model can well understand the underlying semantic meaning. For example, “romantic” only appears twice in the whole COCO dataset annotations, yet the top retrieved images are all topic-related (Figure 5). With multiple keywords, our model attempts to retrieve the combinations of them (if not all). For example, for the query “blue girl boy ball” with four keywords, our model retrieves images that capture at least three keywords (Figure 6).

We also present image retrieval results where the text query is sampled from COCO dataset. We randomly sample 3 queries and present the results as below (ground truth on the top, retrieved top-10 images at the bottom). Clearly, our model retrieves related images from the full dataset.

Figure 7: Retrieved top 10 images from the query "A man and a little boy on skis on a ski hill." (Top picture is the ground truth.)
Figure 8: Retrieved top 10 images from the query "A road is lined with buildings and has cars on it." (Top picture is the ground truth.)
Figure 9: Retrieved top 10 images from the query "Two train employees stand near the open train car door." (Top picture is the ground truth.)
Figure 10: Retrieved top 10 images from the query "The sun hits the floor in a rustic bedroom." (Top picture is the ground truth.)