StacMR: Scene-Text Aware Cross-Modal Retrieval

by   Andrés Mafla, et al.

Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at


page 1

page 4

page 15

page 16

page 17

page 18


Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Textual-visual cross-modal retrieval has been a hot research topic in bo...

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Image-text retrieval of natural scenes has been a popular research topic...

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Visual appearance is considered to be the most important cue to understa...

Scene Text Retrieval via Joint Text Detection and Similarity Learning

Scene text retrieval aims to localize and search all text instances from...

Probabilistic Embeddings for Cross-Modal Retrieval

Cross-modal retrieval methods build a common representation space for sa...

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Cross-modal retrieval between visual data and natural language descripti...

Generating Compositional Color Representations from Text

We consider the cross-modal task of producing color representations for ...

1 Introduction

Textual content is omnipresent in most man-made environments and plays a crucial role as it conveys key information to understand a visual scene. Scene text commonly appears in natural images, especially in urban scenarios, for which about half of the images habitually contain text [51]

. This is especially relevant when considering vision and language tasks, and in particular, related to our work, cross-modal retrieval. Scene text is a rich, explicit and semantic source of information which can be used to disambiguate the fine-grained semantics of a visual scene and can help to provide a suitable ranking for otherwise equally probable results (see example in Figure 

1). Thus explicitly taking advantage of this third modality should be a natural step towards more efficient retrieval models. Nonetheless, and to the best of our knowledge, scene text has never been used for the task of cross-modal retrieval, and the community lacks a benchmark to properly address this research question. Our work tackles these two open directions.

Figure 1: This paper introduces the scene-text aware cross-modal retrieval (StacMR) task and studies scene text as a third modality for cross-modal retrieval. For the example query above, the restaurant name provides crucial information to disambiguate two otherwise equally relevant results.

Scene text has been successfully leveraged to improve several semantics tasks in the past, such as fine-grained image classification [4, 21, 34, 40], visual question answering (VQA) [5, 47]

or image captioning 

[46]. Current mainstream methods tackle cross-modal retrieval by either learning to project images and their captions into a joint embedding space [15, 25, 28, 54] or by directly comparing image regions and caption fragments to compute a similarity score [22, 27]. Although significant gaps have been overcome by previous methods, the lack of integration between scene text and the other modalities still hinder a fuller image comprehension. The intuition that serves as the foundation of this work stems from the notion that scene text, found in natural images, can be exploited to obtain stronger semantic relations between images and their captions. Obtaining such relations opens up the path toward improved retrieval systems in which scene text can serve as a guiding signal to provide more relevant and precise results.

This paper introduces the Scene-Text Aware Cross-Modal Retrieval (StacMR) task which aims to capture the interplay between captions, scene text, and visual signals. To overcome the data scarcity of the proposed task, we have constructed a dataset based on COCO images 

[30] which we name COCO-Text Captioned (CTC). It exhibits unique characteristics compared to other datasets employed for multi-modal tasks and does not share their bias towards scene text as the main component present in an image. In this work, we also evaluate the performance of different state-of-the-art cross-modal retrieval models, their limitations, and we propose distinctive baselines to solve this task.

Concretely, the contribution of this paper is threefold. First, we introduce a new task called Scene-Text Aware Cross-Modal Retrieval (or StacMR in short), as an extension to cross-modal retrieval. In this task, leveraging the additional modality provided by scene text is crucial to further reduce the heterogeneity gap between images and captions.

Second, we describe a new dataset, COCO-Text Captioned (CTC), as the first dataset properly equipped to evaluate the StacMR task. We highlight the importance of the role that incidental scene text plays when interpreting an image and its positive impact on retrieval results. We also compare the properties of our CTC dataset with similar existing datasets containing scene text and captions.

Finally, we provide a extensive analysis of CTC. In particular (1) we benchmark the combination of different cross-modal baselines to model the interaction between scene text, visual, and caption information, and (2) we propose and evaluate a new model, STARNet, which explicitly learns to combine visual and scene-text cues into a unified image-level representation.

2 Related Work

Scene-Text Detection and Recognition.

Due to the large variance in text instances found in the wild 

[9, 64], scene text detection and recognition is still an active research field. Methods such as EAST [63], Textboxes++ [29] or LOMO [61] draw inspiration from general object detectors [19, 31, 44, 45] and typically localize text instances by regressing pre-defined anchor boxes or pixels.

Moreover, pipelines trained end-to-end often benefit from both tasks, detection and recognition. Mask Textspotter [32] is an end-to-end segmentation-based approach which detects and recognizes text in arbitrary shapes. Similarly, [20]

extracts image features with a CNN that are later refined by two Long-Short Term Memories (LSTMs) along with a text-alignment layer to perform these two tasks jointly. In another approach,

[60] employs a semantic reasoning network to mitigate transcriptions by projecting textual regions in a learned semantic space.

Scene Text in Vision and Language. Methods for vision and language tasks typically align both modalities and often perform visual reasoning. Only recently have they started including scene text as an additional modality. Works such as Text-VQA [47] and Scene-Text VQA [5] focus on models capable of reading text in the wild as well as reasoning about the inherent relations with visual features to properly answer a question given in natural language. Scene text has also been used to perform fine-grained image classification: [4, 21, 35] learn a shared semantic space between visual features and text to perform classification while [34] uses the Pyramidal Histogram Of Characters (PHOC) [2, 16, 36] descriptor as a way of overcoming OCR limitations and learn a morphological space suitable for the task. Other works [17, 39] perform scene-text based image search, where we query with a word and retrieve images containing such word. Closer to our work, the TextCaps dataset [46] includes scene text into textual descriptions. We discuss further the link with our work in Section 3.

Cross-Modal Retrieval.

Most cross-modal retrieval (CMR) approaches learn a joint representation space together with visual and textual embedding functions which produce similar representations for semantically related input, an image and its captions. Often, the visual embedding function is a CNN and the textual one a recurrent neural network 

[15, 33, 37, 55]. Other approaches use regions of interest given by a detector [3]. These approaches align each visual region with a corresponding caption word to get a finer-grained image representation [8, 23, 27, 28, 54, 62]. Some methods also use attention mechanisms [27, 41, 48] that model detailed interactions between captions and image regions. More recently, transformers [50] have been combined [49, 57, 58] to perform multi-layered self-attention operations in order to better align visual and textual features. Other works [28, 56] perform visual reasoning by employing graph convolutional networks [24] which yield a rich set of features by defining a relational graph between paired images and sentences. Closer to our work, Vo  [53] propose to use text modifiers along with images to retrieve relevant images.

3 The CTC Dataset

This section introduces the proposed COCO-Text Captioned (CTC) dataset. We first describe how it was gathered and tailored for the new StacMR task, which extends traditional cross-modal retrieval to leverage information from a third modality: scene text. (Section 3.1). Then we present CTC statistics and discuss the dataset in the light of other benchmarks and in particular the most related dataset: TextCaps [46] (Section 3.2).

3.1 Data Collection and Statistics

Building the Dataset. A suitable dataset for the proposed StacMR task requires the availability of these three modalities: images, captions and scene text. The most commonly used datasets for the cross-modal retrieval task [14, 15, 26, 27, 28, 49, 54, 56] are COCO Captions [10], commonly known as MS-COCO in the cross-modal literature, and Flickr30K [59]. Only very few images from Flickr30K contain scene text (see Table 1), so we decided to start from COCO Captions, a subset of the COCO dataset [30]. Additionally, the reading systems community commonly uses the COCO-Text dataset [51]. It contains a sample of COCO images with fully annotated scene-text instances. Among the COCO-Text images, we selected the ones that contain machine printed, legible text in English, leading to a total of images. In order to gather only images with the three modalities, we finally select the intersection between the filtered COCO-Text and COCO Captions. This leads to a multimodal dataset of items, each item consisting of an image with scene text and five captions, referred to as as COCO-Text Captioned (CTC).

Note that the resulting CTC dataset shares of its images with the original COCO caption training split. As a consequence, we can not use any models trained on COCO caption in our experiments, as their training set would inevitably share images with our test set. The dataset’s construction is illustrated in Figure 2.

Dataset Total Images Images w/ Text Annotations
Scene Text Captions
Flickr30K [59] 31,783 3,338
TextCaps [46] 28,408 28,408
COCO Captions [10] 123,287 15,844
COCO-Text [51] 63,686 17,237
COCO-Text Caps 10,683 10,683
Table 1: Datasets’ statistics for standard benchmarks and the proposed CTC. refers to COCO-Text filtered selecting machine printed, English and legible scene text only. numbers obtained with method from  [36]. numbers obtained with method from [7].
Figure 2: Proposed CTC dataset, which is designed to allow a proper evaluation of the STACMR task, as all entries contain three modalities: image, scene text and caption.

Statistics. Our only driver for building the CTC dataset has been to identify samples where all three modalities are available, without explicitly requiring at any point that scene text had any semantic relation to the captions. This is the most important requirement for a dataset where scene text is truly incidental and captions are not biased towards this additional modality. Despite this, to be coherent with the StacMR task definition, it is paramount to show that the proposed CTC dataset contains some inherent semantic relations between scene text found in an image and the captions that describe it. To this end, we design three scenarios which illustrate this semantic relevance at the image, caption and word-level.

Figure 3: CTC full statistics. Cumulative histograms (as thresholds over similarity vary) of the semantic similarity between instances of scene-text tokens and a) all captions for an image (Images), b) individual captions (Captions), and c) individual words in captions (Words).

More precisely, we first remove stop-words from captions and scene-text annotations, and embed each remaining word with Word2Vec [38]vectors trained on the Google News dataset. The semantic relevance between two words is defined as the cosine similarity between their Word2Vec embeddings. We then consider three scenarios to showcase the relevance of scene text to image captions. The first scenario considers the highest semantic similarity between any scene-text word and any word from the set of captions, for each image. This scenario visualizes the semantic relation with images, seen as sets of captions. The second scenario considers the highest semantic similarity between any scene-text word and any word from a corresponding caption. It highlights the semantic relation with individual captions. The third scenario considers how many caption words are related to scene-text words. This captures the semantic relation with individual words in captions.

The three histograms of Figure 3 correspond to the three previously described scenarios. The fact that many words have a strong similarity at all three levels confirm that scene text can be used to model the semantics between the three studied modalities to further leverage them in order to obtain a better performing cross-modal retrieval system.

As scene text provides fine-grained semantic information, its importance is query-dependant and it should be used selectively. An algorithm designed for the task should be able to decide, for each image, to which extent scene text is relevant for the cross-modal retrieval task. In order to better capture this, we define two partitions of the CTC dataset. CTC presents a natural semantic split that is evident in Figure 3 - a) that quantifies semantic similarity at the image-level. The first quantization (threshold = ) corresponds to images for which at least one word appears in both the scene text and one of the captions. We refer to this set of images as CTC explicit. We expect scene text from this set to often be relevant to the retrieval task. We employ the full CTC dataset, here referenced as CTC full to avoid ambiguity, to evaluate the more generic scenario where the role of scene text for retrieval is a priori unknown. This second set contains the previously mentioned explicit partition as well as images in which scene text is less relevant according to the annotated captions. Example image-caption pairs from CTC explicit are shown in Figure 5. This illustrates that scene text provides a strong cue and fine-grained information for cross-modal retrieval.

For evaluation purposes, we define two test splits. The first one, which we refer to as CTC-1K, is a subset of CTC explicit. The second test set, CTC-5K, contains the previous explicit images of CTC-1K plus non-explicit images. The remaining explicit plus non-explicit images are used for training and validation purposes.

3.2 Comparison with other Datasets

Table 1 provides a comparison with the previously mentioned datasets with statistics on the three modalities. Scene-text from COCO Captions [10] and Flickr30K [59] was acquired using a scene-text detector [36]. As mentioned earlier, none of the existing benchmarks contains samples where all three modalities are annotated.

Closely related to the proposed CTC dataset, TextCaps [46] is an image captioning dataset that contains scene-text instances in all of its images. TextCaps is biased by design, as annotators were asked to describe an image in one sentence which would require reading the text in the image. From the statistics shown in Figure 4 it can be seen first, that TextCaps images were selected to contain more text tokens than should be naturally expected and second, that many more of these tokens end up being used in the captions compared to the unbiased captions of CTC. The existing bias in TextCaps is also evident by analysing the intersection of images it has with the recently published Localized Narratives dataset [43]. From those images only () of them were annotated with captions that make use of any text tokens in the Localised Narratives dataset, where annotators were not instructed to always use the scene text. According to our statistics, this is already higher than expected in the real world. This is because the Localised Narratives captions are long descriptions and tend to venture to fine-grained (localised) descriptions of images parts where text is more relevant.

The proposed CTC is a much less biased dataset in terms of caption generation. The objective is to provide realistic data that permit algorithms to explore the complex, real-life interaction between captions, visual and scene-text information, avoiding to assume or force any semantic relation between them. More experiments showing the bias between TextCaps’ captions and scene-text are provided in Section 5 and in the supplementary material.

Figure 4: Histograms of the number of OCR tokens found in images (seen as sets of captions, left) and in individual captions (right) for the CTC and TextCaps datasets.
Image Captions
Sign warns against runaway vehicles along a hilly roadway.
A white signing telling people how to park their cars on a steep hill.
A sign explaining how to park on a hill is posted on the street.
A warning sign is fastened to a post.
Street sign with instructions on parking the hilly city roads.
A person holding up a tasty looking treat.
A person holding up a gummy hot dog in their hand..
a closeup of a candy gummy hot dog in plastic packaging.
A hotdog that appears to be a gummy hotdog.
A gummy hot dog that is for sale.
Parked school bus with a banner attached to it and people looking at it.
A man and a woman outside a school bus.
A school bus parked outside of a building.
A school bus sits parked as people walk by.
A school bus sitting on the side of the road near a pink car.
Figure 5: Image-caption pairs from the CTC dataset. These images belong to CTC explicit, their scene text and captions share at least one word (marked in bold).

4 Method

This section describes approaches to tackle the StacMR task. First, we propose strategies to directly apply standard pretrained cross-modal retrieval models to our new task and its three modalities: images, captions and scene text (Section 4.1). Second, we propose an architecture to learn a joint embedding space for cross-modal retrieval in which the image embedding function learns to fuse both the visual and the scene-text information (Section 4.2).

4.1 Re-Ranking Strategies

This subsection considers the image-to-caption retrieval task. Note that everything can easily be rewritten to consider the caption-to-image case.

For StacMR, images are multimodal objects: they contain visual information as well as textual information coming from scene text. On the other hand, captions contain textual information only. This asymmetry allows decomposing the StacMR task into two independent retrieval problems: visual-to-caption and scene-text-to-caption. The first visual-to-caption retrieval task performs comparisons between a purely visual descriptor of the query image and the textual descriptor of the captions. This corresponds to the standard cross-modal retrieval task as performed on Flickr30K or COCO Captions. The second, scene-text-to-caption retrieval task, performs comparisons between the textual descriptors of the scene text from the query image and the captions. Any textual descriptor could be used. In our experiments, we use the textual descriptor of a cross-modal retrieval model as it has been tailored for capturing concepts relevant for images.

A pretrained cross-modal retrieval model relies on a metric space equipped with a similarity function which can indistinguishably compare visual and textual descriptors and allows to rank all database elements according to a query.

Notations. Given a query image and a caption from the gallery , let be the score between and using the image-to-caption similarity from a cross-modal retrieval model and the score between and using the scene-text-to-caption similarity from that same model.

Re-Ranking Strategies. The most straightforward way to obtain StacMR results is simply to perform a late fusion (LF) of the ranking results obtained using both and . More formally, we compute the linear combination of the scores and , using a parameter :


One weakness of the late fusion strategy is that it combines all gallery items. Instead, we can limit the influence of the tails to avoid misranking by using different fusion strategies. Given , let be the indicator function that a gallery item is in the top- ranked items according to , i.e. if is in the top-k results when querying with and similarity , and otherwise. Following [1, 12, 13], we then define the late semantic combination (LSC) and product semantic combination (PSC) with Equations (2) and (3) respectively. Note that LSC is equivalent to the late fusion if is equal to the gallery size.


These different reranking strategies do not require any training and rely on existing pretrained cross-modal retrieval models. We simply use the part of CTC disjoint from the two test sets to choose the hyperparameters

and .

4.2 STARNet: A Dedicated Trimodal Architecture

All previously described approaches rely on a pretrained cross-modal retrieval model. Here, we introduce a new architecture able to handle the trimodality of the StacMR task. We start from the model presented in [28] and extend it to integrate scene text. First, we assume that scene text has been detected within an image. Then we adapt the model of [28] to be able to read scene-text instances. We include a positional information encoder along with a scene-text Graph Convolutional Network (GCN) and a customized fusion module into the original pipeline. Sharing intuition with [53], we assume that scene text acts as a modifier in the joint embedding space, applied to the visual descriptor of an image.

We propose the STARNet (Scene-Text Aware Retrieval Network) model, illustrated in Figure 6. It is composed of the following modules: a joint encoder for both an image and its scene text, a caption encoder , and a caption generation module . Given an image and its scene-text , the global feature encoding for both modalities is . The image encoder follows [3] and uses a customized Faster R-CNN [45] to extract visual features for all regions of interest represented by . Similarly, the employed OCR [18] extracts scene-text instances as well as their bounding boxes and is represented by .

For both modalities, image and scene text, we use a GCN [24] to obtain richer representations. For notation purposes we refer to the visual or textual features as since the formulation of both visual and textual GCNs are similar. The inputs to each GCN are features , where and, in the case of and in the case of

. A zero padding scheme is employed for both modalities if the number of features is smaller than

. We define the affinity matrix

, which computes the correlation between two regions and is given by: , where represent the two features being compared and and are two fully connected layers that are learned in an end-to-end manner by back propagation.

The obtained graph can be defined by , in which the nodes are represented by the features and the edges are described by the affinity matrix . The graph describes through the degree of semantic relation between two nodes. In our method, we employ the definition of Graph Convolutional Networks given by [24] to obtain a richer set of features from the nodes and edges. The equation that describes a single Graph Convolution layer is:


where is the affinity matrix , are the input features of a previous layer, is a learnable weights matrix of the GCN, is a residual weights matrix and is the number of GCN layer. Particularly, we employ a total number of for both GCNs used in the proposed pipeline.

The output of the visual GCN goes through a Gated Recurrent Unit (GRU) 

[11] to obtain the global image representation denoted by . Textual features from the output of the scene-text GCN are average-pooled to obtain a final textual representation denoted by . The final image representation is the dot product between the visual and final scene-text features (which act as a modifier) added to the original visual features: .

Caption from the corresponding training image-caption pair is encoded with a GRU [11, 15], leading to . To align image features with their caption features in a joint embedding space, we train and using a triplet ranking loss [15, 27] by employing the hardest negative sample on each mini-batch.

In order to provide the model with a stronger supervision signal, the learned image representation is also used to generate a caption as an auxiliary task. We train the third encoder so that the generated caption equals to: . This sequence to sequence model uses an attention mechanism similarly to [52] and we optimize the log-likelihood of the predicted output caption given the final visual features and the previous generated word.

Figure 6: Our proposed STARNet model. Visual regions and scene-text instances are used as input to a GCN. The final learned representations are later combined to leverage complementary semantic information.

5 Experiments

We present results on CTC. They are split into two parts: visual-only and scene-text-only baselines, as well as their unsupervised re-ranking (Section 5.1), and supervised trimodal fusion results from STARNet (Section 5.2). Following cross-modal retrieval (CMR) evaluation standards, we report performance with recall at K (R@K) for K in

for both image-to-text and text-to-image retrieval.

Visual Model Scene-text Model Trained on Scene-text Source Re-rank CTC-1K CTC-5K
Image to Text Text to Image Image to Text Text to Image
F30K TC R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
(1) VSE++ [15] - - 20.5 42.8 54.5 15.4 35.2 48.4 13.3 30.2 40.2 8.4 21.5 30.1
(2) VSE++ - - 23.9 50.6 63.2 16.5 39.6 53.3 12.6 30.1 40.2 7.9 21.0 29.7
(3) VSRN [28] - - 27.1 50.7 62.0 19.7 42.8 55.7 19.2 38.6 49.4 12.5 29.2 39.1
(4) VSRN - - 35.6 64.4 76.0 24.1 50.1 63.8 22.7 45.1 56.0 14.2 32.1 42.6
(5) VSE++ GRU GT - 26.3 40.4 47.3 10.0 20.3 25.6 4.4 7.1 8.2 1.6 3.5 4.7
(6) VSRN GRU GT - 12.3 25.1 30.1 6.8 15.3 20.0 1.9 4.0 5.2 1.1 2.8 3.8
(7) Fasttext+FV GT - 21.7 36.5 44.3 3.2 6.6 9.0 3.5 5.9 7.5 0.6 1.3 1.7
(8) VSE++ VSE++ GRU GT AVG 34.6 53.1 61.0 14.5 31.0 39.4 10.0 21.5 29.5 5.0 14.1 21.4
(9) LF 31.0 60.0 72.3 20.4 44.7 57.3 13.4 30.9 41.5 7.4 20.5 29.1
(10) PSC 37.4 62.8 73.6 15.5 42.6 57.1 12.2 32.1 42.4 4.1 19.3 29.2
(11) LSC 31.6 57.8 70.2 20.3 44.7 57.8 13.7 31.7 41.6 7.7 21.0 29.6
(12) VSRN VSRN GRU GT AVG 36.8 62.2 72.9 18.6 40.5 52.9 15.3 33.5 44.3 6.4 18.9 28.0
(13) LF 40.3 68.5 79.9 23.9 49.9 63.4 22.6 45.0 56.3 11.8 29.5 40.0
(14) PSC 33.5 65.9 78.2 15.8 48.1 64.3 18.5 44.5 56.0 5.3 28.7 41.0
(15) LSC 38.6 67.5 78.5 24.3 50.4 64.0 23.4 45.6 56.5 12.1 30.6 41.1
(16) VSRN VSE++ GRU GT LF 45.8 72.7 81.4 26.5 52.7 66.1 24.2 46.1 57.1 12.9 31.0 41.2
(17) PSC 42.2 71.5 82.8 18.9 51.1 66.4 20.1 46.4 57.5 6.7 29.5 41.6
(18) LSC 45.3 71.5 80.7 26.7 53.0 66.2 24.4 46.9 57.4 13.2 31.8 42.3
(19) VSRN VSE++ GRU OCR LF 41.5 70.1 79.8 25.1 51.2 64.3 23.3 45.0 58.9 12.6 30.5 41.1
(20) PSC 38.5 69.6 80.6 17.9 50.1 65.1 19.8 45.7 57.2 7.0 29.8 41.7
(21) LSC 42.2 68.6 78.5 25.5 51.8 64.9 19.8 45.7 57.2 13.2 31.5 42.2
Table 2: Results on CTC for visual and scene-text baselines, and their re-ranking combinations. Visual model and Scene-text model indicate image-caption and scene-text-caption retrieval, respectively. GT stands for ground-truth scene-text annotations and OCR for scene-text prediction obtained from [18]. Bold numbers denote the best performances of visual, scene-text, and re-ranking methods for each ensemble of models.

5.1 Baselines and Re-Ranking Results

This section first introduces visual-only CMR models. These allow observing how standard CMR models tackle the StacMR task on CTC. Then, we propose scene-text-only metric spaces, where the only information extracted from the image is its scene text. These baselines should help judge the semantic relevance of the scene-text with respect to the captions. The remaining results correspond to different combinations: a naive average of visual and scene-text embeddings for metric spaces that allow it, and the different re-ranking strategies introduced in Section 4.1.

Visual-only Baselines. We use two CMR models based on global features for both images and captions, VSE++ [15] and VSRN [28]. Both works provide public training code, used for all models in this section, with the exception of the VSE++ model trained on Flickr30K, for which we use the model provided by [15]. We train these architectures either with Flickr30K or Flickr30K + TextCaps. As mentioned in Section 3.1, models pretrained on COCO Captions are not considered due to the overlap between the training set of COCO Captions and our test sets.

Results are presented in Table 2, rows (1-4). VSRN surpasses VSE++, mirroring their relative performance from CMR benchmarks. Furthermore, models trained on the additional data of TextCaps outperform models trained only on Flickr30k. This is interesting, as TextCaps images-captions pairs are more dependent on their scene text than those from Flickr30k. Enlarging the dataset size with the inclusion of TextCaps explains this improvement to an extent, as the training set of Flickr30k is relatively small. Moving forward, we only report models trained on F30K+TC.

Scene-Text only Baselines. We use the textual embedding part of our two previously used CMR models (denoted by VSE++ GRU and VSRN GRU respectively). We also consider FastText [6] word embeddings followed by a Fisher vector encoding [42] (denoted by FastText+FV), which is able to deal with out-of-vocabulary words. For these experiments, we use the ground-truth OCR annotations as scene text. Results are presented in Table 2, rows (5-7). We observe much weaker results than the purely visual baselines. For CTC-1K, this approach can rely on shared words between scene text and one of the captions. For the more realistic CTC-5K, we see that scene text brings very little in isolation. Note that the VSE++ GRU outperforms VSRN GRU for this task, while VSRN is better for the purely visual case. This motivates the hybrid strategies merging both models that we report later. Fasttext+FV yields strong results on image-to-caption retrieval on CTC-1K, but shows poor results on the other evaluated scenarios. A discussion of several scene-text only baselines is available in the supplementary material.

Average Embedding. If an image and scene text are represented using the same CMR model, all three modalities are represented in the same embedding space. This allows a naive combination which consists in averaging visual and scene-text embeddings to represent the image, reported as AVG on the Table 2, rows (8) and (12). This brings a non-negligible improvement on CTC-1K Image to Text compared to their respective visual-only baseline and it is a first proof that scene text, even naively used, improves on some StacMR queries. However, we observe a decline on CTC-5K in the same comparison. This hints at the fact that scene text provides fine-grained information that should be used selectively, and giving equal weight to both modalities is too crude an approach.

Re-Ranking Results. Some re-ranking results are presented in Table 2, rows (9-21). We test the best pairing of visual-only and scene-text-only models with three combination strategies: late fusion (LF), product semantic combination (PSC) and late semantic combination (LSC). Hyper-parameters of each re-ranking strategy are chosen for VSRN with VSE++ GRU

and applied to all other combinations as is. We use the part of CTC explicit which is not used for testing as validation. For LF, . For PSC, and . For LSC, and .

When compared to the unimodal baselines, all combinations improve results on CTC-1K. Both LF and LSC match the results of their visual baselines on CTC-5K, showing that these methods are more robust to scene-text information unrelated to the captions.

For the three best performing re-ranking variants, we repeat the experiment using OCR predictions instead of the ground-truth scene-text annotations. Results are shown in rows (19-21). When compared with their counterparts in rows (16-18), we observe a R@10 loss on average of 1.7 in CTC-1k and stable results for CTC-5k. This validates the stability of these re-ranking strategies to loss of information due to imperfect OCR predictions.

Uses Scene Text Scene-Text Source Trained on CTC-1K CTC-5K
Image to Text Text to Image Image to Text Text to Image
F30K TextCaps CTC R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
SCAN [27] - 26.4 48.6 61.1 15.2 36.8 49.3 17.5 36.7 47.1 7.6 21.2 30.4
OCR 19.5 43.8 57.1 10.2 28.7 42.1 7.0 20.0 29.7 3.2 11.7 18.1
OCR 35.0 62.9 74.4 19.3 44.0 58.3 21.1 43.0 54.6 9.6 25.4 35.6
OCR 27.5 48.9 61.9 16.5 37.7 51.1 18.6 37.3 47.6 8.1 21.6 30.6
OCR 36.3 63.7 75.2 26.6 53.6 65.3 22.8 45.6 54.3 12.3 28.6 39.9
VSRN [28] - 27.1 50.7 62.0 19.7 42.8 55.7 19.2 38.6 49.4 12.5 29.2 39.1
OCR 18.6 40.4 52.2 11.7 31.0 44.2 6.6 17.9 25.8 4.5 13.0 19.8
OCR 35.6 64.3 76.0 24.0 50.1 63.1 22.6 45.0 55.9 14.2 32.1 42.5
OCR 36.1 64.1 75.8 26.2 53.1 65.2 24.6 48.1 58.8 15.4 35.7 46.9
OCR 38.2 67.4 79.1 26.6 54.2 66.2 23.7 47.6 59.1 14.9 34.7 45.5
STARNet OCR 29.4 52.3 62.6 21.8 44.3 57.2 19.9 39.6 50.1 13.4 30.7 40.4
OCR 23.4 48.0 61.0 14.2 34.9 47.3 5.1 15.1 22.3 3.9 11.9 25.1
OCR 39.3 65.4 76.8 25.9 52.3 65.2 21.1 41.8 52.9 13.8 31.8 42.0
OCR 36.5 64.6 74.3 26.4 53.8 65.6 25.5 48.4 59.8 15.7 35.3 46.6
OCR 44.1 74.8 82.7 31.5 60.8 72.4 26.4 51.1 63.9 17.1 37.4 48.3
Re-rank Comb. (21) OCR 42.2 68.6 78.5 25.5 51.8 64.9 19.8 45.7 57.2 13.2 31.5 42.2
STARNet - GT GT 45.4 74.9 83.9 32.0 61.2 73.3 26.8 51.4 64.1 17.4 37.8 48.7
Table 3: Retrieval results on the CTC-1K and CTC-5K test set of supervised models. Second-to-last row shows the result from the unsupervised re-ranking baseline described in Table 2, line 21. OCR stands for the textual features obtained from [18], whereas GT refers to the Ground-truth annotated scene text. Results depicted in terms of Recall@K (R@K).

5.2 Supervised Results

Latest cross-modal retrieval models rely on region-based visual features [27, 28, 54] rather than a global image representation [15]. In this section, we include results of two state-of-the-art models, SCAN [27] and VSRN [28] that employ such region-based visual features. The original cross-modal retrieval models, SCAN and VSRN are used only when trained on Flickr30K. In order to leverage scene text, we have modified them to include OCR features. In both models, the OCR features are projected into the same space as the visual features and the default hyperparameters are employed, details are showed in the supplementary material. All the obtained results are reported on Table 3. The second column depicts the usage of scene-text instances by each model, and the third column depicts the source of the scene text. We make the following observations.

First, we see that using standard models trained on a common cross-modal retrieval dataset, such as Flickr30k, does not yield good performances on the StacMR task.

Second, we note the different behaviors when each dataset is used for training and testing is done on the CTC test sets. In particular, it is worth noting that by training solely on TextCaps [46], the performance of any model decreases significantly, specially in the CTC-5K dataset. This effect is caused by the bias in Textcaps that places a big focus on scene-text instances to describe an image, rather than combining visual and textual features in an unbiased way.

However, all datasets provide complementary statistics to train the STARNet model. For instance, Flickr30k focuses on relevant visual regions, whereas the combination of TextCaps and CTC can be seen as a reciprocal set of datasets that aim towards modeling the relevance of scene-text from an image in a more natural manner.

It is worth pointing out that STARNet almost doubles the performance in the CTC-1K subset when compared to common retrieval models. We believe this effect is due to the explicit scene-text instances that reinforce the notion of the relevance of this modality. A smaller improvement is achieved in the CTC-5K. This result is caused by the fact that even though scene text does not appear explicitly in the captions, a varying degree of semantics between image and scene text can still be found.

Finally, we also show an upper-bound at test time assuming a perfect OCR (using ground truth scene-text annotations in CTC), which adds a slight boost to the proposed method. This effect shows and confirms the importance of accurate scene-text recognizers in the StacMR task. Additional experiments regarding the performance of the baseline supervised models have been conducted in Flickr30K and TextCaps datasets along with qualitative results available on the supplementary material.

6 Conclusion

In this work, we highlight the challenges stemming from including scene-text information in the cross-modal retrieval task. Although of high semantic value, scene text proves to be a fine-grained element in the retrieval process that should be used selectively. We introduce a realistic dataset, CTC, where annotations for both scene text and captions are available. Contrary to datasets constructed with scene text in mind, CTC is unbiased in terms of scene-text content and of how it is employed in the captions. A comprehensive set of baseline methods showcase that combining modalities is beneficial, while a simple fusion cannot tackle the newly introduced task of scene-text aware cross-modal retrieval. Finally, we introduce STARNet a supervised model that successfully combines all three modalities.


  • [1] J. Ah-Pine, S. Clinchant, G. Csurka, F. Perronnin, and J. Renders (2010) Leveraging image, text and cross–media similarities for diversity–focused multimedia retrieval. In ImageCLEF, pp. 315–342. Cited by: §4.1.
  • [2] J. Almazán, A. Gordo, A. Fornés, and E. Valveny (2014) Word spotting and recognition with embedded attributes. PAMI 36 (12), pp. 2552–2566. Cited by: §2.
  • [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6077–6086. Cited by: §B.2, §2, §4.2.
  • [4] X. Bai, M. Yang, P. Lyu, Y. Xu, and J. Luo (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6, pp. 66322–66335. Cited by: §1, §2.
  • [5] A. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C.V. Jawahar, and D. Karatzas (2019-10) Scene text visual question answering. In Proc. ICCV, Cited by: §1, §2.
  • [6] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §A.1, §B.1, §5.1.
  • [7] F. Borisyuk, A. Gordo, and V. Sivakumar (2018) Rosetta: large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79. Cited by: Table 5, Table 1.
  • [8] H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han (2020) IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663. Cited by: §2.
  • [9] X. Chen, L. Jin, Y. Zhu, C. Luo, and T. Wang (2020) Text recognition in the wild: a survey. arXiv preprint arXiv:2005.03492. Cited by: §2.
  • [10] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §3.1, §3.2, Table 1.
  • [11] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    arXiv preprint arXiv:1412.3555. Cited by: §4.2, §4.2.
  • [12] S. Clinchant, J. Ah-Pine, and G. Csurka (2011) Semantic combination of textual and visual information in multimedia retrieval. In Proceedings of the 1st ACM international conference on multimedia retrieval, pp. 1–8. Cited by: §4.1.
  • [13] G. Csurka and S. Clinchant (2012) An empirical study of fusion operators for multimodal image retrieval. In 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–6. Cited by: §4.1.
  • [14] M. Engilberge, L. Chevallier, P. Pérez, and M. Cord (2018) Finding beans in burgers: deep semantic-visual embedding with localization. In Proc. CVPR, Cited by: §3.1.
  • [15] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2018) VSE++: improving visual-semantic embeddings with hard negatives. In Proc. BMVC, Cited by: §1, §2, §3.1, §4.2, §5.1, §5.2, Table 2.
  • [16] L. Gomez, A. Mafla, M. Rusinol, and D. Karatzas (2018) Single shot scene text retrieval. In Proc. ECCV, Cited by: §2.
  • [17] L. Gómez, A. Mafla, M. Rusinol, and D. Karatzas (2018) Single shot scene text retrieval. In Proc. ECCV, Cited by: §2.
  • [18] Google (2020 (accessed June 3, 2020)) Cloud vision api. External Links: Link Cited by: §A.1, §A.1, §4.2, Table 2, Table 3.
  • [19] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proc. ICCV, pp. 2961–2969. Cited by: §2.
  • [20] T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun (2018) An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5020–5029. Cited by: §2.
  • [21] S. Karaoglu, R. Tao, J. C. van Gemert, and T. Gevers (2017) Con-text: text detection for fine-grained object classification. TIP 26 (8), pp. 3965–3980. Cited by: §1, §2.
  • [22] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proc. CVPR, pp. 3128–3137. Cited by: §B.1, §1.
  • [23] A. Karpathy, A. Joulin, and L. F. Fei-Fei (2014) Deep fragment embeddings for bidirectional image sentence mapping. In Proc. NeurIPS, pp. 1889–1897. Cited by: §2.
  • [24] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §4.2, §4.2.
  • [25] R. Kiros, R. Salakhutdinov, and R. S. Zemel (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539. Cited by: §1.
  • [26] B. Klein, G. Lev, G. Sadeh, and L. Wolf (2015) Associating neural word embeddings with deep image representations using fisher vectors. In Proc. CVPR, Cited by: §3.1.
  • [27] K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In Proc. ECCV, Cited by: §B.1, §B.2, §1, §2, §3.1, §4.2, §5.2, Table 3.
  • [28] K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu (2019) Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4654–4662. Cited by: §B.1, §B.2, §1, §2, §3.1, §4.2, §5.1, §5.2, Table 2, Table 3.
  • [29] M. Liao, B. Shi, and X. Bai (2018) Textboxes++: a single-shot oriented scene text detector. TIP 27 (8), pp. 3676–3690. Cited by: §2.
  • [30] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proc. ECCV, Cited by: §1, §3.1.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In Proc. ECCV, pp. 21–37. Cited by: §2.
  • [32] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai (2018) Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 67–83. Cited by: §2.
  • [33] L. Ma, Z. Lu, L. Shang, and H. Li (2015)

    Multimodal convolutional neural networks for matching image and sentence

    In Proceedings of the IEEE international conference on computer vision, pp. 2623–2631. Cited by: §2.
  • [34] A. Mafla, S. Dey, A. F. Biten, L. Gomez, and D. Karatzas (2020) Fine-grained image classification and retrieval by combining visual and locally pooled textual features. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2950–2959. Cited by: §1, §2.
  • [35] A. Mafla, S. Dey, A. F. Biten, L. Gomez, and D. Karatzas (2020) Multi-modal reasoning graph for scene-text based fine-grained image classification and retrieval. arXiv preprint arXiv:2009.09809. Cited by: §2.
  • [36] A. Mafla, R. Tito, S. Dey, L. Gómez, M. Rusiñol, E. Valveny, and D. Karatzas (2020)

    Real-time lexicon-free scene text retrieval

    Pattern Recognition, pp. 107656. Cited by: §2, §3.2, Table 1.
  • [37] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632. Cited by: §2.
  • [38] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §3.1.
  • [39] A. Mishra, K. Alahari, and C. Jawahar (2013) Image retrieval using textual cues. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3040–3047. Cited by: §2.
  • [40] Y. Movshovitz-Attias, Q. Yu, M. C. Stumpe, V. Shet, S. Arnoud, and L. Yatziv (2015) Ontological supervision for fine grained classification of street view storefronts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1693–1702. Cited by: §1.
  • [41] H. Nam, J. Ha, and J. Kim (2017) Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307. Cited by: §2.
  • [42] F. Perronnin and C. Dance (2007) Fisher kernels on visual vocabularies for image categorization. In Proc. CVPR, Cited by: §A.1, §5.1.
  • [43] J. Pont-Tuset, J. Uijlings, S. Changpinyo, R. Soricut, and V. Ferrari (2019) Connecting vision and language with localized narratives. arXiv preprint arXiv:1912.03098. Cited by: §3.2.
  • [44] J. Redmon and A. Farhadi (2016) YOLO9000: better, faster, stronger. arXiv preprint arXiv:1612.08242. Cited by: §2.
  • [45] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proc. NeurIPS, pp. 91–99. Cited by: §2, §4.2.
  • [46] O. Sidorov, R. Hu, M. Rohrbach, and A. Singh (2020) TextCaps: a dataset for image captioning with reading comprehension. arXiv preprint arXiv:2003.12462. Cited by: §B.1, §1, §2, §3.2, Table 1, §3, §5.2.
  • [47] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach (2019) Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8317–8326. Cited by: §1, §2.
  • [48] Y. Song and M. Soleymani (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1979–1988. Cited by: §2.
  • [49] H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §2, §3.1.
  • [50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.
  • [51] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie (2016) Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140. Cited by: §1, §3.1, Table 1.
  • [52] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015) Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pp. 4534–4542. Cited by: §4.2.
  • [53] N. Vo, L. Jiang, C. Sun, K. Murphy, L. Li, L. Fei-Fei, and J. Hays (2019) Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6439–6448. Cited by: §2, §4.2.
  • [54] H. Wang, Y. Zhang, Z. Ji, Y. Pang, and L. Ma (2020) Consensus-aware visual-semantic embedding for image-text matching. arXiv preprint arXiv:2007.08883. Cited by: §1, §2, §3.1, §5.2.
  • [55] L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5005–5013. Cited by: §2.
  • [56] S. Wang, R. Wang, Z. Yao, S. Shan, and X. Chen (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In The IEEE Winter Conference on Applications of Computer Vision, pp. 1508–1517. Cited by: §2, §3.1.
  • [57] X. Wei, T. Zhang, Y. Li, Y. Zhang, and F. Wu (2020) Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950. Cited by: §2.
  • [58] Y. Wu, S. Wang, G. Song, and Q. Huang (2019) Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 2088–2096. Cited by: §2.
  • [59] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. ACL 2, pp. 67–78. Cited by: §B.1, §3.1, §3.2, Table 1.
  • [60] D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, and E. Ding (2020) Towards accurate scene text recognition with semantic reasoning networks. In Proc. CVPR, pp. 12113–12122. Cited by: §2.
  • [61] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding, and X. Ding (2019) Look more than once: an accurate detector for text of arbitrary shapes. In Proc. CVPR, pp. 10552–10561. Cited by: §2.
  • [62] Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li (2020) Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3536–3545. Cited by: §2.
  • [63] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: an efficient and accurate scene text detector. In Proc. CVPR, pp. 2642–2651. Cited by: §2.
  • [64] Y. Zhu, C. Yao, and X. Bai (2016) Scene text detection and recognition: recent advances and future trends. Frontiers of Computer Science 10 (1), pp. 19–36. Cited by: §2.

Appendix A Additions to Baselines and Re-Ranking

a.1 Full Table of Results on CTC

Table 4 presents a more extensive version of the results presented in Table 2. This section dives into some parts of these results.

Visual Model Scene-text Model Trained on Scene-text Source Re-rank CTC-1K CTC-5K
Image to Text Text to Image Image to Text Text to Image
F30K TC R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
(1) VSE++ - - 20.5 42.8 54.5 15.4 35.2 48.4 13.3 30.2 40.2 8.4 21.5 30.1
(2) VSE++ - - 23.9 50.6 63.2 16.5 39.6 53.3 12.6 30.1 40.2 7.9 21.0 29.7
(3) VSRN - - 27.1 50.7 62.0 19.7 42.8 55.7 19.2 38.6 49.4 12.5 29.2 39.1
(4) VSRN - - 35.6 64.4 76.0 24.1 50.1 63.8 22.7 45.1 56.0 14.2 32.1 42.6
(5) VSE++ GRU GT - 17.4 29.9 37.1 8.3 17.5 23.2 2.4 4.8 5.8 1.3 3.0 4.2
(5’) VSE++ GRU OCR - 12.4 21.7 26.0 6.5 14.5 18.9 1.9 3.6 4.4 1.1 2.6 3.6
(6) VSE++ GRU GT - 26.3 40.4 47.3 10.0 20.3 25.6 4.4 7.1 8.2 1.6 3.5 4.7
(6’) VSE++ GRU OCR - 19.9 30.8 36.4 8.8 16.1 20.8 3.4 5.4 6.3 1.5 3.0 4.0
(7) VSRN GRU GT - 7.7 18.8 26.0 5.2 12.7 18.8 1.1 2.4 3.3 0.9 2.2 3.3
(8) VSRN GRU GT - 12.3 25.1 30.1 6.8 15.3 20.0 1.9 4.0 5.2 1.1 2.8 3.8
(9) GRU++ GT - 16.0 29.9 35.1 8.7 17.7 22.4 1.4 2.5 3.5 0.8 2.0 2.9
(10) Fasttext+FV uncleaned GT - 19.5 35.8 43.1 0.5 1.4 2.1 3.1 5.4 7.1 0.1 0.3 0.4

Fasttext+FV GT - 21.7 36.5 44.3 3.2 6.6 9.0 3.5 5.9 7.5 0.6 1.3 1.7
(12) VSE++ VSE++ GRU GT AVG 31.1 54.5 65.7 17.2 37.2 47.6 7.2 16.4 24.0 4.7 13.5 20.7
(13) LF 25.3 51.9 63.6 17.3 39.5 52.2 13.4 30.1 40.4 7.5 20.3 29.2
(14) PSC 25.8 51.7 63.2 13.5 37.4 51.0 10.9 30.5 41.3 4.2 19.8 29.5
(15) LSC 25.9 51.8 63.1 17.2 39.4 52.5 13.6 31.1 41.5 7.9 20.8 30.0
(16) VSRN VSE++ GRU GT LF 35.6 61.2 71.3 21.8 45.4 58.0 19.2 39.2 50.2 10.7 26.7 36.9
(17) PSC 30.6 59.3 69.5 16.2 43.2 58.2 14.8 38.8 50.2 6.0 26.4 38.1
(18) LSC 38.0 60.3 70.3 21.9 45.8 58.2 20.3 40.0 50.6 11.1 27.8 38.2
(19) VSRN VSE++ GRU OCR LF 32.2 58.3 69.3 20.3 43.5 56.5 18.3 37.8 48.5 10.6 27.0 36.8
(20) PSC 26.7 56.0 66.7 15.0 44.2 57.4 14.5 38.1 49.5 6.2 26.4 38.0
(21) LSC 32.8 57.0 68.5 20.7 44.0 57.1 19.7 39.6 50.3 11.3 27.9 38.3
(22) VSE++ VSE++ GRU GT AVG 34.6 53.1 61.0 14.5 31.0 39.4 10.0 21.5 29.5 5.0 14.1 21.4
(23) LF 31.0 60.0 72.3 20.4 44.7 57.3 13.4 30.9 41.5 7.4 20.5 29.1
(24) PSC 37.4 62.8 73.6 15.5 42.6 57.1 12.2 32.1 42.4 4.1 19.3 29.2
(25) LSC 31.6 57.8 70.2 20.3 44.7 57.8 13.7 31.7 41.6 7.7 21.0 29.6
(26) VSRN VSRN GRU GT AVG 36.8 62.2 72.9 18.6 40.5 52.9 15.3 33.5 44.3 6.4 18.9 28.0
(27) LF 40.3 68.5 79.9 23.9 49.9 63.4 22.6 45.0 56.3 11.8 29.5 40.0
(28) PSC 33.5 65.9 78.2 15.8 48.1 64.3 18.5 44.5 56.0 5.3 28.7 41.0
(29) LSC 38.6 67.5 78.5 24.3 50.4 64.0 23.4 45.6 56.5 12.1 30.6 41.1
(30) VSRN VSE++ GRU GT LF 41.7 68.6 78.9 25.1 52.0 65.5 22.5 44.4 55.7 12.8 31.0 41.3
(31) PSC 32.8 67.3 79.9 17.6 49.4 64.9 16.1 44.6 56.2 6.5 29.3 41.3
(32) LSC 42.2 67.9 78.5 25.5 52.0 65.6 23.1 45.9 56.1 13.3 31.7 42.2
(33) Oracle LF 63.2 82.9 89.3 37.9 64.3 75.5 31.0 53.9 64.5 19.7 39.3 49.6
(34) VSRN VSE++ GRU OCR LF 39.1 66.7 79.1 24.1 50.3 64.3 21.2 43.8 55.4 12.8 31.8 43.0
(35) PSC 31.6 65.2 78.5 16.6 48.6 64.6 15.8 43.9 55.8 6.7 29.4 41.4
(36) LSC 39.3 67.4 78.7 24.7 50.9 64.6 22.7 45.3 56.3 13.3 31.6 42.2
(37) VSRN VSE++ GRU GT LF 45.8 72.7 81.4 26.5 52.7 66.1 24.2 46.1 57.1 12.9 31.0 41.2
(38) PSC 42.2 71.5 82.8 18.9 51.1 66.4 20.1 46.4 57.5 6.7 29.5 41.6
(39) LSC 45.3 71.5 80.7 26.7 53.0 66.2 24.4 46.9 57.4 13.2 31.8 42.3
(40) Oracle LF 67.9 84.8 91.1 39.2 64.8 76.2 32.9 55.3 65.2 20.1 39.7 50.3
(41) VSRN VSE++ GRU OCR LF 41.5 70.1 79.8 25.1 51.2 64.3 23.3 45.0 58.9 12.6 30.5 41.1
(42) PSC 38.5 69.6 80.6 17.9 50.1 65.1 19.8 45.7 57.2 7.0 29.8 41.7
(43) LSC 42.2 68.6 78.5 25.5 51.8 64.9 19.8 45.7 57.2 13.2 31.5 42.2
Table 4: Results on CTC-1k and CTC-5k for visual-only baselines, scene-text-only baselines and re-ranking combinations of these baselines. Bold results denote the best performance at each of visual model, scene-text model and re-ranking methods. denotes theoretical upper-bounds to the linear combination re-rankings. (see Section A.3)
Visual Model Scene-Text Model Trained on Scene-text Source Re-rank TextCaps
Image to Text Text to Image
F30K TC R@1 R@5 R@10 R@1 R@5 R@10
(1) VSE++ - - 5.6 15.1 21.5 4.1 11.1 16.6
(2) VSRN - - 6.2 14.5 20.2 4.5 11.7 16.6
(3) VSE++ - - 14.7 30.9 40.4 10.0 24.3 32.9
(4) VSE++ GRU GT - 11.5 18.7 22.0 10.3 17.5 20.1
(5) VSE++ GRU - 34.6 45.7 49.7 25.1 35.0 37.9
(6) VSE++ VSE++ GRU GT AVG 42.8 56.6 62.8 30.8 46.2 52.7
(7) LF 33.5 54.7 63.7 22.6 40.8 50.2
(8) PSC 40.0 56.3 64.6 24.7 42.3 50.7
(9) LSC 25.7 46.0 56.1 18.0 36.0 45.3
(10) Oracle LF 57.3 72.3 78.0 39.6 55.9 63.0
Table 5: Results on TextCaps (validation set) for visual-only baselines, scene-text-only baselines and re-ranking combinations of these baselines. GT stands for ground-truth scene-text annotations, which for TextCaps are OCR predictions from [7].  denotes theoretical upper-bounds to the linear combination re-rankings. (see Section A.3)
Model Trained on Flickr30K TextCaps
Image to Text Text to Image Image to Text Text to Image
F30K TextCaps CTC R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
SCAN 57.2 84.4 90.5 38.6 68.4 79.1 9.3 21.7 29.8 4.7 14.1 21.2
14.1 34.6 45.0 7.8 22.7 32.1 23.2 50.5 63.5 14.1 37.6 52.1
57.6 85.3 92.4 39.2 70.0 80.2 16.6 36.6 48.7 9.3 25.4 36.4
58.1 83.2 91.5 39.6 69.8 81.3 4.4 11.2 16.2 2.4 7.2 11.3
55.1 79.6 87.1 35.5 67.2 77.3 15.4 35.2 46.9 13.4 37.1 51.8
VSRN 63.1 86.5 92.1 47.1 75.3 83.8 6.3 14.9 21.4 4.2 11.4 16.6
11.7 30.1 40.2 9.2 23.7 32.8 14.3 34.9 46.2 9.53 26.2 37.2
62.5 86.1 92.3 48.1 76.8 84.3 19.6 41.9 53.1 13.9 32.8 43.8
64.9 88.0 93.2 49.0 76.9 84.9 8.21 18.6 25.4 5.56 14.0 19.5
60.7 85.2 90.4 45.7 73.9 81.8 18.7 38.6 50.1 12.4 30.0 41.2
STARNet 63.9 86.9 92.4 48.6 76.7 84.7 6.79 15.5 21.6 4.6 12.1 17.5
13.3 29.6 39.6 9.8 24.5 34.1 28.7 53.7 65.1 19.8 40.1 51.6
62.4 85.8 92.1 47.1 76.1 84.1 24.0 48.9 60.7 17.3 37.9 49.8
63.2 87.2 92.5 49.5 78.1 85.2 7.5 17.5 25.1 5.2 13.6 19.5
67.5 88.1 93.6 50.7 78.0 85.4 29.5 53.8 65.3 20.8 42.9 53.6
Table 6: Quantitative comparison of experimental results of image-to-text and text-to-image retrieval on the Flickr30K (test) and TextCaps (val) sets of supervised models. Metric depicted in terms of Recall@K (R@K).

Scene-Text-only Baselines. Here we discuss additional scene-text baselines we applied to our task. As described in the main paper, we first experimented with the GRU (textual embedding) of the cross-modal models to describe the scene text and compare it to the captions. Their results are shown in Table 4, rows (5-8). In contrast to the visual model, where VSRN consistently outperformed VSE++, for scene text the later performs better than the former. Models trained on Flickr30K + TextCaps also perform better than their counterparts trained on Flickr30K only.

We also experimented with training a GRU for a caption-to-scene-text retrieval in Flickr30K. We directly applied the training code of VSE++ to these two modalities (scene text and captions) and simulated the scene text of an image as the intersection between two of its captions. The results of this method, called GRU++, are presented in row (9).

Using GRU trained for cross-modal retrieval (CMR) as scene-text descriptors has its limitations. The scene text is described with a descriptor learned to represent captions, which is not optimal. For scene text, the order of the words is not as relevant as for a caption. However, since the CMR models use a GRU, the scene-text representation is dependent on the order their words are fed to the model. The Fasttext+FV baseline aims to address these limitations. FastText [6]

uses a larger vocabulary than other Word2Vec based models, and uses word n-grams to embed words. In this manner, FastText is a more robust embedding that learns the syntax as well as the semantics of a given word. On top of FastText, a Fisher kernel 


is employed to aggregate word embeddings. Additionally, an advantage of such an approach is that the scene-text instances are not order dependent and the only training required is at the moment of constructing a Gaussian Mixture Model (GMM) that models the FastText vocabulary distribution. The best performing implementation of Fasttext+FV approach is presented in row (11). On top of it, we show in row (10) a first implementation of this method before lemmatisation and removal of stop words.

Finally, we show results for the two best models (two different flavors of VSE++ GRU) when using OCR prediction from  [18] in rows (5’) and (6’). These models are also used in combination with visual-only baselines in rows (19-21), (34-36) and (41-43). We observe a considerable decline in performance between (5) and (5’), (6) and (6’). This can be attributed to errors in OCR prediction. Indeed, COCO-Text is a very challenging dataset for scene-text recognition due to its many small bounding boxes, and CTC inherits these annotations. These results highlights the important of good scene-text recognition for StacMR. When comparing combinations to their equivalents with ground-truth annotations, the decline in performance is less pronounced.

Models trained on Flickr30K In the main paper, we highlighted how the best performance are obtained from cross-modal retrieval models trained on Flickr30K+TextCaps. We recommend models trained on this combination of datasets for benchmark on CTC. For completeness, we include here re-ranking results for combining models trained on Flickr30K only. Their performance are shown in rows (12-18) using ground-truth scene-text annotations and rows (19-21) using OCR predictions from [18]. In comparison to the models trained on Flickr30K+TextCaps, models trained on Flickr30K obtain similar improvement on CTC-1K and more significant gains on CTC-5K.

In addition to these, a few hybrid models (where visual-only models are trained on F30K+TC and scene-text-only models are trained on F30K) are shown in rows (30-36).

a.2 Performance on TextCaps

In order to describe why TextCaps is not fit as an evaluation dataset for StacMR, we performed similar experiments to those described in Sections 5.1 of the main paper. The main results are shown in Table 5. Here we see how a model trained for cross-modal retrieval with no access to the scene-text information performs better as a scene-text model than a visual model. This highlights the bias of the dataset towards scene text as its main information and the fact that purely visual information comes second.

a.3 Oracle Late Fusion

In addition to providing strong multimodal baselines from separated visual and scene-text models, combination methods are very intuitive to understand. For example, late fusion scores of two models consists of a linear combination of the scores given by two different models. The hyper-parameter corresponds to the best linear combination factor when averaging for all queries, both images and captions.

A natural extension to the late fusion combination is to make a parameter dependent on the values of the the image-to-caption similarity and the scene-text-to- caption score . Based on this extension, we propose an oracle combination method , called oracle late fusion, where the parameter is query dependent and hand-picked to optimize the ranking for the query. More precisely, this oracle optimizes the median rank of the first retrieved positive item:


where denotes the rank of the first retrieved positive item. Given a visual-only and a scene-text-only model, the oracle late fusion provides us with a theoretical upper-bound to the performance of any combination obtained by linearly combining these models. Moreover, we can analyse the values of obtained for each query to understand how often does a combination prefers to use the visual model or the scene-text model. Indeed, indicates that, for this query, the visual model is enough and the scene text should be ignored, means that the scene text is enough, and in between implies a balanced optimal weighting of both modalities.

We present the performance for oracle late fusion, evaluated both for CTC and TextCaps, on Table 4 rows (33) and (40), and Table 5 row (10). We observe a considerable improvement compared to combination methods. While for instance, looking at results, row (39) improved upon row (4) by 4.7, 2.4, 1.4 and -0.3, row (40) beats row (39) by 10.4, 10, 7.8 and 8. More importantly, these theoretical upper-bounds show the unexplored potential of combining visual and scene-text information to improve StacMR results. We also provide, for the oracle late fusion of row (40), the histogram of optimal values of in Figure 7. We observe that more common for text queries than image queries and more common for CTC-5k than CTC-1k. Indeed, text queries and CTC-5k queries have a higher probability to have a zero-word intersection between groundtruth scene text and positive captions, respectively, then image queries and CTC-1k queries, which favors .

Appendix B The STARNet Model

b.1 Implementation Details

In the baselines of supervised models, SCAN [27] and VSRN [28] use the same hyper parameters as the correspondent work published and it is based on public code available. We introduce modifications to each of those models, in a way that scene-text instances are treated similarly to visual regions. We expanded the number of visual region inputs from the original to add scene-text instances that sum in total combined visual and textual regions. Text instances are sorted according to the confidence value. If text is not present, or the instances are less than , we use a zero-padding scheme.

The proposed supervised model, STARNet was trained for epochs along with a batch size of samples per iteration on each experiment. The learning rate employed was and was decreased by a factor of every epochs. The visual features have a dimension of -d. The FastText [6] textual vectors that serve as input to the model have a dimension of -d, which are linearly projected into a similar feature space of -d as the visual features. We use GCN-based reasoning layers on the visual and textual GCN to enrich and reason from the visual and scene-text features. The final semantic space learned contains -d, which is used to project the final image representation and captions.

In our experiments, when the Flickr30K [59] dataset is employed, we use the same training, validation and testing split as in [22], which contain , and images respectively. When using only the TextCaps [46] dataset, the original training set is used and the validation set is employed as the evaluation set, since the test set is currently publicly unavailable. At the moment of training the proposed STARNet model, we employ the validation set of TextCaps to achieve the best performing weights.

b.2 Performance on Flickr30K and TextCaps

In Table 6 we show the performance of our proposed model with SCAN [27] and VSRN [28]. In order to obtain comparable results, we have obtained features from our implementation to extract visual regions as [3]. Publicly available code for SCAN [27] and VSRN [28] was used to train those models.

Results show that by leveraging scene-text retrieval improvements can be achieved. It is important to note the effect of employing different datasets in the training procedure. As it is expected, training on TextCaps and due to the dataset nature that focuses only on scene text instances, as well as their captions, it does not yield good results when used alone. Even adding samples from the CTC dataset at training time, can yield an improvement when evaluated on the TextCaps validation set.

It is worth noting as well that in standard cross-modal retrieval models, adding TextCaps training data achieve a minor improvement (SCAN) or lower the performance (VSRN) when compared in the Flickr30k dataset. However a slight improvement is achieved when adding the CTC training set.

However, the proposed model learns to model the interactions between scene-text and visual descriptors to combine them appropriately. STARNet achieves better a performance among both datasets even when scene-text is not widely available in Flickr30k.

Appendix C Dataset Samples

Figure 8 showcases a few samples of image-caption pairs that belong to the full CTC dataset. On the other hand, in Figure 9 we depict image-caption pairs that belong to the explicit set of the CTC dataset, the bold words in captions reference to appearing scene text. We can note that scene text provides strong cues to better discriminate each image. Leveraging scene-text can provide with important complementary information for language and vision oriented tasks, such as in the case of cross-modal retrieval.

Appendix D Qualitative Results

In Figure 10 we illustrate qualitative results when performing Image to Text cross-modal retrieval. Text contained within an image usually serve as discriminatory signal, such as the word ”samsung” in the third image and the number ”15” in the fifth query. Scene text also provides a strong complementary cue to be used along with visual features as the rest of the queried samples suggest.

It is important to note, that even though some samples are not entirely correct, the model still preserves semantics between image and retrieved captions.

We illustrate in Figure 11 the results obtained when performing Text to Image cross-modal retrieval. In the queries performed, scene-text work as fine-grained and discriminative information to retrieve correctly an image. Similarly to the previous scenario, wrongly retrieved samples still preserve semantics.

By exploring the qualitative results obtained, added to the quantitative tables in previous sections, we can reinforce the notion that modelling scene-text along with visual features does improve retrieval granularity thus yielding higher performing cross-modal retrieval pipelines.

Figure 7: Histogram of values for oracle late fusion, row (36) of Table 4. Blue histograms show oracle for CTC-1k, green histograms for CTC-5k.
Image Captions
A blue bus at a bus stop with its doors open.
A bus with its doors open is waiting at a bus stop.
A bus sits parked on the side of a street.
A picture of a bus on the side of the street.
The blue and white trolley is waiting on passengers.
A woman, man and two dogs in an inflatable raft on some water.
The two ladies are in the row boat.
Three people in a raft on the lake.
A boat with people on it with a dog in water with a goose in it.
Man and woman with two dogs on a power boat on a lake.
A train on the tracks with people standing and walking by it
A crowd of people are walking in front of a train
A stopped train at a train crossing with people crossing the tracks.
A black train parked at a train station as people walk across the train tracks.
People at a train station, gathering around a black locomotive.
A man holding a tennis racquet on a court.
A man swinging a tennis racket during a tennis match.
A tennis player in mid air action on the court.
A tennis player about to serve the ball as a small crowd looks on.
A tennis player is in the air making an overhead swing.
A red double decker bus on street next to building.
A bus that is driving in the street.
A ride double-decker bus stands out against a black and white background.
A double decker bus with few passengers turning at a corner.
A red double decker bus driving down a city street.
Figure 8: Image-caption pairs taken from the full proposed CTC dataset, in which appearing scene-text does not have a semantic relation with the annotated captions, there are no scene-text and captions common words.
Image Captions
An emergency response person is on a motorcycle.
A medical person riding a motorcycle with ambulance on back.
A police officer on a motorcycle pulling over a black car.
A police motorcycle gets down to business when someone speeds.
A motorcycle with a sign on the back that says ambulance.
A China Airlines Airplane sitting on a waiting area of an airport.
A big commuter plane sits parked in a air port.
A China Airlines airliner is parked at an airport near another jet.
Some white red and blue jets at an airport.
China airplane airline is parked at a dock.
A motorcycle parked in a parking lot next to a car.
An antique Indian motorcycle is parked next to the sidewalk.
Motorcycle parked on the edge of a street.
An old Indian motorcycle parked at the curb of a street.
A motorcycle parked on a sidewalk next to a street.
Looks like a portrait of a distinguished gentleman.
A painting of Walter Camp, siting on bench.
A painting of a man in brown jacket and hat sitting at a bench.
This a painting of Walter Camp in a trench coat.
A painting of an older man on a city bench holding a rolled up magazine.
A professional baseball player standing on the field while holding a mitt.
A baseball player wearing a catchers mitt on top of a field.
A Twins baseball player holding his glove walking on the field.
The pitcher is resigned to losing the important game.
A Twins baseball player walking to the dugout.
Figure 9: Image-caption pairs from the proposed CTC explicit dataset, the scene-text and captions have at least one word in common (marked in bold).
Queried Image Retrieved Captions
Clock at a train station showing the time of the next trains arrival. ✓
A clock with the words next train written about it. ✓
A clock on a train platform during day time.
A clock attached to a pole at a train station.
A clock that is sitting on the side of the pole. ✓
A large number of police motorcycles are lined up.
A bunch of police officers on motorcycles waiting for something. ✓
A group of police officers that are riding on motorcycles.
A police on motorcycles are parked beside a crowd.
A line of police are riding motorcycles down the street.
People riding on the upper level of a samsung bus in a parade. ✓
A blue tow truck carrying a boat.
A blue truck is pulling a white boat.
A police vehicle on a tow truck that is being taken away.
A group of police standing at the back of a moving truck.
A tall lighthouse sign with a clock on the tower of a plaza. ✓
A tall church building with a massive clock on front of it.
A modern clock tower is embellishing a market which sits beneath a clear blue sky. ✓
Tall tower with clock near well lit building at night.
A large tower that has a clock on the very top of it.
Two woman near the interstate 15 sign in las vegas. ✓
Two women standing on a sidewalk next to a street sign at night while cars drive on the street next to them and behind them. ✓
Two young ladies standing on the sidewalk under a street sign. ✓
Two people standing on a street with a street sign. ✓
Two women on street next to cars and traffic signs. ✓
Figure 10: Qualitative samples obtained when an image is used as a query (Image to Text) in the proposed CTC explicit dataset. Correct results are marked with ✓. Incorrect results are marked with . Reasonable mismatches are depicted with but still marked by a .
Query 1: A marc passenger drains rides along railroad tracks.
Query 2: Sign explaining how to park on a hill is posted on the street.
Query 3: Commuter shuttle bus on roadway in large city.
Query 4: A china airlines airliner is parked at an airport near another jet.
Figure 11: Qualitative samples when a caption is used as a query (Text to Image) in the proposed CTC explicit dataset. Correct results are marked in a green box. Incorrect results are marked in a red box. Words in bold in queried captions depict the scene-text that helps to discriminate retrieved images, which otherwise are ambiguous. Query 1 contains an annotator typo ”drains”.