DeepAI
Log In Sign Up

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

01/08/2022
by   Zhixiong Zeng, et al.
0

Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type. It has been widely used in many real-world applications. Recently, the vision-language pre-trained models represented by CLIP demonstrate its superiority in learning the visual and textual representations and gain impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in the unsupervised CMR, the performance and impact of these pre-trained models on the supervised CMR were rarely explored due to the lack of common representation for the multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study. We evaluate its performance and impact on the supervised CMR, and attempt to answer several key research questions. To this end, we first propose a novel model CLIP4CMR (CLIP enhanced network for Cross-Modal Retrieval) that employs the pre-trained CLIP as backbone network to perform the supervised CMR. Then by means of the CLIP4CMR framework, we revisit the design of different learning objectives in current CMR methods to provide new insights on model design. Moreover, we investigate the most concerned aspects in applying CMR, including the robustness to modality imbalance and sensitivity to hyper-parameters, to provide new perspectives for practical applications. Through extensive experiments, we show that CLIP4CMR achieves the SOTA results with prominent improvements on the benchmark datasets, and can be used as a fundamental framework to empirically study the key research issues of the supervised CMR, with significant implications for model design and practical considerations.

READ FULL TEXT VIEW PDF
11/17/2022

Cross-Modal Adapter for Text-Video Retrieval

Text-video retrieval is an important multi-modal learning task, where th...
12/15/2020

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNI...
01/30/2022

VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training

Vision-and-language pre-trained models (VLMs) have achieved tremendous s...
03/16/2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Multimodal pre-training has propelled great advancement in vision-and-la...
09/14/2022

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

The pre-trained image-text models, like CLIP, have demonstrated the stro...
04/17/2022

What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Most of the existing work in one-stage referring expression comprehensio...

1. Introduction

With the explosive increase of multimodal data in social media platforms, cross-modal retrieval (CMR) has become one of the emergent needs for people to acquire relevant images and texts conveniently. CMR is a fundamental task across multimodal computing and information retrieval, which takes the query in one modality to retrieve relevant data of another modality. It not only lays the basis for multimodal visual and language processing, analysis and understanding, but also facilitates a number of applications in domains such as image retrieval

(Xia et al., 2014), image caption (Vinyals et al., 2015), recipe recommendation (Carvalho et al., 2018)

, automatic story generation

(Li et al., 2020a) and so forth.

The aim of cross-modal retrieval is to establish the similarity link between samples from different modalities based on their semantic correlation. Existing research can be broadly categorized into two groups: the unsupervised CMR for paired multimodal data and the supervised CMR for labeled multimodal data. The unsupervised CMR (also called cross-modal matching) methods center on the design of explainable vision and language reasoning networks to learn the cross-modal semantic alignment, which gracefully aggregate the word-level and region-level fine-grained similarities into cross-modal similarity to perform the retrieval task (Chen et al., 2020; Huang et al., 2017; Song and Soleymani, 2019; Wang et al., 2019; Xu et al., 2015; Zhang et al., 2020)

. As the items from one modality usually have multiple semantic related items in another modality, the supervised CMR methods center on designing effective loss functions to preserve the multi-modal class-level semantic associations (

i.e., the modality invariance and semantic discrimination) of the common representation space (Wang et al., 2017, 2015; Wu et al., 2018; Zeng et al., 2021b; Zhen et al., 2019; Zheng et al., 2016). Due to the universality of multiple related samples across different modalities in reality, we focus on the supervised CMR in this paper, and use cross-modal matching and cross-modal retrieval to refer to the unsupervised and supervised CMR respectively.

Inspired by the great success of self-supervised pre-trained language models (Devlin et al., 2018; Liu et al., 2019), a large number of vision-language pre-trained (VLP) models (Tan and Bansal, 2019; Lu et al., 2019; Su et al., 2019; Chen et al., 2019; Li et al., 2020b; Desai and Johnson, 2021) have been developed that learn the vision-language semantic alignments to be finetuned on the downstream tasks. Recently, CLIP (Contrastive Language-Image Pre-training) (Radford et al., 2021) pre-trained on 400 million noisy multimodal web data has demonstrated its impressive performance on various downstream vision and language related tasks. The VLP models represented by CLIP are profoundly reshaping the cross-modal field (Cao et al., 2020) and their superiority on cross-modal tasks are increasingly recognized (Shin et al., 2021). Although the VLP models have been successfully fine-tuned to the unsupervised cross-modal matching, their performance and impact on the supervised CMR have not been investigated, due to the fact that these pre-trained models cannot be directly applied to the supervised CMR, which requires the common representations of the more complex multimodal class-level associations.

In this paper, we conduct an empirical study of the vision-language pre-trained model for cross-modal retrieval. The first important research question raised for our empirical study is: can CLIP boost the performance of the CMR task and why? To explore this, we propose a model named CLIP4CMR, which takes the pre-trained CLIP as the backbone network. To generate the common representation space, CLIP4CMR exploits the pre-trained CLIP as the visual and textual encoders and then employs modality-specific multilayer perceptron for cross-modal retrieval. Although existing CMR methods rely heavily on the design of learning objectives, due to the diversity of model architectures, parameter choices and training protocols, previous research fails to supply a fair comparison vehicle for evaluating the learning objectives designed in the existing models. The CLIP4CMR framework provides a unified common ground for such fair comparison. The second important research question raised for our empirical study is: how does the design of different learning objectives (and their combination) influence the retrieval results? By means of CLIP4CMR, we are able to revisit the existing learning objectives, including the widely used pair-wise losses, more recent class-wise losses and hybrid ones that combine pair-wise and class-wise losses, and assess their comparative performances in the same experimental setting.

In addition, we consider the practical applications of the CMR models. Benefited from CLIP’s abundant multimodal knowledge obtained from extra pre-training data, we would like to investigate the practical aspects of applying the cross-modal retrieval model built on CLIP. The third important research question raised for our empirical study is: how does the CMR model built on CLIP perform under the practical situations? There are two key concerned issues here in practice: the robustness to modality imbalance (Zeng et al., 2021b) and sensitivity to hyper-parameters (Luo et al., 2021), and therefore, the above research question is broken down into two sub-questions. The robustness to modality imbalance has attracted much attention recently due to the discrepancies of data collection and labor annotation between different modalities in real-world applications. To alleviate this problem, previous models are mainly based on the semantic consistency and modality heterogeneity to reconstruct modality-balanced data for improving robustness (Zeng et al., 2021b, a; Jing et al., 2020). The sensitivity to hyper-parameters is related to evaluating the scalability of a CMR model in real-world situations. In particular, the dimensionality of the common representation space is a crucial hyper-parameter for analyzing the computational storage and time efficiency of cross-modal retrieval, as usually the pre-calculated image and text representations are used for similarity ranking during the test phase. Previous studies have shown that the performance of the retrieval model in a more compact representation space is worse due to the lack of partial representation information (Roth et al., 2020; Kim et al., 2021). With the new perspective brought in by CLIP, these issues need to be reexamined.

Through developing CLIP4CMR, this paper proposes the first supervised CMR framework built on the vision-language pre-trained model. Our empirical study based on CLIP4CMR contributes to cross-modal retrieval field in providing the following insights:

  • Benefited from the improvement of intra-class compactness, CLIP4CMR can significantly facilitate cross-modal retrieval task and serve as a promising new baseline.

  • Under the unified experimental setting based on CLIP4CMR, currently widely-used hybrid losses that combine pair-wise and class-wise losses have no obvious performance gains compared to applying the class-wise loss alone.

  • Cross-modal retrieval model built on CLIP can markedly improve the robustness to modality imbalance, and still maintain a small performance degradation in some extremely modality imbalanced cases.

  • Cross-modal retrieval model built on CLIP is almost insensitive to the dimension changes of the common representation space, and can still maintain relatively high performance in a very compact representation space.

Figure 1. Overall architecture of the proposed CLIP4CMR. We leverage CLIP’s visual encoder (i.e., CLIP) and textual encoder (i.e., CLIP

) to generate original image and text representations and employ modality-specific multilayer perceptron layer (MLP) to learn common representation space. We then revisit the existing pair-wise and class-wise losses to provide insights on applying CLIP for supervised cross-modal retrieval.

2. Related Work

Our work focuses on applying the vision-language pre-trained model to cross-modal retrieval task. Below we review cross-modal retrieval methods and vision-language pre-trained models.

2.1. Cross-modal Retrieval

The key challenge of cross-modal retrieval is to bridge the heterogeneity gap and learn transformation functions to project multimodal data into a common representation space, such that the cross-modal retrieval task boils down to the familiar nearest neighbor retrieval in the embedding space (Chun et al., 2021). Existing cross-modal retrieval methods can be broadly categorized into two groups: the unsupervised methods for paired multimodal data and the supervised methods for labeled multimodal data. The unsupervised methods focus on designing explainable vision and language reasoning networks to learn the cross-modal semantic alignment, which gracefully aggregate the word-level and region-level fine-grained similarities into cross-modal similarity to perform the retrieval task (Chen et al., 2020; Huang et al., 2017; Song and Soleymani, 2019; Wang et al., 2019; Xu et al., 2015; Zhang et al., 2020). The supervised methods focus on preserving the multimodal class-wise associations of the common representation space, so that the items of same class but from different modalities are closely grouped together (Wang et al., 2017; Zeng et al., 2021b; Zhen et al., 2019; Wang et al., 2016b; Wu et al., 2017; Zeng et al., 2021a; Wang et al., 2015; Wu et al., 2020).

The multimodal class-wise associations are mainly preserved by learning objectives for training the networks, including the widely used pair-wise losses, more recent class-wise losses and hybrid ones that combine pair-wise and class-wise losses. The pair-wise loss provides rich class-level supervisory signals for learning common representation space by comparing fine-grained intra-class and inter-class relations between items from different modalities, i.e., cross-modal data-to-data relations. A typical pair-wise loss is the modality invariant loss, which maximizes the intra-class similarities between items from different modalities (Hardoon et al., 2004; Feng et al., 2014; Wang and Livescu, 2015; Wang et al., 2015). Inspired by the success of deep metric learning in learning discriminative representations (Hu et al., 2014; Bellet et al., 2013), recent methods calculate the contrastive loss or semi-hard triplet loss on the multimodal data, thereby minimizing the similarity of intra-class multimodal pairs and maximizing that of inter-class multimodal pairs (Wang et al., 2017; Zhen et al., 2019; Peng et al., 2018; Zeng et al., 2020). In contrast, the class-wise loss leverages multimodal shared class proxies for learning common representation space by comparing samples with class proxies, i.e.

, data-to-proxy relations. The seminal examples are the linear regression loss

(Wu et al., 2017; Wang et al., 2015; Zhen et al., 2019) and cross-entropy loss (Peng et al., 2016; Wang et al., 2017; Peng and Qi, 2019), which project image and text samples into a shared label space to preserve class-level associations. Since the classification rule with softmax output layer lacks robustness to unknown classes (Yang et al., 2018), the prototype contrastive loss has been proposed to improve the robustness issue by pulling samples towards the prototype of its class and pushing samples away from prototypes of other classes (Zeng et al., 2021a, b). In fact, most of the existing methods (Wang et al., 2017, 2015; Wu et al., 2017; Zhen et al., 2019; Peng and Qi, 2019) follow the paradigm of optimizing hybrid losses that combine pair-wise and class-wise losses to maximize information utilization, but fail to provide a fair comparison vehicle for evaluating the loss functions designed in these existing methods.

2.2. Vision-Language Pre-trained Models

Recently, self-supervised language pre-trained models such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), GPT2 (Radford et al., 2019) have pushed the state of the art on a wide range of NLP tasks. There are two keys to their success: effective pre-training tasks over large-scale language corpus, and the utilization of Transformer (Vaswani et al., 2017) for learning contextualized text representations (Chen et al., 2019). Inspired by the success of pre-trained language models, a large number of vision-language pre-trained (VLP) models (Tan and Bansal, 2019; Lu et al., 2019; Su et al., 2019; Chen et al., 2019; Li et al., 2020b; Desai and Johnson, 2021; Geigle et al., 2021; Sun et al., 2021; Radford et al., 2021) based on Transformer have been made to build the multimodal counterpart that learns vision-language semantic alignments, bringing about great advances on downstream multimodal tasks like cross-modal retrieval.

Exemplary VLP models can be categorized as cross-encoder based and embedding based methods (Geigle et al., 2021). The cross-encoder based methods (Tan and Bansal, 2019; Lu et al., 2019; Su et al., 2019; Chen et al., 2019; Li et al., 2020b) apply a cross-attention mechanism based on Transformer-based neural architectures to compute the similarity score between items from different modalities. The embedding based methods encode multimodal items separately to generate high-dimensional visual and textual representations, and utilize the standard distance metrics to compute the cross-modal similarity (Desai and Johnson, 2021; Geigle et al., 2021; Sun et al., 2021; Radford et al., 2021). More recently, the CLIP (Radford et al., 2021) employs the embedding-based architecture and is pre-trained on million noisy multimodal web data, and achieves impressive performance on many downstream vision and language related tasks. The great success of CLIP comes from the generality and usability learned from hundreds of millions of raw image and text data. It has inspired the growing interest of empirical studies that explore the impact of CLIP on video retrieval (Luo et al., 2021), visual question answering and visual entailment (Shen et al., 2021).

3. The Unified Framework

Figure 1 illustrates the unified framework of applying the vision-language pre-trained model for cross-modal retrieval, which consists of the design of CLIP4CMR model and learning objectives.

3.1. Design of CLIP4CMR

Without losing generality, we focus on cross-modal retrieval for image and text. Suppose that we have a collection of instances of image-text pairs, denoted as = , where is the input image sample and is the input text sample. Each pair is assigned a semantic label , where is the number of semantic categories.

Inspired by the superiority of CLIP in learning visual and textual representations, we utilize the model architecture of CLIP to perform cross-modal retrieval. The model architecture of CLIP consists of a visual encoder for image modality and a textual encoder for text modality. The visual encoder takes the form of the convolutional neural network like ResNet-50

(He et al., 2016) or vision transformers like ViT (Dosovitskiy et al., 2020), and is pre-trained by a broad source of textual supervision to learn low-dimensional image representations. The textual encoder is built on top of a Transformer (Vaswani et al., 2017), and is pre-trained by a broad source of visual supervision to learn low-dimensional text representations. We employ the pre-trained CLIP to generate image and text representations, which can be formulated as:

(1)

where both and are -dimensional representations, and CLIP and CLIP denote the visual encoder and textual encoder of CLIP, respectively. However, it may be unreasonable to directly apply the representation space generated by CLIP for cross-modal retrieval, as CLIP pre-trained by self-supervised task fails to capture the more complex class-level semantic discrimination. Thus we deploy modality-specific multilayer perceptron to generate a common representation space as in most existing work (Zeng et al., 2021a, b), which can be formulated as:

(2)
(3)

where denotes the GeLU (Hendrycks and Gimpel, 2016)activation function, , , , , , , and are the trainable parameters, and are the projected features in the common representation space, and is the dimension of the representation space. To prevent the divergence of the magnitudes, we apply l2-normalization layer to output the normalized representations.

3.2. Learning Objectives

3.2.1. Pair-wise loss

The pair-wise loss provides rich supervisory signals for learning common representation space by comparing fine-grained intra-class and inter-class relations between samples from different modalities, i.e., cross-modal data-to-data relations. A seminal pair-wise loss for cross-modal retrieval is the contrastive loss, which minimizes the distances of positive image-text pairs belonging to the same class and maximizes the distances of negative pairs for being larger than a margin (Wang et al., 2015; Wu et al., 2017; Peng et al., 2017). Given a batch of image-text pairs, it can be formulated as:

(4)

where denotes the square of the Euclidean distance, denotes the distance margin, and the label indicates whether an image-text pair belongs to the same class or not. Some early cross-modal retrieval methods (Feng et al., 2014; Zhai et al., 2013) only consider the optimization of positive image-text pairs in Equation (4), which was called modality-invariant loss in subsequent work (Zhen et al., 2019). Another popular pair-loss for cross-modal retrieval is the triplet loss, which encourages the distances of positive image-text pairs to be smaller than that of negative pairs with a margin (Peng et al., 2018; Wang et al., 2017). Given a batch of image-text pairs, it can be formulated as:

(5)

where denotes the set of triplets by select as anchor to find positive text and negative text , denotes the set of triplets by select as anchor to find positive image and negative image , and are their cardinalities.

MAP Wikipedia Pascal-Sentence NUS-WIDE XmediaNet
I2T T2I Avg. I2T T2I Avg. I2T T2I Avg. I2T T2I Avg.
CCA (Hardoon et al., 2004) 0.298 0.273 0.286 0.203 0.208 0.206 0.167 0.181 0.174 0.212 0.217 0.215
KCCA (Wang and Livescu, 2015) 0.438 0.389 0.414 0.488 0.446 0.467 0.351 0.356 0.354 0.252 0.27 0.261
Corr-AE (Feng et al., 2014) 0.442 0.429 0.436 0.532 0.521 0.527 0.441 0.494 0.468 0.469 0.507 0.488
JRL (Zhai et al., 2013) 0.479 0.428 0.454 0.563 0.505 0.534 0.466 0.499 0.483 0.488 0.405 0.447
CMDN (Peng et al., 2016) 0.487 0.427 0.457 0.544 0.526 0.535 0.492 0.542 0.517 0.485 0.516 0.501
JFSSL (Wang et al., 2015) 0.458 0.426 0.442 0.553 0.542 0.548 0.514 0.523 0.519 0.525 0.518 0.521
ACMR (Wang et al., 2017) 0.468 0.412 0.440 0.538 0.544 0.541 0.519 0.542 0.531 0.536 0.519 0.528
JLSLR (Wu et al., 2017) 0.473 0.440 0.456 0.568 0.551 0.560 0.536 0.531 0.534 0.544 0.553 0.549
MCSM (Peng et al., 2018) 0.516 0.458 0.487 0.598 0.598 0.598 0.522 0.546 0.534 0.540 0.550 0.545
CCL (Peng et al., 2017) 0.505 0.457 0.481 0.576 0.561 0.569 0.506 0.535 0.521 0.537 0.528 0.533
CM-GANS (Peng and Qi, 2019) 0.521 0.466 0.494 0.603 0.604 0.604 0.536 0.551 0.543 0.567 0.551 0.559
PAN (Zeng et al., 2021b) 0.517 0.462 0.489 0.686 0.689 0.688 0.590 0.571 0.581 0.669 0.660 0.665
DSCMR (Zhen et al., 2019) 0.521 0.478 0.499 0.674 0.682 0.678 0.611 0.615 0.613 0.697 0.693 0.695
MCCN (Zeng et al., 2021a) 0.552 0.487 0.520 0.681 0.686 0.683 - - - 0.741 0.743 0.742
CLIP4CMR 0.592 0.574 0.583 0.698 0.692 0.695 0.609 0.621 0.615 0.746 0.758 0.752
  • Two-stage approach, which use training data to train pre-classified visual and textual encoders followed by cross-modal retrieval.

  • Reproducible results using test samples in Pascal-Sentence dataset following (Wang et al., 2017). The average mAP of our method is when following the dataset split of MCCN (Zeng et al., 2021a), but their test samples are too small for effective evaluation.

Table 1. Performance comparison in terms of mAP on four widely-used benchmark datasets for cross-modal retrieval.

3.2.2. Class-wise loss

The class-wise loss leverage multimodal shared class proxies for learning common representation space by comparing samples with class proxies, i.e., data-to-proxy relations. A seminal example is the linear regression loss, which can be formulated as (Wu et al., 2017; Zhen et al., 2019; Wang et al., 2015):

(6)

where denotes the Frobenius norm, is the projection matrix of the linear classifier,

is the one-hot label vector where the

-th element is 1 and the others are 0. Each column of the projection matrix represents a class proxy, which provides a unified anchor to pull together all images and texts belonging to the same class. To exploit the nonlinearity of the label space, another popular class-wise loss is the cross-entropy loss calculated by (Wang et al., 2017; Peng and Qi, 2019):

(7)

where and denote the -th column of the weight matrix and bias matrix of the shared classification layer. Here the layer parameters and can be regarded as a class proxy with bias term. However, this classification rule with softmax output layer lacks robustness to unknown classes (Yang et al., 2018). To improve the robustness of cross-modal retrieval, recent work PAN (Zeng et al., 2021b) assigns a set of unified prototypes as class proxies and adopt the nearest-prototype classification rule to infer unknown classes. The multimodal representations and prototypes are jointly learned through a prototype contrastive loss:

(8)

here is a scaling factor.

3.2.3. hybrid loss

To utilize both data-to-data and data-to-proxy relations and maximize information utilization, most of the existing methods (Wang et al., 2017, 2015; Wu et al., 2017; Zhen et al., 2019; Peng and Qi, 2019) follow the paradigm of optimizing hybrid losses that combine class-wise and pair-wise losses. Generally, the hybrid loss can be formulated as:

(9)

where , , and is a carefully selected combination weight.

(a) Wikipedia intra-class distances
(b) Wikipedia inter-class distances
(c) XmediaNet intra-class distances
(d) XmediaNet inter-class distances
Figure 2. The distributions of the intra-class distances and inter-class distances across different modalities in the test set.

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

To verify the effectiveness of our proposed method, we conduct our empirical study on four widely-used benchmark datasets, namely Wikipedia (Rasiwasia et al., 2010), Pascal-Sentence (Rashtchian et al., 2010), NUS-WIDE(Chua et al., 2009) and XmediaNet (Peng et al., 2018). For the Wikipedia dataset, we use 2,157 image-text pairs from 10 semantic classes for training, and 462 image-text pairs for test. For the Pascal-Sentence dataset, we use 8,00 image-text pairs from 20 classes for training and 200 image-text pairs for test. For the NUS-WIDE dataset, we use 8,000 image-text pairs from 10 classes for training and 1,000 image-text pairs for test. For the XmediaNet dataset, we use 32,000 image-text pairs from 200 classes for training and the other 4,000 image-text pairs for test. The dataset splits mainly follow those in (Zhen et al., 2019; Wang et al., 2017).

MAP Wikipedia Pascal-Sentence NUS-WIDE XmediaNet
I2T T2I Avg. I2T T2I Avg. I2T T2I Avg. I2T T2I Avg.
Class-wise loss LRL 0.592 0.585 0.588 0.686 0.680 0.683 0.621 0.643 0.632 0.574 0.576 0.575
CEL 0.586 0.565 0.576 0.697 0.686 0.692 0.605 0.619 0.612 0.671 0.674 0.673
PCL 0.592 0.574 0.583 0.698 0.692 0.695 0.609 0.621 0.615 0.746 0.758 0.752
Pair-wise loss ML 0.147 0.153 0.150 0.114 0.104 0.109 0.137 0.131 0.134 0.012 0.011 0.012
CL 0.516 0.498 0.507 0.587 0.555 0.571 0.577 0.592 0.584 0.628 0.641 0.635
TL 0.550 0.536 0.543 0.624 0.620 0.622 0.595 0.603 0.599 0.674 0.678 0.676
Table 2. Revisiting pair-wise and class-wise losses in cross-modal retrieval with the unified CLIP4CMR framework.
(a) Wikipedia, LRL
(b) Wikipedia, CEL
(c) Wikipedia, PCL
(d) NUS-WIDE, LRL
(e) NUS-WIDE, CEL
(f) NUS-WIDE, PCL
Figure 3. The impact of hybrid losses that combine class-wise and pair-wise losses. We show the average mAP values of text retrieval and image retrieval tasks under different combination weights . We compare the performance of the hybrid loss combined by class-wise loss and different pair-wise losses in each sub-figure.

4.1.2. Evaluation Metrics

The results of all the experiments are presented in terms of the mean average precision (mAP), which is the standard evaluation measure in cross-modal retrieval (Wang et al., 2015, 2016a). We compute the mAP scores for two different tasks: text retrieval using image query (I2T) and image retrieval using text query (T2I). To calculate mAP, we first evaluate the average precision (AP) of a set of retrieved items by: , where T is the number of relevant items in the retrieved set, represents the precision of the top retrieved items, and is an indicator function, whose value is 1 if the -th retrieved item is relevant (i.e., from the same class). The mAP scores are then calculated by averaging the AP values over all queries.

4.1.3. Implementation Details

The model architecture of CLIP4CMR is mainly based on CLIP, which consists of a visual encoder and a textual encoder that process image and text modalities separately. The visual encoder utilizes ResNet-50 (He et al., 2016) as the base architecture, and makes several modifications to incorporate the style of Transformer (Vaswani et al., 2017). Specifically, it adopts the modified version in ResNet-D (He et al., 2019) and antialiased rect-2 blur pooling (Zhang, 2019), and then replaces the global average pooling layer with an attention pooling mechanism. The attention pooling is implemented as a single layer of multi-head QKV attention where the query is conditioned on the pooled representation, and finally a -dimensional image representation is obtained. The textual encoder first converts each token (including punctuation) of the input text into a lower-cased byte pair encoding (BPE) representation (Sennrich et al., 2015), which is essentially a unique numeric ID. The vocabulary size in is and the text length is fixed as with the and tokens. Then the text IDs are mapped to -dimensional word embeddings to be passed in the -layer Transformer. Finally, the feature at the position is layer normalized and processed by a linear projection layer to generate -dimensional text representations. Then we employ two fully connected layers to project the original image and text representations into a common representation space, respectively. The entire network is optimized by Adam update rule (Kingma and Ba, 2014). We set the initial learning rate to , the dropout ratio to , the early stop to , the batch size to

and the maximal training epoch to

.

Hyper-parameter setting: We report the results corresponding to the optimal hyper-parameters, where the dimension of the common representation space is , and the scaling factor in Eq.(8) is . In addition, the margin of pair-wise losses is set to be 0.2 as in most of previous work (Chen et al., 2020). Further analysis of these hyper-parameters will be discussed in Section 4.4.

(a) Wikipedia pair-wise
(b) Pascal-Sentence pair-wise
(c) NUS-WIDE pair-wise
(d) XmediaNet pair-wise
(e) Wikipedia class-wise
(f) Pascal-Sentence class-wise
(g) NUS-WIDE class-wise
(h) XmediaNet class-wise
Figure 4. Visualization of the distance matrix between the embeddings of test set learned by pair-wise loss and class-wise loss, respectively. X-axis denotes the image labels, and Y-axis denotes the text labels.

4.2. Study on the CLIP4CMR Performance

4.2.1. Comparative Results

To evaluate the performance and impact of the vision-language pre-trained model CLIP in cross-modal retrieval, we compare the proposed CLIP4CMR with fourteen representative baseline methods, namely CCA (Hardoon et al., 2004), KCCA (Wang and Livescu, 2015), Corr-AE (Feng et al., 2014), JRL (Zhai et al., 2013), CMDN (Peng et al., 2016), JFSSL (Wang et al., 2015), ACMR (Wang et al., 2017), JLSLR (Wu et al., 2017), MCSM (Peng et al., 2018), CCL (Peng et al., 2017), CM-GANs (Peng and Qi, 2019), DSCMR (Zhen et al., 2019), PAN (Zeng et al., 2021b) and MCCN (Zeng et al., 2021a). Note that DSCMR and MCCN are two-stage methods, which use training data to train pre-classified visual and textual encoders followed by cross-modal retrieval. By this way, the two-stage training approach can significantly improve the performance of cross-modal retrieval in their original report, including the baseline methods. Here we report the results of CLIP4CMR trained by prototype contrastive loss because of its overall better performance on all the datasets. We shall compare the performance of CLIP4CMR under different loss functions in Section 4.3.

Table 1 reports the mAP scores of CLIP4CMR and the comparative methods. From the results, we can see that CLIP4CMR outperforms baseline methods on all benchmark datasets. Comparing with representative one-stage methods, our method outperforms PAN with the average mAP improvements 9.4%, 0.7%, 3.4% and 8.7% on Wikipedia, Pascal-Sentence, NUS-WIDE and XmediaNet, respectively. Moreover, CLIP4CMR still achieves better performance compared to recent two-stage methods, especially on the Wikipedia dataset with significant performance gains. The promising results of CLIP4CMR indicate the superiority of CLIP in learning the visual and textual representations for boosting cross-modal retrieval.

4.2.2. Visualization Analysis

To further study how the superiority of CLIP4CMR is generated, we further examine the distributions of the intra-class image-text distances and inter-class image-text distances in the test set. Specifically, we collect intra-class image-text distances and inter-class image-text distances in the Wikipedia dataset, and intra-class image-text distances and inter-class image-text distances in the XmediaNet dataset. We adopt the previous SOTA method PAN (Zeng et al., 2021b) for comparison, and show the visualization results in Figure 3. From the figure, we can see that the intra-class image-text distances of CLIP4CMR is obviously more compact than those of PAN, and the inter-class image-text distances of the two methods are not significantly different. The visualization results show that the superiority of CLIP4CMR mainly comes from the more compact distribution of multimodal samples within class, which actually benefits from the prior knowledge of cross-modal semantic alignment obtained by the vision-language pre-trained model.

4.2.3. Summary and Implication for Future Research

Benefited from the improvement of intra-class compactness, CLIP4CMR provides a promising baseline and can significantly facilitate cross-modal retrieval task. This indicates that more future research efforts are needed to actively explore the effective utilization of powerful vision-language pre-trained models for cross-modal retrieval.

percentage Wikipedia Pascal-Sentence
Baseline DAVAE (Jing et al., 2020) PAN (Zeng et al., 2021b) CLIP4CMR Baseline DAVAE (Jing et al., 2020) PAN (Zeng et al., 2021b) CLIP4CMR
100%I, 50%T 0.4520.013 0.4620.011 0.4750.007 0.5780.001 0.5220.016 0.6290.017 0.6590.015 0.6880.005
50%I, 100%T 0.4330.016 0.4650.021 0.4710.006 0.5730.003 0.5480.019 0.6180.014 0.6520.008 0.6840.005
100%I, 30%T 0.4250.021 0.4530.016 0.4700.009 0.5710.003 0.4660.025 0.5830.022 0.6550.017 0.6870.004
30%I, 100%T 0.4170.019 0.4480.018 0.4620.010 0.5780.003 0.4950.024 0.6060.021 0.6420.012 0.6810.005
100%I, 100%T 0.4820.003 0.4850.006 0.4890.002 0.5760.002 0.6640.007 0.6730.010 0.6880.005 0.6900.003
Table 3. Average mAP scores (mean standard deviation) with imbalanced training data under the experimental settings of PAN (Zeng et al., 2021b).
percentage Wikipedia Pascal-Sentence
100%I, 10%T 10%I, 100%T 100%I, 0%T 0%I, 100%T 100%I, 10%T 10%I, 100%T 100%I, 0%T 0%I, 100%T
CLIP4CMR 0.5640.002 0.5770.004 0.1390.003 0.1290.005 0.6820.003 0.6730.006 0.1000.014 0.0880.007
Table 4. Average mAP scores (mean standard deviation) with extremely imbalanced training data.

4.3. Study on the Design of Learning Objectives

4.3.1. Comparative Results

To provide a fair comparison of the loss function design in the existing models, we deploy CLIP4CMR as the uniform framework as well as experimental tool for revisiting the most common pair-wise losses, class-wise losses and hybrid ones. Specifically, we unify the model architecture of CLIP4CMR, training protocol, parameter choice and random seed for a relatively objective comparison. We compare three popular pair-wise losses namely modality-invariant loss (i.e., ML), contrastive loss (i.e., CL) and triplet loss (i.e., TL), as well as three popular class-wise losses namely linear regression loss (i.e., LRL), cross-entropy loss (i.e., CEL) and prototype contrastive loss (i.e., PCL).

Table 2 reports the performance comparison of different loss function design. From the results, we can see that the overall performance of the prototype contrastive loss on the four datasets is significantly better than the other loss functions, although its performance on Wikipedia and NUS-WIDE datasets is slightly lower than that of linear regression loss. For the pair-wise losses, we can see that the performance of modality-invariant loss is very poor, which shows the necessity of considering negative samples for cross-modal retrieval. Moreover, the results show that there is an obvious performance gap between pair-wise loss and class-wise loss. Specifically, the prototype contrastive loss outperforms the triplet loss with the average mAP improvements 4.0%, 7.3%, 1.6% and 7.6% on Wikipedia, Pascal-Sentence, NUS-WIDE and XmediaNet, respectively.

Figure 4 illustrates the performance of the hybrid losses that combine class-wise and pair-wise losses. We carefully compare nine hybrid losses under different combinations including LRL+ML, LRL+CL, LRL+TL, CEL+ML, CEL+CL, CEL+TL, PCL+ML, PCL+CL and PCL+TL, where represents the combination weight. Since the combination weight is a carefully selected parameter and the existing work does not provide a clear value, we tune the parameter and show the average mAP values. The results show that under all possible combinations, the hybrid losses of carefully adjusted parameter have no obvious performance gains compared to applying class-wise loss alone. This empirical finding is consistent with the perspective in the recently proposed method PAN (Zeng et al., 2021b), that is, a simple combination of pair-wise loss and class-wise loss in cross-modal retrieval may not be necessary.

4.3.2. Visualization Analysis

To further explore the reason for this obvious performance gap, we carry out a visualization experiment to analyze the difference of the common representation spaces obtained by pairwise loss and class-wise loss. Concretely, we randomly select 20 image-text pairs from 10 classes of the test set, and each class evenly contains 2 image-text pairs. We choose triple loss and prototype contrastive loss as the representatives for pair-wise loss and class-wise loss respectively. We illustrate the results in Figure 5, where the positions on the diagonal represent the intra-class image-text distances in the common representation space, and the other positions represent the inter-class image-text distances. The visualization results show that the inter-class distances in the common representation space obtained by the triplet loss are significantly smaller than that obtained by the prototype contrastive loss. This indicates that there are a large number of negative sample pairs in the pair-wise loss that cannot be optimized, leading to poorer retrieval performance. Therefore, simply combining pair-wise loss and class-wise loss does not guarantee the expected performance gains, and the performance of hybrid loss is better when the combination weight is smaller, as shown in Figure 4.

4.3.3. Summary and Implication for Future Research

Under the unified experimental setting based on CLIP4CMR, the hybrid losses that combine pair-wise and class-wise losses have no obvious performance gains compared to applying the class-wise loss alone. This indicates that on the one hand, more future research efforts are needed to design effective high-performing data-to-proxy relations in class-wise loss. On the other hand, the complementary research efforts to further explore the design of more fine-grained data-to-data relations in pair-wise loss (possibly by learning from the merits of class-wise loss) may also be needed.

4.4. Study on Two Practical Issues

To facilitate practical applications, we experiment on two key concerned issues here in practice: the robustness to modality imbalance and sensitivity to hyper-parameters.

4.4.1. The Robustness to Modality Imbalance

First, we follow the dataset split scheme in PAN (Zeng et al., 2021b) to construct imbalanced training data, which includes two imbalanced ratios: retain 50% text or image samples (i.e., 100%I+50%T or 50%I+100%T) and retain 30% text or image samples (i.e., 100%I+30%T or 30%I+100%T). Then we further construct a more extreme imbalanced setting, that is, only 10% text or image samples are retained (i.e., 100%I+10%T or 10%I+100%T). Finally, to show the importance of the coexistence of image and text modalities, we also compare the results of only retaining image samples (i.e., 100%I+0%T) and only retaining text samples (i.e., 0%I+100%T). For comparison, we compare with DAVAE (Jing et al., 2020), PAN (Zeng et al., 2021b), and the baseline method of not processing imbalanced data. All compared results are reported in PAN.

Following PAN, we repeat each experiment five times and report the average mAP scores (mean standard deviation) in Table 3 and Table 4. From the experimental results, we can see that the baseline method encounters an obvious performance decline in the face of modality imbalance, and the degree of performance decline is positively correlated with the proportion of modality imbalance. We can also see that DAEVE and PAN achieve significant performance improvements by reconstructing modality balanced data, validating the necessity of using modality balanced data during the training phase. However, the emergence of CLIP4CMR changes these previously formed perspectives. CLIP4CMR achieves significantly better performance under all the imbalanced settings, and it maintains slight performance degradation in some extremely imbalanced settings (i.e., 100%I+10%T and 10%I+100%T in Table 4). The robustness of CLIP4CMR shows that the image and text representations obtained by CLIP pre-trained on large-scale modality balanced data can greatly alleviate the imbalanced problem effortlessly, which is an important change brought by the vision-language pre-trained model for cross-modal retrieval. In particular, the performance of the model drops seriously when we discard text or image samples (i.e., 100%I+0%T and 0%I+100%T in Table 4), indicating that image and text modalities coexist are important to modality imbalanced situation.

Parameter Wikipedia Pascal-Sentence NUS-WIDE XmediaNet
d=64 0.569 0.675 0.606 0.730
d=128 0.576 0.687 0.609 0.738
d=256 0.582 0.691 0.613 0.743
d=512 0.583 0.695 0.614 0.748
d=1024 0.585 0.694 0.615 0.752
d=2048 0.581 0.694 0.615 0.750
Table 5. Parameter analysis of the dimension .

4.4.2. The Sensitivity to Hyper-parameters

To investigate the influence of hyper-parameters on the retrieval performance, we examine the mAP values of CLIP4CMR by varying the dimensionality of the common representation space. Note that the previous work did not perform a detailed parameter analysis of the dimensionality , but we believe this is necessary due to the importance of its value in analyzing the computational storage and time efficiency of cross-modal retrieval. We vary from to , and show the impact of different values of . We report the average mAP values of text retrieval (I2T) and image retrieval (T2I) tasks in Table 5. We can see that when , the overall performance of CLIP4CMR on the four datasets is the best. We can also see that the performance of CLIP4CMR decreases slightly when decreases, which means that CLIP4CMR can maintain considerable performance even in a more compact representation space. In particular, CLIP4CMR can still maintain a small performance degradation in a very compact representation space (such as ), indicating that the retrieval model built on CLIP is almost insensitive to the dimension changes of the common representation space. In addition, we also analyze the impact of the scaling factor in the prototype contrastive loss. We vary from 0.01 to 10 and show the impact in Figure 6. From the results, we can see that CLIP4CMR achieves the best average mAP value when , and the performance drops significantly when , suggesting that it is harder to train larger scaling factors due to the numerical stability.

(a) Wikipedia
(b) NUS-WIDE
Figure 5. Parameter analysis of the scaling factor .

4.4.3. Summary and Implication for Future Research

Cross-modal retrieval model built on CLIP markedly improves the robustness to modality imbalance and sensitivity to the dimension changes of the common representation space. This indicates that with the help of vision-language pre-trained models, the dataset labeling and computational costs in practical applications can be greatly reduced in future research on cross-modal retrieval.

5. Conclusion

In this paper, we conduct a comprehensive empirical study to investigate the performance and impact of the pre-trained CLIP for cross-modal retrieval. Our empirical study demonstrates that the CLIP4CMR framework built on CLIP can significantly facilitate the performance of cross-modal retrieval, together with the underlying rationale for this. The CLIP4CMR framework also provides a uniform experimental setting for the relatively objective comparison of the existing methods to gain valuable insights on loss function design.

References

  • A. Bellet, A. Habrard, and M. Sebban (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709. Cited by: §2.1.
  • J. Cao, Z. Gan, Y. Cheng, L. Yu, Y. Chen, and J. Liu (2020) Behind the scene: revealing the secrets of pre-trained vision-and-language models. In

    European Conference on Computer Vision

    ,
    pp. 565–580. Cited by: §1.
  • M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, and M. Cord (2018) Cross-modal retrieval in the cooking context: learning semantic text-image embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 35–44. Cited by: §1.
  • H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han (2020) IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the CVPR, pp. 12655–12663. Cited by: §1, §2.1, §4.1.3.
  • Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2019) Uniter: learning universal image-text representations. Cited by: §1, §2.2, §2.2.
  • T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In Proceedings of the CIVR, pp. 1–9. Cited by: §4.1.1.
  • S. Chun, S. J. Oh, R. S. de Rezende, Y. Kalantidis, and D. Larlus (2021) Probabilistic embeddings for cross-modal retrieval. In Proceedings of the CVPR, pp. 8415–8424. Cited by: §2.1.
  • K. Desai and J. Johnson (2021) Virtex: learning visual representations from textual annotations. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 11162–11173. Cited by: §1, §2.2, §2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §3.1.
  • F. Feng, X. Wang, and R. Li (2014)

    Cross-modal retrieval with correspondence autoencoder

    .
    In Proceedings of the ACM MM, pp. 7–16. Cited by: §2.1, §3.2.1, Table 1, §4.2.1.
  • G. Geigle, J. Pfeiffer, N. Reimers, I. Vulić, and I. Gurevych (2021) Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. arXiv preprint arXiv:2103.11920. Cited by: §2.2, §2.2.
  • D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor (2004) Canonical correlation analysis: an overview with application to learning methods. Neural computation 16 (12), pp. 2639–2664. Cited by: §2.1, Table 1, §4.2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1, §4.1.3.
  • T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li (2019)

    Bag of tricks for image classification with convolutional neural networks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 558–567. Cited by: §4.1.3.
  • D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §3.1.
  • J. Hu, J. Lu, and Y. Tan (2014) Discriminative deep metric learning for face verification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1875–1882. Cited by: §2.1.
  • Y. Huang, W. Wang, and L. Wang (2017) Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the CVPR, pp. 2310–2318. Cited by: §1, §2.1.
  • M. Jing, J. Li, L. Zhu, K. Lu, Y. Yang, and Z. Huang (2020) Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of the ACM international conference on Multimedia, pp. 3283–3291. Cited by: §1, §4.4.1, Table 3.
  • S. Kim, D. Kim, M. Cho, and S. Kwak (2021) Embedding transfer with label relaxation for improved metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.3.
  • J. Li, S. Tang, J. Li, J. Xiao, F. Wu, S. Pu, and Y. Zhuang (2020a) Topic adaptation and prototype encoding for few-shot visual storytelling. arXiv preprint arXiv:2008.04504. Cited by: §1.
  • X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. (2020b) Oscar: object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pp. 121–137. Cited by: §1, §2.2, §2.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §1, §2.2, §2.2.
  • H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li (2021) Clip4clip: an empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860. Cited by: §1, §2.2.
  • Y. Peng, X. Huang, and J. Qi (2016) Cross-media shared representation by hierarchical learning with multiple deep networks.. In Proceedings of the IJCAI, pp. 3846–3853. Cited by: §2.1, Table 1, §4.2.1.
  • Y. Peng, J. Qi, X. Huang, and Y. Yuan (2017) CCL: cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20 (2), pp. 405–420. Cited by: §3.2.1, Table 1, §4.2.1.
  • Y. Peng, J. Qi, and Y. Yuan (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing 27 (11), pp. 5585–5599. Cited by: §2.1, §3.2.1, Table 1, §4.1.1, §4.2.1.
  • Y. Peng and J. Qi (2019)

    CM-gans: cross-modal generative adversarial networks for common representation learning

    .
    Transactions on Multimedia Computing, Communications, and Applications 15 (1), pp. 1–24. Cited by: §2.1, §3.2.2, §3.2.3, Table 1, §4.2.1.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §1, §2.2, §2.2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §2.2.
  • C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier (2010) Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL, pp. 139–147. Cited by: §4.1.1.
  • N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos (2010) A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM MM, pp. 251–260. Cited by: §4.1.1.
  • K. Roth, T. Milbich, S. Sinha, P. Gupta, B. Ommer, and J. P. Cohen (2020) Revisiting training strategies and generalization performance in deep metric learning. In

    International Conference on Machine Learning

    ,
    pp. 8242–8252. Cited by: §1.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §4.1.3.
  • S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K. Chang, Z. Yao, and K. Keutzer (2021) How much can clip benefit vision-and-language tasks?. arXiv preprint arXiv:2107.06383. Cited by: §2.2.
  • A. Shin, M. Ishii, and T. Narihira (2021) Perspectives and prospects on transformer architecture for cross-modal tasks with language and vision. arXiv preprint arXiv:2103.04037. Cited by: §1.
  • Y. Song and M. Soleymani (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the CVPR, pp. 1979–1988. Cited by: §1, §2.1.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §1, §2.2, §2.2.
  • S. Sun, Y. Chen, L. Li, S. Wang, Y. Fang, and J. Liu (2021) LightningDOT: pre-training visual-semantic embeddings for real-time image-text retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 982–997. Cited by: §2.2, §2.2.
  • H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §1, §2.2, §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.2, §3.1, §4.1.3.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the CVPR, pp. 3156–3164. Cited by: §1.
  • B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen (2017) Adversarial cross-modal retrieval. In Proceedings of the 25th ACM MM, pp. 154–162. Cited by: §1, §2.1, §2.1, 2nd item, §3.2.1, §3.2.2, §3.2.3, Table 1, §4.1.1, §4.2.1.
  • K. Wang, R. He, L. Wang, W. Wang, and T. Tan (2015)

    Joint feature selection and subspace learning for cross-modal retrieval

    .
    IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 2010–2023. Cited by: §1, §2.1, §2.1, §3.2.1, §3.2.2, §3.2.3, Table 1, §4.1.2, §4.2.1.
  • K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang (2016a) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215. Cited by: §4.1.2.
  • L. Wang, Y. Li, and S. Lazebnik (2016b) Learning deep structure-preserving image-text embeddings. In Proceedings of the CVPR, pp. 5005–5013. Cited by: §2.1.
  • W. Wang and K. Livescu (2015) Large-scale approximate kernel canonical correlation analysis. arXiv preprint arXiv:1511.04773. Cited by: §2.1, Table 1, §4.2.1.
  • Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In Proceedings of the ICCV, pp. 5764–5773. Cited by: §1, §2.1.
  • F. Wu, X. Jing, Z. Wu, Y. Ji, X. Dong, X. Luo, Q. Huang, and R. Wang (2020) Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognition, pp. 107335. Cited by: §2.1.
  • J. Wu, Z. Lin, and H. Zha (2017) Joint latent subspace learning and regression for cross-modal retrieval. In Proceedings of the SIGIR, pp. 917–920. Cited by: §2.1, §2.1, §3.2.1, §3.2.2, §3.2.3, Table 1, §4.2.1.
  • L. Wu, Y. Wang, and L. Shao (2018) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Transactions on Image Processing 28 (4), pp. 1602–1612. Cited by: §1.
  • R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan (2014) Supervised hashing for image retrieval via image representation learning.. In Proceedings of the AAAI, pp. 2156–2162. Cited by: §1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the ICML, pp. 2048–2057. Cited by: §1, §2.1.
  • H. Yang, X. Zhang, F. Yin, and C. Liu (2018) Robust classification with convolutional prototype learning. In Proceedings of the CVPR, pp. 3474–3482. Cited by: §2.1, §3.2.2.
  • Z. Zeng, Y. Sun, and W. Mao (2021a) MCCN: multimodal coordinated clustering network for large-scale cross-modal retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 5427–5435. Cited by: §1, §2.1, §2.1, 2nd item, §3.1, Table 1, §4.2.1.
  • Z. Zeng, S. Wang, N. Xu, and W. Mao (2021b) PAN: prototype-based adaptive network for robust cross-modal retrieval. In Proceedings of the SIGIR, pp. 1125–1134. Cited by: §1, §1, §2.1, §2.1, §3.1, §3.2.2, Table 1, §4.2.1, §4.2.2, §4.3.1, §4.4.1, Table 3.
  • Z. Zeng, N. Xu, and W. Mao (2020) Event-driven network for cross-modal retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2297–2300. Cited by: §2.1.
  • X. Zhai, Y. Peng, and J. Xiao (2013) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24 (6), pp. 965–978. Cited by: §3.2.1, Table 1, §4.2.1.
  • Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li (2020) Context-aware attention network for image-text retrieval. In Proceedings of the CVPR, pp. 3536–3545. Cited by: §1, §2.1.
  • R. Zhang (2019) Making convolutional networks shift-invariant again. In International conference on machine learning, pp. 7324–7334. Cited by: §4.1.3.
  • L. Zhen, P. Hu, X. Wang, and D. Peng (2019) Deep supervised cross-modal retrieval. In Proceedings of the CVPR, pp. 10394–10403. Cited by: §1, §2.1, §2.1, §3.2.1, §3.2.2, §3.2.3, Table 1, §4.1.1, §4.2.1.
  • F. Zheng, Y. Tang, and L. Shao (2016) Hetero-manifold regularisation for cross-modal hashing. IEEE transactions on pattern analysis and machine intelligence 40 (5), pp. 1059–1071. Cited by: §1.