Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

by   Tan Wang, et al.

A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training procedure. Extensive experiments on two datasets demonstrate that our MTFN-RR consistently achieves the state-of-the-art matching performance with much less time complexity. The implementation code is available at


page 3

page 8


Scene Text Retrieval via Joint Text Detection and Similarity Learning

Scene text retrieval aims to localize and search all text instances from...

TediGAN: Text-Guided Diverse Face Image Generation and Manipulation

In this work, we propose TediGAN, a novel framework for multi-modal imag...

Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching

Image-text matching is an important multi-modal task with massive applic...

Learning Two-Branch Neural Networks for Image-Text Matching Tasks

This paper investigates two-branch neural networks for image-text matchi...

HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs

The hubness problem widely exists in high-dimensional embedding space an...

Simple to Complex Cross-modal Learning to Rank

The heterogeneity-gap between different modalities brings a significant ...

Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

Image-text matching plays a central role in bridging vision and language...

Code Repositories


The offical code for paper "Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking", ACM Multimedia 2019 Oral

view repo


The offical code for paper "Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking", ACM Multimedia 2019 Oral

view repo

1. Introduction

In contrast to retrieval of unimodal data, image-text matching (Zhang and Lu, 2018; Li et al., 2017; Huang et al., 2017a; Lee et al., 2018) focuses on retrieving the relevant instances of a different media type than the query, including two typical cross-modal retrieval (Song et al., 2013; Xu et al., 2018; Hu et al., 2019; Xu et al., 2019) scenarios: 1) image-based sentence retrieval (I2T), i.e.

, retrieving ground-truth text sentences given a query image, and 2) text-based image retrieval (T2I),

i.e., retrieving matched images given a query text. Essentially, image-text matching requires algorithms that are able to assess the similarity between data and feature representations of images and and the semantics of text. Due to large discrepancy between the nature of textual and visual data and their feature representations, achieving this matching in an effective, efficient and robust fashion is a challenge.

A straightforward step in pursuing this challenge is to expand a typical unimodal classification approach to operate in a cross-modal case. Methods like (Zhang and Lu, 2018; Wang et al., 2016; Li et al., 2017; Huang et al., 2017a) have been proposed to predict match or mismatch (i.e.

, “+1” and “-1”) for an image-text pair input by optimizing a logistic regression loss, turning this into a binary classification problem. They have been, however, shown to be insufficiently capable of handling cross-modal data complexity and therefore insufficiently effective in finding boundaries between unbalanced matching and non-matching image-text pairs. As an alternative,

embedding-based approach has therefore been investigated as well. Embedding-based methods (e.g., (Karpathy and Fei-Fei, 2015; Lee et al., 2018; Zheng et al., 2017; Faghri et al., 2018; Huang et al., 2017b)) try to map image and text features, either global or local, into a joint embedding subspace by optimizing a ranking loss that ensures the similarities of the ground-truth image-text pairs to be greater than that of any other negative pairs by a margin

. Once the common space is established, the relevance between any image and text instance can be easily measured by cosine similarity or Euclidean distance. However, the main limitation of these methods is that constructing such high-dimensional common space for the complex multi-modal data is not a trivial task, typically showing significant problems with learning convergence and generalizability of the learned space and requiring significant computational time and resources. The general pipelines of the two categories of approaches are illustrated in Figure


Clearly, while the embedding-based methods have more potential to capture the complexity in data than the classification-based methods, their model and algorithmic complexity is significantly higher. This analysis leads to a question that inspired the research reported in this paper: Is a new image-text matching framework possible that combines the advantages of the two paradigms, i.e., balancing the matching performance and model efficiency? To answer this question, in this paper we propose a novel image-text matching framework named Multi-modal Tensor Fusion Network with Re-ranking (MTFN-RR) that in an innovative fashion combines the concepts of embedding and classification to achieve the aforementioned balance. As illustrated in Figure 2, our framework is constructed as a cascade of two steps: 1) deploying tensor fusion to learn an accurate image-text similarity measure in the training stage, and 2) performing cross-modal re-ranking to jointly refine the I2T and T2I retrieval results in the testing stage.

For the first part, our MTFN takes the multi-modal global feature as input and then passes them to two branches of Image-Text Fusion and Text-Text Fusion

. Then for each branch, a tensor fusion block with rank constraint is used to capture rich bilinear interactions between multimodal input into a vector. Finally, the similarity of input is directly learned with a fully convolutional layer from the fused vector and naturally embedded to the advanced ranking loss to encourage the large-margin between groundtruth image-text pairs and negative pairs. In this way, the similarity measuring functions from both image-text and text-text input are directly learned without constructing the whole embedding subspace.

Regarding the re-ranking step in the second part, we note that in the previous work, the I2T and T2I retrieval tasks are typically conducted separately in the testing phase. This may be problematic because in the training stage these two tasks are optimized with a bi-directional max-margin ranking loss function like Eq.

5, yielding a discrepancy between training and inference. To reduce this discrepancy in an efficient way, we develop an general cross-modal re-ranking (RR) scheme that jointly considers the retrieval results of both I2T and T2I directions, to bridge the gap between training and testing with little time but achieving significant improvement applicable to most existing image-text matching approaches. Additionally, to mitigate the effects of the unbalanced data (i.e., an image corresponds to five sentences) in MSCOCO (Lin et al., 2014) and Flickr30k (Young et al., 2014), the similarities between unimodal text predicted by our MTFN are further used to significantly boost the T2I retrieval performance.

We summarize our contributions as follows: 1) We propose a novel Multi-modal Tensor Fusion Network (MTFN) that directly learns an accurate image-text similarity function for visual and textual global features via image-text fusion and text-text fusion. It explicitly incorporates the advantages of both embedding-based and classification-based methods and enables efficient training. 2) We further develop an efficient cross-modal re-ranking scheme that remarkably improves the matching performance without extra training and can be freely applied to other off-the-shelf methods. With extensive experiments, the proposed MTFN itself shows competitive accuracy compared to the current best methods on two standard datasets with much less time complexity and simpler feature extraction. Furthermore, when integrating the proposed RR scheme in our MTFN, it achieves the state-of-the-art performance on two datasets, especially on the R@1 score, showing the effectiveness of the RR scheme. The implementation code and related materials are available at

Figure 2. The overview architecture of our proposed framework separated by training and testing parts. 1) During training, the global features of multi-modal inputs are firstly passed into two branches (i.e., Image-Text Fusion and Text-Text Fusion). Then for each branch a tensor fusion scheme with rank constraint is used for modelling rich interactions between the input features to a vector for learning the similarity score. 2) In testing, a cross-modal re-ranking scheme is applied to jointly consider the I2T and T2I retrieval, with the combination of both image-text similarity and text-text similarity .

2. Related Work

Image-Text Similarity. As mentioned before, the embedding-based methods (Karpathy and Fei-Fei, 2015; Lee et al., 2018; Niu et al., 2017) project multimodal (global/local) features into a common embedding space, in which similarities between instances are measured by conventional cosine or Euclidean distance. However, modeling the similarity between image and text can also be regarded as a classification problem to directly answer whether two input samples match each other (Wang et al., 2017; Fukui et al., 2016; Jabri et al., 2016)

. These methods typically secure rapid convergence of the training process, but are limited to fully exploit the identity information of cross-modal features with simple match/mismatch classifiers. Our MTFN is proposed to leverage the advantages of the two kinds of methods above. Specifically, a fully fusion network is firstly designed for directly learning a similarity function from the image-image fusion and text-text fusion rather than using conventional distance metric in a common embedding space. The predicted similarity is equipped to the widely used ranking loss with large margin constraint, which can be optimized efficiently in the next step.

Multi-modal Fusion. To fully capture the interactions between multiple modalities, a number of fusion strategies have been used for exploring the relationship of visual and textual data. Liu et al. (Liu et al., 2017) applied a fusion module to integrate the intermediate recurrent outputs and generate a more powerful representation for image-text matching. Wang et al. (Wang et al., 2017) used element-wise product to aggregate features from visual and text data in two branches. Recently, bilinear fusion (Fukui et al., 2016; Kim et al., 2016; Ben-younes et al., 2017) has proved to be more effective than traditional fusion scheme such as element-wise product (Antol et al., 2015) and concatenation (Zhou et al., 2015) in Visual Question Answering (VQA) problem, since it enables all elements of multi-modal vectors to interact with each other in a multiplicative way. We draw inspiration from the VQA method (Ben-younes et al., 2017) to capture the bilinear interactions of the image-text and text-text data inputs and directly learn image-text similarity. Note that instead of modelling a vector between two modalities by tucker decomposition for classification with a small set of concepts/answers in VQA dataset, here our MFTN can be regarded as a general tensor fusion architecture for various inputs (e.g., image-text, text-text) to directly learn the similarity.

Re-ranking Scheme. Re-ranking has been successfully studied in unimodal retrieval task such as person re-ID (García et al., 2015; Ye et al., 2016, 2015; Leng et al., 2015), object retrieval (Shen et al., 2012; Qin et al., 2011) and text-based image search (Yang and Hanjalic, 2010, 2012; Hsu et al., 2006) to improve the accuracy. In these problems, retrieved candidates within initial rank list can be re-ordered as an additional refinement process. For example, Leng et al. (Leng et al., 2015) proposed a bi-directional ranking method with the newly computed similarity by fusing the contextual similarity between query images. Zhong et al. (Zhong et al., 2017) designed a new feature vector for the given image under the Jaccard distance after the initial ranking. Yang et al. (Yang and Hanjalic, 2010)

introduced a supervised “learning to rerank” paradigm into the visual search reranking learning by applying query-dependent reranking features. Unlike the unimodal re-rank methods and learning to rerank paradigm in text based image search, we propose a cross-modal re-ranking scheme for image-text matching scenario without supervision and learning procedure, which combines the bidirectional retrieval process (I2T and T2I), only takes few seconds and can be inserted in most image-text matching methods for performance improvement.

3. Proposed Model

Let be a training set of image-text pairs, where the image set is denoted as and the text set as . We refer to as positive pairs and as negative pairs, i.e., the most relevant sentence to image is and for sentence , its matched image is . Given a query of one modality, the goal of image-text matching is to find the most relevant instances of the other modality. In this work, we define a similarity function that is expected to, ideally, assign higher similarity scores to the positive pairs than the negative ones. This procedure can be derived as:


Accordingly, we can conduct I2T retrieval task by ranking a database of text instances based on their similarity scores with the query image using , and likewise for T2I retrieval task. Different from most existing embedding-based methods that adopt conventional distance metric (e.g., cosine similarity or Euclidean distance) as the similarity function on a common embedding space, in this work, we aim to directly learn a similarity function that accurately measures the relevance of image-text pairs without seeking for the common subspace for each instance.

3.1. Multi-modal Tensor Fusion Network

Inspired by the Multimodal Tucker Fusion proposed in visual question answering (Ben-younes et al., 2017), as illustrated in Fig. 2 we introduce a novel Tensor Fusion Network into image-text matching for feature merging and similarity learning. Specifically, MTFN contains two branches of Image-Text Fusion and Text-Text Fusion, learning the similarity scores with different inputs (i.e., image-text and text-text). The Image-Text Fusion branch is a conventional process that fuses the global feature representations of images and sentences on dimension of tensor and predicts the similarity score of any image-text pair as . Moreover, considering the fact that multiple sentences annotated to one image have common semantics, we introduce the Text-Text Fusion branch to further capture the semantic relevance of any text-text pairs as . Different with (Wang et al., 2017), the learned information of text modality would be used for re-ranking in testing stage for narrowing the gap between training and inference. In the following parts, we will explicitly depict the details of the two fusion branches in our MTFN.

Image-Text Fusion. Firstly, given the global feature vectors and for image and sentence , the intra-modal projection matrices and are constructed to encode two feature vectors into spaces of respective dimensions and , which can be written as and . Then for feature fusion at the tensor level, we project and into a common space and merge them with an element-wise product, which can be written as:


where , and denotes the element-wise product in matrices.

Actually, considering that each fusion vector can be regarded as a rank-1 vector carrying limited information, to fully encode the bilinear interactions between the two modalities, we further impose a rank constraint on the fusion vector to express it as a sum of rank-1 vectors instead of performing a single feature merging function. In this way, we directly learn different common subspaces. For each space embedding and , a specific fusion vector can be obtained by element-wise product and ultimately summed together, allowing the model to jointly capture the interactions between two modalities from different representation subspaces. Thus the Eq. 2 can be rewritten as follows:


Finally a fully connected layer is added to transform the fusion vector to the score which infers the similarity between image and text followed by a sigmoid layer embedding to :


Next, instead of treating the “match” and “mismatch” as a binary classification problem, we propose to naturally combine the similarity between the two inputs with the widely used ranking loss constraint in existing embedding-based methods to construct a bi-directional max-margin ranking loss. By this way the nonlinear boundary can be easier found while ensuring the preservation of inter-modal invariance simultaneously, which will be further explained in Ablation Study. Specifically, in our work, we follow (Faghri et al., 2018; Lee et al., 2018) to focus on the hardest negatives in the mini-batch during training. For each positive pair of an image and a text , we additionally sample their hardest negatives which are given by and . Then the image-text loss can be defined as:


where is a constant value of the margin and the operator compares the tolerance value with zero. By minimizing the loss term in Eq. 5, the network is trained to guarantee that the truly matching image-text pairs have larger similarity scores than the most confusing negative pairs by a margin .

Text-Text Fusion. Different from the Image-Text Fusion, the Text-Text Fusion measures the similarity of two unimodal sentences, i.e., and , whose features are respectively denoted as and . Similar to the tensor fusion in Eq. 3, here the similarity function of two sentences can be derived as:


where and . Accordingly, we also adopt the ranking constraint with large-margin to learn the text-text similarity . Specifically, given two sentences in a positive pair , they have the same negative sample . Like Eq. 5, here the text-text loss can be formulated as:


which becomes a triplet ranking loss term. The two loss functions in Eq. 5 and Eq. 7 can be optimized independently with Adam optimizer (Kingma and Ba, 2014).

3.2. Cross-modal Re-ranking

Like most existing methods, the I2T and T2I retrieval tasks can be conducted in our MTFN separately using the learned similarity function to obtain the retrieval candidates for an query image or text. However, the interactions between bi-directional retrieval (I2T and T2I) are ignored in the testing stage, resulting in a discrepancy between training and inference. Motivated by the success of deploying RR methods in person Re-id task (Zhong et al., 2017; García et al., 2015; Qin et al., 2011) and text-based image search (Yang and Hanjalic, 2010, 2012; Hsu et al., 2006) that are designed for retrieval within unimodal data, here we propose a cross-modal RR formulated as a novel -reciprocal nearest neighbours searching problem to make the best of the initial learned similarity of image-text pairs and text-text pairs and narrow the gap between training and testing.

The basic assumption is that if an image is paired with a text, they can be retrieved from each other by I2T or T2I retrieval forwardly and reversely. In other words, for an image, its matching text should be the top of its ranking candidates and vice versa. Based on this assumption, we define two re-ranking strategies of I2T Re-ranking and T2I Re-ranking as follows.

Figure 3. An example of our I2T Re-ranking scheme. Given a query image , a conventional I2T retrieval list is built firstly. Then we apply the inverse retrieval direction T2I for further refinement.

I2T Re-ranking. As shown in Fig. 3, given a query image and its initial ranking list produced by our MTFN using , we define as the initial cross-modal -nearest neighbour text of image :


where denotes the number of candidates in the list. Then for each candidate text , a set of -nearest images can be defined as:


where is the number of images in testing set. To fuse the bi-directional nearest neighbours of and , we further introduce a position index for each candidate as:


Then a position set can be built for all candidate text in the initial -nearest neighbours :


The set can be regarded as a reordering of the initial retrieval list using text modality, deploying the learned similarity matrix from the other direction (i.e., ) effectively. Therefore, we then just re-calculate the pairwise similarity between the query image and candidate text by ranking the position set as:


where denotes the refined retrieval list for the query image after I2T re-ranking.

T2I Re-ranking. As an image is commonly annotated with multiple sentences in datasets, we apply the obtained unimodal text similarity as a prior information to refine the T2I Re-ranking process. Likewise, we first define the -nearest images for a query text with initial ranking list generated by our MTFN using :


Differently, since considering that each image is annotated with multiple sentences in practice, the retrieval results of would have inner associations to those of other semantically related text. Therefore, we find the unimodal nearest neighbour set of to replace the individual query text using the text-text similarity , as


where is the number of related text to . Then similar as the re-ranking procedure in I2T Re-ranking, the refined results are obtained by performing I2T retrieval for each image in , where the detailed steps are depicted as follows:


where is the refined image list for the query text after the T2I re-ranking.

Method Flickr30k dataset MSCOCO dataset
I2T T2I mR I2T T2I mR
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
LTBN (Sim) (Wang et al., 2017) (TPAMI’18) 16.6 38.8 51.0 7.4 23.5 33.3 28.4 30.9 61.1 76.2 14.0 30.0 37.8 41.7
sm-LSTM (Huang et al., 2017a) (CVPR’17) 42.5 71.9 81.5 30.2 60.4 72.3 59.8 53.2 83.1 91.5 40.7 75.8 87.4 72.0
CMPM (Zhang and Lu, 2018) (ECCV’18) 48.3 75.6 84.5 35.7 63.6 74.1 63.6 56.1 86.3 92.9 44.6 78.8 89 74.6
DAN (Nam et al., 2017) (CVPR’17) 41.4 73.5 82.5 31.8 61.7 72.5 60.6 - - - - - - -
JGCAR (Wang et al., 2018) (MM’18) 44.9 75.3 82.7 35.2 62.0 72.4 62.1 52.7 82.6 90.5 40.2 74.8 85.7 71.1
LTBN (Emb) (Wang et al., 2017) (TPAMI’18) 43.2 71.6 79.8 31.7 61.3 72.4 60.0 54.9 84.0 92.2 43.3 76.4 87.5 73.1
RRF (Liu et al., 2017) (ICCV’17) 47.6 77.4 87.1 35.4 68.3 79.9 66.0 56.4 85.3 91.5 43.9 78.1 88.6 74.0
VSE++ (Faghri et al., 2018) (BMVC’18) 52.9 79.1 87.2 39.6 69.6 79.5 68.0 64.6 89.1 95.7 52.0 83.1 92.0 79.4
GXN (Gu et al., 2017) (CVPR’18) - - - - - - - 68.5 - 97.9 56.6 - 94.5 -
SCO (Huang et al., 2017b) (CVPR’18) 55.5 82.0 89.3 41.1 70.5 80.1 69.8 69.9 92.9 97.5 56.7 87.5 94.8 83.2
SCAN (T2I) (Lee et al., 2018) (ECCV’18) 61.8 87.5 93.7 45.8 74.4 83.0 74.4 70.9 94.5 97.8 56.4 87.0 93.9 83.4
SCAN (I2T) (Lee et al., 2018) (ECCV’18) 67.9 89.0 94.4 43.9 74.2 82.8 75.4 69.2 93.2 97.5 54.4 86.0 93.6 82.3
MTFN 63.1 85.8 92.4 46.3 75.3 83.6 74.4 71.9 94.2 97.9 57.3 88.6 95.0 84.2
MTFN-RR w/o 65.3 88.3 93.3 46.7 75.9 83.8 75.6 74.3 94.9 97.9 57.5 88.8 95.0 84.7
MTFN-RR with 65.3 88.3 93.3 52.0 80.1 86.1 77.5 74.3 94.9 97.9 60.1 89.1 95.0 85.2
Table 1. Overall comparison with the state-of-the-art results. Three panels are the classification-based methods, embedding-based methods and our proposed method, respectively. The best results are marked in bold font.

4. Experiment

4.1. Experimental Setup

Datasets and Evaluation Metric.

We conducted several experiments on two widely used datasets, i.e., Flickr30k (Young et al., 2014) and MSCOCO (Lin et al., 2014) with the following widely-used experimental protocols: 1) Flickr30k contains 31000 images collected from the Flickr webset. Each image is manually annotated by 5 sentences. We use the same data split setting as in (Faghri et al., 2018; Lee et al., 2018) with the training, validation and test splits containing 28000, 1000 and 1000 images, respectively. 2) MSCOCO consists of 123287 images and each one is associated with 5 sentences. We use the public training, validation and testing splits following (Lee et al., 2018; Faghri et al., 2018), where 113287 and 5000 images are used for training and validation, respectively. For the 5000 test images, we report results by i) averaging over 5 folds of 1k test images and ii) directly evaluating on the full 5k images.

We conduct two kinds of image-text matching tasks: 1) sentence retrieval, i.e., retrieving groundtruth sentences given a query image (I2T); 2) image retrieval, i.e., retrieving groundtruth images given a query text (T2I). The commonly used evaluation metric for the I2T and T2I tasks is defined as the recall rate at the top results to the query, and usually . We also used “mR” score proposed in (Huang et al., 2017b) for additional evaluation, which averages all the recall scores of to assess the overall performance for both I2T and T2I tasks.

Implementation Details. The feature extraction in our experiment generally follows the pre-process adopted in (Anderson et al., 2017; Lee et al., 2018). Specifically, for visual feature representation, we use the ResNet (He et al., 2016) model to extract the CNN features for 36 regions detected by pre-trained Faster-RCNN (Ren et al., 2015) model on Visual Genomes (Krishna et al., 2016). Then after global average pooling on the feature map, an image can be represented by a 2048d global feature vector. For textual feature representation, we use a GRU (Cho et al., 2014) initialized with the parameters of a pre-trained Skip-thoughts model (Kiros et al., 2015)

to represent each text sentence by a 2400d feature vector. We trained our model using Adam optimizer with a mini-batch size of 128 for 50 epochs on each dataset. The initial learning rate is 0.0001, decayed by 2 every 10 epochs. The two fusion branches in our model are trained successively, where we use the parameters of the Image-Text fusion branch to initialize the Text-Text fusion branch for stable training performance. The parameters

, , are empirically set as 1024, the margin is set to 0.2 and is 20. In the cross-modal RR, the number for nearest-neighbor searching is respectively set to 15 and 7 for Flickr30k and MSCOCO datasets. Our model is implemented in PyTorch (27) and all the experiments are conducted on a workstation with two NVIDIA 1080 Ti GPUs.

4.2. Comparisons with the State-of-the-arts

We compared our model with several recent state-of-the-art models, including the classification-based methods: LTBN (Sim) (Wang et al., 2017), sm-LSTM (Huang et al., 2017a), CMPM (Zhang and Lu, 2018) and the embedding-based methods: DAN (Nam et al., 2017), JGCAR (Wang et al., 2018), LTBN (Emb) (Wang et al., 2017), RRF (Liu et al., 2017), VSE++ (Faghri et al., 2018), GXN (Gu et al., 2017), SCO (Huang et al., 2017b), SCAN (Lee et al., 2018). Note that for fair and objective comparison, feature extractions for images and text and evaluation protocols in all methods are consistent with (Lee et al., 2018; Huang et al., 2017b).

Table 1 shows the overall I2T and T2I retrieval results of our model and the counterparts on the Flickr30k and MSCOCO datasets. We can make the following observations:

  • Our MTFN itself achieves competitive results for both tasks on the two datasets. It indicates that our proposed fusion network is capable to learn the effective similarity function to fully encode the interactions between image and text. We note that our MTFN obtains slightly inferior I2T performance than the current best model SCAN on Flickr30k. However, the SCAN method still cannot outperform on both I2T and T2I task with one model. A probable reason is the smaller size of Flickr30k compared to MSCOCO. However, the difference in performance between MTFN and SCAN is insignificant compared to immense difference in algorithmic complexity: SCAN is much more complex than our MTFN as it is elaborately designed for I2T and T2I tasks separately, and relies on fine-grained local features of both image regions and textual words with additional attention mechanism.

  • When combining with the proposed cross-modal RR scheme with text-text similarity , our MTFN-RR gains remarkable improvements compared with MTFN on both tasks and achieves the state-of-the-art performance in most cases. The main reason is that the two cascaded steps of the framework exploit the synergy between the I2T and T2I retrieval tasks by looking at the image-text matching task simultaneously from two perspectives (from image to text and from text to image). In addition, we also explore our MTFN-RR without exploiting text-text similarity. From the results we can see the improvement on T2I decreases significantly due to the data imbalance between images and text.

  • The notable improvement of our MTFN-RR is achieved on the R@1 and R@5 on both datasets, which is more beneficial for retrieval in practice. Specifically, on Flickr30k dataset, the best R@1 on T2I task of our MTFN-RR is 52.0, which is superior to SCAN with a large margin of 13.5%. On MSCOCO 1k test, our MTFN-RR obtains R@1 score 74.3 and 60.1 on I2T and T2I tasks, consistently outperforming the second best by 4.8% and 6.0%, respectively. Besides, our model also performs best on MSCOCO 5k test shown in Table 2, which further verifies the superiority of the proposed MTFN-RR.

  • The improvement on the T2I task by our MTFN-RR is more remarkable than that on the I2T task, showing the advance of the Text-Text fusion in our proposed fusion network on capturing the similarity of semantically related text and enhancing the accuracy of the learned image-text similarity. Fig. 6 visualizes several typical retrieval examples obtained by our MTFN and MTFN-RR on the two datasets.

Method I2T T2I
R@1 R@5 R@10 R@1 R@5 R@10
DPC (Zheng et al., 2017) 41.2 70.5 81.1 25.3 53.4 66.4
GXN (Gu et al., 2017) 42.0 - 84.7 31.7 - 74.6
SCO (Huang et al., 2017b) 42.8 72.3 83.0 33.1 62.9 75.5
CMPM (Zhang and Lu, 2018) 31.1 60.7 73.9 22.9 50.2 63.8
SCAN (T2I) (Lee et al., 2018) 46.2 77.1 86.8 34.3 64.7 75.8
SCAN (I2T) (Lee et al., 2018) 46.4 77.4 87.2 34.7 64.8 76.8
MTFN (Ours) 44.7 76.4 87.3 33.1 64.7 76.1
MTFN-RR (Ours) 48.3 77.6 87.3 35.9 66.1 76.1
Table 2. The I2T and T2I retrieval results obtained by our models and the counterparts on MSCOCO 5k test set.
Fusion Strategy Attention Training Time (hours) Evaluating Time (s) I2T T2I
MTFN (Ours) Sum Product Concatenation R@1 R@5 R@10 R@1 R@5 R@10
7.9 36.2 65.2 90.1 96.2 45.3 82.6 90.8
7.9 36.7 67.1 92.8 97.2 48.3 84.6 92.5
8.2 37.6 65.1 90.6 96.1 46.2 83.0 91.9
47.1 261.5 66.3 90.2 96.2 46.1 82.9 90.7
48.1 262.9 67.3 92.6 97.1 48.8 84.8 92.6
48.9 264.3 66.1 91.8 96.4 46.6 83.2 92.0
50.2 283.2 70.8 92.8 97.1 53.7 87.2 94.5
9.0 40.2 71.9 94.2 97.9 57.3 88.6 95.0
Table 3. Comparison of our MTFN with other common fusion strategies on the MSCOCO 1k test set. Check mark represents the combination of different fusion methods and attention mechanism.
Figure 4. Visualization of the fusion vector by classification-based method and our MTFN embedding on the part of MSCOCO test set (8000 image-text pairs) with the learned linear SVM boundary.

4.3. Further Analysis

Distribution of Fusion Vector. We concluded our qualitative analysis by providing a global view of the performance of our proposed MTFN comparing to the classification-based method by replacing ranking loss in MTFN with the logistic regression. We visualize the distributions of fusion vector

on MSCOCO dataset by deploying the t-SNE tool to map it onto a 2D space. Additionally for better analysis, we further use a standard SVM (support-vector machine) for each embedding to find a linear boundary between “match” and “mismatch” dots and to compute classification accuracy. Fig.

4 depicts the overall results and the black dashed line denotes the learned SVM boundary. We can conclude that our model can better preserve the structure of matching image-text pairs with a larger margin and get much higher accuracy for classifying “match” and “mismatch”. The main reason is that our ranking-based loss optimizes the model in terms of a margin without forcing the image-text pairs only to “1’ and “-1”.

Analysis on Fusion Strategy. In this experiment, we compare our MTFN with previous linear fusion schemes used in (Liu et al., 2017; Wang et al., 2017), e.g., element-wise sum/product and concatenation, by evaluating the I2T and T2I retrieval results and the training and evaluating time consumption for time complexity. Besides, the popular attention mechanism used in (Xu et al., 2015; Nam et al., 2017; Lee et al., 2018) is also included to assess its influence on different fusion schemes. Following the experimental setting in (Fukui et al., 2016), each combination of model has similar amount of model parameters by combining with multiple fully connected layers.

Table 3 shows that our MTFN itself outperforms all the traditional linear fusion strategies with much less training and evaluating time. The attention mechanism has beneficial impact on the previous linear fusion schemes, however, it deteriorates the performance of our MTFN and greatly increases the time consumed. The potential reason is that our MTFN already effectively encodes the bilinear interactions between the global features of images and text with rank constraint. Using attention mechanism just leads to a great increase of model complexity and time consumption.

Figure 5. R@1 scores of the I2T retrieval on the MSCOCO 1k test set with various model parameters using global and local features. The yellow label indicates the time consumption when achieving best result.
Figure 6. Quantitative results of I2T and T2I retrieval on Flickr30k and MSCOCO datasets obtained by our MTFN and MTFN-RR models. For I2T, the ground-truth text are marked as red and bold, while the text sharing similar meanings are marked with underline. For T2I, the groundtruth images are outlined in red rectangles. More results can be referred to our supplementary.

Analysis on Model Complexity. Our MTFN is flexible to use either global or local features for images and text on tensor fusion constraint as depicted in Eq. 3. To investigate the effect of raw features and hyper-parameter on our MTFN, we take the MSCOCO dataset as testbed to assess the model complexity using the extracted global or local features for images and text. As shown by the comparison results in Fig. 5, we can observe that with various quantities of model parameters, using global features can consistently obtain better performance than using local ones, showing that the bilinear tensor fusion process in our MTFN is more effective to handle the global features. Moreover, to achieve the best performance, our MTFN can be trained much faster (around 9 hours) using global features than the time (around 48 hours) using local ones. It is worthy mentioning that we also evaluate the model complexity of previous sm-LSTM and SCAN methods. In practice, they need around 50 and 60 hours for training, respectively. Thus, it further demonstrates that our MTFN is much more efficient than these two counterparts using local features for training, due to the superiority of tensor fusion applied in our MTFN.

Figure 7. For the proposed RR scheme: (a) Comparison of RR applied to our MTFN and resent state-of-the-art methods on MSCOCO dataset. (b) Effect of the nearest-neighbor used in RR on R@1 on Flickr30k and MSCOCO datasets.

Analysis on Cross-modal Re-ranking. As aforementioned, the proposed cross-modal RR scheme can also be applied to most previous methods that utilize image-text similarity to obtain a ranking list. In this experiment, we take MSCOCO dataset and apply our proposed RR scheme on our MTFN and three latest methods DAN (Nam et al., 2017), VSE++ (Faghri et al., 2018) and SCAN (Lee et al., 2018), to investigate its effect on refining the retrieval results. Specifically, for each query instance (image or sentence), we perform I2T and T2I for each model to get its initial retrieval ranking list. Then we can obtain its refined ranking list after re-ranking process. Fig. 7 shows the results in terms of on both I2T and T2I tasks by comparing the initial and refined ranking lists. It is obvious that the re-ranking process makes remarkable improvements for all the four methods on the I2T task, showing that utilizing the cross-modal associations helps in achieving more accurate retrieval. Besides, we can also observe that re-ranking is effective for our MTFN on the T2I task while its effect is minor in other cases. In Fig. 7, we also assess the impact of various nearest neighbors on our MTFN with R@1 during the RR process. We can see that the RR performance on R@1 remains stable for large neighborhood, with the critical point at =6, below which the performance degrades.

Analysis on Model Ensemble. Model ensemble is a practical strategy that integrates the retrieval results from multiple models. The latest approaches of RRF-Net and SCAN have already studied the effect of model ensemble and show its effectiveness to further boost the retrieval performance. In this part we follow them to integrate the strength of averaging individual MTFN-RR model and compare to RRF-Net and SCAN with different cases of model ensemble on Flickr30k dataset. Specifically, for RRF-Net and our MTFN-RR, denotes the number of individual model used for ensemble, while (I2T + T2I) denotes the integration of the SCAN models separately trained for I2T and T2I. As the result shown in Table 4, for our MTFN-RR model, compared with a single model (i.e., ), merging multiple models generally obtains much better retrieval performance without increasing the training complexity. In addition, our MTFN-RR () significantly outperforms the best ensemble result of SCAN (I2T + T2I) on T2I task, showing the advantage of our MTFN-RR method.

5. Conclusion

In this work, we proposed a novel image-text matching method named MTFN, which directly learns the image-text similarity function by multi-modal tensor fusion of global visual and textual features effectively, without redundant training steps. We then combined our MTFN with an effective and general cross-modal RR scheme, i.e., the MTFN-RR framework, to boost the I2T and T2I retrieval results considering additional unimodal text-text similarity. Experiments on two benchmark datasets showed the effectiveness of our MTFN and the RR scheme, which achieve the state-of-the-art retrieval performance with much less time consumption. In the future, we consider to develop more effective cross-modal RR schemes to form an end-to-end matching framework.

Ensemble Model I2T T2I
R@1 R@5 R@1 R@5
RRF-Net () (Liu et al., 2017) 50.3 79.2 37.4 70.4
SCAN (I2T + T2I) (Lee et al., 2018) 67.4 90.3 48.6 77.7
MTFN-RR 65.3 88.3 52.0 80.1
MTFN-RR 67.4 89.4 52.8 80.9
MTFN-RR 67.7 90.1 53.2 81.2
Table 4. Model ensemble results of our MTFN-RR and the counterparts RRF-Net and SCAN on Flickr30k dataset.

6. Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under grants No. 61602089, 61572108, 61632007 and the Sichuan Science and Technology Program, China, under Grants No. 2019ZDZX0008 and 2018GZDZX0032.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2017) Bottom-up and top-down attention for image captioning and VQA. CoRR abs/1707.07998. Cited by: §4.1.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: visual question answering. In

    2015 IEEE International Conference on Computer Vision

    pp. 2425–2433. Cited by: §2.
  • H. Ben-younes, R. Cadène, M. Cord, and N. Thome (2017) MUTAN: multimodal tucker fusion for visual question answering. In IEEE International Conference on Computer Vision, pp. 2631–2639. Cited by: §2, §3.1.
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    In Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Cited by: §4.1.
  • F. Faghri, D. J. Fleet, J. Kiros, and S. Fidler (2018) VSE++: improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference 2018, pp. 12. Cited by: §1, §3.1, Table 1, §4.1, §4.2, §4.3.
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    pp. 457–468. Cited by: §2, §2, §4.3.
  • J. García, N. Martinel, C. Micheloni, and A. G. Vicente (2015) Person re-identification ranking optimisation by discriminant context information analysis. In 2015 IEEE International Conference on Computer Vision, pp. 1305–1313. Cited by: §2, §3.2.
  • J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang (2017) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. CoRR abs/1711.06420. Cited by: Table 1, §4.2, Table 2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 770–778. Cited by: §4.1.
  • W. H. Hsu, L. S. Kennedy, and S. Chang (2006) Video search reranking via information bottleneck principle. In Proceedings of the 14th ACM international conference on Multimedia, pp. 35–44. Cited by: §2, §3.2.
  • M. Hu, Y. Yang, F. Shen, N. Xie, R. Hong, and H. T. Shen (2019) Collective reconstructive embeddings for cross-modal hashing. IEEE Transactions on Image Processing 28 (6), pp. 2770–2784. Cited by: §1.
  • Y. Huang, W. Wang, and L. Wang (2017a) Instance-aware image and sentence matching with selective multimodal LSTM. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7254–7262. Cited by: §1, §1, Table 1, §4.2.
  • Y. Huang, Q. Wu, and L. Wang (2017b) Learning semantic concepts and order for image and sentence matching. CoRR abs/1712.02036. Cited by: §1, Table 1, §4.1, §4.2, Table 2.
  • A. Jabri, A. Joulin, and L. van der Maaten (2016) Revisiting visual question answering baselines. In Computer Vision - ECCV 2016 - 14th European Conference, pp. 727–739. Cited by: §2.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 3128–3137. Cited by: §1, §2.
  • J. Kim, K. W. On, W. Lim, J. Kim, J. Ha, and B. Zhang (2016) Hadamard product for low-rank bilinear pooling. CoRR abs/1610.04325. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: 1412.6980 Cited by: §3.1.
  • R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pp. 3294–3302. Cited by: §4.1.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. CoRR abs/1602.07332. Cited by: §4.1.
  • K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. CoRR abs/1803.08024. Cited by: §1, §1, §2, §3.1, Table 1, §4.1, §4.1, §4.2, §4.3, §4.3, Table 2, Table 4.
  • Q. Leng, R. Hu, C. Liang, Y. Wang, and J. Chen (2015) Person re-identification with content and context re-ranking. Multimedia Tools Appl. 74 (17), pp. 6989–7014. Cited by: §2.
  • S. Li, T. Xiao, H. Li, W. Yang, and X. Wang (2017) Identity-aware textual-visual matching with latent co-attention. In IEEE International Conference on Computer Vision, pp. 1908–1917. Cited by: §1, §1.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, pp. 740–755. Cited by: §1, §4.1.
  • Y. Liu, Y. Guo, E. M. Bakker, and M. S. Lew (2017) Learning a recurrent residual fusion network for multimodal matching. In IEEE International Conference on Computer Vision, pp. 4127–4136. Cited by: §2, Table 1, §4.2, §4.3, Table 4.
  • H. Nam, J. Ha, and J. Kim (2017) Dual attention networks for multimodal reasoning and matching. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2156–2164. Cited by: Table 1, §4.2, §4.3, §4.3.
  • Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua (2017) Hierarchical multimodal LSTM for dense visual-semantic embedding. In IEEE International Conference on Computer Vision, pp. 1899–1907. Cited by: §2.
  • [27] PyTorch open source toolkit. Note: Cited by: §4.1.
  • D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. J. V. Gool (2011) Hello neighbor: accurate object retrieval with k-reciprocal nearest neighbors. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, pp. 777–784. Cited by: §2, §3.2.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pp. 91–99. Cited by: §4.1.
  • X. Shen, Z. Lin, J. Brandt, S. Avidan, and Y. Wu (2012) Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3013–3020. Cited by: §2.
  • J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 785–796. Cited by: §1.
  • L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013. Cited by: §1.
  • L. Wang, Y. Li, and S. Lazebnik (2017)

    Learning two-branch neural networks for image-text matching tasks

    CoRR abs/1704.03470. Cited by: §2, §2, §3.1, Table 1, §4.2, §4.3.
  • S. Wang, Y. Chen, J. Zhuo, Q. Huang, and Q. Tian (2018) Joint global and co-attentive representation learning for image-sentence retrieval. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 1398–1406. Cited by: Table 1, §4.2.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    Proceedings of the 32nd International Conference on Machine Learning

    pp. 2048–2057. Cited by: §4.3.
  • X. Xu, L. He, H. Lu, L. Gao, and Y. Ji (2018) Deep adversarial metric learning for cross-modal retrieval. World Wide Web. External Links: Document Cited by: §1.
  • X. Xu, H. Lu, J. Song, Y. Yang, H. T. Shen, and X. Li (2019) Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans Cybernetics. External Links: Document Cited by: §1.
  • L. Yang and A. Hanjalic (2010) Supervised reranking for web image search. In Proceedings of the 18th ACM international conference on Multimedia, pp. 183–192. Cited by: §2, §3.2.
  • L. Yang and A. Hanjalic (2012) Prototype-based image search reranking. IEEE transactions on multimedia 14 (3), pp. 871–882. Cited by: §2, §3.2.
  • M. Ye, J. Chen, Q. Leng, C. Liang, Z. Wang, and K. Sun (2015) Coupled-view based ranking optimization for person re-identification. In International Conference on Multimedia Modeling, pp. 105–117. Cited by: §2.
  • M. Ye, C. Liang, Y. Yu, Z. Wang, Q. Leng, C. Xiao, J. Chen, and R. Hu (2016) Person re-identification via ranking aggregation of similarity pulling and dissimilarity pushing. IEEE Transactions on Multimedia 18 (12), pp. 2553–2566. Cited by: §2.
  • P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, pp. 67–78. Cited by: §1, §4.1.
  • Y. Zhang and H. Lu (2018) Deep cross-modal projection learning for image-text matching. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, Table 1, §4.2, Table 2.
  • Z. Zheng, L. Zheng, M. Garrett, Y. Yang, and Y. Shen (2017) Dual-path convolutional image-text embedding. arXiv preprint arXiv:1711.05535. Cited by: §1, Table 2.
  • Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3652–3661. Cited by: §2, §3.2.
  • B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus (2015) Simple baseline for visual question answering. CoRR abs/1512.02167. Cited by: §2.