icme
None
view repo
In this paper, we investigate the cross-media retrieval between images and text, i.e., using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based the 4,096 dimensional convolutional neural network (CNN) visual feature and 100 dimensional LDA textual feature, the mAP of the proposed method achieves 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.
READ FULL TEXT VIEW PDFNone
With the rapid development of information technology, multi-modal data (e.g., image, text, video or audio) have been widely available on the Internet. For example, an image often co-occurs with text on a web page to describe the same object or event. Related research has been conducted incrementally in recent decades, among which the retrieval across different modalities has attracted much attention and benefited many practical applications. However, multi-modal data usually span different feature spaces. This heterogeneous characteristic poses a great challenge to cross-media retrieval tasks. In this work, we mainly focus on addressing the cross-media retrieval between text and images (Fig. 1), i.e., using image (text) to search text documents (images) with the similar semantics.
To address this issue, many approaches have been proposed by learning a common representation for the data of different modalities. We observe that most exiting works [Hardoon et al. (2004), Rasiwasia et al. (2010), Sharma et al. (2012), Gong et al. (2013)] focus on learning one couple of mapping matrices to project high-dimensional features from different modalities into a common latent space. By doing this, the correlations of two variables from different modalities can be maximized in the learned common latent subspace. However, only considering pair-wise closeness [Hardoon et al. (2004)] is not sufficient for cross-media retrieval tasks, since it is required that multi-modal data from the same semantics should be united in the common latent subspace. Although [Sharma et al. (2012)] and [Gong et al. (2013)] have proposed to use supervised information to cluster the multi-modal data with the same semantics, learning one couple of projections may only lead to compromised results for each retrieval task.
In this paper, we propose a modality-dependent cross-media retrieval (MDCR) method, which recommends different treatments for different retrieval tasks, i.e., I2T and T2I. Specifically, MDCR is a task-specific method, which learns two couples of projections for different retrieval tasks. The proposed method is illustrated in Fig. 2. Fig. 2(a) and Fig. 2(c) are two linear regression operations from the image and the text feature space to the semantic space, respectively. By doing this, multi-modal data with the same semantics can be united in the common latent subspace. Fig. 2(b) is a correlation analysis operation to keep pair-wise closeness of multi-modal data in the common space. We combine Fig. 2(a) and Fig. 2(b) to learn a couple of projections for I2T, and a different couple of projections for T2I is jointly optimized by Fig. 2(b) and Fig. 2
(c). The reason why we learn two couples of projections rather than one couple for different retrieval tasks can be explained as follows. For I2T, we argue that the accurate representation of the query (i.e., the image) in the semantic space is more important than that of the text to be retrieved. If the semantics of the query is misjudged, it will be even harder to retrieve the relevant text. Therefore, only the linear regression term from image feature to semantic label vector and the correlation analysis term are considered for optimizing the mapping matrices for I2T. For T2T, the reason is the same as that for I2T. The main contributions of this work are listed as follow:
We propose a modality-dependent cross-media retrieval method, which projects data of different modalities into a common space so that similarity measurement such as Euclidean distance could be applied for cross-media retrieval.
To better validate the effectiveness of our proposed MDCR, we compare it with other state-of-the-arts based on more powerful feature representations. In particular, with the 4,096 dimensional CNN visual feature and 100 dimensional LDA textual feature, the mAP of the proposed method reaches 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset as far as we know.
Based on the INRIA-Websearch dataset [Krapac et al. (2010)], we construct a new dataset for cross-media retrieval evaluation. In addition, all the features utilized in this paper are publicly available^{1}^{1}1https://sites.google.com/site/yunchaosite/mdcr.
The remainder of this paper in organized as follows. We briefly review the related work of cross-media retrieval in Section 2. In Section 3, the proposed modality-dependent cross-media retrieval method is described in detail. Then in Section 4, experimental results are reported and analyzed. Finally, Section 5 presents the conclusions.
During the past few years, numerous methods have been proposed to address cross-media retrieval. Some works [Hardoon et al. (2004), Tenenbaum and Freeman (2000), Rosipal and Krämer (2006), Yang et al. (2008), Sharma and Jacobs (2011), Hwang and Grauman (2010), Rasiwasia et al. (2010), Sharma et al. (2012), Gong et al. (2013), Wei et al. (2014), Zhang et al. (2014)] try to learn an optimal common latent subspace for multi-modal data. This kind of methods projects representations of multiple modalities into an isomorphic space, such that similarity measurement can be directly applied between multi-modal data. Two popular approaches, Canonical Correlation Analysis (CCA) [Hardoon et al. (2004)] and Partial Least Squares (PLS) [Rosipal and Krämer (2006), Sharma and Jacobs (2011)], are usually employed to find a couple of mappings to maximize the correlations between two variables. Based on CCA, a number of successful algorithms have been developed for cross-media retrieval tasks [Rashtchian et al. (2010), Hwang and Grauman (2010), Sharma et al. (2012), Gong et al. (2013)]. The work [Rashtchian et al. (2010)]
investigated the cross-media retrieval problem in terms of correlation hypothesis and abstraction hypothesis. Based on the isomorphic feature space obtained from CCA, a multi-class logistic regression is applied to generate a common semantic space for cross-media retrieval tasks. In
[Hwang and Grauman (2010)], Hwang et al. used KCCA to develop a cross-media retrieval method by modeling the correlation between visual features and textual features. The work [Sharma et al. (2012)]presented a generic framework for multi-modal feature extraction techniques, called Generalized Multiview Analysis (GMA). More recently, the work
[Gong et al. (2013)] proposed a three-view CCA model by introducing a semantic view to produce a better separation for multi-modal data of different classes in the learned latent subspace.To address the problem of prohibitively expensive nearest neighbor search, some hashing-based approaches [Kumar and Udupa (2011), Wu et al. (2014)] to large scale similarity search have drawn much interest from the cross-media retrieval community. In particular, [Kumar and Udupa (2011)] proposed a cross view hashing method to generate hash codes by minimizing the distance of hash codes for the similar data and maximizing the distance for the dissimilar data. Recently, [Wu et al. (2014)]
proposed a sparse multi-modal hashing method, which can obtain sparse codes for the data across different modalities via joint multi-modal dictionary learning, to address cross-modal retrieval. Besides, with the development of deep learning, some deep models
[Frome et al. (2013), Wang et al. (2014), Lu et al. (2014), Zhuang et al. (2014)] have also been proposed to address cross-media problems. Specifically, [Frome et al. (2013)] presented a deep visual-semantic embedding model to identify visual objects using both labeled image data and semantic information obtained from unannotated text documents. [Wang et al. (2014)] proposed an effective mapping mechanism, which can capture both intra-modal and inter-modal semantic relationships of multi-modal data from heterogeneous sources, based on the stacked auto-encoders deep model.Beyond the above mentioned models, some other works [Yang et al. (2009), Yang et al. (2010), Yang et al. (2012), Wu et al. (2013), Zhai et al. (2013), Kang et al. (2014)] have also been proposed to address cross-media problems. In particular, [Wu et al. (2013)] presented a bi-directional cross-media semantic representation model by optimizing the bi-directional list-wise ranking loss with a latent space embedding. In [Zhai et al. (2013)], both the intra-media and the inter-media correlation are explored for cross-media retrieval. Most recently, [Kang et al. (2014)] presented a heterogeneous similarity learning approach based on metric learning for cross-media retrieval. With the convolutional neural network (CNN) visual feature, some new state-of-the-art cross-media retrieval results have been achieved in [Kang et al. (2014)].
In this section, we detail the proposed supervised cross-media retrieval method, which we call modality-dependent cross-media retrieval (MDCR). Each pair of image and text in the training set is accompanied with semantic information (e.g., class labels). Different from [Gong et al. (2013)] which incorporates the semantic information as a third view, in this paper, semantic information is employed to determine a common latent space with a fixed dimension where samples with the same label can be clustered.
Suppose we are given a dataset of data instances, i.e., , where and are original low-level features of image and text document, respectively. Let be the feature matrix of image data, and be the feature matrix of text data. Assume that there are classes in . is the semantic matrix with the th row being the semantic vector corresponding to and . In particular, we set the th element of as 1, if and belong to the th class.
Definition 1: The cross-media retrieval problem is to learn two optimal mapping matrices and from the multi-modal dataset , which can be formally formulated into the following optimization framework:
(1) |
where is the objective function consisting of three terms. In particular, is a correlation analysis term used to keep pair-wise closeness of multi-modal data in the common latent subspace. is a linear regression term from one modal feature space (image or text) to the semantic space, used to centralize the multi-modal data with the same semantics in the common latent subspace. is the regularization term to control the complexity of the mapping matrices and .
In the following subsections, we will detail the two algorithms for I2T and T2I based on the optimization framework Eq.(1).
This section addresses the cross-media retrieval problem of using an image to retrieve its related text documents. Denote the two optimal mapping matrices for images and text as and , respectively. Based on the optimization framework Eq.(1), the objective function of I2T is defined as follows:
(2) | ||||
where is a tradeoff parameter to balance the importance of the correlation analysis term and the linear regression term, denotes the Frobenius norm of the matrix, and is the regularization function used to regularize the mapping matrices. In this paper, the regularization function is defined as:
where and are nonnegative parameters to balance these two regularization terms.
This section addresses the cross-media retrieval problem of using text to retrieve its related images. Different from the objective function of I2T, the linear regression term for T2I is a regression operation from the textual space to the semantic space. Denote the two optimal mapping matrices for images and text in T2I as and , respectively. Based on the optimization framework Eq.(1), the objective function of T2I is defined as follows:
(3) | ||||
where the setting of the tradeoff parameter and the regularization function are consistent with the setting presented in Section 3.1.
The optimization problems for I2T and T2I are unconstrained optimization with respect to two matrices. Hence, both Eq.(2) and Eq.(3) are non-convex optimization problems and only have many local optimal solutions. For the non-convex problem, we usually design algorithms to seek stationary points. We note that Eq.(2) is convex with respect to either or while fixing the other. Similarly, Eq.(3) is also convex with respect to either or while fixing the other. Specifically, by fixing () or (), the minimization over the other can be finished with the gradient descent method.
A common way to solve this kind of optimization problems is an alternating updating process until the result converges. Algorithm 1 summarizes the optimization procedure of the proposed MDCR method for I2T, which can be easily extended for T2I.
To evaluate the proposed MDCR algorithm, we systematically compare it with other state-of-the-art methods on three datasets, i.e., Wikipedia [Rasiwasia et al. (2010)], Pascal Sentence [Rashtchian et al. (2010)] and a subset of INRIA-Websearch [Krapac et al. (2010)].
Wikipedia^{2}^{2}2http://www.svcl.ucsd.edu/projects/crossmodal/: This dataset contains totally 2,866 image-text pairs from 10 categories. The whole dataset is randomly split into a training set and a test set with 2,173 and 693 pairs. We utilize the publicly available features provided by [Rasiwasia et al. (2010)] i.e., 128 dimensional SIFT BoVW for images and 10 dimensional LDA for text, to compare directly with existing results. Besides, we also present the cross-media retrieval results based on the 4,096 dimensional CNN visual features^{3}^{3}3
The CNN model is pre-trained on ImageNet. We utilize the outputs from the second fully-connected layer as the CNN visual feature in this paper. For more details, please refer to
[Krizhevsky et al. (2012)]. and the 100 dimensional Latent Dirichlet Allocation model (LDA) [Blei et al. (2003)]textual features (we firstly obtain the textual feature vector based on 500 tokens and then LDA model is used to compute the probability of each document under 100 topics).
Pascal Sentence^{4}^{4}4http://vision.cs.uiuc.edu/pascal-sentences/: This dataset contains 1,000 pairs of image and text descriptions from 20 categories (50 for each category). We randomly select 30 pairs from each category as the training set and the rest are taken as the testing set. We utilize the 4,096 dimensional CNN visual feature for image representation. For textual features, we firstly extract the feature vector based on 300 most frequent tokens (with stop words removed) and then utilize the LDA to compute the probability of each document under 100 topics. The 100 dimensional probability vector is used for textual representation.
INRIA-Websearch: This dataset contains 71,478 pairs of image and text annotations from 353 categories. We remove those pairs which are marked as irrelevant, and select those pairs that belong to any one of the 100 largest categories. Then, we get a subset of 14,698 pairs for evaluation. We randomly select 70% pairs from each category as the training set (10,332 pairs), and the rest are treated as the testing set (4,366 pairs). We utilize the 4,096 dimensional CNN visual feature for image representation. For textual features, we firstly obtain the feature vector based on 25,000 most frequent tokens (with stop words removed) and then employ the LDA to compute the probability of each document under 1,000 topics.
For semantic representation, the ground-truth labels of each dataset are employed to construct semantic vectors (10 dimensions for Wikipedia dataset, 20 dimensions for Pascal Sentence dataset, and 100 dimensions for INRIA-Websearch dataset) for pairs of image and text.
In the experiment, Euclidean distance is used to measure the similarity between features in the embedding latent subspace. Retrieval performance is evaluated by mean average precision (mAP), which is one of the standard information retrieval metrics. Specifically, given a set of queries, the average precision (AP) of each query is defined as:
where is the size of the test dataset. if the item at rank is relevant, otherwise. denotes the precision of the result ranked at . We can get the mAP score by averaging AP for all queries.
mAP scores for image and text query on the Wikipedia dataset based on the publicy available featrues.
Query | PLS | BLM | CCA | SM | SCM | GMMFA | GMLDA | T-V CCA | MDCR |
Image | 0.207 | 0.237 | 0.182 | 0.225 | 0.277 | 0.264 | 0.272 | 0.228 | 0.287 |
Text | 0.192 | 0.144 | 0.209 | 0.223 | 0.226 | 0.231 | 0.232 | 0.205 | 0.225 |
Average | 0.199 | 0.191 | 0.196 | 0.224 | 0.252 | 0.248 | 0.253 | 0.217 | 0.256 |
In the experiments, we mainly compare the proposed MDCR with six algorithms, including CCA, Semantic Matching (SM) [Rasiwasia et al. (2010)], Semantic Correlation Matching (SCM) [Rasiwasia et al. (2010)], Three-View CCA (T-V CCA) [Gong et al. (2013)], Generalized Multiview Marginal Fisher Analysis (GMMFA) [Sharma et al. (2012)] and Generalized Multiview Linear Discriminant Analysis (GMLDA) [Sharma et al. (2012)].
For the Wikipedia dataset, we firstly compare the proposed MDCR with other methods based on the publicly available features [Rasiwasia et al. (2010)], i.e., 128-SIFT BoVW for images and 10-LDA for text. We fix = and = , and experimentally set , and for the optimization of I2T, and the parameters for T2I are set as , and . The mAP scores for each method are shown in Table 4.2. It can be seen that our method is more effective compared with other common space learning methods. To further validate the necessity to be task-specific for cross-media retrieval, we evaluate the proposed method in terms of training a unified and by incorporating both two linear regression terms in Eq.(2) and Eq.(3) into a single optimization objective. As shown in Table 4.3, the learned subspaces for I2T and T2I could not be used interchangeably and the unified scheme can only achieve compromised performance for each retrieval task, which cannot compare to the proposed modality-dependent scheme.
As a very popular dataset, Wikipedia has been employed by many other works for cross-media retrieval evaluation. With a different train/test division, [Wu et al. (2014)] achieved an average mAP score of 0.226 (Image Query: 0.227, Text Query: 0.224) through a sparse hash model and [Wang et al. (2014)] achieved an average mAP score of 0.183 (Image Query: 0.187, Text Query: 0.179) through a deep auto-encoder model. Besides, some other works utilized their own extracted features (both for images and text) for cross-media retrieval evaluation. To further validate the effectiveness of the proposed method, we also compare MDCR with other methods based on more powerful features, i.e., 4,096-CNN for images and 100-LDA for text. We fix = and = , and experimentally set , and for the optimization of I2T and T2I. The comparison results are shown in Table 4.3. It can be seen that some new state-of-the-art performances are achieved by these methods based on the new feature representations and the proposed MDCR can also outperform others. In addition, we also compare our method with the recent work [Kang et al. (2014)], which utilizes 4,096-CNN for images and 200-LDA for text, in Table 4.3. We can see that the proposed MDCR reaches a new state-of-the-art performance on the Wikipedia dataset. Please refer to Fig. 3 for the comparisons of Precision-Recall curves and Fig. 4 for the mAP score of each category. Figure 5 gives some successful and failure cases of our method. For the image query (the 2nd row), although the query image is categorized into Art, it is prevailingly characterized by the human figure, i.e., a strong man, which has been captured by our method and thus leads to the failure results shown. For the text query (the 4th row), there exist many Warfare descriptions in the document such as war, army and troops, which can be hardly realted to the label of the query text, i.e. Art.
For the Pascal Sentence dataset and the INRIA-Websearch dataset, we experimentally set , , , = and = during the alternative optimization process for I2T and T2T. The comparison results can be found in Table 4.3. It can be seen that our method is more effective compared with others even on a more challenging dataset, i.e., INRIA-Websearch (with 14,698 pairs of multi-media data and 100 categories). Please refer to Fig. 3 for the comparisons of Precision-Recall curves for these two datasets and Fig. 4 for the mAP score of each category on the Pascal Sentence dataset.
Comparitions of cross-media retrieval performance.
Dataset | Query | CCA | SM | SCM | T-V CCA | GMLDA | GMMFA | MDCR |
Wikipedia | Image | 0.226 | 0.403 | 0.351 | 0.310 | 0.372 | 0.371 | 0.435 |
Text | 0.246 | 0.357 | 0.324 | 0.316 | 0.322 | 0.322 | 0.394 | |
Average | 0.236 | 0.380 | 0.337 | 0.313 | 0.347 | 0.346 | 0.415 | |
Pascal Sentence | Image | 0.261 | 0.426 | 0.369 | 0.337 | 0.456 | 0.455 | 0.455 |
Text | 0.356 | 0.467 | 0.375 | 0.439 | 0.448 | 0.447 | 0.471 | |
Average | 0.309 | 0.446 | 0.372 | 0.388 | 0.452 | 0.451 | 0.463 | |
INRIA-Websearch | Image | 0.274 | 0.439 | 0.403 | 0.329 | 0.505 | 0.492 | 0.520 |
Text | 0.392 | 0.517 | 0.372 | 0.500 | 0.522 | 0.510 | 0.551 | |
Average | 0.333 | 0.478 | 0.387 | 0.415 | 0.514 | 0.501 | 0.535 |
Cross-media retrieval has long been a challenge. In this paper, we focus on designing an effective cross-media retrieval model for images and text, i.e., using image to search text (I2T) and using text to search images (T2I). Different from traditional common space learning algorithms, we propose a modality-dependent scheme which recommends different treatments for I2T and T2I by learning two couples of projections for different cross-media retrieval tasks. Specifically, by jointly optimizing a correlation term (between images and text) and a linear regression term (from one modal space, i.e., image or text to the semantic space), two couples of mappings are gained for different retrieval tasks. Extensive experiments on the Wikipedia dataset, the Pascal Sentence dataset and the INRIA-Websearch dataset show the superiority of the proposed method compared with state-of-the-arts.
Journal of Machine Learning Research
3 (2003), 993–1022.International Journal of Computer Vision
(2013), 1–24.Accounting for the Relative Importance of Objects in Image Retrieval. In
British Machine Vision Conference. 1–12.Improving web-image search results using query-relative classifiers. In
IEEE Conference on Computer Vision and Pattern Recognition
. 1094–1101. http://lear.inrialpes.fr/pubs/2010/KAVJ10IJCAI Proceedings-International Joint Conference on Artificial Intelligence
, Vol. 22. 1360.Subspace, Latent Structure and Feature Selection
. Springer, 34–51.Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In
IEEE Conference on Computer Vision and Pattern Recognition. 593–600.