Multi-modal space structure: a new kind of latent correlation for multi-modal entity resolution

04/21/2018
by   Qibin Zheng, et al.
0

Multi-modal data is becoming more common than before because of big data issues. Finding the semantically equal or similar objects from different data sources(called entity resolution) is one of the heart problem of multi-modal task. Current models for solving this problem usually needs much paired data to find the latent correlation between multi-modal data, which is of high cost. A new kind latent correlation is proposed in this article. With the correlation, multi-modal objects can be uniformly represented in a commonly shard space. A classifying based model is designed for multi-modal entity resolution task. With the proposed method, the demand of training data can be decreased much.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/07/2020

Cross-modal Learning for Multi-modal Video Categorization

Multi-modal machine learning (ML) models can process data in multiple mo...
11/16/2017

Deep Matching Autoencoders

Increasingly many real world tasks involve data in multiple modalities o...
10/26/2018

Investigating non-classical correlations between decision fused multi-modal documents

Correlation has been widely used to facilitate various information retri...
06/07/2021

Multi-modal Entity Alignment in Hyperbolic Space

Many AI-related tasks involve the interactions of data in multiple modal...
09/30/2020

Ethically Collecting Multi-Modal Spontaneous Conversations with People that have Cognitive Impairments

In order to make spoken dialogue systems (such as Amazon Alexa or Google...
04/17/2021

Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport

Complex objects are usually with multiple labels, and can be represented...
06/15/2020

Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry and Fusion

With the development of web technology, multi-modal or multi-view data h...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have witnessed a surge of need in jointly analyzing of multi-modal data. As one of the fundamental problem of multi-modal learning, multi-modal coreference resolution aims to find the similar objects from different modality. It is widely used in such area as cross-modal retrieval, image captioning, cross-modal entity resolution. The problem is challenging because it requires detailed knowledge of different modalities and the correspondence between them[1].

Most current approaches presume that there is a linear or non-linear projection between multi-modal data[2][3]. These methods focus on how to utilize extrinsic supervised information to project one modality to the other or map both two modalities into a commonly shared space[2][4][5][1]. The performance of these methods heavily depends on the richness of training samples. However, in real-world applications, obtaining the matched data from multiple modalities is costly and even impossible[6]. Therefor, it is urgently needed to develop a sample-insensitive methods for multi-modal coreference resolution.

When training samples are not rich enough, the intrinsic structure information of each modality can be helpful. In this paper specifically, the space structure of each modality is employed for multi-modal coreference resolution. For two set of cross-modal objects, although the representations are heterogeneous, the proximity relationship of them are similar. Taking the coreference resolution between images and textual descriptions as a example, Fig. 1 demonstrates the core idea of our method: given a set of images and one of corresponding textual descriptions, if two images(like two images of the bird) are similar to each other semantically, their corresponding descriptions should have similar meaning; otherwise, their meaning should be distinct. In other words, with proper distance metrics, the distance between images are similar to that between corresponding textual descriptions. Ignoring the scale difference, although the representation space of different modalities are heterogeneous, the space structure of them are similar. Hence, with a few matched pairs(the red points with same labels in the figure, called reference points), the space structure of different modalities can be anchored.

Nevertheless, the raw feature spaces do not perform well for describing the space structure[7], higher-level semantic embedding spaces should be employed. Besides, it does not always hold that more matched objects bring higher resolution precision. It is significant to select reference points to describe space structure better.

Fig. 1: The core idea of proposed method

Unlike previous works, we focus on digging the intrinsic correlation between modalities. In this paper, we bring intro-modal space structure information in resolving multi-modal coreference relation. The main contribution of this paper are listed as follows:

1) We investigate the intrinsic correlation between multi-modal data, and employ it to find semantically equivalent multi-modal objects. Different from similar works in zero-shot or few-shot learning[8][9], we proved that the correlation between space structure widely exists among multi-modal dataset, rather than to learn it with a mount of training data every time.

2) To describe the space structure better, we utilize high level features to represent images and text, and an optimized strategy to select representative reference points.

3) Experiments on public data sets are carried to test the performance of our proposed methods comparing with state-of-the-art methods.

The paper is organized as follows. Section II discusses the related work on multi-modal and cross-modal modeling. Section III introduces the proposed uniform representation of multi-modal data, and employ it for coreference resolution. Section IV tests the proposed method through the experiments on public datasets.

Ii Related work

According to the way of building the space, current methods for multi-modal coreference resolution can be divided into four types:

Ii-a CCA-Based Methods

To the best of our knowledge, the first well-known cross-modal correlating model may be the CCA based model proposed by Hardoon et. al[10]. It learnt a linear projection to maximize the correlation between the representation of different modality in the projected space. Inspired by this work, many CCA based models are designed for cross-modal analyzing[2][11][12][13]. Rasiwasia et al.[2]

utilized CCA to learn two maximally correlated subspaces, and multiclass logistic regression was performed within them to produce the semantic spaces respectively. Mroueh et al.

[12] proposed a Struncated-SVD based algorithms to compute the full regularization path of CCA for multi-modal retrieval efficiently. Wang et al.[13] developed a new hypergraph-based Canonical Correlation Analysis(HCCA) to project low-level features into a shard space where intra-pair and inter-pair correlation be maintained simultaneously. Heterogeneous high-order relationship was used to discover the structure of cross-modal data.

Ii-B Deep Learning Methods

Due to the strong learning ability of deep neural network, many deep model have been proposed for multi-modal analysis, such as

[4][14][15][1][16][17][18]. Ngiam et al.[4] presented an auto-encoder model to learn joint representations for speech audios and videos of the lip movements. Srivastava and Salakhutdinov[14]

employed restricted Boltzmann machine to learn a shared space between data of different modalities. Frome et al.

[16] proposed a deep visual-semantic embedding(DeViSE) model to identify the visual objects using the information from labeled image and unannotated text. Andrew et al.[15] introduced Deep Canonical Correlation Analysis to learn such nonlinear mapping between two views of data that the corresponding objects are linearly related in the representation space. Jiang et al.[17]

proposed a real time Internet cross-media retrieval method, in which deep learning was employed for feature extraction and distance detection. Due to the powerful representing ability of the convolutional neural network visual feature, Wei et al.

[18] employed it coupled with a deep semantic matching method for cross-modal retrieval.

Ii-C Topic Model Methods

Topic model is also helpful for uniform representing of multi-modal data, assuming that objects of different modality share some latent topics. Latent Dirichlet Allocation(LDA) based methods establish the shared space through the joint distribution of multi-modal data and the conditional relation between them

[19][20]. Roller and Walde[20] integrated visual features into LDA, and presented a multi-modal LDA model to learn joint representations for textual and visual data. Wang et al.[21] proposed multimodal mutual topic reinforce model(MR) to discover mutual consistent topics.

Ii-D Hashing-Based Methods

For the rapid growth of data volume, the cost of finding nearest neighbor cannot be dismissed. Hashing is a scalable method for finding nearest neighbors approximately[22]. It projects data into Hamming space, where the neighbor search can be performed efficiently. To improve the efficient of finding similar multi-modal objects, many cross-modal hashing methods have been proposed[22][23][24][25][26]. Kumar and Udupa[23] proposed a cross view hashing method to generate such hash codes that minimized the distance in Hamming space between similar objects and maximized that between dissimilar ones. Zhen et al.[24] used a co-regularization framework to generate such binary code that the hash codes from different modality were consistent. Ou et al.[25] constructed a Hamming space for each modality and build the mapping between them with logistic regression. Wu et al.[26] proposed a sparse multi-modal hashing method for cross-modal retrieval.

Besides these methods above, there still are other models proposed for multi-modal problems. Chen et al.[27] employed voxel-based multi-modal partial least square(PLS) to analyze the correlations between FDG PET glucose uptake-MRI gray matter volume scores and apolipoprotein E epsilon 4 gene dose in cognitively normal adults.

Although these methods have achieved great success in multi-modal learning, most of them need a mass of training data to learn the complex correlation between objects from different modality. To reduce the demand of training data, Gao et al.[6] proposed an active similarity learning model for cross-modal data. Nevertheless, without extra information, the improvement is limited.

Fig. 2: Overview of our multi-modal coreference resolution model.

Iii Proposed Approach

To perform the coreference resolution with insufficient matched samples, we take the similarity between space structures into consideration. Given two datasets and from two modalities, Fig 2 shows the overview of our proposed model, the red points refer to the matched objects in two modality, the solid lines refer to semantic embedding, the dotted lines mean the distance between two objects. and are firstly mapped into semantic spaces(the blue and pink area in the figure) respectively as in III-A:

(1)

and

(2)

Secondly, represent the to-be-matched objects(the non-red points in Fig 2) of and (the red ones in the figure) with the space structure based scheme in Section III-B:

(3)

and

(4)

Since and are positively correlated(proved in Section III-C), isomorphic and homogeneous, they can be linearly projected into each other:

(5)

In , cross-modal similarities are measured with cosine distance. With the similarity measurement, multi-modal coreference resolution can be solved by traditional methods.

Iii-a High level-feature extraction for text and images

In general, different low-level representations are adopted for different modalities, and there is no explicit correspondence between them[28][29]. Similar to the hypothesis fo zero-shot learning[8], we believe that the space structures of different modalities are similar, and try to connect modalities with them. It is crucial whether the pair-wise distances over the representation can reflect the semantic similarity properly.

The distances over raw or low-level features are always far from the real semantic ones. There are many works available to achieve better representations for various of data, like[1][5][18][16]. In this paper, we utilize the methods below for feature exacting of text and image specifically:

Iii-A1 Smooth inverse frequency for text embedding

Smooth inverse frequency(SIF) is a simple but powerful method for measuring textual similarity[30]

. It achieves great performance on a variety of textual similarity tasks, and even beats some sophisticated supervised methods including RNN and LSTM. In addition, the method is very suitable for domain adaptive setting. The core idea of SIF method is surprisingly simple: compute the weighted average of the word vectors and remove the first principle projection of the average vectors. For the superiority over performance, simplicity and domain adaption of SIF methods, it is utilized for text embedding in this work.

As in [30], we employ SIF methods on 300-dimensional GloVe word vectors trained on 840 billion token Common Crawl corpus for textual semantic embedding, i.e., the mapping in Equation(1).

Iii-A2 Convolutional neural network off-the-shelf for image embedding

Convolutional neural network(CNN) has demonstrated outstanding ability for machine learning tasks of images, like image classification and object detection. Wei et. al proposed to utilize CNN visual features for cross-modal retrieval

[18], which performs better than commonly used features and achieves a baseline for cross-modal retrieval. Firstly, off-the-shelf CNN features[31]

are extracted from the CNN model pretrained on ImageNet; then a fine-tuning step is processed for the target dataset.

In this paper, for better semantic representation, we extracted CNN features from images with the method above, namely, the mapping in Equation(2).

Iii-B Representation scheme of the space structure of single modality

Although multi-modal data are highly heterogeneous, most of them are represented in Euclidean or Hamming space. In this section we modify the representation scheme of space structure in[32], and employ it to represent the objects from Euclidean and Hamming space.

(a) (b)
Fig. 3: The position of five points in (a) can be described by the distances to the reference points as in (b).

Qian et al. proposed a space structure based representation scheme for clustering of categorical data[32]. The space structure is figured with the pair-wise similarity: given a categorical object , it is re-represented as , where is the similarity between and , is the number of the categorical dataset . Qian et al. utilized all the members in the dataset as the reference system to represent categorical data, however, it is unnecessary and inefficient to do so in Euclidean or Hamming space.

In Euclidean space, an object is represented as its distances from the coordinate axes (as Fig. 3 (a)). However, it can also be positioned by its Euclidean distance from several given points(called reference points). For example, in Fig. 3 (b), the red points can be uniquely positioned by their distances from the black ones. In fact, for a -dimensional object in Euclidean space, non-colinear reference points are enough to locate it. Based on the representation scheme, a set of numeric object can be mapped into with a reference point set (), where

(6)
(7)

Similarly, a binary object in Hamming space can be also represented by its Hamming distances from the reference points. For a -dimensional object, disparate ones as reference points are enough to figure out its position in Hamming space. Given a reference point set (), a set of binary object in Euclidean space is mapped into where

(8)
(9)

is xor operator.

Fig. 4: Reference points

Although more matched multi-modal objects would benefit coreference resolution better intuitively, some pairs of objects may undermine the representing ability of the space structure. As Fig 4 illustrates, the red and green points being taken as reference points are definitely better than the black ones. On the one hand, A good reference set should adequately reflect the distribution of the dataset, then the members of reference set should be quite distinct with each other. On the other hand, for higher discrimination, their distances from the non-reference objects should be distinct either. Thus, we transfer the selection of reference points to an optimization problem:

(10)

where is the number of reference points(user-specified),

is the variance of the distances of

from all non-reference objects, is a balance factor.

Being represented with the scheme above, the similarity between numerical objects and (or binary objects and ) can be directly measured with Euclidean or cosine distance between them as[32]:

(11)

or

(12)

Iii-C Correlation between the space structure of different modalities

Although the feature spaces of multi-modal data are usually highly heterogeneous, they may share some latent properties. In this part, we prove that the pair-wise similarities of different modalities are correlated to each other. Based on this, these modalities share similar space structures.

As discussed in section II-A, CCA based methods aim to discover the linear correlation between multi-modal data. Firstly, we assume that and are linearly correlated as

(13)

where is a projection matrix. Let be the similarity between and , be that between and . We have

Theorem 1.

If multi-modal data and are linearly correlated as in Equation (13), there exists a positive correlation between and .

Proof.

For simplicity, we assume all (

) subject to a standard normal distribution

and are independent to each other. The inner product in Equation(14) and (15) are utilized to compute the similarity matrix and ).

(14)
(15)

Since is symmetric, then it can be diagonalized as

(16)

The Pearson correlation coefficient between and is

(17)

where the variance of and is

(18)

and

(19)

Since

(20)

and are dependent with each other if or , then

(21)

The covariance of and is

(22)

Since is independent of each other, then

(23)

then

(24)

Finally,

(25)

From Equation (17), (18), (19) and (25), the correlation coefficient is

(26)

Because is a non-negative symmetric matrix, we have Equation(27

) unless it is a zero matrix either, which is unreasonable obviously.

(27)

In Equation(27), refer to the the principal diagonal elements of . In conclusion, there exist a positive correlation between and . ∎

In Theorem 1 we assume that is linearly correlated to , however, if the correlation is nonlinear the conclusion also hold. Similar to [3], we define a nonlinear mapping

(28)

where is also a linear projection matrix, is a bias matrix,

is specified as Sigmoid function. The correlation between

and is similar to that in Theorem 1.

Theorem 2.

If multi-modal data and are non-linearly correlated as in Equation (28), there exists a positive correlation between and .

Same as Theorem 1, we still use Equation (14) and (15) to compute and . Since and are correlated, and is a monotonic increasing function, it is clear that is positively correlated to .

From Theorem 1 and Theorem 2, whether and are linearly or nonlinearly correlated to each other, and are correlated to each other in most case. This kind of correlation would be utilized for measuring the similarity between cross-modal objects bellow.

Iii-D Cross-modal projection and coreference resolution

Considering the positive correlation between and , since and are the sub-matrices of and respectively, they are positively correlated to each other, too. Taking a set of already matched objects as the shared reference set , and would be isomorphic: the corresponding dimensions of and are consistent in meaning and value111It should be noted that although the cardinality of reference set should be larger than the representation dimension to avoid confusion, it is difficult to achieve. In fact, slight confusion can reduce the possibility of missed detection. Here, we select a refernce set from all the matched objects with the strategy in Section III-B.

A linear mapping is performed on each dimension to eliminate the difference in the scales and distance metrics of and :

(29)

where is a diagonal matrix, is a bias matrix where each column is a constant. Since parameters of the projection are quite few, and can be learnt with a regression on the shared reference set , the object is:

(30)

where

is the square loss function,

is the bias vector,

is a regularization item.

Although it is all right to take or as the target space, visual feature space is more appropriate because it suffers less from the hubness problem[33].

Giving an object , the that most closely matches it is that for which minimizes

(31)

Iv Experiments

Iv-a Datasets

We employ the following datasets to evaluate the performance of the proposed method: Wikipedia[2], Pascal-sentences[34].

Wikipedia:This dataset contains 2866 pairs of images and text from ten categories. Each pair of image and text are extracted from Wikipedia’s articles[2]. Instead of SIFT BoVW visual features and LDA textual features provided by[2], we extract CNN fc6 features222

CNN feature is extract with Caffe feature extraction tool.

and Sif features333Python codes can be obtained from: https://github.com/PrincetonML/SIF. like in Section III-A.

Pascal-sentences: This dataset is a subset of Pascal VOC, which contains 1000 pairs of image and corresponding textual description from twenty categories. Feature extraction are same as that of Wikipedia dataset.

(a) (b)
Fig. 5: The mAP score on Wikipedia dataset: (a) for image query, (b) for text query
(a) (b)
Fig. 6: The mAP score on Pascal-sentences dataset: (a) for image query, (b) for text query

Iv-B Evaluation Protocol

Two common methods for cross-modal coreference resolution are compared with our method(Space Structure Matching, SSM):

Correlation Matching (CM)[2][18]: With canonical correlation analysis(CCA), Rasiwasia proposed to learn a shared space for different modalities where they were maximumly correlated, and projected cross-modal objects into it.

Semantic Matching (SM)[2][18]: SM represent multi-modal data in more abstract levels, then they should be naturally corespondent to each other. Rasiwasia adopted multiclass logistic regression to generate common representations of multi-modal data.

Since our object is to perform coreference resolution with insufficient training data, we take some experiments to evaluate the cross-modal retrieval precision with very limited matched objects. A small subset(ranging 6 to 50) of the dataset is randomly selected as the training set. The trained model on the training set is tested on the left parts.

As in[2][18][6], mean average precision(mAP) is employed to evaluated the performance of these models, which is a widely used metric of information retrieval[35]:

(32)

where is the average precision of test sample . The multi-modal objects with same labels are considered semantically similar. For the coreference resolution of , the average precision can be computed as:

(33)

where denotes the number of the really similar objects in the target modality, P() is the precision of the results ranked at , if the object at rank is similar, otherwise. Both the mAP scores of bidirectional queries and their average are computed, and higher mAP indicates better performance of a model.

Iv-C Performance evaluation

We demonstrate the mAP score of CM, SM and SSM on Pascal-sentences and Wikipedia datasets in Fig. 5 and Fig. 6.

Iv-C1 Results on Wikipedia

Fig. 5 reports the mAP scores on wikipedia dataset in both two retrieval direction: searching similar text for images and the reverse. In Fig. 5 (a), the mAP scores of SSM are higher than the others in most cases, and CM’s scores are lowest all the time. Both the mAP scores of SSM and SM increase with the number of the matched objects, while that of CM are almost unchanged. The reason of poor performance of CM may be that the training sample is not enough for CCA to find significant correlation between modalities. It should be noticed that the mAP of SM increases rapidly, especially for the image query, and exceeds that of SSM when the size of training data increased to forty.

Iv-C2 Results on Pascal-sentences

Fig. 6 shows the results on Pascal-sentences dataset. The result is pretty similar to that of Wikipedia dataset. In Fig. 6 (a), the mAP scores of SSM are also higher than the others in most cases, and CM’s scores still keep lowest. Both the mAP scores of SSM and SM increase with the number of the matched objects, and the performance difference demonstrate a decreasing trend. Besides, with the same training set, the mAP scores of text query are higher than that of image query on Pascal-sentences dataset. And the performance of SSM method on Pascal-sentences is better than that on Wikipedia, which could be due to stronger correlation between space structures of Pascal-sentences.

Overall, the proposed SSM method performs better than two popular methods for multi-modal retrieval when training samples are not rich enough. However, the performance have not reached our expectation. The reason may lie on that our method discards the class label information, while it is the most important for results evaluation. There is still much room to improve the method. For example, the way that SM uses class label information may be helpful to improve the precision.

V Conclusion

For cross-modal issues, abstraction and correlation are two useful tools. In this paper, We have demonstrated the possibility to find semantically similar objects from different modality with the correlation between space structures of different modalities, which can be considered as a combination of abstraction and correlation.

Firstly, we have proved that the space structures of different modalities are correlated, and employ this kind of correlation for multi-modal coreference resolution. To figure the space structure of each modality more accurately, high level features for image and text are employed. With these above, a semi-supervised method for coreference resolution have been proposed. Different from existing methods, the proposed method mainly utilize a intrinsic correlation between different modalities. It makes our method need pretty few training data to learn the correlation. The experiments on two multi-modal datasets have verified that our method outperformed previous methods when training data are insufficient.

References

  • [1] A. Karpathy, A. Joulin, and F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in International Conference on Neural Information Processing Systems, 2014, Conference Proceedings.
  • [2] N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in International Conference on Multimedia, Conference Proceedings.
  • [3] M. N. Luo, X. J. Chang, Z. H. Li, L. Q. Nie, A. G. Hauptmann, and Q. H. Zheng, “Simple to complex cross-modal learning to rank,” Computer Vision and Image Understanding, vol. 163, pp. 67–77, 2017.
  • [4] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July, 2011, Conference Proceedings.
  • [5] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics, vol. 2, no. 0, pp. 207–218, 2014.
  • [6] N. Gao, S.-J. Huang, Y. Yan, and S. Chen, “Cross modal similarity learning with active queries,” Pattern Recognition, vol. 75, pp. 214–222, 2018.
  • [7]

    M. Rohrbach, S. Ebert, and B. Schiele, “Transfer learning in a transductive setting,” in

    International Conference on Neural Information Processing Systems, 2013, Conference Proceedings.
  • [8] R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng, “Zero-shot learning through cross-modal transfer,” in neural information processing systems, 2017, Conference Proceedings, pp. 935–943.
  • [9] Y. Guo, G. Ding, X. Jin, and J. Wang, “Transductive zero-shot recognition via shared model space learning,” in AAAI, 2016, Conference Proceedings.
  • [10] D. R. Hardoon, S. Szedmak, and J. Shawetaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
  • [11] A. Sharma, “Generalized multiview analysis: A discriminative latent space,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, Conference Proceedings.
  • [12] Y. Mroueh, E. Marcheret, and V. Goel, “Multimodal retrieval with asymmetrically weighted truncated-svd canonical correlation analysis,” Computer Science, 2015.
  • [13] L. Wang, W. Sun, Z. Zhao, and F. Su, “Modeling intra- and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval,” Signal Processing, vol. 131, pp. 249–260, 2017.
  • [14] N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in International Conference on Neural Information Processing Systems, 2012, Conference Proceedings.
  • [15] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in international conference on machine learning, 2013, Conference Proceedings.
  • [16] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visual-semantic embedding model,” in neural information processing systems, 2013, Conference Proceedings.
  • [17] B. Jiang, J. Yang, Z. Lv, K. Tian, Q. Meng, and Y. Yan, “Internet cross-media retrieval based on deep learning,” Journal of Visual Communication and Image Representation, vol. 48, pp. 356–366, 2017.
  • [18] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal retrieval with cnn visual features: A new baseline,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 47, pp. 449–460, 2017.
  • [19] Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in International Conference on Computer Vision, 2011, Conference Proceedings.
  • [20] S. Roller and S. S. I. Walde, “A multimodal lda model integrating textual, cognitive and visual modalities,” in

    empirical methods in natural language processing

    , 2013, Conference Proceedings.
  • [21] Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, “Multi-modal mutual topic reinforce modeling for cross-media retrieval,” in acm multimedia, 2014, Conference Proceedings.
  • [22] J. Wang, S. Kumar, and S. Chang, “Sequential projection learning for hashing with compact codes,” in international conference on machine learning, 2010, Conference Proceedings.
  • [23] S. Kumar and R. Udupa, “Learning hash functions for cross-view similarity search,” in

    International Joint Conference on Artificial Intelligence

    , 2011, Conference Proceedings.
  • [24] Z. Yi and D. Y. Yeung, “Co-regularized hashing for multimodal data,” in International Conference on Neural Information Processing Systems, 2012, Conference Proceedings.
  • [25] M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang, “Comparing apples to oranges: a scalable solution with heterogeneous hashing,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, Conference Proceedings.
  • [26] F. Wu, Y. Zhou, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang, “Sparse multi-modal hashing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 427–439, 2014.
  • [27] K. W. Chen, N. Ayutyanont, J. B. S. Langbaum, A. S. Fleisher, C. Reschke, W. Lee, X. F. Liu, G. E. Alexander, D. Bandy, R. J. Caselli, and E. M. Reiman, “Correlations between fdg pet glucose uptake-mri gray matter volume scores and apolipoprotein e epsilon 4 gene dose in cognitively normal adults: A cross-validation study using voxel-based multi-modal partial least squares,” Neuroimage, vol. 60, no. 4, pp. 2316–2322, 2012.
  • [28] P. J. Costa, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “On the role of correlation and abstraction in cross-modal multimedia retrieval,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 36, no. 3, pp. 521–35, 2014.
  • [29] L. Huang and Y. Peng, “Cross-media retrieval by exploiting fine-grained correlation at entity level,” Neurocomputing, vol. 236, pp. 123–133, 2017.
  • [30] S. Arora, Y. Liang, and T. Ma, “A simple but tough-to-beat baseline for sentence embeddings,” in International Conference on Learning Representations, 2017, Conference Proceedings.
  • [31] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” international conference on machine learning, pp. 647–655, 2014.
  • [32] Y. H. Qian, F. J. Li, J. Y. Liang, B. Liu, and C. Y. Dang, “Space structure and clustering of categorical data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 10, pp. 2047–2059, 2016.
  • [33] L. Zhang, T. Xiang, and S. Gong, “Learning a deep embedding model for zero-shot learning,” computer vision and pattern recognition, pp. 3010–3019, 2017.
  • [34] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010, Conference Proceedings.
  • [35] N. Rasiwasia, P. J. Moreno, and N. Vasconcelos, “Bridging the gap: Query by semantic example,” IEEE Transactions on Multimedia, vol. 9, no. 5, pp. 923–938, 2007.