Recent years have witnessed a surge of need in jointly analyzing of multi-modal data. As one of the fundamental problem of multi-modal learning, multi-modal coreference resolution aims to find the similar objects from different modality. It is widely used in such area as cross-modal retrieval, image captioning, cross-modal entity resolution. The problem is challenging because it requires detailed knowledge of different modalities and the correspondence between them.
Most current approaches presume that there is a linear or non-linear projection between multi-modal data. These methods focus on how to utilize extrinsic supervised information to project one modality to the other or map both two modalities into a commonly shared space. The performance of these methods heavily depends on the richness of training samples. However, in real-world applications, obtaining the matched data from multiple modalities is costly and even impossible. Therefor, it is urgently needed to develop a sample-insensitive methods for multi-modal coreference resolution.
When training samples are not rich enough, the intrinsic structure information of each modality can be helpful. In this paper specifically, the space structure of each modality is employed for multi-modal coreference resolution. For two set of cross-modal objects, although the representations are heterogeneous, the proximity relationship of them are similar. Taking the coreference resolution between images and textual descriptions as a example, Fig. 1 demonstrates the core idea of our method: given a set of images and one of corresponding textual descriptions, if two images(like two images of the bird) are similar to each other semantically, their corresponding descriptions should have similar meaning; otherwise, their meaning should be distinct. In other words, with proper distance metrics, the distance between images are similar to that between corresponding textual descriptions. Ignoring the scale difference, although the representation space of different modalities are heterogeneous, the space structure of them are similar. Hence, with a few matched pairs(the red points with same labels in the figure, called reference points), the space structure of different modalities can be anchored.
Nevertheless, the raw feature spaces do not perform well for describing the space structure, higher-level semantic embedding spaces should be employed. Besides, it does not always hold that more matched objects bring higher resolution precision. It is significant to select reference points to describe space structure better.
Unlike previous works, we focus on digging the intrinsic correlation between modalities. In this paper, we bring intro-modal space structure information in resolving multi-modal coreference relation. The main contribution of this paper are listed as follows:
1) We investigate the intrinsic correlation between multi-modal data, and employ it to find semantically equivalent multi-modal objects. Different from similar works in zero-shot or few-shot learning, we proved that the correlation between space structure widely exists among multi-modal dataset, rather than to learn it with a mount of training data every time.
2) To describe the space structure better, we utilize high level features to represent images and text, and an optimized strategy to select representative reference points.
3) Experiments on public data sets are carried to test the performance of our proposed methods comparing with state-of-the-art methods.
The paper is organized as follows. Section II discusses the related work on multi-modal and cross-modal modeling. Section III introduces the proposed uniform representation of multi-modal data, and employ it for coreference resolution. Section IV tests the proposed method through the experiments on public datasets.
Ii Related work
According to the way of building the space, current methods for multi-modal coreference resolution can be divided into four types:
Ii-a CCA-Based Methods
To the best of our knowledge, the first well-known cross-modal correlating model may be the CCA based model proposed by Hardoon et. al. It learnt a linear projection to maximize the correlation between the representation of different modality in the projected space. Inspired by this work, many CCA based models are designed for cross-modal analyzing. Rasiwasia et al.
utilized CCA to learn two maximally correlated subspaces, and multiclass logistic regression was performed within them to produce the semantic spaces respectively. Mroueh et al. proposed a Struncated-SVD based algorithms to compute the full regularization path of CCA for multi-modal retrieval efficiently. Wang et al. developed a new hypergraph-based Canonical Correlation Analysis(HCCA) to project low-level features into a shard space where intra-pair and inter-pair correlation be maintained simultaneously. Heterogeneous high-order relationship was used to discover the structure of cross-modal data.
Ii-B Deep Learning Methods
Due to the strong learning ability of deep neural network, many deep model have been proposed for multi-modal analysis, such as. Ngiam et al. presented an auto-encoder model to learn joint representations for speech audios and videos of the lip movements. Srivastava and Salakhutdinov
employed restricted Boltzmann machine to learn a shared space between data of different modalities. Frome et al. proposed a deep visual-semantic embedding(DeViSE) model to identify the visual objects using the information from labeled image and unannotated text. Andrew et al. introduced Deep Canonical Correlation Analysis to learn such nonlinear mapping between two views of data that the corresponding objects are linearly related in the representation space. Jiang et al.
proposed a real time Internet cross-media retrieval method, in which deep learning was employed for feature extraction and distance detection. Due to the powerful representing ability of the convolutional neural network visual feature, Wei et al. employed it coupled with a deep semantic matching method for cross-modal retrieval.
Ii-C Topic Model Methods
Topic model is also helpful for uniform representing of multi-modal data, assuming that objects of different modality share some latent topics. Latent Dirichlet Allocation(LDA) based methods establish the shared space through the joint distribution of multi-modal data and the conditional relation between them. Roller and Walde integrated visual features into LDA, and presented a multi-modal LDA model to learn joint representations for textual and visual data. Wang et al. proposed multimodal mutual topic reinforce model(MR) to discover mutual consistent topics.
Ii-D Hashing-Based Methods
For the rapid growth of data volume, the cost of finding nearest neighbor cannot be dismissed. Hashing is a scalable method for finding nearest neighbors approximately. It projects data into Hamming space, where the neighbor search can be performed efficiently. To improve the efficient of finding similar multi-modal objects, many cross-modal hashing methods have been proposed. Kumar and Udupa proposed a cross view hashing method to generate such hash codes that minimized the distance in Hamming space between similar objects and maximized that between dissimilar ones. Zhen et al. used a co-regularization framework to generate such binary code that the hash codes from different modality were consistent. Ou et al. constructed a Hamming space for each modality and build the mapping between them with logistic regression. Wu et al. proposed a sparse multi-modal hashing method for cross-modal retrieval.
Besides these methods above, there still are other models proposed for multi-modal problems. Chen et al. employed voxel-based multi-modal partial least square(PLS) to analyze the correlations between FDG PET glucose uptake-MRI gray matter volume scores and apolipoprotein E epsilon 4 gene dose in cognitively normal adults.
Although these methods have achieved great success in multi-modal learning, most of them need a mass of training data to learn the complex correlation between objects from different modality. To reduce the demand of training data, Gao et al. proposed an active similarity learning model for cross-modal data. Nevertheless, without extra information, the improvement is limited.
Iii Proposed Approach
To perform the coreference resolution with insufficient matched samples, we take the similarity between space structures into consideration. Given two datasets and from two modalities, Fig 2 shows the overview of our proposed model, the red points refer to the matched objects in two modality, the solid lines refer to semantic embedding, the dotted lines mean the distance between two objects. and are firstly mapped into semantic spaces(the blue and pink area in the figure) respectively as in III-A:
Since and are positively correlated(proved in Section III-C), isomorphic and homogeneous, they can be linearly projected into each other:
In , cross-modal similarities are measured with cosine distance. With the similarity measurement, multi-modal coreference resolution can be solved by traditional methods.
Iii-a High level-feature extraction for text and images
In general, different low-level representations are adopted for different modalities, and there is no explicit correspondence between them. Similar to the hypothesis fo zero-shot learning, we believe that the space structures of different modalities are similar, and try to connect modalities with them. It is crucial whether the pair-wise distances over the representation can reflect the semantic similarity properly.
The distances over raw or low-level features are always far from the real semantic ones. There are many works available to achieve better representations for various of data, like. In this paper, we utilize the methods below for feature exacting of text and image specifically:
Iii-A1 Smooth inverse frequency for text embedding
Smooth inverse frequency(SIF) is a simple but powerful method for measuring textual similarity
. It achieves great performance on a variety of textual similarity tasks, and even beats some sophisticated supervised methods including RNN and LSTM. In addition, the method is very suitable for domain adaptive setting. The core idea of SIF method is surprisingly simple: compute the weighted average of the word vectors and remove the first principle projection of the average vectors. For the superiority over performance, simplicity and domain adaption of SIF methods, it is utilized for text embedding in this work.
Iii-A2 Convolutional neural network off-the-shelf for image embedding
Convolutional neural network(CNN) has demonstrated outstanding ability for machine learning tasks of images, like image classification and object detection. Wei et. al proposed to utilize CNN visual features for cross-modal retrieval, which performs better than commonly used features and achieves a baseline for cross-modal retrieval. Firstly, off-the-shelf CNN features
are extracted from the CNN model pretrained on ImageNet; then a fine-tuning step is processed for the target dataset.
In this paper, for better semantic representation, we extracted CNN features from images with the method above, namely, the mapping in Equation(2).
Iii-B Representation scheme of the space structure of single modality
Although multi-modal data are highly heterogeneous, most of them are represented in Euclidean or Hamming space. In this section we modify the representation scheme of space structure in, and employ it to represent the objects from Euclidean and Hamming space.
Qian et al. proposed a space structure based representation scheme for clustering of categorical data. The space structure is figured with the pair-wise similarity: given a categorical object , it is re-represented as , where is the similarity between and , is the number of the categorical dataset . Qian et al. utilized all the members in the dataset as the reference system to represent categorical data, however, it is unnecessary and inefficient to do so in Euclidean or Hamming space.
In Euclidean space, an object is represented as its distances from the coordinate axes (as Fig. 3 (a)). However, it can also be positioned by its Euclidean distance from several given points(called reference points). For example, in Fig. 3 (b), the red points can be uniquely positioned by their distances from the black ones. In fact, for a -dimensional object in Euclidean space, non-colinear reference points are enough to locate it. Based on the representation scheme, a set of numeric object can be mapped into with a reference point set (), where
Similarly, a binary object in Hamming space can be also represented by its Hamming distances from the reference points. For a -dimensional object, disparate ones as reference points are enough to figure out its position in Hamming space. Given a reference point set (), a set of binary object in Euclidean space is mapped into where
is xor operator.
Although more matched multi-modal objects would benefit coreference resolution better intuitively, some pairs of objects may undermine the representing ability of the space structure. As Fig 4 illustrates, the red and green points being taken as reference points are definitely better than the black ones. On the one hand, A good reference set should adequately reflect the distribution of the dataset, then the members of reference set should be quite distinct with each other. On the other hand, for higher discrimination, their distances from the non-reference objects should be distinct either. Thus, we transfer the selection of reference points to an optimization problem:
where is the number of reference points(user-specified),
is the variance of the distances offrom all non-reference objects, is a balance factor.
Being represented with the scheme above, the similarity between numerical objects and (or binary objects and ) can be directly measured with Euclidean or cosine distance between them as:
Iii-C Correlation between the space structure of different modalities
Although the feature spaces of multi-modal data are usually highly heterogeneous, they may share some latent properties. In this part, we prove that the pair-wise similarities of different modalities are correlated to each other. Based on this, these modalities share similar space structures.
As discussed in section II-A, CCA based methods aim to discover the linear correlation between multi-modal data. Firstly, we assume that and are linearly correlated as
where is a projection matrix. Let be the similarity between and , be that between and . We have
If multi-modal data and are linearly correlated as in Equation (13), there exists a positive correlation between and .
For simplicity, we assume all (
) subject to a standard normal distributionand are independent to each other. The inner product in Equation(14) and (15) are utilized to compute the similarity matrix and ).
Since is symmetric, then it can be diagonalized as
The Pearson correlation coefficient between and is
where the variance of and is
and are dependent with each other if or , then
The covariance of and is
Since is independent of each other, then
Because is a non-negative symmetric matrix, we have Equation(27
) unless it is a zero matrix either, which is unreasonable obviously.
In Equation(27), refer to the the principal diagonal elements of . In conclusion, there exist a positive correlation between and . ∎
where is also a linear projection matrix, is a bias matrix,
is specified as Sigmoid function. The correlation betweenand is similar to that in Theorem 1.
If multi-modal data and are non-linearly correlated as in Equation (28), there exists a positive correlation between and .
Iii-D Cross-modal projection and coreference resolution
Considering the positive correlation between and , since and are the sub-matrices of and respectively, they are positively correlated to each other, too. Taking a set of already matched objects as the shared reference set , and would be isomorphic: the corresponding dimensions of and are consistent in meaning and value111It should be noted that although the cardinality of reference set should be larger than the representation dimension to avoid confusion, it is difficult to achieve. In fact, slight confusion can reduce the possibility of missed detection. Here, we select a refernce set from all the matched objects with the strategy in Section III-B.
A linear mapping is performed on each dimension to eliminate the difference in the scales and distance metrics of and :
where is a diagonal matrix, is a bias matrix where each column is a constant. Since parameters of the projection are quite few, and can be learnt with a regression on the shared reference set , the object is:
is the square loss function,
is the bias vector,is a regularization item.
Although it is all right to take or as the target space, visual feature space is more appropriate because it suffers less from the hubness problem.
Giving an object , the that most closely matches it is that for which minimizes
Wikipedia:This dataset contains 2866 pairs of images and text from ten categories. Each pair of image and text are extracted from Wikipedia’s articles. Instead of SIFT BoVW visual features and LDA textual features provided by, we extract CNN fc6 features222 CNN feature is extract with Caffe feature extraction tool.
CNN feature is extract with Caffe feature extraction tool.and Sif features333Python codes can be obtained from: https://github.com/PrincetonML/SIF. like in Section III-A.
Pascal-sentences: This dataset is a subset of Pascal VOC, which contains 1000 pairs of image and corresponding textual description from twenty categories. Feature extraction are same as that of Wikipedia dataset.
Iv-B Evaluation Protocol
Two common methods for cross-modal coreference resolution are compared with our method(Space Structure Matching, SSM):
Correlation Matching (CM): With canonical correlation analysis(CCA), Rasiwasia proposed to learn a shared space for different modalities where they were maximumly correlated, and projected cross-modal objects into it.
Semantic Matching (SM): SM represent multi-modal data in more abstract levels, then they should be naturally corespondent to each other. Rasiwasia adopted multiclass logistic regression to generate common representations of multi-modal data.
Since our object is to perform coreference resolution with insufficient training data, we take some experiments to evaluate the cross-modal retrieval precision with very limited matched objects. A small subset(ranging 6 to 50) of the dataset is randomly selected as the training set. The trained model on the training set is tested on the left parts.
where is the average precision of test sample . The multi-modal objects with same labels are considered semantically similar. For the coreference resolution of , the average precision can be computed as:
where denotes the number of the really similar objects in the target modality, P() is the precision of the results ranked at , if the object at rank is similar, otherwise. Both the mAP scores of bidirectional queries and their average are computed, and higher mAP indicates better performance of a model.
Iv-C Performance evaluation
Iv-C1 Results on Wikipedia
Fig. 5 reports the mAP scores on wikipedia dataset in both two retrieval direction: searching similar text for images and the reverse. In Fig. 5 (a), the mAP scores of SSM are higher than the others in most cases, and CM’s scores are lowest all the time. Both the mAP scores of SSM and SM increase with the number of the matched objects, while that of CM are almost unchanged. The reason of poor performance of CM may be that the training sample is not enough for CCA to find significant correlation between modalities. It should be noticed that the mAP of SM increases rapidly, especially for the image query, and exceeds that of SSM when the size of training data increased to forty.
Iv-C2 Results on Pascal-sentences
Fig. 6 shows the results on Pascal-sentences dataset. The result is pretty similar to that of Wikipedia dataset. In Fig. 6 (a), the mAP scores of SSM are also higher than the others in most cases, and CM’s scores still keep lowest. Both the mAP scores of SSM and SM increase with the number of the matched objects, and the performance difference demonstrate a decreasing trend. Besides, with the same training set, the mAP scores of text query are higher than that of image query on Pascal-sentences dataset. And the performance of SSM method on Pascal-sentences is better than that on Wikipedia, which could be due to stronger correlation between space structures of Pascal-sentences.
Overall, the proposed SSM method performs better than two popular methods for multi-modal retrieval when training samples are not rich enough. However, the performance have not reached our expectation. The reason may lie on that our method discards the class label information, while it is the most important for results evaluation. There is still much room to improve the method. For example, the way that SM uses class label information may be helpful to improve the precision.
For cross-modal issues, abstraction and correlation are two useful tools. In this paper, We have demonstrated the possibility to find semantically similar objects from different modality with the correlation between space structures of different modalities, which can be considered as a combination of abstraction and correlation.
Firstly, we have proved that the space structures of different modalities are correlated, and employ this kind of correlation for multi-modal coreference resolution. To figure the space structure of each modality more accurately, high level features for image and text are employed. With these above, a semi-supervised method for coreference resolution have been proposed. Different from existing methods, the proposed method mainly utilize a intrinsic correlation between different modalities. It makes our method need pretty few training data to learn the correlation. The experiments on two multi-modal datasets have verified that our method outperformed previous methods when training data are insufficient.
-  A. Karpathy, A. Joulin, and F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in International Conference on Neural Information Processing Systems, 2014, Conference Proceedings.
-  N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in International Conference on Multimedia, Conference Proceedings.
-  M. N. Luo, X. J. Chang, Z. H. Li, L. Q. Nie, A. G. Hauptmann, and Q. H. Zheng, “Simple to complex cross-modal learning to rank,” Computer Vision and Image Understanding, vol. 163, pp. 67–77, 2017.
-  J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July, 2011, Conference Proceedings.
-  R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics, vol. 2, no. 0, pp. 207–218, 2014.
-  N. Gao, S.-J. Huang, Y. Yan, and S. Chen, “Cross modal similarity learning with active queries,” Pattern Recognition, vol. 75, pp. 214–222, 2018.
M. Rohrbach, S. Ebert, and B. Schiele, “Transfer learning in a transductive setting,” inInternational Conference on Neural Information Processing Systems, 2013, Conference Proceedings.
-  R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng, “Zero-shot learning through cross-modal transfer,” in neural information processing systems, 2017, Conference Proceedings, pp. 935–943.
-  Y. Guo, G. Ding, X. Jin, and J. Wang, “Transductive zero-shot recognition via shared model space learning,” in AAAI, 2016, Conference Proceedings.
-  D. R. Hardoon, S. Szedmak, and J. Shawetaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
-  A. Sharma, “Generalized multiview analysis: A discriminative latent space,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, Conference Proceedings.
-  Y. Mroueh, E. Marcheret, and V. Goel, “Multimodal retrieval with asymmetrically weighted truncated-svd canonical correlation analysis,” Computer Science, 2015.
-  L. Wang, W. Sun, Z. Zhao, and F. Su, “Modeling intra- and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval,” Signal Processing, vol. 131, pp. 249–260, 2017.
-  N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in International Conference on Neural Information Processing Systems, 2012, Conference Proceedings.
-  G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in international conference on machine learning, 2013, Conference Proceedings.
-  A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visual-semantic embedding model,” in neural information processing systems, 2013, Conference Proceedings.
-  B. Jiang, J. Yang, Z. Lv, K. Tian, Q. Meng, and Y. Yan, “Internet cross-media retrieval based on deep learning,” Journal of Visual Communication and Image Representation, vol. 48, pp. 356–366, 2017.
-  Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal retrieval with cnn visual features: A new baseline,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 47, pp. 449–460, 2017.
-  Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in International Conference on Computer Vision, 2011, Conference Proceedings.
S. Roller and S. S. I. Walde, “A multimodal lda model integrating textual,
cognitive and visual modalities,” in
empirical methods in natural language processing, 2013, Conference Proceedings.
-  Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, “Multi-modal mutual topic reinforce modeling for cross-media retrieval,” in acm multimedia, 2014, Conference Proceedings.
-  J. Wang, S. Kumar, and S. Chang, “Sequential projection learning for hashing with compact codes,” in international conference on machine learning, 2010, Conference Proceedings.
S. Kumar and R. Udupa, “Learning hash functions for cross-view similarity
International Joint Conference on Artificial Intelligence, 2011, Conference Proceedings.
-  Z. Yi and D. Y. Yeung, “Co-regularized hashing for multimodal data,” in International Conference on Neural Information Processing Systems, 2012, Conference Proceedings.
-  M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang, “Comparing apples to oranges: a scalable solution with heterogeneous hashing,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, Conference Proceedings.
-  F. Wu, Y. Zhou, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang, “Sparse multi-modal hashing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 427–439, 2014.
-  K. W. Chen, N. Ayutyanont, J. B. S. Langbaum, A. S. Fleisher, C. Reschke, W. Lee, X. F. Liu, G. E. Alexander, D. Bandy, R. J. Caselli, and E. M. Reiman, “Correlations between fdg pet glucose uptake-mri gray matter volume scores and apolipoprotein e epsilon 4 gene dose in cognitively normal adults: A cross-validation study using voxel-based multi-modal partial least squares,” Neuroimage, vol. 60, no. 4, pp. 2316–2322, 2012.
-  P. J. Costa, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “On the role of correlation and abstraction in cross-modal multimedia retrieval,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 36, no. 3, pp. 521–35, 2014.
-  L. Huang and Y. Peng, “Cross-media retrieval by exploiting fine-grained correlation at entity level,” Neurocomputing, vol. 236, pp. 123–133, 2017.
-  S. Arora, Y. Liang, and T. Ma, “A simple but tough-to-beat baseline for sentence embeddings,” in International Conference on Learning Representations, 2017, Conference Proceedings.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” international conference on machine learning, pp. 647–655, 2014.
-  Y. H. Qian, F. J. Li, J. Y. Liang, B. Liu, and C. Y. Dang, “Space structure and clustering of categorical data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 10, pp. 2047–2059, 2016.
-  L. Zhang, T. Xiang, and S. Gong, “Learning a deep embedding model for zero-shot learning,” computer vision and pattern recognition, pp. 3010–3019, 2017.
-  C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010, Conference Proceedings.
-  N. Rasiwasia, P. J. Moreno, and N. Vasconcelos, “Bridging the gap: Query by semantic example,” IEEE Transactions on Multimedia, vol. 9, no. 5, pp. 923–938, 2007.