I Introduction
Recent years have witnessed a surge of need in jointly analyzing of multimodal data. As one of the fundamental problem of multimodal learning, multimodal coreference resolution aims to find the similar objects from different modality. It is widely used in such area as crossmodal retrieval, image captioning, crossmodal entity resolution. The problem is challenging because it requires detailed knowledge of different modalities and the correspondence between them[1].
Most current approaches presume that there is a linear or nonlinear projection between multimodal data[2][3]. These methods focus on how to utilize extrinsic supervised information to project one modality to the other or map both two modalities into a commonly shared space[2][4][5][1]. The performance of these methods heavily depends on the richness of training samples. However, in realworld applications, obtaining the matched data from multiple modalities is costly and even impossible[6]. Therefor, it is urgently needed to develop a sampleinsensitive methods for multimodal coreference resolution.
When training samples are not rich enough, the intrinsic structure information of each modality can be helpful. In this paper specifically, the space structure of each modality is employed for multimodal coreference resolution. For two set of crossmodal objects, although the representations are heterogeneous, the proximity relationship of them are similar. Taking the coreference resolution between images and textual descriptions as a example, Fig. 1 demonstrates the core idea of our method: given a set of images and one of corresponding textual descriptions, if two images(like two images of the bird) are similar to each other semantically, their corresponding descriptions should have similar meaning; otherwise, their meaning should be distinct. In other words, with proper distance metrics, the distance between images are similar to that between corresponding textual descriptions. Ignoring the scale difference, although the representation space of different modalities are heterogeneous, the space structure of them are similar. Hence, with a few matched pairs(the red points with same labels in the figure, called reference points), the space structure of different modalities can be anchored.
Nevertheless, the raw feature spaces do not perform well for describing the space structure[7], higherlevel semantic embedding spaces should be employed. Besides, it does not always hold that more matched objects bring higher resolution precision. It is significant to select reference points to describe space structure better.
Unlike previous works, we focus on digging the intrinsic correlation between modalities. In this paper, we bring intromodal space structure information in resolving multimodal coreference relation. The main contribution of this paper are listed as follows:
1) We investigate the intrinsic correlation between multimodal data, and employ it to find semantically equivalent multimodal objects. Different from similar works in zeroshot or fewshot learning[8][9], we proved that the correlation between space structure widely exists among multimodal dataset, rather than to learn it with a mount of training data every time.
2) To describe the space structure better, we utilize high level features to represent images and text, and an optimized strategy to select representative reference points.
3) Experiments on public data sets are carried to test the performance of our proposed methods comparing with stateoftheart methods.
The paper is organized as follows. Section II discusses the related work on multimodal and crossmodal modeling. Section III introduces the proposed uniform representation of multimodal data, and employ it for coreference resolution. Section IV tests the proposed method through the experiments on public datasets.
Ii Related work
According to the way of building the space, current methods for multimodal coreference resolution can be divided into four types:
Iia CCABased Methods
To the best of our knowledge, the first wellknown crossmodal correlating model may be the CCA based model proposed by Hardoon et. al[10]. It learnt a linear projection to maximize the correlation between the representation of different modality in the projected space. Inspired by this work, many CCA based models are designed for crossmodal analyzing[2][11][12][13]. Rasiwasia et al.[2]
utilized CCA to learn two maximally correlated subspaces, and multiclass logistic regression was performed within them to produce the semantic spaces respectively. Mroueh et al.
[12] proposed a StruncatedSVD based algorithms to compute the full regularization path of CCA for multimodal retrieval efficiently. Wang et al.[13] developed a new hypergraphbased Canonical Correlation Analysis(HCCA) to project lowlevel features into a shard space where intrapair and interpair correlation be maintained simultaneously. Heterogeneous highorder relationship was used to discover the structure of crossmodal data.IiB Deep Learning Methods
Due to the strong learning ability of deep neural network, many deep model have been proposed for multimodal analysis, such as
[4][14][15][1][16][17][18]. Ngiam et al.[4] presented an autoencoder model to learn joint representations for speech audios and videos of the lip movements. Srivastava and Salakhutdinov[14]employed restricted Boltzmann machine to learn a shared space between data of different modalities. Frome et al.
[16] proposed a deep visualsemantic embedding(DeViSE) model to identify the visual objects using the information from labeled image and unannotated text. Andrew et al.[15] introduced Deep Canonical Correlation Analysis to learn such nonlinear mapping between two views of data that the corresponding objects are linearly related in the representation space. Jiang et al.[17]proposed a real time Internet crossmedia retrieval method, in which deep learning was employed for feature extraction and distance detection. Due to the powerful representing ability of the convolutional neural network visual feature, Wei et al.
[18] employed it coupled with a deep semantic matching method for crossmodal retrieval.IiC Topic Model Methods
Topic model is also helpful for uniform representing of multimodal data, assuming that objects of different modality share some latent topics. Latent Dirichlet Allocation(LDA) based methods establish the shared space through the joint distribution of multimodal data and the conditional relation between them
[19][20]. Roller and Walde[20] integrated visual features into LDA, and presented a multimodal LDA model to learn joint representations for textual and visual data. Wang et al.[21] proposed multimodal mutual topic reinforce model(MR) to discover mutual consistent topics.IiD HashingBased Methods
For the rapid growth of data volume, the cost of finding nearest neighbor cannot be dismissed. Hashing is a scalable method for finding nearest neighbors approximately[22]. It projects data into Hamming space, where the neighbor search can be performed efficiently. To improve the efficient of finding similar multimodal objects, many crossmodal hashing methods have been proposed[22][23][24][25][26]. Kumar and Udupa[23] proposed a cross view hashing method to generate such hash codes that minimized the distance in Hamming space between similar objects and maximized that between dissimilar ones. Zhen et al.[24] used a coregularization framework to generate such binary code that the hash codes from different modality were consistent. Ou et al.[25] constructed a Hamming space for each modality and build the mapping between them with logistic regression. Wu et al.[26] proposed a sparse multimodal hashing method for crossmodal retrieval.
Besides these methods above, there still are other models proposed for multimodal problems. Chen et al.[27] employed voxelbased multimodal partial least square(PLS) to analyze the correlations between FDG PET glucose uptakeMRI gray matter volume scores and apolipoprotein E epsilon 4 gene dose in cognitively normal adults.
Although these methods have achieved great success in multimodal learning, most of them need a mass of training data to learn the complex correlation between objects from different modality. To reduce the demand of training data, Gao et al.[6] proposed an active similarity learning model for crossmodal data. Nevertheless, without extra information, the improvement is limited.
Iii Proposed Approach
To perform the coreference resolution with insufficient matched samples, we take the similarity between space structures into consideration. Given two datasets and from two modalities, Fig 2 shows the overview of our proposed model, the red points refer to the matched objects in two modality, the solid lines refer to semantic embedding, the dotted lines mean the distance between two objects. and are firstly mapped into semantic spaces(the blue and pink area in the figure) respectively as in IIIA:
(1) 
and
(2) 
Secondly, represent the tobematched objects(the nonred points in Fig 2) of and (the red ones in the figure) with the space structure based scheme in Section IIIB:
(3) 
and
(4) 
Since and are positively correlated(proved in Section IIIC), isomorphic and homogeneous, they can be linearly projected into each other:
(5) 
In , crossmodal similarities are measured with cosine distance. With the similarity measurement, multimodal coreference resolution can be solved by traditional methods.
Iiia High levelfeature extraction for text and images
In general, different lowlevel representations are adopted for different modalities, and there is no explicit correspondence between them[28][29]. Similar to the hypothesis fo zeroshot learning[8], we believe that the space structures of different modalities are similar, and try to connect modalities with them. It is crucial whether the pairwise distances over the representation can reflect the semantic similarity properly.
The distances over raw or lowlevel features are always far from the real semantic ones. There are many works available to achieve better representations for various of data, like[1][5][18][16]. In this paper, we utilize the methods below for feature exacting of text and image specifically:
IiiA1 Smooth inverse frequency for text embedding
Smooth inverse frequency(SIF) is a simple but powerful method for measuring textual similarity[30]
. It achieves great performance on a variety of textual similarity tasks, and even beats some sophisticated supervised methods including RNN and LSTM. In addition, the method is very suitable for domain adaptive setting. The core idea of SIF method is surprisingly simple: compute the weighted average of the word vectors and remove the first principle projection of the average vectors. For the superiority over performance, simplicity and domain adaption of SIF methods, it is utilized for text embedding in this work.
IiiA2 Convolutional neural network offtheshelf for image embedding
Convolutional neural network(CNN) has demonstrated outstanding ability for machine learning tasks of images, like image classification and object detection. Wei et. al proposed to utilize CNN visual features for crossmodal retrieval
[18], which performs better than commonly used features and achieves a baseline for crossmodal retrieval. Firstly, offtheshelf CNN features[31]are extracted from the CNN model pretrained on ImageNet; then a finetuning step is processed for the target dataset.
In this paper, for better semantic representation, we extracted CNN features from images with the method above, namely, the mapping in Equation(2).
IiiB Representation scheme of the space structure of single modality
Although multimodal data are highly heterogeneous, most of them are represented in Euclidean or Hamming space. In this section we modify the representation scheme of space structure in[32], and employ it to represent the objects from Euclidean and Hamming space.
(a)  (b) 
Qian et al. proposed a space structure based representation scheme for clustering of categorical data[32]. The space structure is figured with the pairwise similarity: given a categorical object , it is rerepresented as , where is the similarity between and , is the number of the categorical dataset . Qian et al. utilized all the members in the dataset as the reference system to represent categorical data, however, it is unnecessary and inefficient to do so in Euclidean or Hamming space.
In Euclidean space, an object is represented as its distances from the coordinate axes (as Fig. 3 (a)). However, it can also be positioned by its Euclidean distance from several given points(called reference points). For example, in Fig. 3 (b), the red points can be uniquely positioned by their distances from the black ones. In fact, for a dimensional object in Euclidean space, noncolinear reference points are enough to locate it. Based on the representation scheme, a set of numeric object can be mapped into with a reference point set (), where
(6) 
(7) 
Similarly, a binary object in Hamming space can be also represented by its Hamming distances from the reference points. For a dimensional object, disparate ones as reference points are enough to figure out its position in Hamming space. Given a reference point set (), a set of binary object in Euclidean space is mapped into where
(8) 
(9) 
is xor operator.
Although more matched multimodal objects would benefit coreference resolution better intuitively, some pairs of objects may undermine the representing ability of the space structure. As Fig 4 illustrates, the red and green points being taken as reference points are definitely better than the black ones. On the one hand, A good reference set should adequately reflect the distribution of the dataset, then the members of reference set should be quite distinct with each other. On the other hand, for higher discrimination, their distances from the nonreference objects should be distinct either. Thus, we transfer the selection of reference points to an optimization problem:
(10) 
where is the number of reference points(userspecified),
is the variance of the distances of
from all nonreference objects, is a balance factor.Being represented with the scheme above, the similarity between numerical objects and (or binary objects and ) can be directly measured with Euclidean or cosine distance between them as[32]:
(11) 
or
(12) 
IiiC Correlation between the space structure of different modalities
Although the feature spaces of multimodal data are usually highly heterogeneous, they may share some latent properties. In this part, we prove that the pairwise similarities of different modalities are correlated to each other. Based on this, these modalities share similar space structures.
As discussed in section IIA, CCA based methods aim to discover the linear correlation between multimodal data. Firstly, we assume that and are linearly correlated as
(13) 
where is a projection matrix. Let be the similarity between and , be that between and . We have
Theorem 1.
If multimodal data and are linearly correlated as in Equation (13), there exists a positive correlation between and .
Proof.
For simplicity, we assume all (
) subject to a standard normal distribution
and are independent to each other. The inner product in Equation(14) and (15) are utilized to compute the similarity matrix and ).(14) 
(15) 
Since is symmetric, then it can be diagonalized as
(16) 
The Pearson correlation coefficient between and is
(17) 
where the variance of and is
(18) 
and
(19) 
Since
(20) 
and are dependent with each other if or , then
(21) 
The covariance of and is
(22) 
Since is independent of each other, then
(23) 
then
(24) 
Finally,
(25) 
From Equation (17), (18), (19) and (25), the correlation coefficient is
(26) 
Because is a nonnegative symmetric matrix, we have Equation(27
) unless it is a zero matrix either, which is unreasonable obviously.
(27) 
In Equation(27), refer to the the principal diagonal elements of . In conclusion, there exist a positive correlation between and . ∎
In Theorem 1 we assume that is linearly correlated to , however, if the correlation is nonlinear the conclusion also hold. Similar to [3], we define a nonlinear mapping
(28) 
where is also a linear projection matrix, is a bias matrix,
is specified as Sigmoid function. The correlation between
and is similar to that in Theorem 1.Theorem 2.
If multimodal data and are nonlinearly correlated as in Equation (28), there exists a positive correlation between and .
IiiD Crossmodal projection and coreference resolution
Considering the positive correlation between and , since and are the submatrices of and respectively, they are positively correlated to each other, too. Taking a set of already matched objects as the shared reference set , and would be isomorphic: the corresponding dimensions of and are consistent in meaning and value^{1}^{1}1It should be noted that although the cardinality of reference set should be larger than the representation dimension to avoid confusion, it is difficult to achieve. In fact, slight confusion can reduce the possibility of missed detection. Here, we select a refernce set from all the matched objects with the strategy in Section IIIB.
A linear mapping is performed on each dimension to eliminate the difference in the scales and distance metrics of and :
(29) 
where is a diagonal matrix, is a bias matrix where each column is a constant. Since parameters of the projection are quite few, and can be learnt with a regression on the shared reference set , the object is:
(30) 
where
is the square loss function,
is the bias vector,
is a regularization item.Although it is all right to take or as the target space, visual feature space is more appropriate because it suffers less from the hubness problem[33].
Giving an object , the that most closely matches it is that for which minimizes
(31) 
Iv Experiments
Iva Datasets
We employ the following datasets to evaluate the performance of the proposed method: Wikipedia[2], Pascalsentences[34].
Wikipedia:This dataset contains 2866 pairs of images and text from ten categories. Each pair of image and text are extracted from Wikipedia’s articles[2]. Instead of SIFT BoVW visual features and LDA textual features provided by[2], we extract CNN fc6 features^{2}^{2}2
CNN feature is extract with Caffe feature extraction tool.
and Sif features^{3}^{3}3Python codes can be obtained from: https://github.com/PrincetonML/SIF. like in Section IIIA.Pascalsentences: This dataset is a subset of Pascal VOC, which contains 1000 pairs of image and corresponding textual description from twenty categories. Feature extraction are same as that of Wikipedia dataset.
(a)  (b) 
(a)  (b) 
IvB Evaluation Protocol
Two common methods for crossmodal coreference resolution are compared with our method(Space Structure Matching, SSM):
Correlation Matching (CM)[2][18]: With canonical correlation analysis(CCA), Rasiwasia proposed to learn a shared space for different modalities where they were maximumly correlated, and projected crossmodal objects into it.
Semantic Matching (SM)[2][18]: SM represent multimodal data in more abstract levels, then they should be naturally corespondent to each other. Rasiwasia adopted multiclass logistic regression to generate common representations of multimodal data.
Since our object is to perform coreference resolution with insufficient training data, we take some experiments to evaluate the crossmodal retrieval precision with very limited matched objects. A small subset(ranging 6 to 50) of the dataset is randomly selected as the training set. The trained model on the training set is tested on the left parts.
As in[2][18][6], mean average precision(mAP) is employed to evaluated the performance of these models, which is a widely used metric of information retrieval[35]:
(32) 
where is the average precision of test sample . The multimodal objects with same labels are considered semantically similar. For the coreference resolution of , the average precision can be computed as:
(33) 
where denotes the number of the really similar objects in the target modality, P() is the precision of the results ranked at , if the object at rank is similar, otherwise. Both the mAP scores of bidirectional queries and their average are computed, and higher mAP indicates better performance of a model.
IvC Performance evaluation
We demonstrate the mAP score of CM, SM and SSM on Pascalsentences and Wikipedia datasets in Fig. 5 and Fig. 6.
IvC1 Results on Wikipedia
Fig. 5 reports the mAP scores on wikipedia dataset in both two retrieval direction: searching similar text for images and the reverse. In Fig. 5 (a), the mAP scores of SSM are higher than the others in most cases, and CM’s scores are lowest all the time. Both the mAP scores of SSM and SM increase with the number of the matched objects, while that of CM are almost unchanged. The reason of poor performance of CM may be that the training sample is not enough for CCA to find significant correlation between modalities. It should be noticed that the mAP of SM increases rapidly, especially for the image query, and exceeds that of SSM when the size of training data increased to forty.
IvC2 Results on Pascalsentences
Fig. 6 shows the results on Pascalsentences dataset. The result is pretty similar to that of Wikipedia dataset. In Fig. 6 (a), the mAP scores of SSM are also higher than the others in most cases, and CM’s scores still keep lowest. Both the mAP scores of SSM and SM increase with the number of the matched objects, and the performance difference demonstrate a decreasing trend. Besides, with the same training set, the mAP scores of text query are higher than that of image query on Pascalsentences dataset. And the performance of SSM method on Pascalsentences is better than that on Wikipedia, which could be due to stronger correlation between space structures of Pascalsentences.
Overall, the proposed SSM method performs better than two popular methods for multimodal retrieval when training samples are not rich enough. However, the performance have not reached our expectation. The reason may lie on that our method discards the class label information, while it is the most important for results evaluation. There is still much room to improve the method. For example, the way that SM uses class label information may be helpful to improve the precision.
V Conclusion
For crossmodal issues, abstraction and correlation are two useful tools. In this paper, We have demonstrated the possibility to find semantically similar objects from different modality with the correlation between space structures of different modalities, which can be considered as a combination of abstraction and correlation.
Firstly, we have proved that the space structures of different modalities are correlated, and employ this kind of correlation for multimodal coreference resolution. To figure the space structure of each modality more accurately, high level features for image and text are employed. With these above, a semisupervised method for coreference resolution have been proposed. Different from existing methods, the proposed method mainly utilize a intrinsic correlation between different modalities. It makes our method need pretty few training data to learn the correlation. The experiments on two multimodal datasets have verified that our method outperformed previous methods when training data are insufficient.
References
 [1] A. Karpathy, A. Joulin, and F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in International Conference on Neural Information Processing Systems, 2014, Conference Proceedings.
 [2] N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to crossmodal multimedia retrieval,” in International Conference on Multimedia, Conference Proceedings.
 [3] M. N. Luo, X. J. Chang, Z. H. Li, L. Q. Nie, A. G. Hauptmann, and Q. H. Zheng, “Simple to complex crossmodal learning to rank,” Computer Vision and Image Understanding, vol. 163, pp. 67–77, 2017.
 [4] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July, 2011, Conference Proceedings.
 [5] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics, vol. 2, no. 0, pp. 207–218, 2014.
 [6] N. Gao, S.J. Huang, Y. Yan, and S. Chen, “Cross modal similarity learning with active queries,” Pattern Recognition, vol. 75, pp. 214–222, 2018.

[7]
M. Rohrbach, S. Ebert, and B. Schiele, “Transfer learning in a transductive setting,” in
International Conference on Neural Information Processing Systems, 2013, Conference Proceedings.  [8] R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng, “Zeroshot learning through crossmodal transfer,” in neural information processing systems, 2017, Conference Proceedings, pp. 935–943.
 [9] Y. Guo, G. Ding, X. Jin, and J. Wang, “Transductive zeroshot recognition via shared model space learning,” in AAAI, 2016, Conference Proceedings.
 [10] D. R. Hardoon, S. Szedmak, and J. Shawetaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
 [11] A. Sharma, “Generalized multiview analysis: A discriminative latent space,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, Conference Proceedings.
 [12] Y. Mroueh, E. Marcheret, and V. Goel, “Multimodal retrieval with asymmetrically weighted truncatedsvd canonical correlation analysis,” Computer Science, 2015.
 [13] L. Wang, W. Sun, Z. Zhao, and F. Su, “Modeling intra and interpair correlation via heterogeneous highorder preserving for crossmodal retrieval,” Signal Processing, vol. 131, pp. 249–260, 2017.
 [14] N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in International Conference on Neural Information Processing Systems, 2012, Conference Proceedings.
 [15] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in international conference on machine learning, 2013, Conference Proceedings.
 [16] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visualsemantic embedding model,” in neural information processing systems, 2013, Conference Proceedings.
 [17] B. Jiang, J. Yang, Z. Lv, K. Tian, Q. Meng, and Y. Yan, “Internet crossmedia retrieval based on deep learning,” Journal of Visual Communication and Image Representation, vol. 48, pp. 356–366, 2017.
 [18] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Crossmodal retrieval with cnn visual features: A new baseline,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 47, pp. 449–460, 2017.
 [19] Y. Jia, M. Salzmann, and T. Darrell, “Learning crossmodality similarity for multinomial data,” in International Conference on Computer Vision, 2011, Conference Proceedings.

[20]
S. Roller and S. S. I. Walde, “A multimodal lda model integrating textual,
cognitive and visual modalities,” in
empirical methods in natural language processing
, 2013, Conference Proceedings.  [21] Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, “Multimodal mutual topic reinforce modeling for crossmedia retrieval,” in acm multimedia, 2014, Conference Proceedings.
 [22] J. Wang, S. Kumar, and S. Chang, “Sequential projection learning for hashing with compact codes,” in international conference on machine learning, 2010, Conference Proceedings.

[23]
S. Kumar and R. Udupa, “Learning hash functions for crossview similarity
search,” in
International Joint Conference on Artificial Intelligence
, 2011, Conference Proceedings.  [24] Z. Yi and D. Y. Yeung, “Coregularized hashing for multimodal data,” in International Conference on Neural Information Processing Systems, 2012, Conference Proceedings.
 [25] M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang, “Comparing apples to oranges: a scalable solution with heterogeneous hashing,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, Conference Proceedings.
 [26] F. Wu, Y. Zhou, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang, “Sparse multimodal hashing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 427–439, 2014.
 [27] K. W. Chen, N. Ayutyanont, J. B. S. Langbaum, A. S. Fleisher, C. Reschke, W. Lee, X. F. Liu, G. E. Alexander, D. Bandy, R. J. Caselli, and E. M. Reiman, “Correlations between fdg pet glucose uptakemri gray matter volume scores and apolipoprotein e epsilon 4 gene dose in cognitively normal adults: A crossvalidation study using voxelbased multimodal partial least squares,” Neuroimage, vol. 60, no. 4, pp. 2316–2322, 2012.
 [28] P. J. Costa, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “On the role of correlation and abstraction in crossmodal multimedia retrieval,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 36, no. 3, pp. 521–35, 2014.
 [29] L. Huang and Y. Peng, “Crossmedia retrieval by exploiting finegrained correlation at entity level,” Neurocomputing, vol. 236, pp. 123–133, 2017.
 [30] S. Arora, Y. Liang, and T. Ma, “A simple but toughtobeat baseline for sentence embeddings,” in International Conference on Learning Representations, 2017, Conference Proceedings.
 [31] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” international conference on machine learning, pp. 647–655, 2014.
 [32] Y. H. Qian, F. J. Li, J. Y. Liang, B. Liu, and C. Y. Dang, “Space structure and clustering of categorical data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 10, pp. 2047–2059, 2016.
 [33] L. Zhang, T. Xiang, and S. Gong, “Learning a deep embedding model for zeroshot learning,” computer vision and pattern recognition, pp. 3010–3019, 2017.
 [34] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010, Conference Proceedings.
 [35] N. Rasiwasia, P. J. Moreno, and N. Vasconcelos, “Bridging the gap: Query by semantic example,” IEEE Transactions on Multimedia, vol. 9, no. 5, pp. 923–938, 2007.
Comments
There are no comments yet.