1 Introduction
Recently, Tomas:Linguistic discovered that word representations learned by a recursive neural net (RNN) as well as by related log-linear models [Mikolov et al.2013b] can capture the linguistic regularities in language, which allows easy solutions to analogy questions of the form “Beijing:China as Paris: ” using simple linear algebra. With this word analogy task, a flurry of subsequent work exhibited that similar linear structure can also be revealed from representations learned from other methods [Mnih and Kavukcuoglu2013, Pennington et al.2014, Levy and Goldberg2014b].
Besides word representation, document representation is also a fundamental and critical problem in natural language processing. Over the past decades, various methods have been proposed to represent the document as a vector, including Bag of Words (BOW)
[Harris1954], Latent Semantic Indexing (LSI) [Deerwester et al.1990], Non-negative Matrix Factorization (NMF) [Lee and Seung1999]) and Latent Dirichlet Allocation (LDA) [Blei et al.2003]. Recently, there is a rising enthusiasm for applying the neural embedding methods to representing the documents [Srivastava et al.2013, Le and Mikolov2014].It is, therefore, natural to ask whether there is also linear structure in these learned document representations to allow similar reasoning at document level. For example, given three articles talking about naive bayes, logistic regression, and hidden markov model, is it possible to find the article about conditional random fields as the solution to the document analogy question “naive bayes: logistic regression as hidden Markov model:
” (i.e., document pairs explaining generative-discriminative model relations)? Obviously, such reasoning is much more complex in semantics and cannot be achieved by simple retrieval or classification based on lexical information. Representation with such linear structure would be useful for many semantic processing applications, e.g., it may help controversial search [Dori-Hacohen et al.2015] by discovering document pairs talking about opposite facts on controversial topics with some seed pairs, or help non-local corpus navigation and paper recommendation together with word vectors [Dai et al.2014].For this purpose, we introduce a new document analogy task for evaluating the semantic regularities in document representations. Since it is non-trivial to directly label the analogy questions over documents, we leverage the existing word/phrase semantic analogy test set and map the words/phrases in these questions to Wikipedia articles through title matching. In this way, we obtain a large labeled analogy test set over documents. The task is then to test whether different document representations over the Wikipedia articles can find the right answers to these semantic analogy questions.
Based on this test set, we evaluate several existing state-of-the-art document representations and show that neural embedding based models can achieve better performance than conventional models. The major contributions of this paper include: 1) the introduction of a new document analogy task with benchmark dataset for evaluating document representations; 2) empirical comparison among state-of-the-art models and preliminary explanations over the results.
2 Measuring Semantic Regularities
Relation | Count | Example |
---|---|---|
capital-common-countries | 506 | beijing: china paris: france |
capital-world | 3991 | bangkok: thailand cario: egypt |
currency | 88 | europe: euro india: rupee |
city-in-state | 277 | houston: texas miami: florida |
family | 56 | boy: girl man:woman |
newspapers | 20 | chicago: chicago tribune houston: houston chronicle |
ice hockey | 462 | boston: boston bruins los angeles: los angeles kings |
basketball | 306 | chicago: chicago bulls dallas: dallas mavericks |
airlines | 306 | canada: air canada italy: alitalia |
people-companies | 100 | bill gates: microsoft larry page: google |
2.1 A Document Analogy Test Set
We propose to create a document analogy test set so that we can quantitatively evaluate how well different document representations capture semantic regularities. Following the idea of word analogy task, we try to build a test set of analogy questions of the form “ is to as is to ”, where are the identities of the documents. However, it is not trivial to directly label the relations between two arbitrary documents due to the diversity in topics. Fortunately, we found that each Wikipedia page is a concise document describing one specific concept, and thus the relations between the documents can be explained by their corresponding concepts. Therefore, we can convert the task of labeling between documents into that between concepts (which are of words or phrases), where we already have a large labeled data set from Mikolov:Distributed.
Based on the idea above, we build a document analogy test set using Wikipedia and existing word and phrase analogy test set. Specifically, we adopt the publicly available April 2010 dump of Wikipedia111http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 [Shaoul and Westbury2010], which has been widely used in [Huang et al.2012, Luong et al.2013, Neelakantan et al.2014]. The corpus contains 3035070 articles and about billion tokens. We then collect all the existing word and phrase analogy test sets and match the words/phrases in questions to Wikipedia page titles. Note here we do not take syntactic analogy questions of words into consideration because the relations between documents are usually semantic. By resolving the ambiguity in matching, we finally obtain 6112 analogy questions over Wikipedia documents. Table 1 shows the details of the test set.
2.2 Analogy Reasoning
In this work, we adopt the same vector offset method [Mikolov et al.2013c] for analogy reasoning. To answer the questions like “a is to b as c is to ”, we try to find a document with vector , which is the closest to
according to the cosine similarity:
(1) |
where and are the normalized document vectors. The question is judged as correctly answered only if
is exactly the answer document in the evaluation set. The evaluation metric for this task is the percentage of questions answered correctly.
3 Models
In this section, we briefly summarize the models used in this paper. Before that, we first list the notations.
Let denote a corpus of documents over the word vocabulary . Let be a document-word matrix, where entry in denotes the weight of the -th word in the -th document .
Bag of Words (BOW) model treats a document as a bag (multiset) of its words. It represents a document as a vector , where denotes the weight of the -th word in the -th document . The most popular weighting scheme for is TF-IDF [Jones1972]
. However, The BOW model suffers from the sparsity and curse of dimensionality due to treating individual word as distinct feature.
Matrix Factorization methods attempt to tackle the limitation of BOW model through learning a low-dimensional vector for document by factorizing the document-word matrix .
Deerwester:Indexing applied truncated Singular Value Decomposition (SVD) to document-word matrix, namely Latent Semantic Indexing (LSI). LSI approximates
by setting all but the largest singular values in to (), asHence one might think of the rows of as representations for documents in the latent space.
An alternative way is factorizing into two non-negative matrices [Lee and Seung1999],
where the rows of can be seen as the representations of documents.
Unlike LSI which may have negative entries, NMF has better interpretability with the non-negative constraint.
Topic Models are also very popular in document representation fields because of their good interpretability, generalization ability and extensibility. The most representative work is the Latent Dirichlet Allocation (LDA) model introduced by Blei:Latent. It represents the documents as distributions over latent topics, where each topic is characterized by a distribution over words.
Neural Embedding models have attracted much attention in text representations due to its breakthrough in statistical language model [Bengio et al.2003].
The Paragraph Vector models are first introduced in [Le and Mikolov2014] for document representation. The Distributed Memory Model of Paragraph Vectors (PV-DM) captures the representation of a document via inserting a document vector to the continuous bag-of-words (CBOW) model [Mikolov et al.2013a]. A simpler model can be obtained by replacing the input word vector with document vector in Skip Gram (SG) model, which is called “Distributed Bag of Words version of Paragraph Vector” (PV-DBOW).
Bag of Word Embeddings (BOWE) model tries to represent the document as a linear combination of word vectors, where the word vectors can be obtained by tools like Word2Vec or GloVe. The low-dimensional representations of documents in BOWE can be written as
where denotes the BOW representation of documents.
4 Experiments
Relation | BOW | LSI | NMF | LDA | PV-DM | PV-DBOW | BOWE |
---|---|---|---|---|---|---|---|
capital-common-countries | 0.0 | 23.12 | 9.29 | 23.72 | 60.87 | 54.15 | 83.0 |
capital-world | 0.8 | 9.97 | 5.06 | 9.15 | 43.62 | 42.65 | 67.53 |
currency | 0.0 | 0.0 | 0.0 | 0.0 | 4.55 | 3.41 | 14.77 |
city-in-state | 0.0 | 7.94 | 6.50 | 4.33 | 33.57 | 34.30 | 51.26 |
family | 19.64 | 5.36 | 1.79 | 14.29 | 21.43 | 21.43 | 19.64 |
newspapers | 5.0 | 25.0 | 10.0 | 10.0 | 5.0 | 50.0 | 40.0 |
ice hockey | 0.0 | 3.68 | 2.16 | 0.0 | 12.12 | 20.13 | 33.33 |
basketball | 0.0 | 4.25 | 1.63 | 0.0 | 10.13 | 14.71 | 38.56 |
airlines | 11.76 | 9.15 | 2.94 | 2.61 | 12.42 | 20.26 | 42.48 |
people-companies | 2.0 | 1.0 | 0.0 | 0.0 | 6.0 | 12.0 | 2.0 |
total | 1.34 | 9.88 | 4.81 | 8.43 | 37.47 | 37.76 | 60.42 |
In this section, we first describe our experimental settings including the corpus, hyper-parameter selections, and specifications for different document representation methods. Then we compare these methods on document analogy task and discuss the results.
4.1 Corpus and Preprocessing
The corpus used to learn document representations in this experiment is the same Wikipedia April 2010 dump as described in section 2.1. In preprocessing, we lowercase the corpus, remove pure digit words, non-English characters and the words occur less than 20 times.
4.2 Experimental Settings
The baseline methods used in this paper including BOW with TF-IDF weight, LSI, NMF, LDA, PV-DM, PV-DBOW, and BOWE. For BOW, LSI and LDA, we use the popular python topic model library gensim222http://radimrehurek.com/gensim
. For NMF, we choose the python machine learning library
scikit learn333http://scikit-learn.org. We implement PV-DM and PV-DBOW models in C++ due to Quoc:Distributed have not released source codes of PV models. For word embeddings in BOWE, we use CBOW in the Word2Vec tool444https://code.google.com/p/Word2Vec/. The negative sampling method is adopted to take the place of the hierarchal softmax since we found the former always achieves better performance. The learning rate is linearly decayed to as described in [Mikolov et al.2013a], where the initial learning rate of PV-DM and CBOW model is 0.05, and PV-DBOW is 0.025. We set context window size as 10 and use 10 negative samples.4.3 Results
In Table 2, we compare the results of 100-dimensional document vectors from all the methods on different subtasks of document analogy. As we can see, among all the methods, BOW is almost the worst. This demonstrates the weakness of simple vector space model on capturing semantic regularities.
Neural embedding models such as PV-DM and PV-DBOW perform much better than conventional latent models such as LSI, NMF, and LDA. This is quite amazing since PV models can also be viewed as implicit matrix factorization according to the explanations on Word2Vec [Levy and Goldberg2014a]. A major difference is that conventional latent models usually work on matrix with each entry standing for the frequency or TF-IDF of a word in a document, while PV models factorize a shifted pointwise mutual information (shifted-PMI) matrix. As discussed in [Arora et al.2015], PMI is a key factor why Word2Vec can work well for word analogy task. We guess this might also be a major factor to explain the gap between PV models and other latent models. Therefore, we conducted further experiments on LSI. As a result, we find that the total accuracy of LSI can achieve 15.53% with PMI matrix (about 57% performance gain over LSI with TF-IDF matrix). The results indicate that PMI plays an important role in revealing linear structure in document representations.
A surprising result is that the simple BOWE model performs significantly better than any other methods on almost all the subtasks (-value ). There might be two possible reasons for the result. Firstly, when learning word vectors alone with Word2Vec, one can achieve very high scores on word analogy tasks [Mikolov et al.2013c]. BOWE thus benefits from the strong linear structures in word vectors by directly using word vectors as the representation of a document. Secondly, the calculation of Euclidean distance555Note that dot product between normalized vectors in analogy reasoning is equivalent to Euclidean distance. between documents under BOWE is equivalent to using a relaxed Word Mover’s Distance [Kusner et al.2015], which has been shown strong performance in measuring document distance.
We also conduct the experiments on different dimensions as shown in Table 3. Similar trending can be found as that in Table 2.
Dimension | ||||
Model | 50 | 100 | 150 | 200 |
BOW | 1.34 | 1.34 | 1.34 | 1.34 |
LSI | 4.19 | 9.88 | 17.0 | 21.81 |
NMF | 1.59 | 4.81 | 8.75 | 11.85 |
LDA | 3.52 | 8.43 | 10.39 | 10.45 |
PV-DM | 25.62 | 37.47 | 37.71 | 36.03 |
PV-DBOW | 25.33 | 37.76 | 40.61 | 39.09 |
BOWE | 42.05 | 60.42 | 66.74 | 69.49 |
5 Conclusion
In this paper, we introduce a new document analogy task for quantitatively evaluating how well different document representations capture semantic regularities. Based on the introduced benchmark dataset, we conduct empirical comparisons among several state-of-the-art document representation methods. The results reveal that neural embedding based document representations work better on this analogy task. We provide some preliminary explanations on these observations, leaving the inherent differences of these models to be further investigated in the future. With this benchmark dataset, it would also be easier for us to develop new document representation models and to compare with existing methods.
References
- [Arora et al.2015] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2015. Random walks on context spaces: Towards an explanation of the mysteries of semantic word embeddings. CoRR, abs/1502.03520.
- [Bengio et al.2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March.
- [Blei et al.2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March.
-
[Dai et al.2014]
Andrew M Dai, Christopher Olah, Quoc V Le, and Greg S Corrado.
2014.
Document embedding with paragraph vectors.
NIPS Deep Learning Workshop
. - [Deerwester et al.1990] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407.
- [Dori-Hacohen et al.2015] Shiri Dori-Hacohen, Elad Yom-Tov, and James Allan. 2015. Navigating controversy as a complex search task. In Proceedings of the First International Workshop on Supporting Complex Search Tasks co-located with the 37th European Conference on Information Retrieval(ECIR 2015). Elsevier, March.
- [Harris1954] Zellig Harris. 1954. Distributional structure. Word, 10(23):146–162.
- [Huang et al.2012] Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pages 873–882, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Jones1972] Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21.
- [Kusner et al.2015] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the 32th International Conference on Machine Learning (ICML-15), ICML ’15. ACM, New York, NY, USA, July.
- [Le and Mikolov2014] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Tony Jebara and Eric P. Xing, editors, Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188–1196. JMLR Workshop and Conference Proceedings.
- [Lee and Seung1999] Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, october.
- [Levy and Goldberg2014a] Omer Levy and Yoav Goldberg. 2014a. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27, pages 2177–2185. Curran Associates, Inc., Montreal, Quebec, Canada.
- [Levy and Goldberg2014b] Omer Levy and Yoav Goldberg, 2014b. Proceedings of the Eighteenth Conference on Computational Natural Language Learning, chapter Linguistic Regularities in Sparse and Explicit Word Representations, pages 171–180. Association for Computational Linguistics.
-
[Luong et al.2013]
Minh-Thang Luong, Richard Socher, and Christopher D. Manning.
2013.
Better word representations with recursive neural networks for morphology.
In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113. Association for Computational Linguistics. -
[Mikolov et al.2013a]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
2013a.
Efficient estimation of word representations in vector space.
In Proceedings of Workshop of ICLR. - [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
- [Mikolov et al.2013c] Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-2013). Association for Computational Linguistics, May.
-
[Mnih and Kavukcuoglu2013]
Andriy Mnih and Koray Kavukcuoglu.
2013.
Learning word embeddings efficiently with noise-contrastive estimation.
In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2265–2273. Curran Associates, Inc. - [Neelakantan et al.2014] Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1059–1069, Doha, Qatar, October. Association for Computational Linguistics.
- [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543.
- [Shaoul and Westbury2010] Cyrus Shaoul and Chris Westbury. 2010. The westbury lab wikipedia corpus. Edmonton, AB: University of Alberta.
-
[Srivastava et al.2013]
Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey E. Hinton.
2013.
Modeling documents with deep boltzmann machines.
InProceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence
, pages 616–625, Seattle, USA, August.