Distributed word embeddings represent words as dense, low-dimensional and real-valued vectors that can capture their semantic and syntactic properties. These embeddings are used abundantly by machine learning algorithms in tasks such as text classification and clustering. Traditional bag-of-word models that represent words as indices into a vocabulary don’t account for word ordering and long-distance semantic relations. Representations based on neural network language modelsMikolov et al. (2013b) can overcome these flaws and further reduce the dimensionality of the vectors. However, there is a need to extend word embeddings to entire paragraphs and documents for tasks such as document and short-text classification.
Representing entire documents in a dense, low-dimensional space is a challenge. A simple weighted average of the word embeddings in a large chunk of text ignores word ordering, while a parse tree based combination of embeddings Socher et al. (2013) can only extend to sentences. Le and Mikolov (2014) trains word and paragraph vectors to predict context but shares word-embeddings across paragraphs. However, words can have different semantic meanings in different contexts. Hence, vectors of two documents that contain the same word in two distinct senses need to account for this distinction for an accurate semantic representation of the documents. Wang Ling (2015), Liu et al. (2015a) map word embeddings to a latent topic space to capture different senses in which words occur. However, they represent complex documents in the same space as words, reducing their expressive power. These methods are also computationally intensive.
In this work, we propose the Sparse Composite Document Vector(SCDV) representation learning technique to address these challenges and create efficient, accurate and robust semantic representations of large texts for document classification tasks. SCDV combines syntax and semantics learnt by word embedding models together with a latent topic model that can handle different senses of words, thus enhancing the expressive power of document vectors. The topic space is learnt efficiently using a soft clustering technique over embeddings and the final document vectors are made sparse for reduced time and space complexity in tasks that consume these vectors.
The remaining part of the paper is organized as follows. Section 2 discusses related work in document representations. Section 3 introduces and explains SCDV in detail. This is followed by extensive and rigorous experiments together with analysis in section 4 and 5 respectively.
2 Related Work
Le and Mikolov (2014) proposed two models for distributional representation of a document, namely, Distributed Memory Model Paragraph Vectors (PV-DM) and Distributed BoWs paragraph vectors (PV-DBoW). In PV-DM, the model is learned to predict the next context word using word and paragraph vectors. In PV-DBoW, the paragraph vector is directly learned to predict randomly sampled context words. In both models, word vectors are shared across paragraphs. While word vectors capture semantics across different paragraphs of the text, documents vectors are learned over context words generated from the same paragraph and potentially capture only local semantics Pranjal Singh (2015). Moreover, a paragraph vector is embedded in the same space as word vectors though it can contain multiple topics and words with multiple senses. As a result, doc2vec Le and Mikolov (2014) doesn’t perform well on Information Retrieval as described in Ai et al. (2016a) and Roy et al. (2016). Consequently, we expect a paragraph vector to be embedded in a higher dimensional space.
A paragraph vector also assumes all words contribute equally, both quantitatively (weight) and qualitatively (meaning). They ignore the importance and distinctiveness of a word across all documents Pranjal Singh (2015). Mukerjee et al. Pranjal Singh (2015) proposed idf-weighted averaging of word vectors to form document vectors. This method tries to address the above problem. However, it assumes that all words within a document belong to the same semantic topic. Intuitively, a paragraph often has words originating from several semantically different topics. In fact, Latent Dirichlet Allocation Blei et al. (2003) models a document as a distribution of multiple topics.
These shortcomings are addressed in three novel composite document representations called Topical word embedding (TWE-1,TWE-2 and TWE-3) by Liu et al. (2015a). TWE-1 learns word and topic embeddings by considering each topic as a pseudo word and builds the topical word embedding for each word-topic assignment. Here, the interaction between a word and the topic to which it is assigned is not considered. TWE-2 learns a topical word embedding for each word-topic assignment directly, by considering each word- topic pair as a pseudo word. Here, the interaction between a word and its assigned topic is considered but the vocabulary of pseudo-words blows up. For each word and each topic, TWE-3 builds distinct embeddings for the topic and word and concatenates them for each word-topic assignment. Here, the word embeddings are influenced by the corresponding topic embeddings, making words in the same topic less discriminative.
Liu et al. (2015a) proposed an architecture called Neural tensor skip-gram model (NTSG-1, NTSG-2, NTSG-3, NTSG-4)
, that learns multi-prototype word embeddings and uses a tensor layer to model the interaction of words and topics to capture different senses.outperforms other embedding methods like on the 20 newsgroup data-set by modeling context-sensitive embeddings in addition to topical-word embeddings. Law et al. (2017) builds on by jointly learning the latent topic space and context-sensitive word embeddings. All three, , and use and suffer from computational issues like large training time, prediction time and storage space. They also embed document vectors in the same space as terms. Other works that harness topic modeling like Fu et al. (2016), Nguyen et al. (2015), Li et al. (2016a), Law et al. (2017), Das et al. (2015), Niu et al. (2015), Moody (2016) and Li et al. (2016b) also suffer from similar issues.
Vivek Gupta (2016) proposed a method to form a composite document vector using word embeddings and tf-idf values, called the Bag of Words Vector (BoWV). In , each document is represented by a vector of dimension , where is the number of clusters and is the dimension of the word embeddings. The core idea behind is that semantically different words belong to different topics and their word vectors should not be averaged. Further, computes inverse cluster frequency of each cluster (icf) by averaging the idf values of its member terms to capture the importance of words in the corpus. However,
does hard clustering using K-means algorithm, assigning each word to only one cluster or semantic topic but a word can belong to multiple topics. For example, the wordapple belongs to topic food as a fruit, and belongs to topic Information Technology as an IT company. Moreover, is a non-sparse, high dimensional continuous vector and suffers from computational problems like large training time, prediction time and storage requirements.
3 Sparse Composite Document Vectors
In this section, we present the proposed Sparse Composite Document Vector (SCDV) representation as a novel document vector learning algorithm. The feature formation algorithm can be divided into three steps.
3.1 Word Vector Clustering
We begin by learning d dimensional word vector representations for every word in the vocabulary using the skip-gram algorithm with negative sampling (SGNS) Mikolov et al. (2013a)
. We then cluster these word embeddings using the Gaussian Mixture Models(GMM)Reynolds (2015) soft clustering technique. The number of clusters, K
, to be formed is a parameter of the SCDV model. By inducing soft clusters, we ensure that each word belongs to every cluster with some probability.
3.2 Document Topic-vector Formation
For each word , we create different word-cluster vectors of d dimensions (
) by weighing the word’s embedding with its probability distribution in the kcluster, . We then concatenate all K word-cluster vectors () into a Kd dimensional embedding and weigh it with inverse document frequency of to form a word-topics vector (). Finally, for all words appearing in document , we sum their word-topic vectors to obtain the document vector .
where, is concatenation
3.3 Sparse Document Vectors
After normalizing the vector, we observed that most values in are very close to zero. Figure 3 verifies this observation. We utilize this fact to make the document vector sparse by zeroing attribute values whose absolute value is close to a threshold (specified as a parameter), which results in the Sparse Composite Document Vector .
In particular, let be percentage sparsity threshold parameter, the value of the attribute of the non-Sparse Composite Document Vector and represent the document in the training set:
We perform multiple experiments to show the effectiveness of SCDV representations for multi-class and multi-label text classification. For all experiments and baselines, we use Intel(R) Xeon(R) CPU E5-2670 v2
2.50GHz, 40 working cores, 128GB RAM machine with Linux Ubuntu 14.4. However, we utilize multiple cores only during Word2Vec training and when we run the one-vs-rest classifier for Reuters.
We consider the following baselines: Bag-of-Words (BoW) model Harris (1954), Bag of Word Vector (BoWV) Vivek Gupta (2016) model, paragraph vector models Le and Mikolov (2014), Topical word embeddings (TWE-1) Liu et al. (2015b), Neural Tensor Skip-Gram Model (NTSG-1 to NTSG-3) Liu et al. (2015a), tf-idf weighted average word-vector model Pranjal Singh (2015) and weighted Bog of Concepts (weight-BoC) Kim et al. (2015), where we build topic-document vectors by counting the member words in each topic.
We use the best parameter settings as reported in all our baselines to generate their results. We use dimensions for tf-idf weighted word-vector model, for paragraph vector model, topics and dimensional vectors for TWE, NTSG, LTSG and topics and dimensional word vectors for BOWV. We also compare our results with reported results of other topic modeling based document embedding methods like Fu et al. (2016), Nguyen et al. (2015), Liu and EDU (2014), Li et al. (2016a), Law et al. (2017), Das et al. (2015), Niu et al. (2015), Moody (2016) and Li et al. (2016b). Implementation of SCDV and related experiments is available here 111https://github.com/dheeraj7596/SCDV.
4.2 Text Classification
We run multi-class experiments on 20NewsGroup dataset 222http://qwone.com/jason/20Newsgroups/ and multi-label classification experiments on Reuters-21578 dataset 333www.daviddlewis.com/resources/testcollections/reuters21578/. We use the script444 https://gist.github.com/herrfz/7967781
for preprocessing the Reuters-21578 dataset. We use LinearSVM for multi-class classification and Logistic regression with OneVsRest setting for multi-label classification in baselines and SCDV.
For SCDV, we set the dimension of word-embeddings to , sparsity threshold parameter to and the number of mixture components in GMM to
. All mixture components share the same spherical co-variance matrix. We learn word vector embedding using Skip-Gram with Negative Sampling (SGNS) of 10 and minimum word frequency as 20. We use 5-fold cross-validation on F1 score to tune parameter C of SVM.
4.2.1 Multi-class classification
We evaluate classifier performance using standard metrics like accuracy, macro-averaging precision, recall and F-measure. Table 1
shows a comparison with the current state-of-art (NTSG) document representations on the 20Newsgroup dataset. We observe that SCDV outperforms all other current models by fair margins. We also present the class-wise precision and recall for 20Newsgroup on an almost balanced dataset with SVM over Bag of Words model and the SCDV embeddings in Table2 and observe that SCDV improves consistently over all classes.
4.2.2 Multi-label classification
We evaluate multi-label classification performance using Precision@K, nDCG@k Bhatia et al. (2015), Coverage error, Label ranking average precision score (LRAPS)555Section of
and F1-score. All measures are extensively used for the multi-label classification task. However, F1-score is an appropriate metric for multi-label classification as it considers label biases when train-test splits are random. Table 3 show evaluation results for multi-label text classification on the Reuters-21578 dataset.
4.2.3 Effect of Hyper-Parameters
SCDV has three parameters: the number of clusters, word vector dimension and sparsity threshold parameter. We vary one parameter by keeping the other two constant. Performance on varying all three parameters in shown in Figure 4. We observe that performance improves as we increase the number of clusters and saturates at 60. The performance improves until a word vector dimension of 300 after which it saturates. Similarly, we observe that the performance improves as we increase till 4 after which it declines. At 4% thresholding, we reduce the storage space by 80% compared to the dense vectors.
4.3 Topic Coherence
We evaluate the topics generated by GMM clustering on 20NewsGroup for quantitative and qualitative analysis. Instead of using perplexity Chang et al. (2011), which doesn’t correlate with semantic coherence and human judgment of individual topics, we used the popular topic coherence Mimno et al. (2011), Arora et al. (2013), Liu and EDU (2014) measure. A higher topic coherence score indicates a more coherent topic.
We used Bayes rule to compute the for a given topic and given word and compute the score of the top 10 words for each topic.
Here, denotes the number of times word appears in the corpus and V represents vocabulary size.
We calculated the topic coherence score for all topics for , and Law et al. (2017). Averaging the score of all 80 topics, GMM clustering scores -85.23 compared to -108.72 of LDA and -92.23 of LTSG. Thus, SCDV creates more coherent topics than both LDA and LTSG.
|Topic Image||Topic Health||Topic Mail|
Table 4 shows top 10 words of 3 topics from clustering, model and model on 20NewsGroup and shows higher topic coherence. Words are ranked based on their probability distribution in each topic. Our results also support the qualitative results of Randhawa et al. (2016) paper, where k-means was used over word vectors find topics.
4.4 Context-Sensitive Learning
In order to demonstrate the effects of soft clustering (GMM) during SCDV formation, we select some words (w) with multiple senses from 20Newsgroup and their soft cluster assignments to find the dominant clusters. We also select top scoring words (w) from each cluster (c) to represent the meaning of that cluster. Table 5 shows polysemic words and their dominant clusters with assignment probabilities. This indicates that using soft clustering to learn word vectors helps combine multiple senses into a single embedding vector.
|subject:1||physics, chemistry, math, science||0.27|
|subject:2||mail, letter, email, gmail||0.72|
|interest:1||information, enthusiasm, question||0.65|
|interest:2||bank, market, finance, investment||0.32|
|break:1||vacation, holiday, trip, spring||0.52|
|break:2||encryption, cipher, security, privacy||0.22|
|break:3||if, elseif, endif, loop, continue||0.23|
|unit:1||calculation, distance, mass, length||0.25|
|unit:2||electronics, KWH, digital, signal||0.69|
4.5 Information Retrieval
’s paragraph vectors to enhance the basic language model based retrieval model. The language model(LM) probabilities are estimated from the corpus and smoothed using a Dirichlet priorZhai and Lafferty (2004). In (Ai et al., 2016b)
, this language model is then interpolated with the paragraph vector (PV) language model as follows.
and the score for document d and query string Q is given by
where is obtained from the unigram query model and is used to rank documents. Ai et al. (2016b) do not directly make use of paragraph vectors for the retrieval task, but improve the document language model. To directly make use of paragraph vectors and make computations more tractable, we directly interpolate the language model query-document score with the similarity score between the normalized query and document vectors to generate , which is then used to rank documents.
Directly evaluating the document similarity score with the query paragraph vector rather than collecting similarity scores for individual words in the query helps avoid confusion amongst distinct query topics and makes the interpolation operation faster. In Table 6, we report Mean Average Precision(MAP) values for four datasets, Associated Press 88-89 (topics 51-200), Wall Street Journal (topics 51-200), San Jose Mercury (topics 51-150) and Disks 4 & 5 (topics 301-450) in the TREC collection. We learn on a held out set of topics. We observe consistent improvement in MAP for all datasets. We marginally improve the MAP reported by Ai et al. (2016b) on the Robust04 task. In addition, we also report the improvements in MAP score when Model based relevance feedback Zhai and Lafferty (2001) is applied over the initially retrieved results from both models. Again, we notice a consistent improvement in MAP.
|Dataset||LM||LM+SCDV||MB||MB + SCDV|
5 Analysis and Discussion
SCDV overcomes several challenges encountered while training document vectors, which we had mentioned above.
Clustering word-embeddings to discover topics improves performance of classification as Figure 4 (left) indicates, while also generating coherent clusters of words (Table 4). Figure 5 shows that clustering gives more discriminative representations of documents than paragraph vectors do since it uses K d dimensions while paragraph vectors embed documents and words in the same space. This enables SCDV to represent complex documents. Fuzzy clustering allows words to belong to multiple topics, thereby recognizing polysemic words, as Table 5 indicates. Thus it mimics the word-context interaction in NTSG and LTSG.
Semantically different words are assigned to different topics. Moreover, a single document can contain words from multiple different topics. Instead of a weighted averaging of word embeddings to form document vectors, as most of the previous work does, concatenating word embeddings for each topic (cluster) avoids merging of semantically different topics.
It is well-known that in higher dimensions, structural regularizers such as sparsity help overcome the curse of dimensionalityWainwright (2014).Figure 3 demonstrates this, since majority of the features are close to zero. Sparsity also enables linear SVM to scale to large dimensions. On 20NewsGroups, BoWV model takes up 1.1 GB while SCDV takes up only 236MB( decrease). Since GMM assigns a non-zero probability to every topic in the word embedding, noise can accumulate when document vectors are created and tip the scales in favor of an unrelated topic. Sparsity helps to reduce this by zeroing out very small values of probability.
SCDV uses Gaussian Mixture Model (GMM) while , and use LDA for finding semantic topics respectively. GMM time complexity is while that of LDA is . Here, V = Vocabulary size, N = number of documents and T = number of topics. Since number of topics T vocabulary size V, GMM is faster. Empirically, compared to , reduces document vector formation, training and prediction time significantly. Table 7 shows training and prediction times for BoWV, SCDV and TWE models.
In this paper, we propose a document feature formation technique for topic-based document representation. SCDV outperforms state-of-the-art models in multi-class and multi-label classification tasks. SCDV introduces sparsity in document vectors to handle high dimensionality. Table 7 indicates that SCDV shows considerable improvements in feature formation, training and prediction times for the 20NewsGroups dataset. We show that fuzzy GMM clustering on word-vectors lead to more coherent topic than LDA and can also be used to detect Polysemic words. SCDV embeddings also provide a robust estimation of the query and document language models, thus improving the MAP of language model based retrieval systems. In conclusion, SCDV is simple, efficient and creates a more accurate semantic representation of documents.
The authors wants to thank Nagarajan Natarajan (Post-Doc, Microsoft Research, India), Praneeth Netrapalli (Researcher, Microsoft Research, India), Raghavendra Udupa (Researcher, Microsoft Research, India), Prateek Jain (Researcher, Microsoft Research, India) for encouraging and valuable feedback .
- Ai et al. (2016a) Qingyao Ai, Liu Yang, Jiafeng Guo, and W Bruce Croft. 2016a. Analysis of the paragraph vector model for information retrieval. In Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval. ACM, pages 133–142.
- Ai et al. (2016b) Qingyao Ai, Liu Yang, Jiafeng Guo, and W Bruce Croft. 2016b. Improving language estimation with the paragraph vector model for ad-hoc retrieval. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, pages 869–872.
- Arora et al. (2013) Sanjeev Arora, Rong Ge, Yonatan Halpern, David M Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In ICML (2). pages 280–288.
- Bhatia et al. (2015) Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. 2015. Sparse local embeddings for extreme multi-label classification. In Advances in Neural Information Processing Systems. pages 730–738.
- Blei et al. (2003) David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3:2003.
- Chang et al. (2011) Jonathan Chang, Jordan L Boyd-Graber, Sean Gerrish, Chong Wang, and David M Blei. 2011. Reading tea leaves: How humans interpret topic models. pages 262–272.
- Das et al. (2015) Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian lda for topic models with word embeddings. In ACL (1). pages 795–804.
- Fu et al. (2016) Xianghua Fu, Ting Wang, Jing Li, Chong Yu, and Wangwang Liu. 2016. Improving distributed word representation and topic model by word-topic mixture model. In Proceedings of The 8th Asian Conference on Machine Learning. pages 190–205.
- Harris (1954) Zellig Harris. 1954. Distributional structure. Word 10:146–162.
- Kim et al. (2015) Han Kyul Kim, Hyunjoong Kim, and Sungzoon Cho. 2015. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. SNU Data Mining Center 12.
- Law et al. (2017) Jarvan Law, Hankz Hankui Zhuo, Junhua He, and Erhu Rong. 2017. Ltsg: Latent topical skip-gram for mutually learning topic model and vector representations. arXiv preprint arXiv:1702.07117 .
- Le and Mikolov (2014) Quoc V Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML. volume 14, pages 1188–1196.
- Li et al. (2016a) Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao. 2016a. Generative topic embedding: a continuous representation of documents. In Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL).
- Li et al. (2016b) Ximing Li, Jinjin Chi, Changchun Li, Jihong Ouyang, and Bo Fu. 2016b. Integrating topic modeling with word embeddings by mixtures of vmfs. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers pages 151––160.
- Liu and EDU (2014) Bing Liu and UIC EDU. 2014. Topic modeling using topics from many domains, lifelong learning and big data .
- Liu et al. (2015a) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015a. Learning context-sensitive word embeddings with neural tensor skip-gram model. In IJCAI. pages 1284–1290.
- Liu et al. (2015b) Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015b. Topical word embeddings. In AAAI. pages 2418–2424.
- Mikolov et al. (2013a) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013a. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. pages 3111–3119.
- Mikolov et al. (2013b) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. In HLT-NAACL. pages 746–751.
Mimno et al. (2011)
David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew
Optimizing semantic coherence in topic models.
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 262–272.
- Moody (2016) Christopher E Moody. 2016. Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019 .
- Nguyen et al. (2015) Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics 3:299–313.
- Niu et al. (2015) Liqiang Niu, Xinyu Dai, Jianbing Zhang, and Jiajun Chen. 2015. Topic2vec: learning distributed representations of topics. In Asian Language Processing (IALP), 2015 International Conference on. IEEE, pages 193–196.
- Pranjal Singh (2015) Amitabha Mukerjee Pranjal Singh. 2015. Words are not equal: Graded weighting model for building composite document vectors. In Proceedings of the twelfth International Conference on Natural Language Processing (ICON-2015). BSP Books Pvt. Ltd.
- Randhawa et al. (2016) Ramandeep S Randhawa, Parag Jain, and Gagan Madan. 2016. Topic modeling using distributed word embeddings. arXiv preprint arXiv:1603.04747 .
- Reynolds (2015) Douglas Reynolds. 2015. Gaussian mixture models. Encyclopedia of biometrics pages 827–832.
- Roy et al. (2016) Dwaipayan Roy, Debasis Ganguly, Mandar Mitra, and Gareth JF Jones. 2016. Representing documents and queries as sets of word embedded vectors for information retrieval. arXiv preprint arXiv:1606.07869 .
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP). Citeseer, volume 1631, page 1642.
- Vivek Gupta (2016) Harish Karnick Ashendra Bansal Pradhuman Jhala Vivek Gupta. 2016. Product classification in e-commerce using distributional semantics. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers.
- Wainwright (2014) Martin J Wainwright. 2014. Structured regularizers for high-dimensional problems: Statistical and computational issues. Annual Review of Statistics and Its Application 1:233–253.
- Wang Ling (2015) Chris Dyer Wang Ling. 2015. Two/too simple adaptations of wordvec for syntax problems. In Proceedings of the 50th Annual Meeting of the North American Association for Computational Linguistics. North American Association for Computational Linguistics.
- Zhai and Lafferty (2001) Chengxiang Zhai and John Lafferty. 2001. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the tenth international conference on Information and knowledge management. ACM, pages 403–410.
- Zhai and Lafferty (2004) Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22(2):179–214.