N-gram-Based Low-Dimensional Representation for Document Classification

12/19/2014 ∙ by Rémi Lebret, et al. ∙ Villa Collobert 0

The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize documents in a lower dimension with the least semantic information loss. Some semantic information is nevertheless always lost, since only words are considered. Instead, we aim at using information coming from n-grams to overcome this limitation, while remaining in a low-dimension space. Many approaches, such as the Skip-gram model, provide good word vector representations very quickly. We propose to average these representations to obtain representations of n-grams. All n-grams are thus embedded in a same semantic space. A K-means clustering can then group them into semantic concepts. The number of features is therefore dramatically reduced and documents can be represented as bag of semantic concepts. We show that this model outperforms LSA and LDA on a sentiment classification task, and yields similar results than a traditional BOW-model with far less features.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text document classification aims at assigning a text document to one or more classes. Successful methods are traditionally based on bag-of-words (BOW). Finding discriminative keywords is, in general, good enough for text classification. Given a dictionary of words to consider, documents are represented by a -dimensional vector (the bag of its words). Each dimension is either a binary value (present or not in the document) or a word occurrence frequency. Some term weightings (e.g.

the popular td-idf) have also been defined to reflect how discriminative a word is for a document. These are considered as features for training a classifier. Naive Bayes (NB) and Support Vector Machine (SVM) models are often the first choices. One limitation of the bag-of-words model is that the discriminative words are usually not the most frequent ones. A large dictionary of words needs to be defined to obtain a robust model. Classifiers then have to deal with a huge number of features, and thus become time-consuming and memory-hungry.

Some techniques have been proposed to reduce the dimensionality and represent documents in a low-dimensional semantic space. Latent Semantic Analysis (LSA) (Deerwester et al., 1990)

uses the term-document matrix and a singular value decomposition (SVD) to represent terms and documents in a new low-dimensional space. Latent Dirichlet Allocation (LDA)

(Blei et al., 2003) is a generative probabilistic model of a corpus. Each document is represented as a mixture of latent topics, where each topic is characterized by a distribution over words. By defining topics, documents can then be represented as -dimensional vectors. Pessiot et al. (2010) also proposed probabilistic models for unsupervised dimensionality reduction in the context of document clustering. They make the hypothesis that words occuring with the same frequencies in the same document are semantically related. Based on this assumption, words are partioned into word topics. Document are then represented by a vector where each feature corresponds to a word-topic representing the number of occurrences of words from that word-topic in the document. Other techniques have tried to improve text document clustering by taking into account relationships between important terms. Some have enriched document representations by integrating core ontologies as background knowledge (Staab & Hotho, 2003), or with Wikipedia concepts and category information (Hu et al., 2009). Part-of-speech tags have also been used to disambiguate words (Sedding & Kazakov, 2004).

All these techniques are based on words alone, which raises another limitation. A collection of words cannot capture phrases or multi-word expressions, while

-grams have shown to be helpful features in several natural language processing tasks

(Tan et al., 2002; Lin & Wu, 2009; Wang & Manning, 2012).

-gram features are not commonly used in text classification, probably because the dictionary

tends to grow exponentially with . Phrase structure extraction can be used to identify only -grams which are phrase patterns, and thus limit the dictionary size. However, this adds another step to the model, making it more complex. To overcome these barriers, we propose that documents be represented as a bag of semantic concepts, where -grams are considered instead of only words. Good word vector representations are obtained very quickly with many different recent approaches (Mikolov et al., 2013b; Mnih & Kavukcuoglu, 2013; Lebret & Collobert, 2014; Pennington et al., 2014). Mikolov et al. (2013a) also showed that simple vector addition can often produce meaningful results, such as king - man + woman queen. By leveraging the ability of these word vector representations to compose, representations for -grams are easily computed with an element-wise addition. Using a clustering algorithm such as -means, those representations are grouped into clusters which can be viewed as semantic concepts. Text documents are now represented as bag of semantic concepts, with each feature corresponding to the presence or not of -grams from the resulting clusters. Therefore, more information is captured while remaining in a low-dimensional space. As Mikolov et al’s Skip-gram model and -means are highly parallelizable, this model is much faster to compute than LSA or LDA. The same classifiers as with BOW-based models are then applied on these bag of semantic concepts. We show that such model is a good alternative to LSA or LDA to represent documents and yields even better results on movie review tasks.

2 A bag of semantic concepts model

The model is divided into three steps: (1) vector representations of -grams are obtained by averaging pre-trained representations of its individual words; (2) -grams are gouped into semantic concepts by performing -means clustering on all -gram representations; (3) documents are represented by a bag of semantic concepts, where each entry depends on the presence of -grams from the concepts defined in the previous step.

2.1 -gram representation

The first step of the model is to generate continuous vector representations for each word within the dictionary . Leveraging recent models, such as the Skip-gram (Mikolov et al., 2013b) or GloVe (Pennington et al., 2014) models, they are trained over a large corpus of unlabeled data in an efficient manner. These models are indeed highly parallelizable, which helps to obtain these representations very quickly. Word representations are then summed to generate -gram representations:


These representations are vectors which keep the semantic information of -grams with different in the same dimensionality. Distances between them are thus computable. It allows the use of a -means clustering for grouping all -grams into classes.

2.2 -means clustering

-means is an unsupervised learning algorithm commonly used to automatically partition a data set into

clusters. Considering a set of -gram representations , the algorithm will determine a set of centroids , so as to minimize the average distance from each representation to its nearest centroid:


The limitation due to the size of the dictionary is therefore overcomed. By setting to a low value, documents can also be represented by more compact vectors than with a bag-of-words model, while keeping all the meaningful information.

2.3 Document representation

Denoting a set of text documents, where each document contains a set of -grams. First, each -gram is embedded into a common vector space by averaging its word vector representations. The resulting -grams representations are assigned to clusters using the centroids defined by the -means clustering. Documents are then represented by a vector of features, . Each entry usually corresponds to the frequency of -grams from the cluster within the document . The set of text documents is then defined as .

With NB features.

For certain type of document, such as movie reviews, the use of Naive Bayes features can improve the general performance (Wang & Manning, 2012)

. Success in sentiment analysis relies mostly on the capability of the models to detect negative and positive

-grams in a document. A proper normalization is then calculated to determine how important each -gram is for a given class . We denote a set of count vectors for all -grams contained in , . represents the number of occurence of the -gram in the training document . Defining count vectors as and , a log-count ratio is calculated to determine how important -grams are for the classes :


Because -grams are in clusters, we extract the maximum absolute log-count ratio for every cluster:


These document representations can then be used for several NLP tasks such as classification or information retrieval. As for BOW-based models, this model is particulary suitable for linear SVM.

3 Experiments with sentiment analysis

Sentiments can have a completely different meaning if -grams are considered instead of words. A classifier might leverage a bigram such as “not good” to classify a document as negative, while this would probably fail if only unigrams (words) were considered. We thus benchmark the bag of semantic concepts model on sentiment analysis.

3.1 IMDB movie reviews datasets

Datasets from IMDB have the nice property of containing long documents. It is thus valuable to considerer -grams in such a framework. We did experiments with small and large collections of reviews. We can thus analyse how well our model compares against classical models, for different dataset sizes.

3.1.1 Pang & Lee (2004)

The collection consists of 1,000 positive and 1,000 negative processed reviews111Available at http://www.cs.cornell.edu/people/pabo/movie-review-data/.. So a random guess yields 50% accuracy. The authors selected only reviews where rating was expressed either with stars or some numerical value. To avoid domination of the corpus by a small number of prolific reviewers, they imposed a limit of fewer than 20 reviews per author per sentiment category. As there is no test set, we used 10-fold cross-validation.

3.1.2 Maas et al. (2011)

The collection consists of 100,000 reviews222Available at http://www.andrew-maas.net/data/sentiment.. It has been divided into three datasets: training and test sets (25,000 labeled reviews each), and 50,000 unlabeled training reviews. It allows no more than 30 reviews per movie. It contains an even number of positive and negative reviews, so randomly guessing yields 50% accuracy. Only highly polarized reviews have been considered. A negative review has a score 4 out of 10, and a positive review has a score 7 out of 10.

3.2 Experimental setup

We first learn word vector representations over a large corpus of unlabeled text. This step could however be skipped by taking existing pre-trained word representations333Different pre-trained word vector representations are available at https://code.google.com/p/word2vec/, http://stanford.edu/~jpennin/ or http://lebret.ch/words/. instead of learning them from scratch. By following the three steps described in Section 2, movie reviews are then represented as bags of semantic concepts. These representations are finally used for training a linear SVM to classify sentiment.

3.2.1 Learning word representation over large corpora

Our English corpus is composed of the entire English Wikipedia444We took the January 2014 version., the Reuters corpus and the Wall Street Journal (WSJ) corpus. We consider lower case words and replace digits with a special token. The resulting text is tokenized using the Stanford tokenizer. The final data set contains about 2 billion words. Our dictionary

consists of all the words appearing at least one hundred times. This results in a 202,255 words dictionary. We then train a Skip-gram model to get word representation in a 100-dimensional vector. This dimension is intentionally quite low to speed up the clustering afterwards. As other hyperparameters, we use a fixed learning rate of 0.01, a context size of 5 phrases, Negative Sampling with 5 negative samples for each positive sample, and a subsampling approach with a threshold of

3.2.2 Bag of semantic concepts for movie reviews

Computing -gram representations.

We consider -grams up to . Only -grams with words from our dictionary are considered for both datasets.555Our English corpus is not large enough to cover all the words present in the IMDB datasets. We thus use the same 1-gram dictionary with the other methods. This results in a set of 34,360 1-gram representations, 419,918 2-gram representations, and 921,837 3-gram representations for the Pang and Lee’s dataset. And 67,847 1-gram representations, 1,842,461 2-gram representations, and 5,724,871 3-gram representations for the Maas et al.’s dataset. Because -gram representations are computed by averaging representations of its word, all -grams are also represented in a 100-dimensional vector.

Partitioning -grams into semantic concepts.

Because -grams are represented in a common vector space, similarities between -grams of different length can be computed. To evaluate the benefit of adding -grams for sentiment analysis, we define semantic concepts with different combinations of -grams: (1) only 1-grams (i.e. clusters of words), (2) only 2-grams, (3) only 3-grams, (4) with 1-grams and 2-grams, and (5) with 1-grams, 2-grams and 3-grams. Each of these five sets of -gram representations are then partitioned in clusters with the -means clustering. The centroids are obtained after 10 iterations of the algorithm.

Movie review representations.

Movie reviews are then represented as bags of semantic concepts with naive bayes features as described in Section 2.3. The log-count ratio for each -gram is calculated on the training set for both datasets.

3.2.3 Comparison with other methods

We compare our models with two classical techniques for representing text documents in a low-dimensional vector space: LSA and LDA. Both methods use the same 1-gram dictionaries than with the bag of semantic concepts model with . In the framework of Maas et al.’s dataset, LSA and LDA benefit from the large set of unlabeled reviews.

Latent Sentiment Analysis (LSA) (Deerwester et al., 1990).

Let be a matrix where each element describes the log count ratio of words in document , with the number of training documents and the dictionary of words (i.e. 34,360 for Pang and Lee’s dataset, 67,847 for Maas et al’s dataset). By applying truncated SVD to the log-count ratio matrix , we thus obtain semantic representations in a -dimensional space for movie reviews.

Latent Dirichlet Allocation (LDA) (Blei et al., 2003).

We train the -topics LDA model using the code released by Blei et al. (2003)666Available at http://www.cs.princeton.edu/~blei/lda-c/.. We leave the LDA hyperparameters at their default values. Like our model, LDA extracts topics (i.e. semantic concepts) and assigns words to these topics. Considering only the words in documents, we thus apply the method described in Section 2.3 to get document representations. A movie review is then represented in a -dimensional vector, where each feature is the maximum absolute log-count ratio for the topic.

3.2.4 Classification using SVM

Having representations of movie reviews in a -dimensional vector, a classifier is trained to determine whether a given review is positive or negative. Given the set of training documents , we picked a linear SVM as a classifier, trained using the LIBLINEAR library (Fan et al., 2008):


with the weight vector, and a penalty parameter.

3.3 Results

Pang and Lee, 2004 Maas et al., 2011
100 200 300 100 200 300
LDA 76.20 77.10 76.80 85.43 85.45 84.40
LSA 81.60 82.55 83.75 85.82 86.63 86.88
1-grams 81.60 82.60 82.70 84.51 84.76 85.54
2-grams 82.30 82.25 83.15 88.02 88.06 87.87
3-grams 73.85 73.05 72.65 87.41 87.46 87.22
1+2-grams 83.85 84.00 84.00 88.10 88.19 88.18
1+2+3-grams 82.45 83.05 83.05 88.39 88.46 88.55
Table 1: Classification accuracy on both movie review tasks with number of features.

The overall results summarized in Table 1 show that the bag of semantic concepts approach outperforms the traditionnal LDA and LSA approaches to represent documents in a low-dimensional space. Good performance is achieved even with only 100 clusters, where LSA needs more clusters to improve. We also denote that our approach performs well on a small dataset, where LDA fails. A significant increase is observed when using 2-grams instead of 1-grams. However, using only 3-grams hurts the performance. The best results are obtained using a combination of -grams, which confirms the benefit of the method. That also means that word vector representations can be combined while keeping relevant semantic information. This is illustrated in Table 3 where semantically close -grams are in the same cluster. We can see that the model is furthermore able to clearly separate antonyms, which is a good asset for sentiment classification. The results are also very competitive with a traditional BOW-model. Using the same 1-gram dictionary and a linear SVM classifier with the naive bayes features, BOW-model achieves 83% accuracy for Pang and Lee’s dataset, and 88.58% for Maas et al’s dataset. Our model therefore performs better with about 344 times less features for the first dataset, and yields similar result with about 678 times less features for the second one.

3.4 Computation time

The bag of semantic concepts model can leverage information coming from -grams to improve sentiment classification of documents. This model has also the nice property to build document representations in an efficient and timely manner. The most time-consuming and costly process step in the model is the -means clustering, especially when dealing with millions of -gram representations. However, this step can be done very quickly with low memory by using mini-batch -means method. Computation times for generating 300-dimensional representations are reported in Table 2. All experiments have been run on single CPU core Intel i7 2600K 3.4 GHz. Despite the fact that single CPU has been used for this benchmark, the three steps of the model are highly parallelizable. The recorded times could thus be divided by the number of CPUs available. We see that representations can be computed in less than one minute with only 1-gram dictionary. About 10 minutes are necessary when adding 2-grams, and about 40 minutes by adding 3-grams. In comparison, LDA needs six hours for extracting 100 topics and three days for 300 topics. Our model is also very competitive with LSA which takes 540 seconds to generate 300-dimensional document representations. However, adding 2-grams and 3-grams to perform a LSA would be extremely time-consuming and memory-hungry while our model can handle it.

1-grams 2-grams 3-grams 1+2-grams 1+2+3-grams
-gram Representations 0 43.00 164.34 43.00 207.34
-means 14.18 291.62 747.90 302.34 1203.99
Document Representations 36.45 173.48 494.06 343.29 949.01
Total 50.63 508.10 1406.30 688.63 2360.34
Table 2: Computation time for building movie review representations with semantic concepts. Time is reported in seconds.

3.5 Inferring semantic concepts for unseen -grams

good not good enjoy did n’t enjoy
=269 =297 =160 =108
nice one sufficiently bad entertain sceptics
liked here not liked adored them did n’t like
is pretty nice is far worse enjoying n’t enjoy any
the greatest thing not that greatest watched and enjoy valueless
Table 3: Selected pairs of antonyms and their cluster number. Here, -grams from Maas et al’s dataset have been partitioned into 300 clusters. Each -gram is accompagnied with a selection of others from its cluster.

Another drawback of classical models is that they cannot deal with unseen words. Only words present in the training documents are used to infer representation for a new text document. Unlike these models, our model can easily assign semantic concepts for new -grams. Because -gram representations are based on its word vector representations, a new -gram vector representation can be calculated if a representation is available for each of its words. This new representations is then assigned to the nearest centroid , which determines its semantic concept. With a small training set, this is a valuable asset when compared to other models.

4 Conclusion

Word vector representations can be quickly obtained with recent techniques such as the Skip-gram model. -grams with different length can then be embedded in a same dimensional vector space with a simple element-wise addition. This makes it possible to compute distances between -grams, which can have many applications in natural language processing. We therefore proposed a bag of semantic concepts model to represent documents in a low-dimensional space. These semantic concepts are obtained by performing a -means clustering which partition all -grams into clusters. This model has several advantages over classical approaches for representing documents in a low-dimensional space: it leverages semantic information coming from -grams; it builds document representations with low resource consumption (time and memory); it can infer semantic concepts for unseen -grams. Furthermore, we have shown that such model is suitable for document classification. Competitive performance has been reached on binary sentiment classification tasks, where this model outperforms traditional approaches. It also attained similar results to traditional bag-of-words with considerably less features.


This work was supported by the HASLER foundation through the grant “Information and Communication Technology for a Better World 2020” (SmartWorld).


  • Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet Allocation.

    Journal of Machine Learning Research

    , 2003.
  • Deerwester et al. (1990) Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990.
  • Fan et al. (2008) Fan, R., Chang, K., Hsieh, C., Wang, X., and Lin, C. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 2008.
  • Hu et al. (2009) Hu, X., Zhang, X., Lu, C., Park, E. K., and Zhou, X. Exploiting Wikipedia As External Knowledge for Document Clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 389–396, New York, NY, USA, 2009. ACM.
  • Lebret & Collobert (2014) Lebret, R. and Collobert, R. Word Embeddings through Hellinger PCA. In Proceedings of the EACL, 2014.
  • Lin & Wu (2009) Lin, D. and Wu, X. Phrase clustering for discriminative learning. In Proceedings of ACL, 2009.
  • Maas et al. (2011) Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of ACL, 2011.
  • Mikolov et al. (2013a) Mikolov, T., Chen, K., Corrado, G., and Dean, J.

    Efficient Estimation of Word Representations in Vector Space.

    ICLR Workshp, 2013a.
  • Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 2013b.
  • Mnih & Kavukcuoglu (2013) Mnih, A. and Kavukcuoglu, K.

    Learning word embeddings efficiently with noise-contrastive estimation.

    In NIPS. 2013.
  • Pang & Lee (2004) Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity. In Proceedings of ACL, 2004.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. GloVe: Global Vectors for Word Representation. In Proceedings of EMNLP, 2014.
  • Pessiot et al. (2010) Pessiot, J.-F., Kim, Y.-M., Amini, M.-R., and Gallinari, P. Improving Document Clustering in a Learned Concept Space. Information Processing & Management, 46(2):180–192, 2010.
  • Sedding & Kazakov (2004) Sedding, J. and Kazakov, D. WordNet-based Text Document Clustering. In Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, pp. 104–113, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics.
  • Staab & Hotho (2003) Staab, S. and Hotho, A. Ontology-based Text Document Clustering. In Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’03 Conference held in Zakopane, pp. 451–452, 2003.
  • Tan et al. (2002) Tan, C., Wang, Y., and Lee, C. The use of bigrams to enhance text categorization. Journal of Information Processing and Management, 2002.
  • Wang & Manning (2012) Wang, S. I. and Manning, C. D. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. In Proceedings of ACL, 2012.