Short text clustering is a challenging problem due to its sparseness of text representation. Here we propose a flexible Self-Taught Convolutional neural network framework for Short Text Clustering (dubbed STC^2), which can flexibly and successfully incorporate more useful semantic features and learn non-biased deep text representation in an unsupervised manner. In our framework, the original raw text features are firstly embedded into compact binary codes by using one existing unsupervised dimensionality reduction methods. Then, word embeddings are explored and fed into convolutional neural networks to learn deep feature representations, meanwhile the output units are used to fit the pre-trained binary codes in the training process. Finally, we get the optimal clusters by employing K-means to cluster the learned representations. Extensive experimental results demonstrate that the proposed framework is effective, flexible and outperform several popular clustering methods when tested on three public short text datasets.READ FULL TEXT VIEW PDF
Short text clustering is of great importance due to its various applications, such as user profiling li-ritter-hovy:2014:P14-1 and recommendation wang-EtAl:2010:ACL1 , for nowaday’s social media dataset emerged day by day. However, short text clustering has the data sparsity problem and most words only occur once in each short text 15_aggarwal2012survey . As a result, the Term Frequency-Inverse Document Frequency (TF-IDF) measure cannot work well in short text setting. In order to address this problem, some researchers work on expanding and enriching the context of data from Wikipedia 29_banerjee2007clustering or an ontology 33_fodeh2011ontology
. However, these methods involve solid Natural Language Processing (NLP) knowledge and still use high-dimensional representation which may result in a waste of both memory and computation time. Another way to overcome these issues is to explore some sophisticated models to cluster short texts. For example, Yin and Wang30_yin2014dirichlet proposed a Dirichlet multinomial mixture model-based approach for short text clustering. Yet how to design an effective model is an open question, and most of these methods directly trained based on Bag-of-Words (BoW) are shallow structures which cannot preserve the accurate semantic similarities.
Recently, with the help of word embedding, neural networks demonstrate their great performance in terms of constructing text representation, such as Recursive Neural Network (RecNN) 24_socher2011semi ; 35_socher2013recursive
and Recurrent Neural Network (RNN)38_mikolov2011extensions . However, RecNN exhibits high time complexity to construct the textual tree, and RNN, using the hidden layer computed at the last word to represent the text, is a biased model where later words are more dominant than earlier words 14_lai2015rcnn . Whereas for the non-biased models, the learned representation of one text can be extracted from all the words in the text with non-dominant learned weights. More recently, Convolution Neural Network (CNN), as the most popular non-biased model and applying convolutional filters to capture local features, has achieved a better performance in many NLP applications, such as sentence modeling 16_blunsom2014convolutional , relation classification 34_zeng2014relation , and other traditional NLP tasks 19_collobert2011natural . Most of the previous works focus CNN on solving supervised NLP tasks, while in this paper we aim to explore the power of CNN on one unsupervised NLP task, short text clustering.
We systematically introduce a simple yet surprisingly powerful Self-Taught Convolutional neural network framework for Short Text Clustering, called STC. An overall architecture of our proposed approach is illustrated in Figure 1. We, inspired by 28_zhang2010self ; TwoStepHash_Lin_2013 , utilize a self-taught learning framework into our task. In particular, the original raw text features are first embedded into compact binary codes with the help of one traditional unsupervised dimensionality reduction function. Then text matrix projected from word embeddings are fed into CNN model to learn the deep feature representation and the output units are used to fit the pre-trained binary codes . After obtaining the learned features, K-means algorithm is employed on them to cluster texts into clusters . Obviously, we call our approach “self-taught” because the CNN model is learnt from the pseudo labels generated from the previous stage, which is quite different from the term “self-taught” in raina2007self . Our main contributions can be summarized as follows:
We propose a flexible short text clustering framework which explores the feasibility and effectiveness of combining CNN and traditional unsupervised dimensionality reduction methods.
Non-biased deep feature representations can be learned through our self-taught CNN framework which does not use any external tags/labels or complicated NLP pre-processing.
We conduct extensive experiments on three short text datasets. The experimental results demonstrate that the proposed method achieves excellent performance in terms of both accuracy and normalized mutual information.
This work is an extension of our conference paper xu2015short , and they differ in the following aspects. First, we put forward a general a self-taught CNN framework in this paper which can flexibly couple various semantic features, whereas the conference version can be seen as a specific example of this work. Second, in this paper we use a new short text dataset, Biomedical, in the experiment to verify the effectiveness of our approach. Third, we put much effort on studying the influence of various different semantic features integrated in our self-taught CNN framework, which is not involved in the conference paper.
For the purpose of reproducibility, we make the datasets and software used in our experiments publicly available at the website111Our code and dataset are available: https://github.com/jacoxu/STC2.
The remainder of this paper is organized as follows: In Section 2, we first briefly survey several related works. In Section 3, we describe the proposed approach STC and implementation details. Experimental results and analyses are presented in Section 4. Finally, conclusions are given in the last Section.
In this section, we review the related work from the following two perspectives: short text clustering and deep neural networks.
There have been several studies that attempted to overcome the sparseness of short text representation. One way is to expand and enrich the context of data. For example, Banerjee et al. 29_banerjee2007clustering proposed a method of improving the accuracy of short text clustering by enriching their representation with additional features from Wikipedia, and Fodeh et al. 33_fodeh2011ontology incorporate semantic knowledge from an ontology into text clustering. However, these works need solid NLP knowledge and still use high-dimensional representation which may result in a waste of both memory and computation time. Another direction is to map the original features into reduced space, such as Latent Semantic Analysis (LSA) deerwester1990indexing , Laplacian Eigenmaps (LE) 37_ng2002spectral , and Locality Preserving Indexing (LPI) niyogi2004locality . Even some researchers explored some sophisticated models to cluster short texts. For example, Yin and Wang 30_yin2014dirichlet proposed a Dirichlet multinomial mixture model-based approach for short text clustering. Moreover, some studies even focus the above both two streams. For example, Tang et al. 32_tang2012enriching proposed a novel framework which enrich the text features by employing machine translation and reduce the original features simultaneously through matrix factorization techniques.
Despite the above clustering methods can alleviate sparseness of short text representation to some extent, most of them ignore word order in the text and belong to shallow structures which can not fully capture accurate semantic similarities.
Recently, there is a revival of interest in DNN and many researchers have concentrated on using Deep Learning to learn features. Hinton and Salakhutdinov23_hinton2006reducing
More recently, researchers propose to use external corpus to learn a distributed representation for each word, called word embedding25_turian2010word , to improve DNN performance on NLP tasks. The Skip-gram and continuous bag-of-words models of Word2vec 21_mikolov2013distributed propose a simple single-layer architecture based on the inner product between two word vectors, and Pennington et al. 26_pennington2014glove introduce a new model for word representation, called GloVe, which captures the global corpus statistics.
In order to learn the compact representation vectors of sentences, Le and Mikolov le2014distributed directly extend the previous Word2vec 21_mikolov2013distributed by predicting words in the sentence, which is named Paragraph Vector (Para2vec). Para2vec is still a shallow window-based method and need a larger corpus to yield better performance. More neural networks utilize word embedding to capture true meaningful syntactic and semantic regularities, such as RecNN 24_socher2011semi ; 35_socher2013recursive and RNN 38_mikolov2011extensions
. However, RecNN exhibits high time complexity to construct the textual tree, and RNN, using the layer computed at the last word to represent the text, is a biased model. Recently, Long Short-Term Memory (LSTM)hochreiter1997long
and Gated Recurrent Unit (GRU)cho2014learning , as sophisticated recurrent hidden units of RNN, has presented its advantages in many sequence generation problem, such as machine translation sutskever2014sequence , speech recognition graves2013speech , and text conversation shang2015neural . While, CNN is better to learn non-biased implicit features which has been successfully exploited for many supervised NLP learning tasks as described in Section 1, and various CNN based variants are proposed in the recent works, such as Dynamic Convolutional Neural Network (DCNN) 16_blunsom2014convolutional , Gated Recursive Convolutional Neural Network (grConv) cho2014properties and Self-Adaptive Hierarchical Sentence model (AdaSent) zhao2015self .
In the past few days, Visin et al. visin2015renet
have attempted to replace convolutional layer in CNN to learn non-biased features for object recognition with four RNNs, called ReNet, that sweep over lower-layer features in different directions: (1) bottom to top, (2) top to bottom, (3) left to right and (4) right to left. However, ReNet does not outperform state-of-the-art convolutional neural networks on any of the three benchmark datasets, and it is also a supervised learning model for classification. Inspired by Skip-gram of word2vecmikolov2013efficient ; 21_mikolov2013distributed , Skip-thought model kiros2015skip describe an approach for unsupervised learning of a generic, distributed sentence encoder. Similar as Skip-gram model, Skip-thought model trains an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded sentence and released an off-the-shelf encoder to extract sentence representation. Even some researchers introduce continuous Skip-gram and negative sampling to CNN for learning visual representation in an unsupervised manner wang2015unsupervised . This paper, from a new perspective, puts forward a general self-taught CNN framework which can flexibly couple various semantic features and achieve a good performance on one unsupervised learning task, short text clustering.
Assume that we are given a dataset of training texts denoted as: , where is the dimensionality of the original BoW representation. Denote its tag set as and the pre-trained word embedding set as , where is the dimensionality of word vectors and is the vocabulary size. In order to learn the -dimensional deep feature representation from CNN in an unsupervised manner, some unsupervised dimensionality reduction methods are employed to guide the learning of CNN model. Our goal is to cluster these texts into clusters based on the learned deep feature representation while preserving the semantic consistency.
As depicted in Figure 1, the proposed framework consist of three components, deep convolutional neural network (CNN), unsupervised dimensionality reduction function and K-means module. In the rest sections, we first present the first two components respectively, and then give the trainable parameters and the objective function to learn the deep feature representation. Finally, the last section describe how to perform clustering on the learned features.
In this section, we briefly review one popular deep convolutional neural network, Dynamic Convolutional Neural Network (DCNN) 16_blunsom2014convolutional as an instance of CNN in the following sections, which as the foundation of our proposed method has been successfully proposed for the completely supervised learning task, text classification.
Taking a neural network with two convolutional layers in Figure 2 as an example, the network transforms raw input text to a powerful representation. Particularly, each raw text vector is projected into a matrix representation by looking up a word embedding , where is the length of one text. We also let and denote the weights of the neural networks. The network defines a transformation which transforms an input raw text to a -dimensional deep representation . There are three basic operations described as follows:
Wide one-dimensional convolution This operation is applied to an individual row of the sentence matrix , and yields a resulting matrix , where is the width of convolutional filter.
Folding In this operation, every two rows in a feature map are simply summed component-wisely. For a map of rows, folding returns a map of rows, thus halving the size of the representation and yielding a matrix feature . Note that folding operation does not introduce any additional parameters.
Dynamic -max pooling Assuming the pooling parameter as , -max pooling selects the sub-matrix of the highest values in each row of the matrix . For dynamic -max pooling, the pooling parameter is dynamically selected in order to allow for a smooth extraction of higher-order and longer-range features 16_blunsom2014convolutional . Given a fixed pooling parameter for the topmost convolutional layer, the parameter of -max pooling in the -th convolutional layer can be computed as follows:
where is the total number of convolutional layers in the network.
As described in Figure 1, the dimensionality reduction function is defined as follows:
where, are the -dimensional reduced latent space representations. Here, we take four popular dimensionality reduction methods as examples in our framework.
Average Embedding (AE): This method directly averages the word embeddings which are respectively weighted with TF and TF-IDF. Huang et al. 13_huang2012improving used this strategy as the global context in their task, and Socher et al. 35_socher2013recursive and Lai et al. 14_lai2015rcnn used this method for text classification. The weighted average of all word vectors in one text can be computed as follows:
where can be any weighting function that captures the importance of word in the text .
Latent Semantic Analysis (LSA): LSA deerwester1990indexing
is the most popular global matrix factorization method, which applies a dimension reducing linear projection, Singular Value Decomposition (SVD), of the corresponding term/document matrix. Suppose the rank ofis , LSA decompose into the product of three other matrices:
where and are the singular values of , is a set of left singular vectors and is a set of right singular vectors. LSA uses the top vectors in as the transformation matrix to embed the original text features into a -dimensional subspace deerwester1990indexing .
Laplacian Eigenmaps (LE)
: The top eigenvectors of graph Laplacian, defined on the similarity matrix of texts, are used in the method, which can discover the manifold structure of the text space37_ng2002spectral . In order to avoid storing the dense similarity matrix, many approximation techniques are proposed to reduce the memory usage and computational complexity for LE. There are two representative approximation methods, sparse similarity matrix and Nystrm approximation. Following previous studies 4_cai2005document ; 28_zhang2010self , we select the former technique to construct the local similarity matrix by using heat kernel as follows:
where, is a tuning parameter (default is 1) and represents the set of -nearest-neighbors of . By introducing a diagonal matrix whose entries are given by , the graph Laplacian can be computed by (). The optimal real-valued matrix can be obtained by solving the following objective function:
where is the trace function, requires the different dimensions to be uncorrelated, and
requires each dimension to achieve equal probability as positive or negative).
Locality Preserving Indexing (LPI): This method extends LE to deal with unseen texts by approximating the linear function 28_zhang2010self
, and the subspace vectors are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the Riemannian manifoldniyogi2004locality . Similar as LE, we first construct the local similarity matrix , then the graph Laplacian can be computed by (), where measures the local density around and is equal to . Compute the eigenvectors
and eigenvaluesof the following generalized eigen-problem:
The mapping function can be obtained and applied to the unseen data 4_cai2005document .
All of the above methods claim a better performance in capturing semantic similarity between texts in the reduced latent space representation than in the original representation , while the performance of short text clustering can be further enhanced with the help of our framework, self-taught CNN.
The last layer of CNN is an output layer as follows:
where, is the deep feature representation, is the output vector and is weight matrix.
In order to incorporate the latent semantic features , we first binary the real-valued vectors to the binary codes by setting the threshold to be the media vector . Then, the output vector is used to fit the binary codes via logistic operations as follows:
All parameters to be trained are defined as .
Given the training text collection , and the pre-trained binary codes , the log likelihood of the parameters can be written down as follows:
Following the previous work 16_blunsom2014convolutional , we train the network with mini-batches by back-propagation and perform the gradient-based optimization using the Adagrad update rule 36_duchi2011adaptive . For regularization, we employ dropout with 50% rate to the penultimate layer 16_blunsom2014convolutional ; 22_kim2014convolutional .
With the given short texts, we first utilize the trained deep neural network to obtain the semantic representations , and then employ traditional K-means algorithm to perform clustering.
|SearchSnippets: 8 different domains|
StackOverflow: 20 semantic tags
Biomedical: 20 MeSH major topics
SearchSnippets222http://jwebpro.sourceforge.net/data-web-snippets.tar.gz.. This dataset was selected from the results of web search transaction using predefined phrases of 8 different domains by Phan et al. 20_phan2008learning .
StackOverflow. We use the challenge data published in Kaggle.com333https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/download/train.zip.. The raw dataset consists 3,370,528 samples through July 31st, 2012 to August 14, 2012. In our experiments, we randomly select 20,000 question titles from 20 different tags as in Table 2.
Biomedical. We use the challenge data published in BioASQ’s official website444http://participants-area.bioasq.org/.. In our experiments, we randomly select 20, 000 paper titles from 20 different MeSH555http://en.wikipedia.org/wiki/Medical_Subject_Headings. major topics as in Table 2. As described in Table 1, the max length of selected paper titles is 53666http://www.ncbi.nlm.nih.gov/pubmed/207752..
For these datasets, we randomly select 10% of data as the development set. Since SearchSnippets has been pre-processed by Phan et al. 20_phan2008learning , we do not further process this dataset. In StackOverflow, texts contain lots of computer terminology, and symbols and capital letters are meaningful, thus we do not do any pre-processed procedures. For Biomedical, we remove the symbols and convert letters into lower case.
|SearchSnippets||23,826 (77%)||211,575 (95%)|
|StackOverflow||19,639 (85%)||162,998 (97%)|
|Biomedical||18,381 (97%)||257,184 (99%)|
We use the publicly available word2vec777https://code.google.com/p/word2vec/. tool to train word embeddings, and the most parameters are set as same as Mikolov et al. 21_mikolov2013distributed to train word vectors on Google News setting888https://groups.google.com/d/msg/word2vec-toolkit/lxbl_MB29Ic/NDLGId3KPNEJ., except of vector dimensionality using 48 and minimize count using 5. For SearchSnippets, we train word vectors on Wikipedia dumps999http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.. For StackOverflow, we train word vectors on the whole corpus of the StackOverflow dataset described above which includes the question titles and post contents. For Biomedical, we train word vectors on all titles and abstracts of 2014 training articles. The coverage of these learned vectors on three datasets are listed in Table 3, and the words not present in the set of pre-trained words are initialized randomly.
In our experiment, some widely used text clustering methods are compared with our approach. Besides K-means, Skip-thought Vectors, Recursive Neural Network and Paragraph Vector based clustering methods, four baseline clustering methods are directly based on the popular unsupervised dimensionality reduction methods as described in Section 3.2. We further compare our approach with some other non-biased neural networks, such as bidirectional RNN. More details are listed as follows:
K-means K-means 2_wagstaff2001constrained on original keyword features which are respectively weighted with term frequency (TF) and term frequency-inverse document frequency (TF-IDF).
Skip-thought Vectors (SkipVec) This baseline kiros2015skip gives an off-the-shelf encoder to produce highly generic sentence representations. The encoder101010https://github.com/ryankiros/skip-thoughts. is trained using a large collection of novels and provides three encoder modes, that are unidirectional encoder (SkipVec (Uni)) with 2,400 dimensions, bidirectional encoder (SkipVec (Bi)) with 2,400 dimensions and combined encoder (SkipVec (Combine)) with SkipVec (Uni) and SkipVec (Bi) of 2,400 dimensions each. K-means is employed on the these vector representations respectively.
Recursive Neural Network (RecNN) In 24_socher2011semi
, the tree structure is firstly greedy approximated via unsupervised recursive autoencoder. Then, semi-supervised recursive autoencoders are used to capture the semantics of texts based on the predicted structure. In order to make this recursive-based method completely unsupervised, we remove the cross-entropy error in the second phrase to learn vector representation and subsequently employ K-means on the learned vectors of the top tree node and the average of all vectors in the tree.
Paragraph Vector (Para2vec) K-means on the fixed size feature vectors generated by Paragraph Vector (Para2vec) le2014distributed which is an unsupervised method to learn distributed representation of words and paragraphs. In our experiments, we use the open source software111111https://github.com/mesnilgr/iclr15. released by Mesnil et al. mesnil2014ensemble .
Average Embedding (AE) K-means on the weighted average vectors of the word embeddings which are respectively weighted with TF and TF-IDF. The dimension of average vectors is equal to and decided by the dimension of word vectors used in our experiments.
Latent Semantic Analysis (LSA) K-means on the reduced subspace vectors generated by Singular Value Decomposition (SVD) method. The dimension of subspace is default set to the number of clusters, we also iterate the dimensions ranging from 10:10:200 to get the best performance, that is 10 on SearchSnippets, 20 on StackOverflow and 20 on Biomedical in our experiments.
Laplacian Eigenmaps (LE)
This baseline, using Laplacian Eigenmaps and subsequently employing K-means algorithm, is well known as spectral clustering12_belkin2001laplacian . The dimension of subspace is default set to the number of clusters 37_ng2002spectral ; 4_cai2005document , we also iterate the dimensions ranging from 10:10:200 to get the best performance, that is 20 on SearchSnippets, 70 on StackOverflow and 30 on Biomedical in our experiments.
Locality Preserving Indexing (LPI) This baseline, projecting the texts into a lower dimensional semantic space, can discover both the geometric and discriminating structures of the original feature space 4_cai2005document . The dimension of subspace is default set to the number of clusters 4_cai2005document , we also iterate the dimensions ranging from 10:10:200 to get the best performance, that is 20 on SearchSnippets, 80 on StackOverflow and 30 on Biomedical in our experiments.
bidirectional RNN (bi-RNN) We replace the CNN model in our framework as in Figure 1 with some bi-RNN models. Particularly, LSTM and GRU units are used in the experiments. In order to generate the fixed-length document representation from the variable-length vector sequences, for both bi-LSTM and bi-GRU based clustering methods, we further utilize three pooling methods: last pooling (using the last hidden state), mean pooling and element-wise max pooling. These pooling methods are respectively used in the previous works palangi2015deep ; cho2014learning , tang2015document and 14_lai2015rcnn . For regularization, the training gradients of all parameters with an 2 norm larger than 40 are clipped to 40, as the previous work sukhbaatar2015end .
The clustering performance is evaluated by comparing the clustering results of texts with the tags/labels provided by the text corpus. Two metrics, the accuracy (ACC) and the normalized mutual information metric (NMI), are used to measure the clustering performance 4_cai2005document ; 1_huang2014deep . Given a text , let and be the obtained cluster label and the label provided by the corpus, respectively. Accuracy is defined as:
where, is the total number of texts, is the indicator function that equals one if and equals zero otherwise, and is the permutation mapping function that maps each cluster label to the equivalent label from the text data by Hungarian algorithm 7_papadimitriou1998combinatorial .
Normalized mutual information 8_chen2011parallel between tag/label set and cluster set is a popular metric used for evaluating clustering tasks. It is defined as follows:
where, is the mutual information between and , is entropy and the denominator is used for normalizing the mutual information to be in the range of [0, 1].
|Method||ACC (%)||ACC (%)||ACC (%)|
|Method||NMI (%)||NMI (%)||NMI (%)|
The most of parameters are set uniformly for these datasets. Following previous study 4_cai2005document , the number of nearest neighbors in Eqn. (5) is fixed to 15 when constructing the graph structures for LE and LPI. For CNN model, the networks has two convolutional layers. The widths of the convolutional filters are both 3. The value of for the top -max pooling in Eqn. (1) is 5. The number of feature maps at the first convolutional layer is 12, and 8 feature maps at the second convolutional layer. Both those two convolutional layers are followed by a folding layer. We further set the dimension of word embeddings as 48. Finally, the dimension of the deep feature representation is fixed to 480. Moreover, we set the learning rate as 0.01 and the mini-batch training size as 200. The output size in Eqn. (8) is set same as the best dimensions of subspace in the baseline method, as described in Section 4.3.
For initial centroids have significant impact on clustering results when utilizing the K-means algorithms, we repeat K-means for multiple times with random initial centroids (specifically, 100 times for statistical significance) as Huang 1_huang2014deep . The all subspace vectors are normalized to 1 before applying K-means and the final results reported are the average of 5 trials with all clustering methods on three text datasets.
|Method||ACC (%)||ACC (%)||ACC (%)|
|Method||NMI (%)||NMI (%)||NMI (%)|
In Table 4 and Table 5, we report the ACC and NMI performance of our proposed approaches and four baseline methods, K-means, SkipVec, RecNN and Para2vec based clustering methods. Intuitively, we get a general observation that (1) BoW based approaches, including K-means (TF) and K-means (TF-IDF), and SkipVec based approaches perform not well; (2) RecNN based approaches, both RecNN (Ave.) and RecNN (Top+Ave.), do better; (3) Para2vec makes a comparable performance with the most baselines; and (4) the evaluation clearly demonstrate the superiority of our proposed methods STC. It is an expected results. For SkipVec based approaches, the off-the-shelf encoders are trained on the BookCorpus datasets zhu2015aligning , and then applied to our datasets to extract the sentence representations. The SkipVec encoders can produce generic sentence representations but may not perform well for specific datasets, in our experiments, StackOverflow and Biomedical datasets consist of many computer terms and medical terms, such as “ASP.NET”, “XML”, “C#”, “serum” and “glycolytic”. When we take a more careful look, we find that RecNN (Top) does poorly, even worse than K-means (TF-IDF). The reason maybe that although recursive neural models introduce tree structure to capture compositional semantics, the vector of the top node mainly captures a biased semantic while the average of all vectors in the tree nodes, such as RecNN (Ave.), can be better to represent sentence level semantic. And we also get another observation that, although our proposed STC-LE and STC-LPI outperform both BoW based and RecNN based approaches across all three datasets, STC-AE and STC-LSA do just exhibit some similar performances as RecNN (Ave.) and RecNN (Top+Ave.) do in the datasets of StackOverflow and Biomedical.
We further replace the CNN model in our framework as in Figure 1 with some other non-biased models, such as bi-LSTM and bi-GRU, and report the results in Table 6 and Table 7. As an instance, the binary codes generated from LPI are used to guide the learning of bi-LSTM/bi-GRU models. From the results, we can see that bi-GRU and bi-LSTM based clustering methods do equally well, no clear winner, and both achieve great enhancements compared with LPI (best). Compared with these bi-LSTM/bi-GRU based models, the evaluation results still demonstrate the superiority of our approach methods, CNN based clustering model, in the most cases. As the results reported by Visin et al. visin2015renet
, despite bi-directional or multi-directional RNN models perform a good non-biased feature extraction, they yet do not outperform state-of-the-art CNN on some tasks.
In order to make clear what factors make our proposed method work, we report the bar chart results of ACC and MNI of our proposed methods and the corresponding baseline methods in Figure 3 and Figure 4. It is clear that, although AE and LSA does well or even better than LE and LPI, especially in dataset of both StackOverflow and Biomedical, STC-LE and STC-LPI achieve a much larger performance enhancements than STC-AE and STC-LSA do. The possible reason is that the information the pseudo supervision used to guide the learning of CNN model that make difference. Especially, for AE case, the input features fed into CNN model and the pseudo supervision employed to guide the learning of CNN model are all come from word embeddings. There are no different semantic features to be used into our proposed method, thus the performance enhancements are limited in STC-AE. For LSA case, as we known, LSA is to make matrix factorization to find the best subspace approximation of the original feature space to minimize the global reconstruction error. And as 26_pennington2014glove ; li2015mfp
recently point out that word embeddings trained with word2vec or some variances, is essentially to do an operation of matrix factorization. Therefore, the information between input and the pseudo supervision in CNN is not departed very largely from each other, and the performance enhancements of STC-AE is also not quite satisfactory. For LE and LPI case, as we known that LE extracts the manifold structure of the original feature space, and LPI extracts both geometric and discriminating structure of the original feature space 4_cai2005document . We guess that our approach STC-LE and STC-LPI achieve enhancements compared with both LE and LPI by a large margin, because both of LE and LPI get useful semantic features, and these features are also different from word embeddings used as input of CNN. From this view, we say that our proposed STC has potential to behave more effective when the pseudo supervision is able to get semantic meaningful features, which is different enough from the input of CNN.
Furthermore, from the results of K-means and AE in Table 4-5 and Figure 3-4, we note that TF-IDF weighting gives a more remarkable improvement for K-means, while TF weighting works better than TF-IDF weighting for Average Embedding. Maybe the reason is that pre-trained word embeddings encode some useful information from external corpus and are able to get even better results without TF-IDF weighting. Meanwhile, we find that LE get quite unusual good performance than LPI, LSA and AE in SearchSnippets dataset, which is not found in the other two datasets. To get clear about this, and also to make a much better demonstration about our proposed approaches and other baselines, we further report 2-dimensional text embeddings on SearchSnippets in Figure 5, using t-SNE121212http://lvdmaaten.github.io/tsne/. 39_van2008visualizing to get distributed stochastic neighbor embedding of the feature representations used in the clustering methods. We can see that the results of from AE and LSA seem to be fairly good or even better than the ones from LE and LPI, which is not the same as the results from ACC and NMI in Figure 3-4. Meanwhile, RecNN (Ave.) performs better than BoW (both TF and TF-IDF) while RecNN (Top) does not, which is the same as the results from ACC and NMI in Table 4 and Table 5. Then we guess that both ”the same as” and ”not the same as” above, is just a good example to illustrate that visualization tool, such as t-SNE, get some useful information for measuring results, which is different from the ones of ACC and NMI. Moreover, from this complementary view of t-SNE, we can see that our STC-AE, STC-LSA, STC-LE, and STC-LPI show more clear-cut margins among different semantic topics (that is, tags/labels), compared with AE, LSA, LE and LPI, respectively, as well as compared with both baselines, BoW and RecNN based ones.
From all these results, with three measures of ACC, NMI and t-SNE under three datasets, we can get a solid conclusion that our proposed approaches is an effective approaches to get useful semantic features for short text clustering.
With the emergence of social media, short text clustering has become an increasing important task. This paper explores a new perspective to cluster short texts based on deep feature representation learned from the proposed self-taught convolutional neural networks. Our framework can be successfully accomplished without using any external tags/labels and complicated NLP pre-processing, and and our approach is a flexible framework, in which the traditional dimension reduction approaches could be used to get performance enhancement. Our extensive experimental study on three short text datasets shows that our approach can achieve a significantly better performance. In the future, how to select and incorporate more effective semantic features into the proposed framework would call for more research.
We would like to thank reviewers for their comments, and acknowledge Kaggle and BioASQ for making the datasets available. This work is supported by the National Natural Science Foundation of China (No. 61602479, No. 61303172, No. 61403385) and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB02070005).
S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification, in: Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch, The Journal of Machine Learning Research 12 (2011) 2493–2537.
G. Lin, C. Shen, D. Suter, A. v. d. Hengel, A general two-step approach to learning-based hashing, in: Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE, 2013, pp. 2552–2559.
R. Raina, A. Battle, H. Lee, B. Packer, A. Y. Ng, Self-taught learning: transfer learning from unlabeled data, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp. 759–766.
A. Y. Ng, M. I. Jordan, Y. Weiss, et al., On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems 2 (2002) 849–856.
J. Turian, L. Ratinov, Y. Bengio, Word representations: a simple and general method for semi-supervised learning, in: Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, 2010, pp. 384–394.
J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, The Journal of Machine Learning Research 12 (2011) 2121–2159.
X.-H. Phan, L.-M. Nguyen, S. Horiguchi, Learning to classify short and sparse text & web with hidden topics from large-scale data collections, in: Proceedings of the 17th international conference on World Wide Web, ACM, 2008, pp. 91–100.
C. H. Papadimitriou, K. Steiglitz, Combinatorial optimization: algorithms and complexity, Courier Corporation, 1998.