bills
None
view repo
Short text clustering is a challenging problem due to its sparseness of text representation. Here we propose a flexible SelfTaught Convolutional neural network framework for Short Text Clustering (dubbed STC^2), which can flexibly and successfully incorporate more useful semantic features and learn nonbiased deep text representation in an unsupervised manner. In our framework, the original raw text features are firstly embedded into compact binary codes by using one existing unsupervised dimensionality reduction methods. Then, word embeddings are explored and fed into convolutional neural networks to learn deep feature representations, meanwhile the output units are used to fit the pretrained binary codes in the training process. Finally, we get the optimal clusters by employing Kmeans to cluster the learned representations. Extensive experimental results demonstrate that the proposed framework is effective, flexible and outperform several popular clustering methods when tested on three public short text datasets.
READ FULL TEXT VIEW PDFNone
Short text clustering is of great importance due to its various applications, such as user profiling liritterhovy:2014:P141 and recommendation wangEtAl:2010:ACL1 , for nowaday’s social media dataset emerged day by day. However, short text clustering has the data sparsity problem and most words only occur once in each short text 15_aggarwal2012survey . As a result, the Term FrequencyInverse Document Frequency (TFIDF) measure cannot work well in short text setting. In order to address this problem, some researchers work on expanding and enriching the context of data from Wikipedia 29_banerjee2007clustering or an ontology 33_fodeh2011ontology
. However, these methods involve solid Natural Language Processing (NLP) knowledge and still use highdimensional representation which may result in a waste of both memory and computation time. Another way to overcome these issues is to explore some sophisticated models to cluster short texts. For example, Yin and Wang
30_yin2014dirichlet proposed a Dirichlet multinomial mixture modelbased approach for short text clustering. Yet how to design an effective model is an open question, and most of these methods directly trained based on BagofWords (BoW) are shallow structures which cannot preserve the accurate semantic similarities.Recently, with the help of word embedding, neural networks demonstrate their great performance in terms of constructing text representation, such as Recursive Neural Network (RecNN) 24_socher2011semi ; 35_socher2013recursive
and Recurrent Neural Network (RNN)
38_mikolov2011extensions . However, RecNN exhibits high time complexity to construct the textual tree, and RNN, using the hidden layer computed at the last word to represent the text, is a biased model where later words are more dominant than earlier words 14_lai2015rcnn . Whereas for the nonbiased models, the learned representation of one text can be extracted from all the words in the text with nondominant learned weights. More recently, Convolution Neural Network (CNN), as the most popular nonbiased model and applying convolutional filters to capture local features, has achieved a better performance in many NLP applications, such as sentence modeling 16_blunsom2014convolutional , relation classification 34_zeng2014relation , and other traditional NLP tasks 19_collobert2011natural . Most of the previous works focus CNN on solving supervised NLP tasks, while in this paper we aim to explore the power of CNN on one unsupervised NLP task, short text clustering.We systematically introduce a simple yet surprisingly powerful SelfTaught Convolutional neural network framework for Short Text Clustering, called STC. An overall architecture of our proposed approach is illustrated in Figure 1. We, inspired by 28_zhang2010self ; TwoStepHash_Lin_2013 , utilize a selftaught learning framework into our task. In particular, the original raw text features are first embedded into compact binary codes with the help of one traditional unsupervised dimensionality reduction function. Then text matrix projected from word embeddings are fed into CNN model to learn the deep feature representation and the output units are used to fit the pretrained binary codes . After obtaining the learned features, Kmeans algorithm is employed on them to cluster texts into clusters . Obviously, we call our approach “selftaught” because the CNN model is learnt from the pseudo labels generated from the previous stage, which is quite different from the term “selftaught” in raina2007self . Our main contributions can be summarized as follows:
We propose a flexible short text clustering framework which explores the feasibility and effectiveness of combining CNN and traditional unsupervised dimensionality reduction methods.
Nonbiased deep feature representations can be learned through our selftaught CNN framework which does not use any external tags/labels or complicated NLP preprocessing.
We conduct extensive experiments on three short text datasets. The experimental results demonstrate that the proposed method achieves excellent performance in terms of both accuracy and normalized mutual information.
This work is an extension of our conference paper xu2015short , and they differ in the following aspects. First, we put forward a general a selftaught CNN framework in this paper which can flexibly couple various semantic features, whereas the conference version can be seen as a specific example of this work. Second, in this paper we use a new short text dataset, Biomedical, in the experiment to verify the effectiveness of our approach. Third, we put much effort on studying the influence of various different semantic features integrated in our selftaught CNN framework, which is not involved in the conference paper.
For the purpose of reproducibility, we make the datasets and software used in our experiments publicly available at the website^{1}^{1}1Our code and dataset are available: https://github.com/jacoxu/STC2.
The remainder of this paper is organized as follows: In Section 2, we first briefly survey several related works. In Section 3, we describe the proposed approach STC and implementation details. Experimental results and analyses are presented in Section 4. Finally, conclusions are given in the last Section.
In this section, we review the related work from the following two perspectives: short text clustering and deep neural networks.
There have been several studies that attempted to overcome the sparseness of short text representation. One way is to expand and enrich the context of data. For example, Banerjee et al. 29_banerjee2007clustering proposed a method of improving the accuracy of short text clustering by enriching their representation with additional features from Wikipedia, and Fodeh et al. 33_fodeh2011ontology incorporate semantic knowledge from an ontology into text clustering. However, these works need solid NLP knowledge and still use highdimensional representation which may result in a waste of both memory and computation time. Another direction is to map the original features into reduced space, such as Latent Semantic Analysis (LSA) deerwester1990indexing , Laplacian Eigenmaps (LE) 37_ng2002spectral , and Locality Preserving Indexing (LPI) niyogi2004locality . Even some researchers explored some sophisticated models to cluster short texts. For example, Yin and Wang 30_yin2014dirichlet proposed a Dirichlet multinomial mixture modelbased approach for short text clustering. Moreover, some studies even focus the above both two streams. For example, Tang et al. 32_tang2012enriching proposed a novel framework which enrich the text features by employing machine translation and reduce the original features simultaneously through matrix factorization techniques.
Despite the above clustering methods can alleviate sparseness of short text representation to some extent, most of them ignore word order in the text and belong to shallow structures which can not fully capture accurate semantic similarities.
Recently, there is a revival of interest in DNN and many researchers have concentrated on using Deep Learning to learn features. Hinton and Salakhutdinov
23_hinton2006reducinguse DAE to learn text representation. During the finetuning procedure, they use backpropagation to find codes that are good at reconstructing the wordcount vector.
More recently, researchers propose to use external corpus to learn a distributed representation for each word, called word embedding
25_turian2010word , to improve DNN performance on NLP tasks. The Skipgram and continuous bagofwords models of Word2vec 21_mikolov2013distributed propose a simple singlelayer architecture based on the inner product between two word vectors, and Pennington et al. 26_pennington2014glove introduce a new model for word representation, called GloVe, which captures the global corpus statistics.In order to learn the compact representation vectors of sentences, Le and Mikolov le2014distributed directly extend the previous Word2vec 21_mikolov2013distributed by predicting words in the sentence, which is named Paragraph Vector (Para2vec). Para2vec is still a shallow windowbased method and need a larger corpus to yield better performance. More neural networks utilize word embedding to capture true meaningful syntactic and semantic regularities, such as RecNN 24_socher2011semi ; 35_socher2013recursive and RNN 38_mikolov2011extensions
. However, RecNN exhibits high time complexity to construct the textual tree, and RNN, using the layer computed at the last word to represent the text, is a biased model. Recently, Long ShortTerm Memory (LSTM)
hochreiter1997longand Gated Recurrent Unit (GRU)
cho2014learning , as sophisticated recurrent hidden units of RNN, has presented its advantages in many sequence generation problem, such as machine translation sutskever2014sequence , speech recognition graves2013speech , and text conversation shang2015neural . While, CNN is better to learn nonbiased implicit features which has been successfully exploited for many supervised NLP learning tasks as described in Section 1, and various CNN based variants are proposed in the recent works, such as Dynamic Convolutional Neural Network (DCNN) 16_blunsom2014convolutional , Gated Recursive Convolutional Neural Network (grConv) cho2014properties and SelfAdaptive Hierarchical Sentence model (AdaSent) zhao2015self .In the past few days, Visin et al. visin2015renet
have attempted to replace convolutional layer in CNN to learn nonbiased features for object recognition with four RNNs, called ReNet, that sweep over lowerlayer features in different directions: (1) bottom to top, (2) top to bottom, (3) left to right and (4) right to left. However, ReNet does not outperform stateoftheart convolutional neural networks on any of the three benchmark datasets, and it is also a supervised learning model for classification. Inspired by Skipgram of word2vec
mikolov2013efficient ; 21_mikolov2013distributed , Skipthought model kiros2015skip describe an approach for unsupervised learning of a generic, distributed sentence encoder. Similar as Skipgram model, Skipthought model trains an encoderdecoder model that tries to reconstruct the surrounding sentences of an encoded sentence and released an offtheshelf encoder to extract sentence representation. Even some researchers introduce continuous Skipgram and negative sampling to CNN for learning visual representation in an unsupervised manner wang2015unsupervised . This paper, from a new perspective, puts forward a general selftaught CNN framework which can flexibly couple various semantic features and achieve a good performance on one unsupervised learning task, short text clustering.Assume that we are given a dataset of training texts denoted as: , where is the dimensionality of the original BoW representation. Denote its tag set as and the pretrained word embedding set as , where is the dimensionality of word vectors and is the vocabulary size. In order to learn the dimensional deep feature representation from CNN in an unsupervised manner, some unsupervised dimensionality reduction methods are employed to guide the learning of CNN model. Our goal is to cluster these texts into clusters based on the learned deep feature representation while preserving the semantic consistency.
As depicted in Figure 1, the proposed framework consist of three components, deep convolutional neural network (CNN), unsupervised dimensionality reduction function and Kmeans module. In the rest sections, we first present the first two components respectively, and then give the trainable parameters and the objective function to learn the deep feature representation. Finally, the last section describe how to perform clustering on the learned features.
In this section, we briefly review one popular deep convolutional neural network, Dynamic Convolutional Neural Network (DCNN) 16_blunsom2014convolutional as an instance of CNN in the following sections, which as the foundation of our proposed method has been successfully proposed for the completely supervised learning task, text classification.
Taking a neural network with two convolutional layers in Figure 2 as an example, the network transforms raw input text to a powerful representation. Particularly, each raw text vector is projected into a matrix representation by looking up a word embedding , where is the length of one text. We also let and denote the weights of the neural networks. The network defines a transformation which transforms an input raw text to a dimensional deep representation . There are three basic operations described as follows:
Wide onedimensional convolution This operation is applied to an individual row of the sentence matrix , and yields a resulting matrix , where is the width of convolutional filter.
Folding In this operation, every two rows in a feature map are simply summed componentwisely. For a map of rows, folding returns a map of rows, thus halving the size of the representation and yielding a matrix feature . Note that folding operation does not introduce any additional parameters.
Dynamic max pooling Assuming the pooling parameter as , max pooling selects the submatrix of the highest values in each row of the matrix . For dynamic max pooling, the pooling parameter is dynamically selected in order to allow for a smooth extraction of higherorder and longerrange features 16_blunsom2014convolutional . Given a fixed pooling parameter for the topmost convolutional layer, the parameter of max pooling in the th convolutional layer can be computed as follows:
(1) 
where is the total number of convolutional layers in the network.
As described in Figure 1, the dimensionality reduction function is defined as follows:
(2) 
where, are the dimensional reduced latent space representations. Here, we take four popular dimensionality reduction methods as examples in our framework.
Average Embedding (AE): This method directly averages the word embeddings which are respectively weighted with TF and TFIDF. Huang et al. 13_huang2012improving used this strategy as the global context in their task, and Socher et al. 35_socher2013recursive and Lai et al. 14_lai2015rcnn used this method for text classification. The weighted average of all word vectors in one text can be computed as follows:
(3) 
where can be any weighting function that captures the importance of word in the text .
Latent Semantic Analysis (LSA): LSA deerwester1990indexing
is the most popular global matrix factorization method, which applies a dimension reducing linear projection, Singular Value Decomposition (SVD), of the corresponding term/document matrix. Suppose the rank of
is , LSA decompose into the product of three other matrices:(4) 
where and are the singular values of , is a set of left singular vectors and is a set of right singular vectors. LSA uses the top vectors in as the transformation matrix to embed the original text features into a dimensional subspace deerwester1990indexing .
Laplacian Eigenmaps (LE)
: The top eigenvectors of graph Laplacian, defined on the similarity matrix of texts, are used in the method, which can discover the manifold structure of the text space
37_ng2002spectral . In order to avoid storing the dense similarity matrix, many approximation techniques are proposed to reduce the memory usage and computational complexity for LE. There are two representative approximation methods, sparse similarity matrix and Nystrm approximation. Following previous studies 4_cai2005document ; 28_zhang2010self , we select the former technique to construct the local similarity matrix by using heat kernel as follows:(5) 
where, is a tuning parameter (default is 1) and represents the set of nearestneighbors of . By introducing a diagonal matrix whose entries are given by , the graph Laplacian can be computed by (). The optimal realvalued matrix can be obtained by solving the following objective function:
(6) 
where is the trace function, requires the different dimensions to be uncorrelated, and
requires each dimension to achieve equal probability as positive or negative).
Locality Preserving Indexing (LPI): This method extends LE to deal with unseen texts by approximating the linear function 28_zhang2010self
, and the subspace vectors are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the Riemannian manifold
niyogi2004locality . Similar as LE, we first construct the local similarity matrix , then the graph Laplacian can be computed by (), where measures the local density around and is equal to . Compute the eigenvectorsand eigenvalues
of the following generalized eigenproblem:(7) 
The mapping function can be obtained and applied to the unseen data 4_cai2005document .
All of the above methods claim a better performance in capturing semantic similarity between texts in the reduced latent space representation than in the original representation , while the performance of short text clustering can be further enhanced with the help of our framework, selftaught CNN.
The last layer of CNN is an output layer as follows:
(8) 
where, is the deep feature representation, is the output vector and is weight matrix.
In order to incorporate the latent semantic features , we first binary the realvalued vectors to the binary codes by setting the threshold to be the media vector . Then, the output vector is used to fit the binary codes via logistic operations as follows:
(9) 
All parameters to be trained are defined as .
(10) 
Given the training text collection , and the pretrained binary codes , the log likelihood of the parameters can be written down as follows:
(11) 
Following the previous work 16_blunsom2014convolutional , we train the network with minibatches by backpropagation and perform the gradientbased optimization using the Adagrad update rule 36_duchi2011adaptive . For regularization, we employ dropout with 50% rate to the penultimate layer 16_blunsom2014convolutional ; 22_kim2014convolutional .
With the given short texts, we first utilize the trained deep neural network to obtain the semantic representations , and then employ traditional Kmeans algorithm to perform clustering.
Dataset  C  Num.  Len.  

SearchSnippets  8  12,340  17.88/38  30,642 
StackOverflow  20  20,000  8.31/34  22,956 
Biomedical  20  20,000  12.88/53  18,888 
SearchSnippets: 8 different domains  
business  computers  health  education 
culture  engineering  sports  politics 
StackOverflow: 20 semantic tags 

svn  oracle  bash  apache 
excel  matlab  cocoa  visualstudio 
osx  wordpress  spring  hibernate 
scala  sharepoint  ajax  drupal 
qt  haskell  linq  magento 
Biomedical: 20 MeSH major topics 

aging  chemistry  cats  erythrocytes 
glucose  potassium  lung  lymphocytes 
spleen  mutation  skin  norepinephrine 
insulin  prognosis  risk  myocardium 
sodium  mathematics  swine  temperature 
We test our proposed approach on three public short text datasets. The summary statistics and semantic topics of these datasets are described in Table 1 and Table 2.
SearchSnippets^{2}^{2}2http://jwebpro.sourceforge.net/datawebsnippets.tar.gz.. This dataset was selected from the results of web search transaction using predefined phrases of 8 different domains by Phan et al. 20_phan2008learning .
StackOverflow. We use the challenge data published in Kaggle.com^{3}^{3}3https://www.kaggle.com/c/predictclosedquestionsonstackoverflow/download/train.zip.. The raw dataset consists 3,370,528 samples through July 31st, 2012 to August 14, 2012. In our experiments, we randomly select 20,000 question titles from 20 different tags as in Table 2.
Biomedical. We use the challenge data published in BioASQ’s official website^{4}^{4}4http://participantsarea.bioasq.org/.. In our experiments, we randomly select 20, 000 paper titles from 20 different MeSH^{5}^{5}5http://en.wikipedia.org/wiki/Medical_Subject_Headings. major topics as in Table 2. As described in Table 1, the max length of selected paper titles is 53^{6}^{6}6http://www.ncbi.nlm.nih.gov/pubmed/207752..
For these datasets, we randomly select 10% of data as the development set. Since SearchSnippets has been preprocessed by Phan et al. 20_phan2008learning , we do not further process this dataset. In StackOverflow, texts contain lots of computer terminology, and symbols and capital letters are meaningful, thus we do not do any preprocessed procedures. For Biomedical, we remove the symbols and convert letters into lower case.
Dataset  

SearchSnippets  23,826 (77%)  211,575 (95%) 
StackOverflow  19,639 (85%)  162,998 (97%) 
Biomedical  18,381 (97%)  257,184 (99%) 
We use the publicly available word2vec^{7}^{7}7https://code.google.com/p/word2vec/. tool to train word embeddings, and the most parameters are set as same as Mikolov et al. 21_mikolov2013distributed to train word vectors on Google News setting^{8}^{8}8https://groups.google.com/d/msg/word2vectoolkit/lxbl_MB29Ic/NDLGId3KPNEJ., except of vector dimensionality using 48 and minimize count using 5. For SearchSnippets, we train word vectors on Wikipedia dumps^{9}^{9}9http://dumps.wikimedia.org/enwiki/latest/enwikilatestpagesarticles.xml.bz2.. For StackOverflow, we train word vectors on the whole corpus of the StackOverflow dataset described above which includes the question titles and post contents. For Biomedical, we train word vectors on all titles and abstracts of 2014 training articles. The coverage of these learned vectors on three datasets are listed in Table 3, and the words not present in the set of pretrained words are initialized randomly.
In our experiment, some widely used text clustering methods are compared with our approach. Besides Kmeans, Skipthought Vectors, Recursive Neural Network and Paragraph Vector based clustering methods, four baseline clustering methods are directly based on the popular unsupervised dimensionality reduction methods as described in Section 3.2. We further compare our approach with some other nonbiased neural networks, such as bidirectional RNN. More details are listed as follows:
Kmeans Kmeans 2_wagstaff2001constrained on original keyword features which are respectively weighted with term frequency (TF) and term frequencyinverse document frequency (TFIDF).
Skipthought Vectors (SkipVec) This baseline kiros2015skip gives an offtheshelf encoder to produce highly generic sentence representations. The encoder^{10}^{10}10https://github.com/ryankiros/skipthoughts. is trained using a large collection of novels and provides three encoder modes, that are unidirectional encoder (SkipVec (Uni)) with 2,400 dimensions, bidirectional encoder (SkipVec (Bi)) with 2,400 dimensions and combined encoder (SkipVec (Combine)) with SkipVec (Uni) and SkipVec (Bi) of 2,400 dimensions each. Kmeans is employed on the these vector representations respectively.
Recursive Neural Network (RecNN) In 24_socher2011semi
, the tree structure is firstly greedy approximated via unsupervised recursive autoencoder. Then, semisupervised recursive autoencoders are used to capture the semantics of texts based on the predicted structure. In order to make this recursivebased method completely unsupervised, we remove the crossentropy error in the second phrase to learn vector representation and subsequently employ Kmeans on the learned vectors of the top tree node and the average of all vectors in the tree.
Paragraph Vector (Para2vec) Kmeans on the fixed size feature vectors generated by Paragraph Vector (Para2vec) le2014distributed which is an unsupervised method to learn distributed representation of words and paragraphs. In our experiments, we use the open source software^{11}^{11}11https://github.com/mesnilgr/iclr15. released by Mesnil et al. mesnil2014ensemble .
Average Embedding (AE) Kmeans on the weighted average vectors of the word embeddings which are respectively weighted with TF and TFIDF. The dimension of average vectors is equal to and decided by the dimension of word vectors used in our experiments.
Latent Semantic Analysis (LSA) Kmeans on the reduced subspace vectors generated by Singular Value Decomposition (SVD) method. The dimension of subspace is default set to the number of clusters, we also iterate the dimensions ranging from 10:10:200 to get the best performance, that is 10 on SearchSnippets, 20 on StackOverflow and 20 on Biomedical in our experiments.
Laplacian Eigenmaps (LE)
This baseline, using Laplacian Eigenmaps and subsequently employing Kmeans algorithm, is well known as spectral clustering
12_belkin2001laplacian . The dimension of subspace is default set to the number of clusters 37_ng2002spectral ; 4_cai2005document , we also iterate the dimensions ranging from 10:10:200 to get the best performance, that is 20 on SearchSnippets, 70 on StackOverflow and 30 on Biomedical in our experiments.Locality Preserving Indexing (LPI) This baseline, projecting the texts into a lower dimensional semantic space, can discover both the geometric and discriminating structures of the original feature space 4_cai2005document . The dimension of subspace is default set to the number of clusters 4_cai2005document , we also iterate the dimensions ranging from 10:10:200 to get the best performance, that is 20 on SearchSnippets, 80 on StackOverflow and 30 on Biomedical in our experiments.
bidirectional RNN (biRNN) We replace the CNN model in our framework as in Figure 1 with some biRNN models. Particularly, LSTM and GRU units are used in the experiments. In order to generate the fixedlength document representation from the variablelength vector sequences, for both biLSTM and biGRU based clustering methods, we further utilize three pooling methods: last pooling (using the last hidden state), mean pooling and elementwise max pooling. These pooling methods are respectively used in the previous works palangi2015deep ; cho2014learning , tang2015document and 14_lai2015rcnn . For regularization, the training gradients of all parameters with an 2 norm larger than 40 are clipped to 40, as the previous work sukhbaatar2015end .
The clustering performance is evaluated by comparing the clustering results of texts with the tags/labels provided by the text corpus. Two metrics, the accuracy (ACC) and the normalized mutual information metric (NMI), are used to measure the clustering performance 4_cai2005document ; 1_huang2014deep . Given a text , let and be the obtained cluster label and the label provided by the corpus, respectively. Accuracy is defined as:
(12) 
where, is the total number of texts, is the indicator function that equals one if and equals zero otherwise, and is the permutation mapping function that maps each cluster label to the equivalent label from the text data by Hungarian algorithm 7_papadimitriou1998combinatorial .
Normalized mutual information 8_chen2011parallel between tag/label set and cluster set is a popular metric used for evaluating clustering tasks. It is defined as follows:
(13) 
where, is the mutual information between and , is entropy and the denominator is used for normalizing the mutual information to be in the range of [0, 1].
SearchSnippets  StackOverflow  Biomedical  
Method  ACC (%)  ACC (%)  ACC (%) 
Kmeans (TF)  24.752.22  13.512.18  15.181.78 
Kmeans (TFIDF)  33.773.92  20.313.95  27.992.83 
SkipVec (Uni)  28.231.08  08.790.19  16.440.50 
SkipVec (Bi)  29.241.57  09.590.15  16.110.60 
SkipVec (Combine)  33.581.95  09.340.24  16.270.33 
RecNN (Top)  21.211.62  13.130.80  13.730.67 
RecNN (Ave.)  65.595.35  40.791.38  37.051.27 
RecNN (Top+Ave.)  65.535.64  40.451.60  36.681.29 
Para2vec  69.072.53  32.550.89  41.261.22 
STCAE  68.342.51  40.051.77  37.441.19 
STCLSA  73.091.45  35.811.80  38.471.55 
STCLE  77.093.99  51.132.80  43.621.00 
STCLPI  77.014.13  51.142.92  43.001.25 

SearchSnippets  StackOverflow  Biomedical  
Method  NMI (%)  NMI (%)  NMI (%) 
Kmeans (TF)  09.032.30  07.812.56  09.362.04 
Kmeans (TFIDF)  21.404.35  15.644.68  25.433.23 
SkipVec (Uni)  10.980.93  02.240.13  10.520.41 
SkipVec (Bi)  09.270.29  02.890.20  10.150.59 
SkipVec (Combine)  13.850.78  02.720.34  10.720.46 
RecNN (Top)  04.040.74  09.900.96  08.870.53 
RecNN (Ave.)  50.551.71  40.580.91  33.850.50 
RecNN (Top+Ave.)  50.441.84  40.211.18  33.750.50 
Para2vec  50.510.86  27.860.56  34.830.43 
STCAE  54.011.55  38.221.31  33.580.48 
STCLSA  54.531.47  34.381.12  33.900.67 
STCLE  63.161.56  49.031.46  38.050.48 
STCLPI  62.941.65  49.081.49  38.180.47 

The most of parameters are set uniformly for these datasets. Following previous study 4_cai2005document , the number of nearest neighbors in Eqn. (5) is fixed to 15 when constructing the graph structures for LE and LPI. For CNN model, the networks has two convolutional layers. The widths of the convolutional filters are both 3. The value of for the top max pooling in Eqn. (1) is 5. The number of feature maps at the first convolutional layer is 12, and 8 feature maps at the second convolutional layer. Both those two convolutional layers are followed by a folding layer. We further set the dimension of word embeddings as 48. Finally, the dimension of the deep feature representation is fixed to 480. Moreover, we set the learning rate as 0.01 and the minibatch training size as 200. The output size in Eqn. (8) is set same as the best dimensions of subspace in the baseline method, as described in Section 4.3.
For initial centroids have significant impact on clustering results when utilizing the Kmeans algorithms, we repeat Kmeans for multiple times with random initial centroids (specifically, 100 times for statistical significance) as Huang 1_huang2014deep . The all subspace vectors are normalized to 1 before applying Kmeans and the final results reported are the average of 5 trials with all clustering methods on three text datasets.
SearchSnippets  StackOverflow  Biomedical  
Method  ACC (%)  ACC (%)  ACC (%) 
biLSTM (last)  64.503.18  46.831.79  36.501.08 
biLSTM (mean)  65.854.18  44.931.83  35.601.21 
biLSTM (max)  61.705.10  38.741.62  32.830.73 
biGRU (last)  70.182.62  43.361.46  35.190.78 
biGRU (mean)  70.292.61  44.531.81  36.751.21 
biGRU (max)  65.691.02  54.402.07  37.231.19 
LPI (best)  47.112.91  38.041.72  37.151.16 
STCLPI  77.014.13  51.142.92  43.001.25 

SearchSnippets  StackOverflow  Biomedical  
Method  NMI (%)  NMI (%)  NMI (%) 
biLSTM (last)  50.321.15  41.890.90  34.510.34 
biLSTM (mean)  52.111.69  40.930.91  34.030.28 
biLSTM (max)  46.812.38  36.730.56  31.900.23 
biGRU (last)  56.000.75  38.730.78  32.910.40 
biGRU (mean)  55.760.85  39.840.94  34.270.27 
biGRU (max)  51.111.06  51.101.31  32.740.34 
LPI (best)  38.482.39  27.210.88  29.730.30 
STCLPI  62.941.65  49.081.49  38.180.47 

In Table 4 and Table 5, we report the ACC and NMI performance of our proposed approaches and four baseline methods, Kmeans, SkipVec, RecNN and Para2vec based clustering methods. Intuitively, we get a general observation that (1) BoW based approaches, including Kmeans (TF) and Kmeans (TFIDF), and SkipVec based approaches perform not well; (2) RecNN based approaches, both RecNN (Ave.) and RecNN (Top+Ave.), do better; (3) Para2vec makes a comparable performance with the most baselines; and (4) the evaluation clearly demonstrate the superiority of our proposed methods STC. It is an expected results. For SkipVec based approaches, the offtheshelf encoders are trained on the BookCorpus datasets zhu2015aligning , and then applied to our datasets to extract the sentence representations. The SkipVec encoders can produce generic sentence representations but may not perform well for specific datasets, in our experiments, StackOverflow and Biomedical datasets consist of many computer terms and medical terms, such as “ASP.NET”, “XML”, “C#”, “serum” and “glycolytic”. When we take a more careful look, we find that RecNN (Top) does poorly, even worse than Kmeans (TFIDF). The reason maybe that although recursive neural models introduce tree structure to capture compositional semantics, the vector of the top node mainly captures a biased semantic while the average of all vectors in the tree nodes, such as RecNN (Ave.), can be better to represent sentence level semantic. And we also get another observation that, although our proposed STCLE and STCLPI outperform both BoW based and RecNN based approaches across all three datasets, STCAE and STCLSA do just exhibit some similar performances as RecNN (Ave.) and RecNN (Top+Ave.) do in the datasets of StackOverflow and Biomedical.
We further replace the CNN model in our framework as in Figure 1 with some other nonbiased models, such as biLSTM and biGRU, and report the results in Table 6 and Table 7. As an instance, the binary codes generated from LPI are used to guide the learning of biLSTM/biGRU models. From the results, we can see that biGRU and biLSTM based clustering methods do equally well, no clear winner, and both achieve great enhancements compared with LPI (best). Compared with these biLSTM/biGRU based models, the evaluation results still demonstrate the superiority of our approach methods, CNN based clustering model, in the most cases. As the results reported by Visin et al. visin2015renet
, despite bidirectional or multidirectional RNN models perform a good nonbiased feature extraction, they yet do not outperform stateoftheart CNN on some tasks.
In order to make clear what factors make our proposed method work, we report the bar chart results of ACC and MNI of our proposed methods and the corresponding baseline methods in Figure 3 and Figure 4. It is clear that, although AE and LSA does well or even better than LE and LPI, especially in dataset of both StackOverflow and Biomedical, STCLE and STCLPI achieve a much larger performance enhancements than STCAE and STCLSA do. The possible reason is that the information the pseudo supervision used to guide the learning of CNN model that make difference. Especially, for AE case, the input features fed into CNN model and the pseudo supervision employed to guide the learning of CNN model are all come from word embeddings. There are no different semantic features to be used into our proposed method, thus the performance enhancements are limited in STCAE. For LSA case, as we known, LSA is to make matrix factorization to find the best subspace approximation of the original feature space to minimize the global reconstruction error. And as 26_pennington2014glove ; li2015mfp
recently point out that word embeddings trained with word2vec or some variances, is essentially to do an operation of matrix factorization. Therefore, the information between input and the pseudo supervision in CNN is not departed very largely from each other, and the performance enhancements of STC
AE is also not quite satisfactory. For LE and LPI case, as we known that LE extracts the manifold structure of the original feature space, and LPI extracts both geometric and discriminating structure of the original feature space 4_cai2005document . We guess that our approach STCLE and STCLPI achieve enhancements compared with both LE and LPI by a large margin, because both of LE and LPI get useful semantic features, and these features are also different from word embeddings used as input of CNN. From this view, we say that our proposed STC has potential to behave more effective when the pseudo supervision is able to get semantic meaningful features, which is different enough from the input of CNN.Furthermore, from the results of Kmeans and AE in Table 45 and Figure 34, we note that TFIDF weighting gives a more remarkable improvement for Kmeans, while TF weighting works better than TFIDF weighting for Average Embedding. Maybe the reason is that pretrained word embeddings encode some useful information from external corpus and are able to get even better results without TFIDF weighting. Meanwhile, we find that LE get quite unusual good performance than LPI, LSA and AE in SearchSnippets dataset, which is not found in the other two datasets. To get clear about this, and also to make a much better demonstration about our proposed approaches and other baselines, we further report 2dimensional text embeddings on SearchSnippets in Figure 5, using tSNE^{12}^{12}12http://lvdmaaten.github.io/tsne/. 39_van2008visualizing to get distributed stochastic neighbor embedding of the feature representations used in the clustering methods. We can see that the results of from AE and LSA seem to be fairly good or even better than the ones from LE and LPI, which is not the same as the results from ACC and NMI in Figure 34. Meanwhile, RecNN (Ave.) performs better than BoW (both TF and TFIDF) while RecNN (Top) does not, which is the same as the results from ACC and NMI in Table 4 and Table 5. Then we guess that both ”the same as” and ”not the same as” above, is just a good example to illustrate that visualization tool, such as tSNE, get some useful information for measuring results, which is different from the ones of ACC and NMI. Moreover, from this complementary view of tSNE, we can see that our STCAE, STCLSA, STCLE, and STCLPI show more clearcut margins among different semantic topics (that is, tags/labels), compared with AE, LSA, LE and LPI, respectively, as well as compared with both baselines, BoW and RecNN based ones.
From all these results, with three measures of ACC, NMI and tSNE under three datasets, we can get a solid conclusion that our proposed approaches is an effective approaches to get useful semantic features for short text clustering.
With the emergence of social media, short text clustering has become an increasing important task. This paper explores a new perspective to cluster short texts based on deep feature representation learned from the proposed selftaught convolutional neural networks. Our framework can be successfully accomplished without using any external tags/labels and complicated NLP preprocessing, and and our approach is a flexible framework, in which the traditional dimension reduction approaches could be used to get performance enhancement. Our extensive experimental study on three short text datasets shows that our approach can achieve a significantly better performance. In the future, how to select and incorporate more effective semantic features into the proposed framework would call for more research.
We would like to thank reviewers for their comments, and acknowledge Kaggle and BioASQ for making the datasets available. This work is supported by the National Natural Science Foundation of China (No. 61602479, No. 61303172, No. 61403385) and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB02070005).
S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification, in: TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch, The Journal of Machine Learning Research 12 (2011) 2493–2537.
G. Lin, C. Shen, D. Suter, A. v. d. Hengel, A general twostep approach to learningbased hashing, in: Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE, 2013, pp. 2552–2559.
R. Raina, A. Battle, H. Lee, B. Packer, A. Y. Ng, Selftaught learning: transfer learning from unlabeled data, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp. 759–766.
A. Y. Ng, M. I. Jordan, Y. Weiss, et al., On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems 2 (2002) 849–856.
J. Turian, L. Ratinov, Y. Bengio, Word representations: a simple and general method for semisupervised learning, in: Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, 2010, pp. 384–394.
J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, The Journal of Machine Learning Research 12 (2011) 2121–2159.
X.H. Phan, L.M. Nguyen, S. Horiguchi, Learning to classify short and sparse text & web with hidden topics from largescale data collections, in: Proceedings of the 17th international conference on World Wide Web, ACM, 2008, pp. 91–100.
C. H. Papadimitriou, K. Steiglitz, Combinatorial optimization: algorithms and complexity, Courier Corporation, 1998.
Comments
There are no comments yet.