Aspect term extraction [Hu and Liu2004, Pontiki et al.2014, Pontiki et al.2015] aims to identify the aspect expressions which refer to the product’s or service’s properties (or attributes), from the review sentence. It is a fundamental step to obtain the fine-grained sentiment of specific aspects of a product, besides the coarse-grained overall sentiment. Until now, there have been two major approaches for aspect term extraction. The unsupervised (or rule based) methods [Qiu et al.2011] rely on a set of manually defined opinion words as seeds and rules derived from syntactic parsing trees to iteratively extract aspect terms. The supervised methods [Jakob and Gurevych2010, Li et al.2010, Chernyshevich2014, Toh and Wang2014, San Vicente et al.2015] usually treat aspect term extraction as a sequence labeling problem, and conditional random field (CRF) has been the mainstream method in the aspect term extraction task of SemEval.
Representation learning has been introduced and achieved success in natural language processing (NLP)[Bengio et al.2013], such as word embeddings [Mikolov et al.2013b] and structured embeddings of knowledge bases [Bordes et al.2011]. It learns distributed representations for text in different granularities, such as words, phrases and sentences, and reduces data sparsity compared with the conventional one-hot representation. The distributed representations have been reported to be useful in many NLP tasks [Turian et al.2010, Collobert et al.2011].
In this paper, we focus on representation learning for aspect term extraction under an unsupervised framework. Besides words, dependency paths, which have been shown to be important clues in aspect term extraction [Qiu et al.2011], are also taken into consideration. Inspired by the representation learning of knowledge bases [Bordes et al.2011, Neelakantan et al.2015, Lin et al.2015] that embeds both entities and relations into a low-dimensional space, we learn distributed representations of words and dependency paths from the text corpus. Specifically, the optimization objective is formalized as . In the triple , and are words, is the corresponding dependency path consisting of a sequence of grammatical relations. The recurrent neural network [Mikolov et al.2010] is used to learn the distributed representations of dependency paths. Furthermore, the word embeddings are enhanced by linear context information in a multi-task learning manner.
The learned embeddings of words and dependency paths are utilized as features in CRF for aspect term extraction. The embeddings are real values that are not necessarily in a bounded range [Turian et al.2010]. We therefore firstly map the continuous embeddings into the discrete embeddings and make them more appropriate for the CRF model. Then, we construct the embedding features which include the target word embedding, linear context embedding and dependency context embedding for aspect term extraction. We conduct experiments on the SemEval datasets and obtain comparable performances with the top systems. To demonstrate the effectiveness of the proposed embedding method, we also compare our method with other state-of-the-art models. With the same feature settings, our approach achieves better results. Moreover, we perform a qualitative analysis to show the effectiveness of the learned word and dependency path embeddings.
The contributions of this paper are two-fold. First, we use the dependency path to link words in the embedding space for distributed representation learning of words and dependency paths. By this method, the syntactic information is encoded in word embeddings and the multi-hop dependency path embeddings are learned explicitly. Second, we construct the CRF features only based on the derived embeddings for aspect term extraction, and achieve state-of-the-art results.
2 Related Work
There are a number of unsupervised methods for aspect term extraction. hu2004mining hu2004mining mine aspect term based on a data mining approach called association rule mining. In addition, they use opinion words to extract infrequent aspect terms. Using relationships between opinion words and aspect words to extract aspect term is employed in many follow-up studies. In [Qiu et al.2011], the dependency relation is used as a crucial clue, and the double propagation method is proposed to iteratively extract aspect terms and opinion words.
have been provided for aspect term extraction. Among these algorithms, the mainstream method is the CRF model. li2010structure li2010structure propose a new machine learning framework on top of CRF to jointly extract positive opinion words, negative opinion words and aspect terms. In order to obtain a more domain independent extraction, jakob2010extracting jakob2010extracting train the CRF model on review sentences from different domains. Most of top systems in SemEval also rely on the CRF model, and hand-crafted features are constructed to boost performances.
The representation learning has been employed in NLP [Bengio et al.2013]. Two successful applications are the word embedding [Mikolov et al.2013b] and the knowledge embedding [Bordes et al.2011]. The word embeddings are learned from raw textual data and are effective representations for word meanings. They have been proven useful in many NLP tasks [Turian et al.2010, Guo et al.2014]
. The knowledge embedding method models the relation as a translation vector that connects the vectors of two entities. It optimizes an objective over all the facts in the knowledge bases, encoding the global information. The learned embeddings are useful for reasoning missing facts and facts extraction in knowledge bases.
As dependency paths contain rich linguistics information between words, recent works model it as a dense vector. liu-EtAl:2015:ACL-IJCNLP liu-EtAl:2015:ACL-IJCNLP incorporate the vectors of grammatical relations (one-hop dependency paths) into the relation representation between two entities for relation classification task. Dependency-based word embedding model [Levy and Goldberg2014] encodes dependency information into word embeddings which is one similar work to ours. However, it implicitly encodes the dependency information, embedding the unit word + dependency path as the context vector. Besides, it only considers one-hop dependency paths and ignores multi-hop dependency paths. In this paper, we take the Dependency-based word embedding as a baseline method.
The approaches that learn path embeddings are investigated in knowledge bases. neelakantan-roth-mccallum:2015:ACL-IJCNLP neelakantan-roth-mccallum:2015:ACL-IJCNLP provide a method that reasons about conjunctions of multi-hop relations for knowledge base completion. The implication of a path is composed by using a recurrent neural network. Lin2015Model Lin2015Model also learn the representations of relation paths. A path-constraint resource allocation algorithm is employed to measure the reliability of path. In this paper, we learn the semantic composition of dependency paths over dependency trees.
In this section, we first present the unsupervised learning of word and dependency path embeddings. In addition, we describe how to enhance word embeddings by leveraging the linear context in a multi-task learning manner. Then, the construction of embedding features for CRF model is discussed.
3.1 Unsupervised Learning of Word and Dependency Path Embeddings
We learn the distributed representations of words and dependency paths in a unified model, as shown in Figure 1. We extract the triple from dependency trees, where and denote two words, the corresponding dependency path is the shortest path from to and consists of a sequence of grammatical relations. For example, in Figure 1, given that service is and professional is , we obtain the dependency path . We notice that considering the lexicalized dependency paths can provide more information for the embedding learning. However, we need to memorize more dependency path frequencies for the learning method (negative sampling) described in Section 3.3. Formally, the number of dependency paths is when we take n-hop dependency paths into consideration111 is the set of words and is over 100,000 in our paper. is the set of grammatical relations and is about 50 in our paper.. It is computationally expensive, even when . Hence, in our paper, we only include the grammatical relations for modeling which also shows the promising results in the experiment section.
The proposed learning model connects two words with the dependency path in the vector space. By incorporating the dependency path into the model, the word embeddings encode more syntactic information [Levy and Goldberg2014] and the distributed representations of dependency paths are explicitly learned. To be specific, this model requires that r is close to
and we minimize the following loss function for the model learning:
where represents the set of triples extracted from the dependency trees parsed from the text corpus, is a sequence of grammatical relations that is denoted as , is the hop number of , is the -th grammatical relation in , and is the marginal distribution for . The given loss function encourages that the triple has a higher ranking score than the randomly chosen triple . The ranking score is measured by the inner product of vector and vector .
We employ the recurrent neural network to learn the compositional representations for multi-hop dependency paths. The composition operation is realized with a matrix W:
where is a hard hyperbolic tangent function (hTanh), is the concatenation of two vectors, is the embedding of . We set and recursively perform composition operation to obtain the final representation . Figure 1 (Left) illustrates the composition progress of the dependency path from service to professional. In this work, we train the triples whose hop numbers are less than or equal to 3. It is time-consuming to learn embeddings when we consider the triples with larger hop dependency paths.
3.2 Multi-Task Learning with Linear Context
We use the linear context to enhance word embeddings. The proposed method is based on the distributional hypothesis that words in similar contexts have similar meanings. Inspired by Skip-gram [Mikolov et al.2013a], we enhance word embeddings by maximizing the prediction accuracy of context word c that occurs in the linear context of a target word w. The method is shown in Figure 1 (Right). Following the previous setting in [Mikolov et al.2013b], we set the window size to 5. A hinge loss function, which has been proven to be effective in the cross-domain representation learning [Bollegala et al.2015], is modified to the following function222We use dot product to calculate the similarity of words without using the softmax function that is employed in Skip-gram.:
where, means that the word appears in the linear context of the target word , and is the marginal distribution for . Every word has two roles, the target word and the context word of other target words. In this paper, we use different vectors to represent the two roles of a word.
3.3 Model Training
The negative sampling method [Collobert et al.2011, Mikolov et al.2013b] is employed to train the embedding model. Specifically, for each pair , we randomly select words that do not appear in the linear context window of the target word . These randomly chosen words are sampled based on the marginal distribution and the
is estimated from the word frequency raised to thepower [Mikolov et al.2013a] in the corpus. We use the same method to optimize the Equation (1). The difference is, we sample path separately according to the hop number and randomly select dependency paths for the -hop path . The selected dependency paths have the same hop number with . We experimentally find that = 5 is a trade-off between the training time and performance. Similarly, , and are set to 5, 3 and 2.
The model is trained by back-propagation. We use asynchronous gradient descent for parallel training. Following the strategy for updating learning rate [Mikolov et al.2013a], we linearly decrease it over our training instances. The initial learning rate is set to 0.001.
3.4 Aspect Term Extraction with Embeddings
We use the CRF model for aspect term extraction, and the features are constructed based on the learned embeddings. Besides the target word embedding features, we also utilize the features encoding the context information, (1) the linear context is within a fixed window of the target word in the linear sequence; (2) the dependency context is directly related with the target word in the dependency tree. These context features are used as complementary information to help label aspect terms.
Given a sentence and its dependency tree, we denote as the function extracting the features for word . In our case, , where represents the embedding of word , and refer to the linear context embedding and the dependency context embedding respectively. These extracted embeddings are concatenated as CRF features.
Based on the assumption that the label of a word mainly depends on itself and its neighboring words [Collobert et al.2011], the linear context is leveraged as additional information to help tag a word in NLP tasks. In our work, is defined as:
where is a floor function, represents the window size of linear context and is set to 5 in this paper which follows [Collobert et al.2011].
The dependency context captures syntactic information from the dependency tree. We collect the pairs (dependency path, context word) as the dependency context. They are directly linked with the word to be labeled. We limit the max hop number of dependency path in pairs to 3. For example, the dependency context of target word staff in Figure 1 is (, waiter), (, service), (, and), (, very) and (, professional). We sum the vectors of dependency path and context word as the embedding for the pair, and average the embeddings of collected pairs to derive the dependency context embedding features .
The goal of embedding discretization is to map the real-valued embedding matrix into the discrete-valued embedding matrix , where is the dimension of embeddings and is the size of vocabulary. Formally, the operation of embedding discretization is shown as follows:
where and are the maximum and minimum respectively in the -th dimension, is the number of discrete interval for each dimension. By this transformation, we obtain CRF features for aspect term extraction, without losing much information encoded in the original embeddings.
4.1 Dataset and Setting
We conduct our experiments on the SemEval 2014 and 2015 datasets. The SemEval 2014 dataset is from two domains, laptop (D1) and restaurant (D2) and the SemEval 2015 dataset only includes the restaurant domain (D3). The details of datasets are described in Table 1, and the F1 score is used as the evaluation metric.
|Dataset||# Sentences||# Aspects|
We also introduce two in-domain corpora for learning distributed representations of words and dependency paths. The corpora contain Yelp dataset333https://www.yelp.com/academic_dataset and Amazon dataset444https://snap.stanford.edu/data/web-Amazon.html which are in-domain corpora for restaurant domain and laptop domain respectively. As the size of laptop reviews is small in Amazon corpus, we add the reviews of other similar products in laptop corpus. These products consist of Electronics, Kindle, Phone and Video Game. All corpora are parsed by using the Stanford corenlp555http://nlp.stanford.edu/software/corenlp.shtml. The derived tokens are converted to lowercase, the words and dependency paths that appear less than 10 are removed. The details of unlabeled data are shown in Table 2.
|Dataset||# Sentences||# Tokens|
In order to choose and , we use 80% sentences in training data as training set, and the rest 20% as development set. The dimensions of word and dependency path embeddings are set to 100. Larger dimensions get similar results in the development set but need more training time. is set to 15 that performs best in the development set.
We use an available CRF tool666https://crfsharp.codeplex.com/ for tagging aspect terms, and set the parameters with default values. The only exception is the regularization option. L1 regularization is chosen as it obtains better results than L2 regularization in the development set.
4.2 Result and Analysis
We compare our method with the following methods:
(1) Naive: Token in test sentences is tagged as an aspect term, if it is in the dictionary that contains all the aspect terms of training sentences.
(2) Baseline Feature Templates: Most of features used in the Baseline Feature Templates are adopted from the NER Feature Templates [Guo et al.2014] which are described in Table 3.
(3) IHS_RD: IHS_RD is the top system in D1 which also relies on CRF with a rich set of lexical, syntactic and statistical features. Unlike other systems in SemEval 2014 and our work, IHS_RD trains the CRF model on the reviews of both the restaurant domain and laptop domain.
(4) DLIREC: DLIREC is the top system in D2 which relies on CRF with a variety of lexicon, syntactic, semantic features derived from NLP resources. In terms of features based on the dependency tree, they use the head word, head word POS and one-hop dependency relation of the target word. Besides, different from our work, DLIREC utilizes the corpora of Yelp and Amazon reviews to derive cluster features for words.
(5) EliXa: EliXa is the top system in D3 which addresses the problem using an averaged perceptron with a BIO tagging scheme. The features used in EliXa consist of n-grams, token shape (digits, lowercase, punctuation, etc.), previous prediction, n-gram prefixes and suffixes, and word clusters derived from additional data (Yelp for Brown and Clark clusters; Wikipedia for word2vec clusters).
The statistical significance tests are calculated by approximate randomization, as described in [Yeh2000] and the results are displayed in Table 4. Compared with Baseline Feature Templates, the top systems (DLIREC, IHS_RD and EliXa) obtain the F1 score gains of , and in D1, D2 and D3 respectively. It indicates that high-quality feature engineering is important for aspect term extraction. In terms of embedding-based features, the target word embedding performs worse than other feature settings. Both “W+L” and “W+D” improve performances, as they capture the additional context information. By combining embedding features of target word, linear context and dependency context, we achieve comparable performance with the best systems in SemEval 2015, and outperform the best systems in SemEval 2014. It shows that, (1) with the features based on distributed representations, we can achieve state-of-the-art results in aspect term extraction; (2) the context embedding features offer complementary information to help detect aspect terms. Additionally, adding the traditional features (Baseline Feature Templates), we get better performances. It can be explained by that the traditional features capture the surface information (e.g, stem, suffix and prefix) which is not explicitly encoded in the embedding features.
|Baseline Feature Templates|
|Baseline Feature Templates||70.72||81.56||64.32|
|IHS_RD (Top system in D1)||74.55||79.62||-|
|DLIREC (Top system in D2)||73.78||84.01||-|
|EliXa (Top system in D3)||-||-||70.04|
4.3 Comparison of Different Embedding Methods
We conduct experiments to evaluate our method of word and dependency path embeddings (WDEmb) with state-of-the-art embedding methods.
The baseline embedding algorithms include Dependency Recurrent Neural Language Model (DRNLM) [Piotr and Andreas2015], Skip-gram, CBOW [Mikolov et al.2013a] and Dependency-based word embeddings (DepEmb) [Levy and Goldberg2014]
. The DRNLM predicts the current words given the previous words, aiming at learning a probability distribution over sequences of words. Note that, besides the previous words, DRNLM also uses one-hop dependency relation as the additional features to help the model learning. Skip-gram learns word embeddings by predicting context word given the target word, while CBOW learns word embeddings by predicting current word given the context words. DepEmb learns word embeddings using one-hop dependency context. It is similar to our method to encode functional information into word embeddings. But the dependency path considered is one-hop rather than multi-hop.
The implementations of baseline models are publicly released777DepEmb (https://levyomer.wordpress.com/software/)
Skip-gram and CBOW (https://code.google.com/p/word2vec/)
DRNLM (https://github.com/piotrmirowski/DependencyTreeRnn) . , and are used to train embeddings on the same corpora as our model. We set the parameters of baselines to ensure a fair comparison888The dimensions and interval numbers of baselines are tuned in the development set. The interval numbers of DRNLM, Skip-gram and CBOW are set to 10, and the interval number of DepEmb is set to 15. All dimensions of baselines are set to 100. The window sizes of Skip-gram and CBOW are set to 5, which is the same as our model of multi-task learning with linear context.. As the baselines do not learn the distributed representation of multi-hop dependency path, we compare our model with them in the feature settings (W and W+L) where only word embeddings are used. The statistical significance tests are calculated by approximate randomization and the results are shown in Table 5.
Among these baselines, the DepEmb learns word embeddings by dependency context, and performs best in both domains. It indicates that the syntactic information is more desirable in aspect term extraction. WDEmb outperforms DepEmb and the reasons are that, (1) compared with DepEmb that embeds a unit word + dependency path into a vector, WDEmb learns the word and dependency path embeddings separately, and alleviates the data sparsity problem; (2) multi-hop dependency paths are taken into consideration in WDEmb which help encode more syntactic information. DRNLM performs worst among these baselines, though it utilizes the dependency relation in their model. The reason is that the DRNLM focuses on language modeling and does not include subsequent words which are used in other embedding baselines, as the context.
We also employ two methods for qualitative analysis of the learned embeddings, (1) we select the 6 most similar words (by cosine similarity) for each aspect word. Results are shown in Table 6; (2) we design some queries made up of one word and one dependency path, such asdelicious + and delicious + , to find similar words. Results are displayed in Table 7.
Table 6 shows that the similar words derived from both Skip-gram and CBOW, are less similar with the defined aspect words in syntactic function, such as, fast, estimated and speedy. The reason lies in that both the Skip-gram and CBOW models learn word embeddings based on linear context and capture less functional similarity between words than the DepEmb and WDEmb [Levy and Goldberg2014]. Table 6 also shows that the similar words induced from WDEmb are more topical similar with the aspect words than DepEmb (e.g., racquetball, everything, smog). The reason is that WDEmb encodes more topical information into word embeddings by multi-task learning based on linear context. Besides, we can obtain sensible words by the queries made up of word + dependency path in Table 7. It reveals that our approach of word and dependency path embeddings is promising.
4.4 Effect of Parameters
In this section, we investigate the effect of interval number and the dimension of embeddings in the feature setting of W+L+D. All the experiments are conducted on the development set.
We vary from 3 to 21, increased by 3. The performances in three datasets are given in Figure 2. We can see that the performances increase when , and the models achieve the best results in two domains when . However, when , the performances decrease. We also vary from 25 to 225, increased by 25. The results are shown in Figure 2. The performances increase when . But they keep stable when . We can infer that 100 is a trade-off between the performance and training time.
5 Conclusion and Future Work
In this paper, we propose to learn the distributed representations of words and dependency path in an unsupervised way for aspect term extraction. Our method leverages the dependency path information to connect words in the embedding space. It is effective to distinguish words with similar context but different syntactic functions which are important for aspect term extraction. Furthermore, the distributed representation of multi-hop dependency paths are also derived explicitly. The embeddings are then discretized and used as features in the CRF model to extract aspect terms from review sentences. With only the distributed representation features, we obtain comparable results with the top systems in aspect term extraction on the benchmark dataset. Recently, [Wang et al.2015a]
propose to incorporate knowledge graph to represent the documents or short text[Wang et al.2015b, Wang et al.2016]. This could be interesting to combine this representation with our embedding methods to further improve the performance.
This paper is partially supported by the National Natural Science Foundation of China (NSFC Grant Numbers 61272343, 61472006), the Doctoral Program of Higher Education of China (Grant No. 20130001110032) as well as the National Basic Research Program (973 Program No. 2014CB340405).
- [Bengio et al.2013] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. PAMI, 35(8):1798–1828, Aug 2013.
- [Bollegala et al.2015] Danushka Bollegala, Takanori Maehara, and Ken-ichi Kawarabayashi. Unsupervised cross-domain word representation learning. arXiv:1505.07184, 2015.
[Bordes et al.2011]
Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio.
Learning structured embeddings of knowledge bases.
Conference on Artificial Intelligence, number EPFL-CONF-192344, 2011.
- [Chernyshevich2014] Maryna Chernyshevich. Ihs r&d belarus: Cross-domain extraction of product features using conditional random fields. SemEval, page 309, 2014.
- [Collobert et al.2011] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. JMLR, 2011.
[Guo et al.2014]
Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu.
Revisiting embedding features for simple semi-supervised learning.In EMNLP, pages 110–120, 2014.
- [Hu and Liu2004] Minqing Hu and Bing Liu. Mining opinion features in customer reviews. In AAAI, volume 4, pages 755–760, 2004.
- [Jakob and Gurevych2010] Niklas Jakob and Iryna Gurevych. Extracting opinion targets in a single-and cross-domain setting with conditional random fields. In EMNLP, pages 1035–1045, 2010.
- [Levy and Goldberg2014] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In ACL, pages 302–308, 2014.
- [Li et al.2010] Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu, Ying-Ju Xia, Shu Zhang, and Hao Yu. Structure-aware review mining and summarization. In ACL, pages 653–661, 2010.
- [Li et al.2012] Fangtao Li, Sinno Jialin Pan, Ou Jin, Qiang Yang, and Xiaoyan Zhu. Cross-domain co-extraction of sentiment and topic lexicons. In ACL, pages 410–419, 2012.
- [Lin et al.2015] Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. Modeling relation paths for representation learning of knowledge bases. In EMNLP, 2015.
- [Liu et al.2015a] Pengfei Liu, Shafiq Joty, and Helen Meng. Fine-grained opinion mining with recurrent neural networks and word embeddings. In EMNLP, pages 1433–1443, 2015.
- [Liu et al.2015b] Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. A dependency-based neural network for relation classification. In ACL, pages 285–290, 2015.
- [Mikolov et al.2010] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTERSPEECH, pages 1045–1048, 2010.
- [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.
- [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
- [Neelakantan et al.2015] Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compositional vector space models for knowledge base completion. In ACL, pages 156–166, July 2015.
- [Piotr and Andreas2015] Mirowski Piotr and Vlachos Andreas. Dependency recurrent neural language models for sentence completion. arXiv:1507.01193, 2015.
[Pontiki et al.2014]
Maria Pontiki, Haris Papageorgiou, Dimitrios Galanis, Ion Androutsopoulos, John
Pavlopoulos, and Suresh Manandhar.
Semeval-2014 task 4: Aspect based sentiment analysis.In SemEval, pages 27–35, 2014.
- [Pontiki et al.2015] Maria Pontiki, Dimitrios Galanis, Haris Papageogiou, Suresh Manandhar, and Ion Androutsopoulos. Semeval-2015 task 12: Aspect based sentiment analysis. In SemEval, 2015.
- [Qiu et al.2011] Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Opinion word expansion and target extraction through double propagation. CL, 37(1):9–27, 2011.
- [San Vicente et al.2015] Iñaki San Vicente, Xabier Saralegi, and Rodrigo Agerri. Elixa: A modular and flexible absa platform. In SemEval, pages 748–752, 2015.
- [Toh and Wang2014] Zhiqiang Toh and Wenting Wang. Dlirec: Aspect term extraction and term polarity classification system. 2014.
- [Turian et al.2010] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, pages 384–394, 2010.
- [Wang et al.2015a] Chenguang Wang, Yangqiu Song, Ahmed El-Kishky, Dan Roth, Ming Zhang, and Jiawei Han. Incorporating world knwledge to document clustering via heterogeneous information networks. In KDD, pages 1215–1224, 2015.
- [Wang et al.2015b] Chenguang Wang, Yangqiu Song, Haoran Li, Ming Zhang, and Jiawei Han. Knowsim: A document similarity measure on structured heterogeneous information networks. In ICDM, pages 1015–1020, 2015.
- [Wang et al.2016] Chenguang Wang, Yangqiu Song, Haoran Li, Ming Zhang, and Jiawei Han. Text classification with heterogeneous information network kernels. In AAAI, 2016.
- [Yeh2000] Alexander Yeh. More accurate tests for the statistical significance of result differences. COLING, pages 947–953, 2000.