DeepAI
Log In Sign Up

tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification

02/01/2019
by   Blaž Škrlj, et al.
0

The use of background knowledge remains largely unexploited in many text classification tasks. In this work, we explore word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy based features, and demonstrate its use on six short-text classification problems, including gender, age and personality type prediction, drug effectiveness and side effect prediction, and news topic prediction. The experimental results indicate that the interpretable features constructed using tax2vec can notably improve the performance of classifiers; the constructed features, in combination with fast, linear classifiers tested against strong baselines, such as hierarchical attention neural networks, achieved comparable or better classification results on short documents. Further, tax2vec can also serve for extraction of corpus-specific keywords. Finally, we investigated the semantic space of potential features where we observe a similarity with the well known Zipf's law.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/05/2018

Improving Medical Short Text Classification with Semantic Expansion Using Word-Cluster Embedding

Automatic text classification (TC) research can be used for real-world p...
04/12/2022

Easy Adaptation to Mitigate Gender Bias in Multilingual Text Classification

Existing approaches to mitigate demographic biases evaluate on monolingu...
04/14/2018

ClassiNet -- Predicting Missing Features for Short-Text Classification

The fundamental problem in short-text classification is feature sparsene...
02/20/2022

Hierarchical Interpretation of Neural Text Classification

Recent years have witnessed increasing interests in developing interpret...
02/22/2020

Incorporating Effective Global Information via Adaptive Gate Attention for Text Classification

The dominant text classification studies focus on training classifiers u...

1 Introduction

In text mining, document classification refers to the task of classifying a given text document into one or more categories based on its content Sebastiani:2002:MLA:505282.505283

. A text classifier is given a set of labeled documents as input, and is expected to learn to associate the patterns appearing in the documents to the document labels. Lately, deep learning approaches have become a standard in natural language-related learning tasks, demonstrating good performance on a variety of different classification tasks, including sentiment analysis of tweets

tang2015document and news categorization kusner2015word . Despite achieving state-of-the-art performance on many tasks, deep learning is not yet optimized for situations, where the number of documents in the training set is low, or when the documents contain very little text rangel2017overview .

Semantic data mining denotes a data mining approach where domain ontologies are used as a background knowledge in the data mining process lawrynowicz2017semantic . Semantic data mining approaches have been successfully applied to association rule learning angelino2017learning , semantic subgroup discovery vavpetic_SDM ; wordification

, data visualization

Adhikari2014 , as well as to text classification scott1998text . Provision of semantic information allows the learner to use features on a higher semantic level, allowing for better data generalizations. The semantic information is commonly represented as relational data in the form of complex networks, ontologies and taxonomies. Development of approaches which leverage such information remains a lively research topic in several fields, including biology 8538944 ; Chang:2015:HNE:2783258.2783296 , sociology freeman2017research

, and natural language processing

wang2017combining .

This paper contributes to semantic data mining by using word taxonomies as means for semantic enrichment by constructing new features, with the goal to improve the performance and robustness of the learned classifiers. In particular, it addresses classification of short or incomplete documents, which is useful in a large variety of tasks. For example, in author profiling the task is to recognize the author’s characteristics, such as age or gender rangel2014overview , based on a collection of author’s text samples. Here, the effect of data size is known to be an important factor, influencing classification performance rangel2016overview . A frequent text type for this task are tweets, where a collection of tweets from the same author is considered a single document, to which a label must be assigned. The fewer instances (tweets) per user we need, the more powerful and useful the approach. Learning from only a handful of tweets can lead to preliminary detection of bots in social networks, and is hence of practical importance chu2012detecting ; chu2010tweeting . In a similar way, this holds true for nearly any kind of text classification task. For example, for classifying news into a specific topic, using only snippets or titles and not the entire news text, may be preferred due to the text availability or processing speed. For example, in biomedical applications, Grässer et al.GraBer:2018:ASA:3194658.3194677 tried to predict drug’s side effects and effectiveness from patients’ short commentaries, while Boyce et al.boyce2012using investigated the use of short user comments to assess drug-drug interactions.

It has been demonstrated that deep neural networks in general need a large amount of information in order to learn complex classifiers, i.e. they require a large training set of documents. For example, the recently introduced BERT neural network architecture devlin2018bert consisting of hundreds of hidden layers was trained on the whole Wikipedia, even though its application (fine-tuning) can be executed on smaller data sets. However, the state-of-the-art models do not perform well when incomplete (or scarce) information is used as input cho2015much , even though promising results regarding zero-shot socher2013zero and few-shot snell2017prototypical learning were recently achieved.

This paper proposes a novel approach named tax2vec

, where semantic information in the form of taxonomies is used to improve classification performance on short texts. In the proposed approach, based on a single input parameter (the number of features), the features are constructed autonomously and remain interpretable. We believe that tax2vec could help explore and understand how external semantic information can be incorporated into existing (black-box) machine learning models, as well as help to explain what is being learned.

This work is structured as follows. Following the theoretical preliminaries and the related work, necessary to understand how semantic background knowledge can be used in learning, we continue with the description of the proposed tax2vec methodology. This is followed by the experimental evaluation, where we first evaluate the qualitative properties of features constructed using tax2vec, followed by extensive classification benchmark tests. The paper concludes by a comment on open source software and by a discussion on further work. In terms of sections, we formulate the proposed tax2vec algorithm in Section 3. In Section 4, we describe the experimental setting used to test the methodology. In Section 5, we present the results of experimental testing. In Section 6 we demonstrate how tax2vec can be used for qualitative corpus analysis.

2 Background and related work

In this section we present the theoretical preliminaries and some related work, which served as the basis for the proposed tax2vec approach. We begin by explaining different levels of semantic context, followed by the explanation of the rationale behind the proposed approach.

2.1 Semantic context

Document classification is highly dependent on document representation. In simple bag-of-words representations, the frequency (or a similar weight such as term frequency-inverse document frequency—tf-idf) of each word or -gram is considered as a separate feature. More advanced representations group words with similar meaning together. Such approaches include Latent Semantic Analysis landauer2006latent , Latent Dirichlet Allocation blei2003latent , and more recently word embeddings mikolov2013efficient . It has been previously demonstrated that context-aware algorithms significantly outperform the naive learning approaches cagliero2013improving . We refer to such semantic context as the first-level context.

Second-level context can be introduced by incorporating background knowledge (e.g., ontologies) into a learning task, which can lead to improved interpretability and performance of classifiers, learned e.g., by rule learning vavpetic_SDM

, and random forests

xu2018ontological . In text mining, Elhadad et al. elhadad2018novel present an ontology-based web document classifier, while Kaur et al. kaur2018domain propose a clustering-based algorithm for document classification, which also benefits from knowledge stored in the underlying ontologies. Cagliero and Garza cagliero2013improving report a custom classification algorithm, which can leverage taxonomies and demonstrate on a case study of geospatial data that such information can be used to improve the learner’s classification performance. Use of hypernym-based features for classification tasks has been considered previously. The Ripper rule learner was used with hypernym-based features scott1998text , while in mansuy2006evaluating the impact of WordNet-based features for text classification was evaluated, demonstrating that hypernym based features significantly impact the classifier performance.

2.2 Feature construction and selection

When unstructured data is used as input, it is common to explore the options of feature construction. Even though recently introduced deep neural network based approaches operate on simple word indices, and thus eliminate the need for manual construction of features, such alternatives are not necessarily the optimal approach when vectorizing the background knowledge in the form of taxonomies or ontologies. Features obtained by training a neural network are inherently non-symbolic and as such do not present any added value to the developer’s understanding of the (possible) causal mechanisms underlying the learned representations bunge2017causality ; pearl2009causality . On the contrary, understanding the semantic background of a classifier’s decision can shed light on previously not observed second-level context vital to the success of learning, rendering otherwise incomprehensive models easier to understand.

Definition 1 (Feature construction).

Given an unstructured input consisting of documents, a feature construction algorithm outputs a matrix , where denotes the predefined number of features to be constructed.

In practical applications, features are constructed from various data sources, including texts stanczyk2015feature , graphs kakisim2018unsupervised , audio recordings and similar data tomavsev2015hubness

. With the increasing computational power at one’s disposal, automated feature construction methods are becoming prevalent. Here, the idea is that given some criterion, the feature constructor outputs a set of features selected according to the criterion. For example, the tf-idf feature construction algorithm, applied to a given document corpus, can automatically construct hundreds of thousands of n-gram features in a matter of minutes on an average of-the-shelf laptop.

Many approaches can thus output too many features to be processed in a reasonable time, and can introduce additional noise, which renders the task of learning even harder. To solve this problem, one of the known solutions is feature selection.

Definition 2 (Feature selection).

Let represent the feature matrix (as defined above), obtained during automated feature construction. A feature selection algorithm transforms the matrix to a matrix , where represents the number of desired features after feature selection.

Feature selection thus filters out the (unnecessary) features, with the aim of yielding a compact, information-rich representation of the unstructured input. There exist many approaches to feature selection. They can be based on the individual feature’s information content, correlation, significance etc. chandrashekar2014survey . Feature selection is for example relevant in biological data sets, where e.g., only a handful of the key gene markers are of interest, and can be identified by assessing the impact of individual features on the target space hira2015review .

2.3 Learning from graphs and relational information

In this section we discuss briefly the works that influenced the development of the proposed approach. One of the most elegant ways to learn from graphs is by transforming them into propositional tables, which are a suitable input for many down-stream learning algorithms. Recent attempts to vectorization of graphs include node2vec Grover:2016:NSF:2939672.2939754 , an algorithm for constructing features from homogeneous networks; its extension to heterogeneous networks metapath2vec Dong:2017:MSR:3097983.3098036 ; mol2vec doi:10.1021/acs.jcim.7b00616 , a vectorization algorithm focused on molecular data; struc2vec Ribeiro:2017:SLN:3097983.3098061

, a graph vectorization algorithm based on homophily relations between nodes, and more. All of these approaches are non-symbolic, as the obtained vectorized information (embeddings) are not interpretable. Similarly, recently introduced graph-convolutional neural networks also yield local node embeddings, which also take node feature vectors into account

kipf2017semi ; NIPS2017_6703 .

In parallel to graph based vectorization, approaches which tackle the problem of learning from relational databases emerged. Symbolic (i.e., interpretable) approaches for this vectorization task, known under the term propositionalization, include RSD vzelezny2006propositionalization , a rule-based algorithm which constructs relational features; and wordification perovvsek2013wordification , an approach for unfolding relational databases into bag-of-words representations. The approach, described in the following sections, relies on some of the key ideas initially introduced in the mentioned works on propositionalization, as taxonomies are inherently relational data structures.

3 The tax2vec approach

In this section we outline the proposed tax2vec approach. We begin with a general description of classification from short texts, followed by the key features of tax2vec, which offer solutions to some of the currently not well explored issues in text mining.

3.1 The rationale behind tax2vec

Even though deep learning-based approaches recently dominate in the field of general text classification, they remain outperformed by simpler ones, such as SVMs, for classification based on short documents (tweets, opinions etc.) where also the number of instances is low. Compared to non-symbolic node vectorization algorithms discussed in the previous section, tax2vec uses hypernyms as potential features directly, and thus makes the process of feature construction and selection possible without the loss of classifier’s interpretability. In this work we first explore how parts of the WordNet taxonomy Miller:1995:WLD:219717.219748 , related to the training corpus, can be used for the construction of novel features, as such background knowledge can be applied in virtually every English text-based learning setting, as well as for many other languages Gonzalez-Agirre:Laparra:Rigau:2012 .

We propose the tax2vec, an algorithm for semantic feature vector construction that can be used to enrich the feature vectors, constructed by the established text processing methods such as tf-idf. The tax2vec algorithm takes as input a labeled or unlabeled corpus of documents and a word taxonomy. It outputs a matrix of semantic feature vectors in which each row represents a semantics-based vector representation of one input document. Example use of tax2vec in a common language processing pipeline is shown in Figure 1. Note that the obtained feature vectors serve as additional features in the final, vectorized representation of a given corpus.

Figure 1: Schematic representation of tax2vec, combined with standard tf-idf representation of documents. Note that darker nodes in the taxonomy represent more general terms.

3.2 Document-based taxonomy construction

In the first step of the tax2vec algorithm, a document-based taxonomy is constructed from the input corpus. In this section we describe how the words from individual documents of a corpus are mapped to the WordNet taxonomy, where the obtained mappings are considered as the novel features. We focus on semantic structures, derived exclusively from the hypernymy relation between words. Such taxonomies are tree-like structures, which span from individual words to higher-order semantic concepts. For example, given the word monkey, one of its mappings in the WordNet hypernym taxonomy is the term mammal, which can be further mapped to e.g., animal etc., eventually reaching the most general term, i.e. entity.

In the tax2vec algorithm, each word is first mapped to the hypernym WordNet taxonomy. In order to discover the mapping, the first problem that must be solved is that of disambiguation. For example, the word bank has two different meanings, when considered in the following sentences:

River bank was enforced. National bank was robbed.

There exist many approaches to word-sense disambiguation (WSD). We refer the reader to Navigli:2009:WSD:1459352.1459355 for detailed overviews of the WSD methodology. In this work we use Lesk basile2014enhanced , the gold standard WSD algorithm.

In tax2vec, the disambiguated word, mapped to the WordNet taxonomy, is then associated with a path in the taxonomy leading from the word to the root of the taxonomy. An example hypernym path (with WordNet-style notation) extracted with respect to word “astatine” is shown below.

where the corresponds to the “hypernym of” relation (the majority of hypernym paths end with the “entity” term, as it represents one of the most general objects in the taxonomy). Finding this path to the root of the taxonomy for all words in the input document, a document-based taxonomy is constructed, which consists of all hypernyms of all words in the document. During the construction of the document-based taxonomy, document-level term counts are calculated for each term. For each word and document , we count the number of times the word or one of its hypernyms appeared in a given document . After constructing the document-based taxonomy for all the documents in the corpus, the taxonomies are joined into a corpus-based taxonomy.

Note that processing of each document and constructing the document-based taxonomy is entirely independent from other documents, allowing us to process the documents in parallel and join the results only when constructing the joint corpus-based taxonomy.

The obtained counts can be used for feature construction directly; each term from the corpus-based taxonomy is associated with a feature, and a (potentially weighted) document-level term count is used as the feature value. The current implementation of tax2vec weighs the feature values according to the double normalization tf-idf metric and calculates the feature tf-idf(t,D) for hypernym and document as follows manning_raghavan_schütze_2008 :

(1)

In calculating the tf-idf value of the word, the raw frequency is normalized by , which corresponds to the raw count of the most common hypernym of words in the document. Value represents the total number of documents in the corpus, denotes the number of document-based taxonomies the hypernym appears in (i.e. the number of documents that contain a hyponym of ) and is a normalization constant, in this work set to . The term frequencies are normalized with respect to the most occurring term to prevent a bias towards longer documents.

3.3 Feature selection

The problem with the approach, presented so far, is that all hypernyms from the corpus-based taxonomy are considered, and therefore, the number of columns in the feature matrix can grow to tens of thousands of terms. Including all these terms in the learning process introduces unnecessary noise, as well as increases the spatial complexity. This necessitates the use of feature selection (see Definition 2 in Section 2.2) to reduce the number of features to a user-defined number (a free parameter specified as part of the input). We next describe the scoring functions of feature selection approaches, considered in this work.

3.3.1 Feature selection by term counts

Intuitively, the rarest terms are the most document-specific and could provide additional information to the classifier. This is addressed in tax2vec by the simplest heuristic, used in the algorithm: a term-count based heuristic which simply takes overall counts of all hypernyms in the document-based taxonomy, sorts them in ascending order according to their frequency of occurrence and takes the top

.

3.3.2 Feature selection using term betweenness centrality

As the training corpus-specific taxonomy is not necessarily the same as the global (whole) taxonomy, the graph-theoretic properties of individual terms within the local taxonomy could provide a reasonable estimate of a term’s importance. The proposed tax2vec implements the betweenness centrality (BC)

brandes measure of individual terms as the scoring measure. The betweenness centrality is defined as:

(2)

where corresponds to the number of shortest paths (see Figure 2) between nodes and , and corresponds to the number of paths that pass through node (hypernym) . Intuitively, betweenness measures the ’s importance in the local taxonomy. Here, the terms are sorted in a descending order according to their betweenness centrality, and again, the top terms are used for learning.

Figure 2: An example shortest path. The path colored red represents the smallest number of edges needed to reach node C from node A.

3.3.3 Feature selection using mutual information

The third heuristic, mutual information (MI) peng2005feature , aims to exploit the information from the labels, assigned to the documents used for training.

The MI between two random discrete variables represented as vectors and (i.e. the -th hypernym feature and a target binary class) is defined as:

(3)

where and

correspond to marginal distributions of the joint probability distribution of

and

. Note that for this step, tax2vec uses the binary feature representation, where the tf-idf features are rounded to the closest integer value (either 0 or 1). This way, only well represented features are taken into account. Further, tax2vec uses one-hot encodings of target classes, meaning that each target class vector consists exclusively of zeros and ones. For

each of the target classes, tax2vec computes the mutual information (MI) between all hypernym features (i.e. matrix ) and a given class. Hence, for each target class, a vector of mutual information scores is obtained, corresponding to MI between individual hypernym features and a given target class.

Finally, tax2vec sums the MI scores obtained for each target class to obtain the final vector, which is then sorted in descending order. The first hypernym features are used for learning. At this point tax2vec yields the selected features as a sparse matrix, maintaining the spatial complexity which amounts to the number of float-valued non-zero entries.

3.3.4 Personalized-PageRank-based hypernym ranking

Recent advances by Kralj et al. kralj2017heterogeneous in learning using extensive background knowledge for rule induction explore the use of Personalized PageRank (PPR) algorithm for prioritizing a semantic search space. In tax2vec, we use the same idea to prioritize (score) hypernyms in the corpus-based taxonomy. In this section, we first briefly describe the Personalized PageRank algorithm and then describe how it is applied in tax2vec.

The PPR algorithm takes as input a network and a set of starting nodes in the network and returns a vector assigning a score to each node in the input network. The scores of the nodes are calculated as the stationary distribution of the positions of a random walker that starts its walk on one of the starting nodes and, in each step, either randomly jumps from a node to one of its neighbors (with probability , set to in our experiments) or jumps back to one of the starting nodes (with probability ). Detailed description of the Personalized PageRank used in tax2vec is given in Appendix A. This algorithm is used in tax2vec as follows:

  1. Identify a set of hypernyms in the corpus-based taxonomy, to which the words in the input corpus map to in the first step of tax2vec (described in Section 3.2).

  2. Run the PPR algorithm on the corpus-based taxonomy, using the hypernyms identified in step as the starting set.

  3. Use the top best ranked hypernyms as candidate features.

Note that this heuristics offers global node ranks with respect to the corpus used.

3.4 tax2vec formulation

All the aforementioned steps form the basis of tax2vec, outlined in Algorithm 1.

Data: Training set documents , training document labels , taxonomy , word-to-taxonomy mapping , heuristic , number of features
1 for  do
2       :=;
3      
4 end for
5 := ;
6 := ;
7 return ;
Result: new feature vectors in sparse vector format.
Algorithm 1 tax2vec pseudocode

First, tax2vec iterates through the given labeled document corpus (lines 2-5), and samples the word-term mappings for individual documents (MaptoTaxonomy method). In this process, counts are stored in a hash-like structure, where for each document, hypernym counts can be accessed in constant time (line 4, method storeTermCounts). Once sampled, counts are subject to processing and feature construction (lines 4-5). Here, the featureSelection method yields best features according to a given heuristic (). The final result are thus novel feature vectors.

3.5 Additional implementation details

The tax2vec algorithm is implemented in Python 3, where Multiprocessing111https://docs.python.org/2/library/multiprocessing.html, SciPy scipy and Numpy walt2011numpy libraries are used for fast (sparse), vectorized operations and parallelism. We developed a stand-alone library so that it as seamlessly as possible fits into existing text mining workflows, hence the Scikit-learn’s model syntax was adopted pedregosa2011scikit . The algorithm is first initiated as an object;

followed by standard fit and transform calls:

Such implementation offers fast prototyping capabilities, needed ubiquitously in the development of learning algorithms and executable NLP and text mining workflows. Installation instructions along with download links are available in Section 7. We continue the discussion by explaining the experimental setting, used to test the performance of tax2vec.

4 Experimental setting

This section presents the experimental setting used in testing the performance of tax2vec in document classification tasks. We begin by describing the data sets on which the method was tested. Next, we describe the classifiers, used to assess the use of features constructed using tax2vec, along with the baseline approaches. We continue by describing the methodology used to explore the qualitative properties of obtained corpus-based taxonomies.We continue by describing the metrics used to assess classification performance, and the description of the experiments.

4.1 Data sets

We tested the effects of features, produced with tax2vec, on seven different class labeled text data sets, summarized in Table 1, intentionally chosen from different domains.

data set (target) Classes Words Unique words Documents MNS
PAN 2017 (Gender) 2 5,169,966 607,474 3,600 102
MBTI (Personality) 16 11,832,937 372,811 8,676 89
PAN 2016 (Age) 5 943,880 178,450 403 202
BBC news (Topic) 4 544,872 43,525 1,406 76
Drugs (Side effects) 4 385,746 27,257 3,107 3
Drugs (Overall effect) 4 385,746 27,257 3,107 3
Table 1: Data sets used for experimental evaluation of tax2vec’s impact on learning. Note that MNS corresponds to the maximum number of text segments (max. number of tweets or comments per user or number of news paragraphs as presented in Appendix B).

The first four data sets are composed of short documents appearing in social media, where we consider classification of tweets and news.

We also consider two biomedical data sets related to drug consumption. Here, the same training instances were used to predict two different targets:

4.2 The classifiers used

As tax2vec serves as a preprocessing method for data enrichment with semantic features, arbitrary classifiers can use semantic features for learning. We use the following learners:

4.2.1 PAN 2017 approach

An SVM-based approach which relies heavily on the method proposed by Martinc et al. Martinc2017PAN2A for the author profiling task in the PAN 2017 shared task rangel2017overview . This method is based on sophisticated hand-crafted features calculated on different levels of preprocessed text. The following features were used:

  1. tf-idf weighted word unigrams calculated on lower-cased text with stopwords removed;

  2. tf-idf weighted word bigrams calculated on lower-cased text with punctuation removed;

  3. tf-idf weighted word bound character tetragrams calculated on lower-cased text;

  4. tf-idf weighted punctuation trigrams (the so-called beg-punct SapkotaBMS15 , in which the first character is punctuation but other characters are not) calculated on lower-cased text;

  5. tf-idf weighted suffix character tetragrams (the last four letters of every word that is at least four characters long SapkotaBMS15 ) calculated on lower-cased text;

  6. emoji counts: the number of emojis in the document, counted by using the list of emojis created by novak2015sentiment 777http://kt.ijs.si/data/Emoji_sentiment_ranking/. This feature is only useful if the input text contains emojis;

  7. document sentiment: the above-mentioned emoji list also contains the sentiment of a specific emoji, which allowed us to calculate the sentiment of the entire document by simply adding the sentiment of all the emojis in the document. Again, this feature is only useful if the input text contains emojis;

  8. character flood counts: the number of times that three or more identical character sequences appear in the document;

In contrast to the original approach proposed Martinc2017PAN2A

, we do not use POS tag sequences as features and a Logistic regression classifier is replaced by a Linear SVM. Here, we experimented with the regularization parameter C, for which values in range

were tested. This SVM variation is from this point on referred to as “SVM (Martinc et al.)”. As this feature construction pipeline consists of too many parameters, we were not able to perform extensive grid search due to computational complexity. Thus, we did not experiment with feature construction parameters, and kept the state-of-the-art configuration as proposed in the original study.

4.2.2 Linear SVMs, automatic feature construction

The second learner is a libSVM linear classifier chang2011libsvm , trained on a predefined number of word and character level n-grams, constructed using Scikit-learn’s TfidfVectorizer method. To find the best setting, we varied the SVM’s C parameter in range , the number of word features between and character features between . Note that the word features were sorted by decreasing frequency. Here, we considered n-grams of lengths between two and six. This SVM variation is from this point on referred to as “SVM (generic)”.

4.2.3 Hierarchical attention networks

The first neural network baseline is the recently introduced hierarchical attention network yang2016hierarchical . Here, we performed a grid search over hidden layers sizes, embedding sizes of , batch sizes of

and number of epochs

. For detailed explanation of the architecture, please refer to the original contribution yang2016hierarchical . We discuss the best-performing architecture in the Section 5 below.

4.2.4 Deep feedforward neural networks

As tax2vec constructs feature vectors, we also attempted to use them as inputs for a standard feedforward neural network architecture lecun2015deep ; schmidhuber2015deep . Here, we performed grid search across hidden layer settings: (where for example

corresponds to a two hidden layer neural network, where in the first hidden layer there are 128 neurons and 64 in the second), batch sizes

and the number of training epochs

. The two deep architectures were implemented using TensorFlow

45166 , and trained using a Nvidia Tesla K40 GPU.

4.3 Statistical properties of the semantic space: qualitative exploration

As the proposed approach is entirely symbolic—each feature can be unanimously traced back to a unique hypernym—we explored the feature space qualitatively by exploring the statistical properties of the induced taxonomy using graph-statistical approaches. Here, we modeled hypernym frequency distributions to investigate possible similarity with the Zipf’s law piantadosi2014zipf . The analysis was performed using the Py3plex library 10.1007/978-3-030-05411-3_60 . We also visualized the document-based taxonomy of the PAN (Age) data set using Cytoscape shannon2003cytoscape .

As the proposed experimental setup, performing a grid search over several parameters, is computationally expensive, the majority of the experiments were conducted using the SLING supercomputing architecture.888http://www.sling.si/sling/

4.4 Description of the experiments

The experiments were set up as follows. For the drug-related data sets, we used the splits given in the original paper GraBer:2018:ASA:3194658.3194677 . For other data sets, we trained the classifiers using stratified splits. For each classifier, 10 such splits were obtained. The measure used in all cases is F1, where for the multiclass problems (e.g., MBTI), we use the micro-averaged F1. All experiments were repeated five times using different random seeds. The features, obtained using tax2vec are used in combination with SVM classifiers, while the other classifiers are used as baselines.999 Note that simple feedforward neural networks could also be used in combination with hypernym features—we leave such computationally expensive experiments for further work.

5 Classification results and qualitative evaluation

In this section we provide the results obtained by conducting the experiments outlined in the previous section. We begin by discussing the overall classification performance with respect to different heuristics used. Next, we discuss how tax2vec augments the learner’s ability to classify when the number of text segments per user is reduced.

5.1 Classification performance evaluation

We first present classification results in the form of critical distance diagrams. The diagrams show average ranks of different algorithms according to the (micro) F1 measure. For each data set, we selected the best performing parametrization. A red line connects groups of classifiers that are not statistically significantly different from each other at a confidence level of . The significance levels are computed using Friedman multiple test comparisons followed by Nemenyi post-hoc correction demvsar2006statistical . Overall classification results are summarized in Figure 3, Figure 4 and Figure 5.

Figure 3: Overall classifier performance. The best (on average) performing classifier is an SVM classifier augmented with semantic features, selected using either simple frequency counts or closeness centrality.
Figure 4: Effect of semantic features on average classifier rank. Up to 100 semantic features positively effects the classifiers’ performance.
Figure 5: Overall model performance. SVMs dominate the short text classification. The diagram shows performance averaged over all data sets, where the best model parameterizations (see Table 2) were used for comparison.
Semantic Learner PAN (Age) PAN (Gender) MBTI BBC News Drugs (effect) Drugs (side)
0 DNN 0.4 0.511 0.182 0.353 0.4 0.321
HILSTM 0.422 0.752 0.407 0.833 0.443 0.514
SVM (Martinc et al.) 0.417 0.814 0.682 0.983 0.468 0.503
SVM (generic) 0.424 0.751 0.556 0.967 0.445 0.462
10 SVM (Martinc et al.) 0.445 0.815 0.679 0.996 0.47 0.506
SVM (generic) 0.502 0.781 0.556 0.972 0.445 0.469
25 SVM (Martinc et al.) 0.454 0.814 0.681 0.984 0.468 0.5
SVM (generic) 0.484 0.755 0.554 0.967 0.449 0.466
50 SVM (Martinc et al.) 0.439 0.814 0.681 0.983 0.462 0.499
SVM (generic) 0.444 0.751 0.554 0.963 0.446 0.463
100 SVM (Martinc et al.) 0.424 0.816 0.678 0.984 0.466 0.496
SVM (generic) 0.422 0.749 0.551 0.958 0.443 0.46
500 SVM (Martinc et al.) 0.383 0.797 0.662 0.975 0.45 0.477
SVM (generic) 0.4 0.724 0.532 0.909 0.424 0.438
1000 SVM (Martinc et al.) 0.368 0.783 0.647 0.964 0.436 0.466
SVM (generic) 0.373 0.701 0.512 0.851 0.407 0.42
Table 2: Effect of the added semantic features to classification performance, where all text segments (tweets/comments per user or segments per news article) are used. The best performing feature selection heuristic for the majority of top performing classifiers was “rarest terms” or “PPR”, indicating that only a handful of hypernyms carry added value, relevant for classification. Note that the results in the table correspond to the best performing combination of a classifier and a given heuristic.

The accuracy measure values are also presented in Table 2. It can be observed that up to 100 semantic features aid the SVM learners to achieve better accuracy. The most apparent improvement can be observed for the case of PAN 2016 (Age) data set, where the task was to predict age. Here, 10 semantic features notably improved the classifiers’ performance (up to approximately ). Further, a minor improvement over the state-of-the-art was also observed on the PAN 2017 (Gender) data set and the BBC news categorization. Hierarchical attention networks outperformed all other learners for the task of side effects prediction, yet semantics-augmented SVMs outperformed neural models when general drug effects were considered as target classes. Similarly, no performance improvements were offered by tax2vec on the MBTI data set.

The best (on average) performing C parameter for both SVM models was 50. The number of features that performed the best for all SVMs proposed in this study is 100,000. The HILSTM architecture’s topology varied between data sets, yet we observed that the best results were obtained when more than 15 epochs of training were conducted, combined with the hidden layer size of 64 neurons, where the size of the attention layer was of the same dimension.

5.2 Few-shot (per instance) learning

As discussed in the introductory sections, one of the goals of this paper was also to explore the setting, where only a handful of text segments per user are considered. Even though such setting is not strictly a few-shot learning snell2017prototypical , reducing the number of text segments per instance aims to simulate a similar setting where there is limited information available. In Table 3, we present the results for the setting, where only (up to) 10 text segments (e.g., tweets or news paragraphs) per instance were used for training. The segments were sampled randomly. Only a single text segment per user was considered for the medical texts, as they consist of at max of three commentaries. Similarly, as the BBC news data set consists of news article-genre pairs, we split the news article to sentences, which we randomly sampled. The rationale for such sampling is, we could evaluate tax2vec’s performance when for example only a handful of sentences are available (e.g., only the abstract).

Semantic Learner PAN (Age) PAN (Gender) MBTI BBC News Drugs (effect) Drugs (side)
0 SVM (Martinc et al.) 0.378 0.617 0.288 0.977 0.468 0.503
SVM (generic) 0.429 0.554 0.225 0.936 0.445 0.462
10 SVM (Martinc et al.) 0.39 0.616 0.292 0.981 0.47 0.503
SVM (generic) 0.429 0.557 0.225 0.948 0.444 0.464
25 SVM (Martinc et al.) 0.429 0.618 0.288 0.979 0.465 0.5
SVM (generic) 0.439 0.562 0.226 0.933 0.445 0.458
50 SVM (Martinc et al.) 0.402 0.617 0.288 0.974 0.474 0.504
SVM (generic) 0.42 0.557 0.225 0.919 0.442 0.46
100 SVM (Martinc et al.) 0.382 0.614 0.286 0.974 0.476 0.493
SVM (generic) 0.411 0.552 0.223 0.906 0.437 0.457
500 SVM (Martinc et al.) 0.359 0.604 0.276 0.959 0.465 0.471
SVM (generic) 0.365 0.548 0.22 0.8 0.419 0.435
1000 SVM (Martinc et al.) 0.34 0.59 0.266 0.925 0.442 0.46
SVM (generic) 0.359 0.535 0.213 0.704 0.412 0.417
Table 3: Effect of added semantic features to classification performance—few shot learning.

We observe that tax2vec based features improve the learners’ performance on all of the datasets. Here, up to 50 semantic features are observed to increase the accuracy by up to (on drug effects data). This result could indicate that even a small amount of text per instance contains enough semantic information to improve the classification performance.

5.3 Interpretation of results

In this section we attempt to explain the intuition behind the effect of semantic features on the classifier’s performance. Note that the best performing SVM models consisted of e.g., thousands of tf-idf word and character level features, yet only up to 100 semantic features, when added, notably improved the performance. We believe such effect can be understood via the way SVMs learn from high-dimensional data. With each new feature, we increase the dimensionality of the feature space. Even a single feature, when added, potentially impacts the hyperplane construction. Thus, otherwise problem-irrelevant features can become relevant when novel features are added. We believe that adding semantic features to otherwise un-ordered (raw) e.g., word tf-idf vector space introduces new information, crucial for successful learning, and potentially aligns the remainder of features so that the classifier can better separate the points of interest.

The other explanation for the notable differences in predictive performance is possibly related to small data set sizes, where only a handful of features can be of relevance and thus notably impact a given classifier’s performance.

5.4 Qualitative assessment

In this section we discuss the qualitative properties of the obtained corpus-based taxonomies. We present the results concerning hypernym frequency distributions, as well as the overall structure of an example corpus-based taxonomy. The examples in this section are all based on the corpus-based taxonomy, constructed from the PAN (Age) data set. The results of fitting various heavy-tailed distributions to the hypernym frequencies are given in Figure 6.

Figure 6: Hypernym frequency distribution for the PAN (Age) data set. The equation above the upper plot denotes the coefficients of a power law distribution. In real world phenomena, the exponent of the rightmost expression was observed to range between and , indicating the hypernym structure of the feature space is subject to a heavy-tailed (possibly best fit—power law) distribution. The denotes the hypernym count, after which notable differences in hypernym counts—scale free behavior is observed. Such distribution is to some extent expected, as some hypernyms are more general than others, and thus present in more document-hypernym mappings.

We fitted power law, truncated normal, log-normal and exponential distributions to the hypernym frequency data. For detailed overview of the distributions we refer the reader to

foss2011introduction . One of the key properties we observed was whether the underlying hypernym distribution is exponential or not, as non-exponential distributions indicate similarity with the well known Zipf’s law piantadosi2014zipf . The hypernym corpus-based taxonomy is visualized in Figure 7.

Figure 7: Topological structure of the hypernym space, induced from the PAN (Age) data set. Multiple connected components emerged, indicating not all hypernyms map to the same high-level concepts. Such segmentation is data set-specific, and can also potentially provide the means to compare semantic spaces of different data sets.

Here, each node represents a hypernym obtained in word-to-hypernym mapping phase of tax2vec. The edges represent the hypernymy relation between a given pair of hypernyms.

We next present the results of modeling the corpus-based hypernym frequency distributions. The two functions representing the best fit to hypernym frequency distributions are indeed the power law and the truncated power law. As similar behavior is known for word frequency in documents piantadosi2014zipf , we believe hypernym distributions are a natural extension, as naturally, if a high-frequency word maps to a given hypernym, the hypernym will be relatevely more common with respect to the occurrence of other hypernyms.

We observe that multiple connected components of varying sizes emerge. There exists only a single largest connected component, which consists of more general noun hypernyms, such as entity and similar. Interestingly, many smaller components also emerged, indicating parts of the word vector space could be mapped to very specific, disconnected parts of the WordNet taxonomy. Some examples of small disconnected components include (one component per line):

indicating also verb-level semantics can be captured and taken into account.

6 Interpretability of tex2vec

As discussed in the previous sections, tax2vec selects a set of hypernyms according to a given heuristic and uses them for learning. One of the key benefits of such approach is that the selected semantic features can easily be inspected, hence potentially offering interesting insights into the semantics, underlying the problem at hand.

We discuss here a set of 30 features which emerged as relevant according to the “mutual information” heuristic when the BBC News and PAN (Age) data sets were learned on. Here, tax2vec was trained on 90% of the data, the rest was removed (test set). The features and their corresponding mutual information scores are shown in Table 4.

Sorted target class-mutual information pairs
Semantic feature Average MI Class 1 Class 2 Class 3 Class 4 Class 5
BBC News data set
tory.n.03 0.057 politics:0.14 entertainment:0.05 business:0.03 sport:0.01 x
movie.n.01 0.059 business:0.14 politics:0.04 entertainment:0.04 sport:0.02 x
conservative.n.01 0.061 politics:0.15 entertainment:0.05 business:0.03 sport:0.01 x
vote.n.02 0.061 business:0.15 entertainment:0.04 politics:0.04 sport:0.02 x
election.n.01 0.063 entertainment:0.16 business:0.05 politics:0.04 sport:0.0 x
topology.n.04 0.063 entertainment:0.16 business:0.05 politics:0.04 sport:0.0 x
mercantile_establishment.n.01 0.068 politics:0.17 business:0.07 entertainment:0.03 sport:0.01 x
star_topology.n.01 0.069 politics:0.17 business:0.07 entertainment:0.03 sport:0.01 x
rightist.n.01 0.074 politics:0.18 business:0.06 entertainment:0.04 sport:0.01 x
marketplace.n.02 0.087 entertainment:0.22 business:0.06 politics:0.05 sport:0.01 x
PAN (Age) data set
hippie.n.01 0.007 25-34:0.01 35-49:0.01 18-24:0.0 65-xx:0.0 50-64:0.0
ceremony.n.03 0.007 25-34:0.01 35-49:0.01 18-24:0.01 65-xx:0.0 50-64:0.0
resource.n.02 0.008 50-64:0.02 18-24:0.01 25-34:0.0 65-xx:0.0 35-49:0.0
draw.v.07 0.008 25-34:0.02 35-49:0.01 50-64:0.01 65-xx:0.0 18-24:0.0
observation.n.02 0.008 25-34:0.02 35-49:0.01 50-64:0.01 65-xx:0.0 18-24:0.0
wine.n.01 0.008 35-49:0.02 25-34:0.01 18-24:0.01 50-64:0.01 65-xx:0.0
suck.v.02 0.008 25-34:0.02 50-64:0.02 35-49:0.0 65-xx:0.0 18-24:0.0
sleep.n.03 0.008 25-34:0.02 50-64:0.02 35-49:0.0 65-xx:0.0 18-24:0.0
recognize.v.09 0.009 25-34:0.02 35-49:0.02 18-24:0.0 50-64:0.0 65-xx:0.0
weather.v.04 0.009 25-34:0.02 50-64:0.02 35-49:0.0 18-24:0.0 65-xx:0.0
invention.n.02 0.009 25-34:0.02 35-49:0.01 18-24:0.01 50-64:0.0 65-xx:0.0
yankee.n.03 0.01 50-64:0.02 18-24:0.01 25-34:0.01 35-49:0.0 65-xx:0.0
Table 4: Most informative features with respect to the target class (ranked by MI) – Classes represent news topics (BBC) and different age intervals (PAN (Age)). Individual target classes are sorted according to descending mutual information with respect to a given feature

We can observe that the “sport” topic (BBC data set) is not well associated with the prioritized features. On the contrary, terms such as “rightist” and “conservative” emerged as relevant for classifying into the “politics” class. Similarly, “marketplace” for example, appeared relevant for classifying into the “entertainment” class. Even more interesting associations emerged when the same feature ranking was conducted on the PAN (Age) data set. Here, terms such as “resource” and “wine” were relevant for classifying middle-aged (“wine”) and older adult (“resource”) populations. Note that the older population (65-xx class) was not associated with any of the hypernyms. We believe the reason for this is that the number of available tweets decreases with age.

We repeated a similar experiment (BBC data set) using the “rarest terms” heuristic. The terms which emerged are:

’problem.n.02’, ’question.n.02’, ’riddle.n.01’, ’salmon.n.04’, ’militia.n.02’, ’orphan.n.04’, ’taboo.n.01’, ’desertion.n.01’, ’dearth.n.02’, ’outfitter.n.02’, ’scarcity.n.01’, ’vasodilator.n.01’, ’dilator.n.02’, ’fluoxetine.n.01’, ’high blood pressure.n.01’, ’amlodipine besylate.n.01’, ’drain.n.01’, ’imperative mood.n.01’, ’fluorescent.n.01’, ’veneer.n.01’, ’autograph.n.01’, ’oak.n.02’, ’layout.n.01’, ’wall.n.01’, ’firewall.n.03’, ’workload.n.01’, ’manuscript.n.02’, ’cake.n.01’, ’partition.n.01’, ’plasterboard.n.01’

Even if the feature selection method is unsupervised (not directly associated to classes), we can immediately observe that the features correspond to different topics,raging from medicine (e.g., “high blood presure”), politics (e.g., “militia”), food(e.g., “cake”) and more, indicating that the rarest hypernyms are indeed diverse,and as such potentially useful for the learner

The results suggest that tax2vec could potentially also be used to inspect the semantic background of a given data set directly, regardless of the learning task. We believe there are many potential uses for the obtained features, including the following, to be addressed in further work.

  • Concept drift detection, i.e. topics change over time; could it be qualitatively detected?

  • Topic domination, i.e. what type of topic is dominant with respect to e.g., a geographical region inspected?

  • What other learning tasks can benefit by using second level semantics? Can the obtained features be used, for example, for fast keyword search?

7 Availability

TBA

8 Conclusions and future work

In this work we propose tax2vec, a parallel algorithm for taxonomy-based enrichment of text documents. Tax2vec first maps the words from individual documents to their hypernym counterparts, which are considered as candidate features, where their values are weighted according to a normalized tf-idf metric. To select only a user-specified number of relevant features, tax2vec implements multiple feature selection heuristics, which select only the potentially relevant features. The sparse matrix of constructed features is finally used alongside the bag-of-words document representations for the task of text classification, where we study its performance on small data sets, where both the number of text segments per user, as well as the number of overall users considered are small.

Tax2vec considerably improves the classification performance especially on data sets consisting of tweets, but also on the news. The proposed implementation offers a simple-to-use API, which facilitates inclusion into existing text preprocessing workflows.

One of the drawbacks we plan to address is the support for arbitrary directed acyclic multigraphs—structures commonly used to represent background knowledge. Support for such knowledge would offer a multitude of applications in e.g., biology, where gene ontology and other resources which annotate entities of interest are freely available.

In this work we focus on BoW representation of documents, yet we believe tax2vec could also be used along Continuous Bag-of-Words (CBoW) models. We leave such experimentation for further work.

Even though we use Lesk for the disambiguation task, we believe recent advancements in neural disambiguation iacobacci2016embeddings could also be a “drop-in” replacement for this part of tax2vec. We leave the exploration of such options for further work.

Other further work considers joining the tax2vec features with existing state-of-the-art deep learning approaches, such as the hierarchical attention networks, which are—according to this study—not very suitable for learning on scarce data sets. We believe that the introduction of semantics into deep learning could be beneficial for both performance, as well as the interpretability of currently poorly understood black-box models.

Finally, as the main benefit of tax2vec is its explanatory power, we believe it could be used for fast keyword search; here, for example, new news or articles could be used as inputs, where the ranked list of semantic features could be directly used as candidate keywords.

Acknowledgements

The work of the first author was funded by the Slovenian Research Agency through a young researcher grant (TSP). The work of other authors was supported by the Slovenian Research Agency (ARRS) core research programme Knowledge Technologies (P2-0103), an ARRS funded research project Semantic Data Mining for Linked Open Data (financed under the ERC Complementary Scheme, N2-0078) and European Unionś Horizon 2020 research and innovation programme under grant agreement No 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media). We also gratefully acknowledge the support of NVIDIA Corporation for the donation of Titan-XP GPU.

References

References

  • (1) F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. 34 (1) (2002) 1–47.
  • (2)

    D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neural network for sentiment classification, in: Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 1422–1432.

  • (3) M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances, in: International Conference on Machine Learning, 2015, pp. 957–966.
  • (4) F. Rangel, P. Rosso, M. Potthast, B. Stein, Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter, Working Notes Papers of the CLEF.
  • (5) A. Ławrynowicz, Semantic Data Mining: An Ontology-based Approach, Vol. 29, IOS Press, 2017.
  • (6) E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, C. Rudin, Learning certifiably optimal rule lists, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 35–44.
  • (7) A. Vavpetič, N. Lavrač, Semantic subgroup discovery systems and workflows in the sdm-toolkit, The Computer Journal 56 (3) (2013) 304–320.
  • (8) M. Perovšek, A. Vavpetič, B. Cestnik, N. Lavrač, A wordification approach to relational data mining, in: J. Fürnkranz, E. Hüllermeier, T. Higuchi (Eds.), Discovery Science, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 141–154.
  • (9) P. R. Adhikari, A. Vavpetič, J. Kralj, N. Lavrač, J. Hollmén, Explaining mixture models through semantic pattern mining and banded matrix visualization, Machine Learning 105 (1) (2016) 3–39.
  • (10) S. Scott, S. Matwin, Text classification using wordnet hypernyms, Usage of WordNet in Natural Language Processing Systems.
  • (11) C. Kim, P. Yin, C. X. Soto, I. K. Blaby, S. Yoo, Multimodal biological analysis using NLP and expression profile, in: 2018 New York Scientific Data Summit (NYSDS), 2018, pp. 1–4.
  • (12) S. Chang, W. Han, J. Tang, G.-J. Qi, C. C. Aggarwal, T. S. Huang, Heterogeneous network embedding via deep architectures, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, ACM, New York, NY, USA, 2015, pp. 119–128.
  • (13) L. C. Freeman, Research Methods in Social Network Analysis, Routledge, 2017.
  • (14) J. Wang, Z. Wang, D. Zhang, J. Yan, Combining knowledge with deep convolutional neural networks for short text classification, in: Proceedings of IJCAI, Vol. 350, 2017.
  • (15) F. Rangel, P. Rosso, I. Chugur, M. Potthast, M. Trenkmann, B. Stein, B. Verhoeven, W. Daelemans, Overview of the 2nd author profiling task at PAN 2014, in: Working Notes Papers of the CLEF 2014, 2014, pp. 1–30.
  • (16) F. Rangel, P. Rosso, B. Verhoeven, W. Daelemans, M. Potthast, B. Stein, Overview of the 4th author profiling task at PAN 2016: Cross-genre evaluations, in: Working Notes Papers of the CLEF 2016, 2016, pp. 750–784.
  • (17) Z. Chu, S. Gianvecchio, H. Wang, S. Jajodia, Detecting automation of twitter accounts: Are you a human, bot, or cyborg?, IEEE Transactions on Dependable and Secure Computing 9 (6) (2012) 811–824.
  • (18) Z. Chu, S. Gianvecchio, H. Wang, S. Jajodia, Who is tweeting on twitter: human, bot, or cyborg?, in: Proceedings of the 26th annual computer security applications conference, ACM, 2010, pp. 21–30.
  • (19) F. Grässer, S. Kallumadi, H. Malberg, S. Zaunseder, Aspect-based sentiment analysis of drug reviews applying cross-domain and cross-data learning, in: Proceedings of the 2018 International Conference on Digital Health, DH ’18, ACM, New York, NY, USA, 2018, pp. 121–125.
  • (20) R. Boyce, G. Gardner, H. Harkema, Using natural language processing to identify pharmacokinetic drug-drug interactions described in drug package inserts, in: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, Association for Computational Linguistics, 2012, pp. 206–213.
  • (21) J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805.
  • (22) J. Cho, K. Lee, E. Shin, G. Choy, S. Do, How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?, arXiv preprint arXiv:1511.06348.
  • (23) R. Socher, M. Ganjoo, C. D. Manning, A. Ng, Zero-shot learning through cross-modal transfer, in: Advances in neural information processing systems, 2013, pp. 935–943.
  • (24) J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, in: Advances in Neural Information Processing Systems, 2017, pp. 4077–4087.
  • (25) T. K. Landauer, Latent Semantic Analysis, Wiley Online Library, 2006.
  • (26) D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of machine Learning research 3 (1) (2003) 993–1022.
  • (27)

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems 26, Curran Associates, Inc., 2013, pp. 3111–3119.

  • (28)

    L. Cagliero, P. Garza, Improving classification models with taxonomy information, Data & Knowledge Engineering 86 (2013) 85–101.

  • (29)

    N. Xu, J. Wang, G. Qi, T. S. Huang, W. Lin, Ontological random forests for image classification, in: Computer Vision: Concepts, Methodologies, Tools, and Applications, IGI Global, 2018, pp. 784–799.

  • (30) M. K. Elhadad, K. M. Badran, G. I. Salama, A novel approach for ontology-based feature vector generation for web text document classification, International Journal of Software Innovation (IJSI) 6 (1) (2018) 1–10.
  • (31) R. Kaur, M. Kumar, Domain ontology graph approach using markov clustering algorithm for text classification, in: International Conference on Intelligent Computing and Applications, Springer, 2018, pp. 515–531.
  • (32) T. N. Mansuy, R. J. Hilderman, Evaluating wordnet features in text classification models., in: FLAIRS Conference, 2006, pp. 568–573.
  • (33) M. Bunge, Causality and Modern Science, Routledge, 2017.
  • (34) J. Pearl, Causality, Cambridge university press, 2009.
  • (35)

    U. Stańczyk, L. C. Jain, Feature selection for Data and Pattern Recognition, Springer, 2015.

  • (36) A. G. Kakisim, I. Sogukpinar, Unsupervised binary feature construction method for networked data, Expert Systems with Applications 121 (2019) 256 – 265.
  • (37) N. Tomašev, K. Buza, K. Marussy, P. B. Kis, Hubness-aware classification, Instance Selection and Feature Construction: Survey and Extensions to Time-series, in: Feature selection for data and pattern recognition, Springer, 2015, pp. 231–262.
  • (38) G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical Engineering 40 (1) (2014) 16–28.
  • (39)

    Z. M. Hira, D. F. Gillies, A review of feature selection and feature extraction methods applied on microarray data, Advances in bioinformatics 2015.

  • (40) A. Grover, J. Leskovec, Node2vec: Scalable feature learning for networks, in: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM, New York, NY, USA, 2016, pp. 855–864.
  • (41) Y. Dong, N. V. Chawla, A. Swami, Metapath2vec: Scalable representation learning for heterogeneous networks, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, ACM, New York, NY, USA, 2017, pp. 135–144.
  • (42) S. Jaeger, S. Fulle, S. Turk, Mol2vec: Unsupervised machine learning approach with chemical intuition, Journal of Chemical Information and Modeling 58 (1) (2018) 27–35.
  • (43) L. F. Ribeiro, P. H. Saverese, D. R. Figueiredo, Struc2vec: Learning node representations from structural identity, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, ACM, New York, NY, USA, 2017, pp. 385–394.
  • (44) T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: International Conference on Learning Representations (ICLR), 2017.
  • (45) W. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, in: Advances in Neural Information Processing Systems 30, Curran Associates, Inc., 2017, pp. 1024–1034.
  • (46) F. Železnỳ, N. Lavrač, Propositionalization-based relational subgroup discovery with RSD, Machine Learning 62 (1-2) (2006) 33–63.
  • (47) M. Perovšek, A. Vavpetič, B. Cestnik, N. Lavrač, A wordification approach to relational data mining, in: International Conference on Discovery Science, Springer, 2013, pp. 141–154.
  • (48) G. A. Miller, Wordnet: A lexical database for english, Commun. ACM 38 (11) (1995) 39–41.
  • (49) A. Gonzalez-Agirre, E. Laparra, G. Rigau, Multilingual central repository version 3.0: upgrading a very large lexical knowledge base, in: Proceedings of the 6th Global WordNet Conference (GWC 2012), Matsue, 2012.
  • (50) R. Navigli, Word sense disambiguation: A survey, ACM Comput. Surv. 41 (2) (2009) 10:1–10:69.
  • (51) P. Basile, A. Caputo, G. Semeraro, An enhanced lesk word sense disambiguation algorithm through a distributional semantic model, in: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 1591–1600.
  • (52) C. D. Manning, P. Raghavan, H. Schütze, Scoring, term weighting, and the vector space model, Cambridge University Press, 2008, p. 100–123.
  • (53) U. Brandes, A faster algorithm for betweenness centrality, The Journal of Mathematical Sociology 25 (2) (2001) 163–177.
  • (54) H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence 27 (8) (2005) 1226–1238.
  • (55) J. Kralj, Heterogeneous information network analysis for semantic data mining: Doctoral dissertation, Ph.D. thesis, J. Kralj (2017).
  • (56) E. Jones, T. Oliphant, P. Peterson, SciPy: Open source scientific tools for Python, http://www.scipy.org/ (2001–).
  • (57) S. v. d. Walt, S. C. Colbert, G. Varoquaux, The NumPy array: a structure for efficient numerical computation, Computing in Science & Engineering 13 (2) (2011) 22–30.
  • (58) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, Journal of machine learning research 12 (Oct) (2011) 2825–2830.
  • (59) D. Greene, P. Cunningham, Practical solutions to the problem of diagonal dominance in kernel document clustering, in: Proceedings of the 23rd International Conference on Machine learning (ICML’06), ACM Press, 2006, pp. 377–384.
  • (60) Matej Martinc and Iza Škrjanec and Katja Zupan and Senja Pollak, Pan 2017: Author profiling - gender and language variety prediction, in: CLEF, 2017.
  • (61) U. Sapkota, S. Bethard, M. Montes-y-Gómez, T. Solorio, Not all character n-grams are created equal: A study in authorship attribution, in: NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, 2015, pp. 93–102.
  • (62) P. K. Novak, J. Smailović, B. Sluban, I. Mozetič, Sentiment of emojis, PloS one 10 (12) (2015) e0144296.
  • (63)

    C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines, ACM transactions on intelligent systems and technology (TIST) 2 (3) (2011) 27.

  • (64) Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1480–1489.
  • (65) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436.
  • (66) J. Schmidhuber, Deep learning in neural networks: An overview, Neural networks 61 (2015) 85–117.
  • (67) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, software available from tensorflow.org (2015).
  • (68) S. T. Piantadosi, Zipf’s word frequency law in natural language: A critical review and future directions, Psychonomic bulletin & review 21 (5) (2014) 1112–1130.
  • (69) B. Škrlj, J. Kralj, N. Lavrač, Py3plex: A library for scalable multilayer network analysis and visualization, in: Complex Networks and Their Applications VII, Springer International Publishing, Cham, 2019, pp. 757–768.
  • (70) P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, T. Ideker, Cytoscape: a software environment for integrated models of biomolecular interaction networks, Genome research 13 (11) (2003) 2498–2504.
  • (71) J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine learning research 7 (Jan) (2006) 1–30.
  • (72) S. Foss, D. Korshunov, S. Zachary, et al., An introduction to Heavy-tailed and Subexponential Distributions, Vol. 6, Springer, 2011.
  • (73) I. Iacobacci, M. T. Pilehvar, R. Navigli, Embeddings for word sense disambiguation: An evaluation study, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1, 2016, pp. 897–907.

Appendix A: Personalized PageRank algorithm

The Personalized PageRank (PPR) algorithm is described as follows. Let represent the nodes of the corpus-based taxonomy. For each node , a feature vector is computed by calculating the stationary distribution of a random walk, starting at node . The stationary distribution is approximated by using power iteration, where the -th component of the approximation in the -th iteration is computed as

(4)

The number of iterations is increased until the stationary distribution converges to the stationary distribution vector (P-PRS value for node ). In the above equation, is the damping factor that corresponds to the probability that a random walk follows a randomly chosen outgoing edge from the current node rather than restarting its walk. The summation index runs over all nodes of the network that have an outgoing connection toward , (denoted as in the sum), and is the out degree of node . The term is the restart distribution that corresponds to a vector of probabilities for a walker’s return to the starting node , i.e. and for . This vector guarantees that the walker will jump back to the starting node in case of a restart.101010Note that if the binary vector were instead composed exclusively of ones, the iteration would compute the global PageRank vector, and Equation 4 would correspond to the standard PageRank iteration.

Appendix B: Example document split

While for the data sets consisting of tweets and short comments, the number of segments in a document corresponds to the number of tweets or comments by a user, in the news data set, we varied the size of the news (to create short documents) by splitting the news into paragraphs. An example of segmentation of a news from the BBC data set111111https://github.com/suraj-deshmukh/BBC-Dataset-News-Classification/blob/master/dataset/dataset.csv is listed below.

——— The decision to keep interest rates on hold at 4.75% earlier this month was passed 8-1 by the Bank of England’s rate-setting body, minutes have shown.——— One member of the Bank’s Monetary Policy Committee (MPC) - Paul Tucker - voted to raise rates to 5%. The news surprised some analysts who had expected the latest minutes to show another unanimous decision. Worries over growth rates and consumer spending were behind the decision to freeze rates, the minutes showed. The Bank’s latest inflation report, released last week, had noted that the main reason inflation might fall was weaker consumer spending.——— However, MPC member Paul Tucker voted for a quarter point rise in interest rates to 5%. He argued that economic growth was picking up, and that the equity, credit and housing markets had been stronger than expected.——— The Bank’s minutes said that risks to the inflation forecast were “sufficiently to the downside” to keep rates on hold at its latest meeting. However, the minutes added: “Some members noted that an increase might be warranted in due course if the economy evolved in line with the central projection”. Ross Walker, UK economist at Royal Bank of Scotland, said he was surprised that a dissenting vote had been made so soon. He said the minutes appeared to be “trying to get the market to focus on the possibility of a rise in rates”. “If the economy pans out as they expect then they are probably going to have to hike rates.” However, he added, any rate increase is not likely to happen until later this year, with MPC members likely to look for a more sustainable pick up in consumer spending before acting.

This news article is split by a parser into the following four segments (and in short document setting only one paragraph is used to represent the document).

  • The decision to keep interest rates on hold at 4.75% earlier this month was passed 8-1 by the Bank of England’s rate-setting body, minutes have shown.

  • One member of the Bank’s Monetary Policy Committee (MPC) - Paul Tucker - voted to raise rates to 5%. The news surprised some analysts who had expected the latest minutes to show another unanimous decision. Worries over growth rates and consumer spending were behind the decision to freeze rates, the minutes showed. The Bank’s latest inflation report, released last week, had noted that the main reason inflation might fall was weaker consumer spending.

  • However, MPC member Paul Tucker voted for a quarter point rise in interest rates to 5%. He argued that economic growth was picking up, and that the equity, credit and housing markets had been stronger than expected.

  • The Bank’s minutes said that risks to the inflation forecast were ”sufficiently to the downside” to keep rates on hold at its latest meeting. However, the minutes added: ”Some members noted that an increase might be warranted in due course if the economy evolved in line with the central projection”. Ross Walker, UK economist at Royal Bank of Scotland, said he was surprised that a dissenting vote had been made so soon. He said the minutes appeared to be ”trying to get the market to focus on the possibility of a rise in rates”. ”If the economy pans out as they expect then they are probably going to have to hike rates.” However, he added, any rate increase is not likely to happen until later this year, with MPC members likely to look for a more sustainable pick up in consumer spending before acting.”