A Survey on Text Classification: From Shallow to Deep Learning

08/02/2020 ∙ by Qian Li, et al. ∙ University of Leeds Lehigh University Beihang University University of Illinois at Chicago 0

Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state of the art approaches from 1961 to 2020, focusing on models from shallow to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Text classification – the procedure of designating pre-defined labels for text – is an essential and significant task in many Natural Language Processing (NLP) applications, such as sentiment analysis

(Maas et al., 2011)(Tai et al., 2015) (Zhu et al., 2015), topic labeling (Wang and Manning, 2012) (Zhang et al., 2015) (Yao et al., 2019), question answering (Kalchbrenner et al., 2014) (Liu et al., 2015) and dialog act classification (Lee and Dernoncourt, 2016)

. In the era of information explosion, it is time-consuming and challenging to process and classify large amounts of text data manually. Besides, the accuracy of manual text classification can be easily influenced by human factors, such as fatigue and expertise. It is desirable to use machine learning methods to automate the text classification procedure to yield more reliable and less subjective results. Moreover, this can also help enhance information retrieval efficiency and alleviate the problem of information overload by locating the required information.

Fig. 1 illustrates a flowchart of the procedures involved in the text classification, under the light of shallow and deep analysis. Text data is different from numerical, image, or signal data. It requires NLP techniques to be processed carefully. The first important step is to preprocess text data for the model. Shallow learning models usually need to obtain good sample features by artificial methods and then classify them with classic machine learning algorithms. Therefore, the effectiveness of the method is largely restricted by feature extraction. However, different from shallow models, deep learning integrates feature engineering into the model fitting process by learning a set of nonlinear transformations that serve to map features directly to outputs.

Figure 1. Flowchart of the text classification with classic methods in each module. It is crucial to extract essential features for shallow models, but features can be extracted automatically by DNNs.

The schematic illustration of the primary text classification methods is shown in Fig. 2. From the 1960s until the 2010s, shallow learning-based text classification models dominated. Shallow learning means statistics-based models, such as Naïve Bayes (NB) (Maron, 1961)

, K-nearest neighbor (KNN)

(Cover and Hart, 1967)

, and support vector machine (SVM)

(Joachims, 1998). Comparing with the earlier rule-based methods, this method has obvious advantages in accuracy and stability. However, these approaches still need to do feature engineering, which is time-consuming and costly. Besides, they usually disregard the natural sequential structure or contextual information in textual data, making it challenging to learn the semantic information of the words. Since the 2010s, text classification has gradually changed from shallow learning models to deep learning models. Compared with the methods based on shallow learning, deep learning methods avoid designing rules and features by humans and automatically provide semantically meaningful representations for text mining. Therefore, most of the text classification research works are based on DNNs, which are data-driven approaches with high computational complexity. Few works focus on shallow learning models to settle the limitations of computation and data.

Figure 2. Schematic illustration of the primary text classification methods from 1961 to 2020. Before 2010, almost all existing methods are based on shallow models (orange color); since 2010, most work in this area has concentrated on deep learning schemes (green color).

1.1. Major Differences and Contributions

In the literature, Kowsari et al. (Kowsari et al., 2019) surveyed different text feature extraction, dimensionality reduction methods, basic model structure for text classification, and evaluation methods. Minaee et al. (Minaee et al., 2020) reviewed recent deep learning based text classification methods, benchmark datasets, and evaluation metrics. Unlike existing text classification surveys, we conclude existing models from shallow to deep learning with works of recent years. Shallow learning models emphasize the feature extraction and classifier design. Once the text has well-designed characteristics, it can be quickly converged by training the classifier. DNNs can perform feature extraction automatically and learn well without domain knowledge. We then give the datasets and evaluation metrics for single-label and multi-label tasks and summarize future research challenges from data, models, and performance perspective. Moreover, we summarize various information in four tables, including the necessary information of classic shallow and deep learning models, technical details of DNNs, primary information of main datasets, and a general benchmark of state-of-the-art methods under different applications. In summary, this study’s main contributions are as follows:

  • We introduce the process and development of text classification and summarize the necessary information of classic models in terms of publishing years in Table 1, including venues, applications, citations, and code links.

  • We present comprehensive analysis and research on primary models – from shallow to deep learning models – according to their model structures. We summarize classic or more specific models and primarily outline the design difference in terms of basic models, metric and experimental datasets in Table 2.

  • We introduce the present datasets and give the formulation of main evaluation metrics, including single-label and multi-label text classification tasks. We summarize the necessary information of primary datasets in Table 3, including the number of categories, average sentence length, the size of each dataset, related papers and data addresses.

  • We summarize classification accuracy scores of classical models on benchmark datasets in Table 5 and conclude the survey by discussing the main challenges facing the text classification and key implications stemming from this study.

1.2. Organization of the Survey

The rest of the survey is organized as follows. Section 2 summarizes the existing models related to text classification, including shallow learning and deep learning models. Section 3 introduces the primary datasets with a summary table and evaluation metrics on single-label and multi-label tasks. We then give quantitative results of the leading models in classic text classification datasets in Section 4. Finally, we summarize the main challenges for deep learning text classification in Section 5 before concluding the article in Section 6.

2. Text Classification Methods

Text classification is referred to as extracting features from raw text data and predicting the categories of text data based on such features. Numerous models have been proposed in the past few decades for text classification, as shown in Table 1. We tabulate primary information – including venues, applications, citations, and code links – of main models for text classification. The applications in this table include sentiment analysis (SA), topic labeling (TL), news classification (NC), question answering (QA), dialog act classification (DAC), natural language inference (NLI), and event prediction (EP). For shallow learning models, NB (Maron, 1961) is the first model used for the text classification task. Whereafter, generic classification models are proposed, such as KNN, SVM (Joachims, 1998), and RF (Breiman, 2001)

, which are called classifiers, widely used for text classification. Recently, XGBoost

(Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017) have arguably the potential to provide excellent performance. For deep learning models, TextCNN (Kim, 2014) has the highest number of references in these models, wherein a CNN model has been introduced to solve the text classification problem for the first time. While not specifically designed for handling text classification tasks, BERT (Devlin et al., 2019) has been widely employed when designing text classification models, considering its effectiveness on numerous text classification datasets.

Year Method Venue Applications Citations Code Link
1961 NB (Maron, 1961) JACM TL 612 (11)
1967 KNN (Cover and Hart, 1967) IEEE Trans. - 12152 (8)
1984 CART (Breiman et al., 1984) Wadsworth - 45967 (5)
1993 C4.5 (Quinlan, 1993) Morgan Kaufmann - 37847 (4)
1995 AdaBoost (Freund and Schapire, 1995) EuroCOLT - 19372 (3)
1998 SVM (Joachims, 1998) ECML - 10770 (11)
2001 RF (Breiman, 2001) Mach. Learn. - 60249 (10)
2011 RAE (Socher et al., 2011) EMNLP SA, QA 1231 (17)
2012 MV-RNN (Socher et al., 2012) EMNLP SA 1141 (37)
2013 RNTN (Socher et al., 2013) EMNLP SA 3725 (18)
2014 Paragraph-Vec (Le and Mikolov, 2014) ICML SA, QA 5679 (24)
2014 DCNN (Kalchbrenner et al., 2014) ACL SA, QA 2433 (33)
2014 TextCNN (Kim, 2014) EMNLP SA, QA 7171 (15)
2015 TextRCNN (Lai et al., 2015) AAAI SA, TL 1141 (41)
2015 DAN (Iyyer et al., 2015) ACL SA, QA 467 (6)
2015 Tree-LSTM (Tai et al., 2015) ACL SA 1761 (29)
2015 CharCNN (Zhang et al., 2015) NeurIPS SA, QA, TL 2114 (32)
2016 XGBoost (Chen and Guestrin, 2016) KDD QA 6187 (13)
2016 HAN (Yang et al., 2016) NAACL SA, TL 1889 (16)
2016 Multi-Task (Liu et al., 2016a) IJCAI SA 410 (23)
2016 LSTMN (Cheng et al., 2016) EMNLP SA 449 (36)
2017 LightGBM (Ke et al., 2017) NeurIPS QA 1065 (9)
2017 FastText (Grave et al., 2017) EACL SA, TL 1954 (7)
2017 Miyato et al. (Miyato et al., 2017a) ICLR SA 246 (42)
2017 TopicRNN (Dieng et al., 2017) ICML SA 113 (28)
2017 DPCNN (Johnson and Zhang, 2017) ACL SA, TL 156 (20)
2017 IAN (Ma et al., 2017) IJCAI SA 222 (19)
2017 DeepMoji (Felbo et al., 2017) EMNLP SA 260 (14)
2017 RAM (Chen et al., 2017) EMNLP SA 225 (19)
2018 ELMo (Peters et al., 2018) NAACL SA, QA, NLI 3722 (21)
2018 DGCNN (Peng et al., 2018) TheWebConf TL 81 (34)
2018 ULMFiT (Howard and Ruder, 2018) ACL SA, TL, News 819 (12)
2018 LEAM (Wang et al., 2018a) ACL TL, News 87 (35)
2018 SGM (Yang et al., 2018b) COLING TL 42 (26)
2018 SGNN (Li et al., 2018) IJCAI EP 26 (27)
2018 TextCapsule (Yang et al., 2018a) EMNLP SA, QA, TL 118 (39)
2018 MGAN (Fan et al., 2018) EMNLP SA 46 (19)
2019 TextGCN (Yao et al., 2019) AAAI SA, TL 114 (40)
2019 BERT (Devlin et al., 2019) NAACL SA, QA 5532 (31)
2019 MT-DNN (Liu et al., 2019a) ACL SA, NLI 186 (22)
2019 XLNet (Yang et al., 2019) NeurIPS SA, QA, NC 652 (43)
2019 RoBERTa (Liu et al., 2019b) arXiv SA, QA 203 (25)
2020 ALBERT(Lan et al., 2020) ICLR SA, QA 197 (30)
2020 SpanBERT (Joshi et al., 2020) TACL QA 63 (38)
Table 1. Necessary information of the principal models for text classification and the citation is counted on June 8, 2020.

2.1. Shallow Learning Models

Shallow learning models accelerate text classification with improved accuracy and make the application scope of shallow learning expand. The first thing is to preprocess the raw input text for training shallow learning models, which generally consists of word segmentation, data cleaning, and data statistics. Then, text representation aims to express preprocessed text in a form that is much easier for computers and minimizes information loss, such as Bag-of-words (BOW), N-gram, term frequency-inverse document frequency (TF-IDF) 

(184), word2vec (Mikolov et al., 2013) and GloVe  (Pennington et al., 2014). At the core of the BOW is representing each text with a dictionary-sized vector. The individual value of the vector denotes the word frequency corresponding to its inherent position in the text. Compared to BOW, N-gram considers the information of adjacent words and builds a dictionary by considering the adjacent words. TF-IDF (184) uses the word frequency and inverses the document frequency to model the text. The word2vec (Mikolov et al., 2013) employs local context information to obtain word vectors. The GloVe (Pennington et al., 2014) – with both the local context and global statistical features – trains on the nonzero elements in a word-word co-occurrence matrix. Finally, the represented text is fed into the classifier according to selected features. Here, we discuss some of the representative classifiers in detail:

2.1.1. PGM-based methods

Probabilistic graphical models (PGMs) express the conditional dependencies among features in graphs, such as the Bayesian network

(Zhang and Zhang, 2010), the hidden Markov network (van den Bosch, 2017)

. Such models are combinations of probability theory and graph theory.

Naïve Bayes (NB) (Maron, 1961)

is the simplest and most broadly used model based on applying Bayes’ theorem. The NB algorithm has an independent assumption: when the target value has been given, the conditions between features

are independent (see Fig. 3

). The NB algorithm primarily uses the prior probability to calculate the posterior probability. Due to its simple structure, NB is broadly used for text classification tasks. Although the assumption that the features are independent is sometimes not actual, it substantially simplifies the calculation process and performs better. To improve the performance on smaller categories, Schneider

(Schneider, 2004)

proposes a feature selection score method through calculating KL-divergence

(Cover and Thomas, 2006) between the training set and corresponding categories for multinomial NB text classification. Dai et al. (Dai et al., 2007)

propose a transfer learning method named Naive Bayes Transfer Classification (NBTC) to settle the different distribution between the training set and the target set. It uses the EM algorithm

(A. et al., 1977) to obtain a locally optimal posterior hypothesis on the target set.

Figure 3. The structure of NB (left) and the structure of HMM (right).

Hidden Markov model (HMM) is a Markov model assumed to be a Markov process within hidden states (van den Bosch, 2017). It is suitable for sequential text data, effective in reducing algorithmic complexity by redesigning model structure. HMM operates under the assumption that a separate process exists, and its behavior depends upon . The reachable learning goal is to learn about by observing , considering the state dependencies (see Fig. 3). To consider the contextual information among pages in a text, Frasconi et al.  (Frasconi et al., 2002) reshape a text into sequences of pages and exploit the serial order relationship among pages within a text for multi-page texts. However, these methods get no excellent performance for domain text. Motivated by this, Yi et al. (Yi and Beheshti, 2009) use prior knowledge – primarily stemming from a specialized subject vocabulary set Medical Subject Headings (MeSH)  (O’Donnell, 2009) – to carry out the medical text classification task.

Figure 4. The structure of KNN where (left) and the structure of SVM (right). Different colored nodes represent different categories.

2.1.2. KNN-based Methods

At the core of the K-Nearest Neighbors (KNN) algorithm (Cover and Hart, 1967) is to classify an unlabeled sample by finding the category with most samples on the k-nearest labeled samples. It is a simple classifier without building the model and can decrease complexity through the fasting process of getting nearest neighbors. Fig. 4 showcases the structure of KNN. We can find

training texts approaching a specific text to be classified through estimating the in-between distance. Hence, the text can be divided into the most common categories found in

training set texts. However, due to the positive correlation between model time/space complexity and the amount of data, the KNN algorithm takes an unusually long time on the large-scale datasets. To decrease the number of selected features, Soucy et al. (Soucy and Mineau, 2001) propose a KNN algorithm without feature weighting. It manages to find relevant features, building the inter-dependencies of words by using a feature selection. When the data is extremely unevenly distributed, KNN tends to classify samples with more data. The neighbor-weighted K-nearest neighbor (NWKNN) (Tan, 2005) is proposed to improve classification performance on the unbalanced corpora. It casts a significant weight for neighbors in a small category and a small weight for neighbors in a broad class.

2.1.3. SVM-based Methods

Cortes and Vapnik propose Support Vector Machine (SVM) (Cortes and Vapnik, 1995)

to tackle the binary classification of pattern recognition. Joachims 

(Joachims, 1998), for the first time, uses the SVM method for text classification representing each text as a vector. As illustrated in Fig. 4

, SVM-based approaches turn text classification tasks into multiple binary classification tasks. In this context, SVM constructs an optimal hyperplane in the one-dimensional input space or feature space, maximizing the distance between the hyperplane and the two categories of training sets, thereby achieving the best generalization ability. The goal is to make the distance of the category boundary along the direction perpendicular to the hyperplane is the largest. Equivalently, this will result in the lowest error rate of classification. Constructing an optimal hyperplane can be transformed into a quadratic programming problem to obtain a globally optimal solution. Choosing the appropriate kernel function is of the utmost importance to ensure SVM can deal with nonlinear problems and become a robust nonlinear classifier. To analyze what the SVM algorithms learn and what tasks are suitable, Joachims

(Joachims, 2001) proposes a theoretical learning model combining the statistical traits with the generalization performance of an SVM analyzing the features and benefits using a quantitative approach. Transductive Support Vector Machine (TSVM) (JOACHIMS,T., 1999) is proposed to lessen misclassifications of the particular test collections with a general decision function considering a specific test set. It uses prior knowledge to establish a more suitable structure and study faster.

Figure 5. The structure of DT (left) and the structure of RF (right). Nodes in blue represent the nodes of the decision route.

2.1.4. DT-based Methods

Decision Trees (DT) (Mitchell, 1997) is a supervised tree structure learning method – reflective of the idea of divide-and-conquer – and is constructed recursively. It learns disjunctive expressions and has robustness for the text with noise. As shown in Fig. 5, decision trees can be generally divided into two distinct stages: tree construction and tree pruning. It starts at the root node and tests the data samples (composed of instance sets, which have several attributes), and divides the dataset into diverse subsets according to different results. A subset of datasets constitutes a child node, and every leaf node in the decision tree represents a category. Constructing the decision tree is to determine the correlation between classes and attributes, further exploited to predict the record categories of unknown forthcoming types. The classification rules generated by the decision tree algorithm are straight-forward, and the pruning strategy can also help reduce the influence of noise. Its limitation, however, mainly derives from inefficiency in coping with explosively increasing data size. More specifically, the ID3 (Quinlan, 1986) algorithm uses information gain as the attribute selection criterion in the selection of each node – It is used to select the attribute of each branch node, and then select the attribute having the maximum information gain value to become the discriminant attribute of the current node. Based on ID3, C4.5 (Quinlan, 1993) learns to obtain a map from attributes to classes, which effectively classifies entities unknown to new categories. DT based algorithms usually need to train for each dataset, which is low efficiency. Thus, Johnson et al. (Johnson et al., 2002) propose a DT-based symbolic rule system. The method represents each text as a vector calculated by the frequency of each word in the text and induces rules from the training data. The learning rules are used for classifying the other data being similar to the training data. Furthermore, to reduce the computational costs of DT algorithms, fast decision-tree (FDT) (Vateekul and Kubat, 2009) uses two-pronged strategy: pre-selecting a feature set and training multiple DTs on different data subsets. Results from multiple DTs are combined through a data-fusion technique to resolve the cases of imbalanced classes.

2.1.5. Integration-based Methods

Integrated algorithms aim to aggregate the results of multiple algorithms for better performance and interpretation. Conventional integrated algorithms are bootstrap aggregation, such as random forest (RF)

(Breiman, 2001), boosting such as AdaBoost (Freund and Schapire, 1995), and XGBoost (Chen and Guestrin, 2016) and stacking. The bootstrap aggregation method trains multiple classifiers without strong dependencies and then aggregates their results. For instance, RF (Breiman, 2001) consists of multiple tree classifiers wherein all trees depend on the value of the random vector sampled independently (depicted in Fig. 5). It is worth noting that each tree within the RF shares the same distribution. The generalization error of an RF relies on the strength of each tree and the relationship among trees and will converge to a limit with the increment of tree number in the forest. In boosting based algorithms, all labeled data are trained with the same weight to initially obtain a weaker classifier. The weights of the data will then be adjusted according to the former result of the classifier. The training procedure will continue by repeating such steps until the termination condition is reached. Unlike bootstrap and boosting algorithms, stacking based algorithms break down the data into parts and use classifiers to calculate the input data in a cascade manner – Result from upstream classifier will feed into the downstream classifier as input. The training will terminate once a pre-defined iteration number is targeted. The integrated method can capture more features from multiple trees. However, it helps little for short text. Motivated by this, Bouaziz et al.  (Bouaziz et al., 2014) combine data enrichment – with semantics in RFs for short text classification – to overcome the deficiency of sparseness and insufficiency of contextual information. In integrated algorithms, not all classifiers learn well. It is necessary to give different weights for each classifier. To differentiate contributions of trees in a forest, Islam et al.  (Islam et al., 2019) exploit Semantics Aware Random Forest (SARF) classifier, choosing features similar to the features of the same class, for extracting features and producing the prediction values.

Summary. The shallow learning method is a type of machine learning. It learns from data, which are pre-defined features that are important to the performance of prediction values. However, feature engineering is tough work. Before training the classifier, we need to collect knowledge or experience to extract features from the original text. The shallow learning methods train the initial classifier based on various textual features extracted from the raw text. Toward small datasets, shallow learning models usually present better performance than deep learning models under the limitation of computational complexity. Therefore, some researchers have studied the design of shallow models for specific domains with fewer data.

2.2. Deep Learning Models

The DNNs consist of artificial neural networks that simulate the human brain to automatically learn high-level features from data, getting better results than shallow learning models in speech recognition, image processing, and text understanding. Input datasets should be analyzed to classify the data, such as a single-label, multi-label, unsupervised, unbalanced dataset. According to the trait of the dataset, the input word vectors are sent into the DNN for training until the termination condition is reached. The performance of the training model is verified by the downstream task, such as sentiment classification, question answering, and event prediction. We show some DNNs over the years in Table

2, including designs that are different from the corresponding basic models, evaluation metrics, and experimental datasets.

As shown in Table 2

, the feed-forward neural network and the recursive neural network are the first two deep learning approaches used for the text classification task, which improve performance compared with shallow learning models. Then, CNNs, RNNs, and attention mechanisms are used for text classification. Many researchers advance text classification performance for different tasks by improving CNN, RNN, and attention, or model fusion and multi-task methods. The appearance of Bidirectional Encoder Representations from Transformers (BERT)

(Devlin et al., 2019), which can generate contextualized word vectors, is a significant turning point in the development of text classification and other NLP technologies. Many researchers have studied text classification models based on BERT, which achieves better performance than the above models in multiple NLP tasks, including text classification. Besides, some researchers study text classification technology based on GNN (Yao et al., 2019) to capture structural information in the text, which cannot be replaced by other methods. Here, we classify DNNs by structure and discuss some of the representative models in detail:

Model Design Metrics Datasets

recursive autoencoders

(Socher et al., 2011)
Accuracy MPQA, MR, EP
ReNN recursive neural network (Socher et al., 2012) Accuracy, F1 MR
richer supervised training (Socher et al., 2013) Accuracy Sentiment Treebank
multiple recursive layers (Irsoy and Cardie, 2014) Accuracy SST-1;SST-2
MLP a deep unordered model (Iyyer et al., 2015) Accuracy, Time RT, SST, IMDB
paragraph vector (Le and Mikolov, 2014) Error Rate SST, IMDB
tree-structured topologies (Tai et al., 2015) Accuracy SST-1, SST-2
a memory cell (Zhu et al., 2015) Accuracy SST

RCNN and a max-pooling layer

(Lai et al., 2015)
Accuracy, 20NG, Fudan, ACL, SST-2
multi-timescale (Liu et al., 2015) Accuracy SST-1, SST-2, QC, IMDB
RNN embeddings of text regions (Johnson and Zhang, 2016) Error Rate IMDB, Elec, RCV1, 20NG
2DCNN (Zhou et al., 2016a) Accuracy SST-1, SST-2, Subj, TREC, etc.
multi-task (Liu et al., 2016a) Accuracy SST-1, SST-2, Subj, IMDB
distant supervision (Felbo et al., 2017) Accuracy SS-Twitter, SE1604, etc.
global dependencies (Dieng et al., 2017) Error Rate IMDB
virtual adversarial training (Miyato et al., 2017a) Error Rate IMDB, DBpedia, RCV1, etc.
capsule (Wang et al., 2018b) Accuracy MR, SST-1, Hospital Feedback
basic CNN (Kim, 2014) Accuracy MR, SST-1, SST-2, Subj, etc.
dynamic k-Max pooling (Kalchbrenner et al., 2014) Accuracy MR, TREC, Twitter
character-level (Zhang et al., 2015) Error Rate AG, Yelp P, DBPedia, etc.
preceding short texts (Lee and Dernoncourt, 2016) Accuracy DSTC 4, MRDA, SwDA
extreme multi-label (Liu et al., 2017) P@k, DCG@k, etc. EUR-Lex, Wiki-30K, etc.
CNN deep pyramid CNN (Johnson and Zhang, 2017) Error Rate AG, DBPedia, Yelp.P, etc.
knowledge base (Wang et al., 2017a) Accuracy TREC, Twitter, AG, Bing, MR
8‐bit character encoding (Adams and McKenzie, 2018) Accuracy Geonames toponyms, etc.
dynamic routing (Yang et al., 2018a) Accuracy Subj, TREC, Reuters, etc.
hierarchical relations (Shimura et al., 2018) , , etc. RCV1, Amazon670K
meta-learning (Bao et al., 2020) Accuracy 20NG, RCV, Reuters-2157, etc.
hierarchical attention (Yang et al., 2016) Accuracy Yelp.F, IMDB, YahooA, Amz.F
add bilingual BiLSTM (Zhou et al., 2016b) Accuracy NLP&CC 2013 (149)
intra-attention mechanism (Cheng et al., 2016) Accuracy SST-1
two-way attention mechanism (dos Santos et al., 2016) P, MAP, MRR TREC-QA, WikiQA, etc.
Attention Inner-Attention mechanism (Liu et al., 2016b) Accuracy SNLI
cross-attention mechanism (Hao et al., 2017) F1 WebQuestion
self-attention sentence embedding (Lin et al., 2017) Accuracy Yelp, Age
sequence generation model (Yang et al., 2018b) HL, RCV1-V2, AAPD
deep contextualized representation (Peters et al., 2018) Accuracy, F1 SQuAD, SNLI, SRL, SST-5, etc.
a label tree-based model (You et al., 2019) P@k, N@k, PSP@k EUR-Lex, Amazon-670K, etc.
knowledge powered attention (Chen et al., 2019) Accuracy Weibo, Product Review, etc.
bi-directional block self-attention (Shen et al., 2018b) Accuracy, Time CR, MPQA, SST-1, SUBJ, etc.
deep contextualized representation (Peters et al., 2018) Accuracy SQuAD, SNLI, SST-5
bidirectional encoder (Devlin et al., 2019) Accuracy SST-2, QQP, QNLI, CoLA
multi-label legal text (Chalkidis et al., 2019) P@K, RP@K, R@K, etc. EUR-LEX
Trans fine-tune BERT (Sun et al., 2019) Error Rate IMDB, TREC, DBPedia, etc.
autoregressive pretraining (Yang et al., 2019) DNCG@K, EM, F1, etc. IMDB, Yelp-2, AG, MNLI, etc.
modifications on BERT (Liu et al., 2019b) SQuAD, MNLI-m, SST-2 F1, Accuracy
improvement of BERT (Lan et al., 2020) F1, Accuracy SST, MNLI, SQuAD
graph-CNN for multi-label text (Peng et al., 2018) , , etc. RCV1, NYTimes
build a heterogeneous graph (Yao et al., 2019) Accuracy 20NG, Ohsumed, R52, R8, MR
GNN removing the nonlinearities (Wu et al., 2019) Accuracy, Time 20NG, R8, R52, Ohsumed, MR
a text level graph (Huang et al., 2019) Accuracy R8, R52, Ohsumed
hierarchical taxonomy-aware (Peng et al., 2019) , RCV1, EUR-Lex, etc.
graph attention network-based (Pal et al., 2020) , HL Reuters-21578, RCV1-V2, etc.
Table 2. Basic information based on different models. Trans: Transformer. Time: training time.

2.2.1. ReNN-based Methods

Shallow learning models cost lots of time on design features for each task. The recursive neural network (ReNN) can automatically learn the semantics of text recursively and the syntax tree structure without feature design, as shown in Fig. 6. We give an example of ReNN based models. First, each word of input text is taken as the leaf node of the model structure. Then all nodes are combined into parent nodes using a weight matrix. The weight matrix is shared across the whole model. Each parent node has the same dimension with all leaf nodes. Finally, all nodes are recursively aggregated into a parent node to represent the input text to predict the label.

Figure 6. The architecture of ReNN (left) and the architecture of MLP (right).

ReNN-based models improve performance compared with shallow learning models and save on labor costs due to excluding feature designs used for different text classification tasks. The recursive autoencoder (RAE) (Socher et al., 2011) is used to predict the distribution of sentiment labels for each input sentence and learn the representations of multi-word phrases. To learn compositional vector representations for each input text, the matrix-vector recursive neural network (MV-RNN) (Socher et al., 2012)

introduces a ReNN model to learn the representation of phrases and sentences. It allows that the length and type of input texts are inconsistent. MV-RNN allocates a matrix and a vector for each node on the constructed parse tree. Furthermore, the recursive neural tensor network (RNTN)

(Socher et al., 2013) is proposed with a tree structure to capture the semantics of sentences. It inputs phrases with different length and represents the phrases by parse trees and word vectors. The vectors of higher nodes on the parse tree are estimated by the equal tensor-based composition function. For RNTN, the time complexity of building the textual tree is high, and expressing the relationship between documents is complicated within a tree structure. The performance is usually improved, with the depth being increased for DNNs. Therefore, Irsoy et al. (Irsoy and Cardie, 2014) propose a deep recursive neural network (DeepReNN), which stacks multiple recursive layers. It is built by binary parse trees and learns distinct perspectives of compositionality in language.

2.2.2. MLP-based Methods

A multilayer perceptron (MLP)

(k. Alsmadi et al., 2009), sometimes colloquially called ”vanilla” neural network, is a simple neural network structure that is used for capturing features automatically. As shown in Fig. 6

, we show a three-layer MLP model. It contains an input layer, a hidden layer with an activation function in all nodes, and an output layer. Each node connects with a certain weight

. It treats each input text as a bag of words and achieves high performance on many text classification benchmarks comparing with shallow learning models.

There are some MLP-based methods proposed by some research groups for text classification tasks. The Paragraph Vector (Paragraph-Vec) (Le and Mikolov, 2014) is the most popular and widely used method, which is similar to the Continuous Bag of Words (CBOW) (Mikolov et al., 2013). It gets fixed-length feature representations of texts with various input lengths by employing unsupervised algorithms. Comparing with CBOW, it adds a paragraph token mapped to the paragraph vector by a matrix. The model predicts the fourth word by the connection or average of this vector to the three contexts of the word. Paragraph vectors can be used as a memory for paragraph themes and are used as a paragraph function and inserted into the prediction classifier.

2.2.3. RNN-based Methods

The recurrent neural network (RNN) is broadly used due to capturing long-range dependency through recurrent computation. The RNN language model learns historical information, considering the location information among all words suitable for text classification tasks. We show an RNN model for text classification with a simple sample, as shown in Fig. 

7. Firstly, each input word is represented by a specific vector using a word embedding technology. Then, the embedding word vectors are fed into RNN cells one by one. The output of RNN cells are the same dimension with the input vector and are fed into the next hidden layer. The RNN shares parameters across different parts of the model and has the same weights of each input word. Finally, the label of input text can be predicted by the last output of the hidden layer.

To diminish the time complexity of the model and capture contextual information, Liu et al. (Liu et al., 2016a) introduce a model for catching the semantics of long texts. It parses the text one by one and is a biased model, making the following inputs profit over the former and decreasing the semantic efficiency of capturing the whole text. For modeling topic labeling tasks with long input sequences, TopicRNN (Dieng et al., 2017) is proposed. It captures the dependencies of words in a document via latent topics and uses RNNs to capture local dependencies and latent topic models for capturing global semantic dependencies. Virtual Adversarial Training (VAT) (Miyato et al., 2017b)

is a useful regularization method applicable to semi-supervised learning tasks. Miyato et al.

(Miyato et al., 2017a) apply adversarial and virtual adversarial training to the text field and employ the perturbation into word embedding rather than the original input text. The model improves the quality of the word embedding and is not easy to overfit during training. Capsule network (Hinton et al., 2011)

captures the relationships between features using dynamic routing between capsules comprised of a group of neurons in a layer. Wang et al.

(Wang et al., 2018b) propose an RNN-Capsule model with a simple capsule structure for the sentiment classification task.

Figure 7. The RNN based model (left) and the CNN based model (right).

In the backpropagation process of RNN, the weights are adjusted by gradients, calculated by continuous multiplications of derivatives. If the derivatives are extremely small, it may cause a gradient vanishing problem by continuous multiplications. Long Short-Term Memory (LSTM)

(Hochreiter and Schmidhuber, 1997), the improvement of RNN, effectively alleviates the gradient vanishing problem. It is composed of a cell to remember values on arbitrary time intervals and three gate structures to control information flow. The gate structures include input gates, forget gates, and output gates. The LSTM classification method can better capture the connection among context feature words, and use the forgotten gate structure to filter useless information, which is conducive to improving the total capturing ability of the classifier. Tree-LSTM (Tai et al., 2015) extends the sequence of LSTM models to the tree structure. The whole subtree with little influence on the result can be forgotten through the LSTM forgetting gate mechanism for the Tree-LSTM model.

Natural Language Inference (NLI) predicts whether one text’s meaning can be deduced from another by measuring the semantic similarity between each pair of sentences. To consider other granular matchings and matchings in the reverse direction, Wang et al. (Wang et al., 2017b) propose a model for the NLI task named Bilateral multi-perspective matching (BiMPM). It encodes input sentences by the BiLSTM encoder. Then, the encoded sentences are matched in two directions. The results are aggregated in a fixed-length matching vector by another BiLSTM layer. Finally, the result is evaluated by a fully connected layer.

2.2.4. CNN-based Methods

Convolutional neural networks (CNNs) are proposed for image classification with convolving filters that can extract features of pictures. Unlike RNN, CNN can simultaneously apply convolutions defined by different kernels to multiple chunks of a sequence. Therefore, CNNs are used for many NLP tasks, including text classification. For text classification, the text requires being represented as a vector similar to the image representation, and text features can be filtered from multiple angles, as shown in Fig. 7. Firstly, the word vectors of the input text are spliced into a matrix. The matrix is then fed into the convolutional layer, which contains several filters with different dimensions. Finally, the result of the convolutional layer goes through the pooling layer and concatenates the pooling result to obtain the final vector representation of the text. The category is predicted by the final vector.

To try using CNN for the text classification task, an unbiased model of convolutional neural networks is introduced by Kim, called TextCNN (Kim, 2014)

. It can better determine discriminative phrases in the max-pooling layer with one layer of convolution and learn hyperparameters except for word vectors by keeping word vectors static. Training only on labeled data is not enough for data-driven deep models. Therefore, some researchers consider utilizing unlabeled data. Johnson et al.

(Johnson and Zhang, 2015) propose a text classification CNN model based on two-view semi-supervised learning, which first uses unlabeled data to train the embedding of text regions and then labeled data. DNNs usually have better performance, but it increases the computational complexity. Motivated by this, a deep pyramid convolutional neural network (DPCNN) (Johnson and Zhang, 2017) is proposed, with a little more computational accuracy, increasing by raising the network depth. The DPCNN is more specific than ResNet (He et al., 2016), as all the shortcuts are exactly simple identity mappings without any complication for dimension matching.

According to the minimum embedding unit of text, embedding methods are divided into character-level, word-level, and sentence-level embedding. Character-level embeddings can settle Out-of-Vocabulary (OOV) words. Word-level embeddings learn the syntax and semantics of the words. Moreover, sentence-level embedding can capture relationships among sentences. Motivated by these, Nguyen et al. (Nguyen and Nguyen, 2017) propose a deep learning method based on a dictionary, increasing information for word-level embeddings through constructing semantic rules and deep CNN for character-level embeddings. Adams et al. (Adams and McKenzie, 2018) propose a character-level CNN model, called MGTC, to classify multi-lingual texts written. TransCap (Chen and Qian, 2019) is proposed to encapsulate the sentence-level semantic representations into semantic capsules and transfer document-level knowledge.

RNN based models capture the sequential information to learn the dependency among input words, and CNN based models extract the relevant features from the convolution kernels. Thus some works study the fusion of the two methods. BLSTM-2DCNN (Zhou et al., 2016a) integrates a Bidirectional LSTM (BiLSTM) with two-dimensional max pooling. It uses a 2D convolution to sample more meaningful information of the matrix and understands the context better through BiLSTM. Moreover, Xue et al. (Xue et al., 2017) propose MTNA, a combination of BiLSTM and CNN layers, to solve aspect category classification and aspect term extraction tasks.

Figure 8. Hierarchical Attention Network (Yang et al., 2016).

2.2.5. Attention-based Methods

CNN and RNN provide excellent results on tasks related to text classification. However, these models are not intuitive enough for poor interpretability, especially in classification errors, which cannot be explained due to the non-readability of hidden data. The attention-based methods are successfully used in the text classification. Bahdanau et al. (Bahdanau et al., 2015) first propose an attention mechanism that can be used in machine translation. Motivated by this, Yang et al. (Yang et al., 2016) introduce the hierarchical attention network (HAN) to gain better visualization by employing the extremely informational components of a text, as shown in Fig. 8. HAN includes two encoders and two levels of attention layers. The attention mechanism lets the model pay different attention to specific inputs. It aggregates essential words into sentence vectors firstly and then aggregates vital sentence vectors into text vectors. It can learn how much contribution of each word and sentence for the classification judgment, which is beneficial for applications and analysis through the two levels of attention.

The attention mechanism can improve the performance with interpretability for text classification, which makes it popular. There are some other works based on attention. LSTMN (Cheng et al., 2016) is proposed to process text step by step from left to right and does superficial reasoning through memory and attention. Wang et al. (Wang et al., 2016) propose an attention-based LSTM neural network by exploring the connection between the aspects and the input sentences. BI-Attention (Zhou et al., 2016b) is proposed for cross-lingual text classification to catch bilingual long-distance dependencies. Hu et al. (Hu et al., 2018) propose an attention mechanism based on category attributes for solving the imbalance of the number of various charges which contain few-shot charges.

Self-attention (Vaswani et al., 2017) captures the weight distribution of words in sentences by constructing K, Q and V matrices among sentences that can capture long-range dependencies on text classification. We give an example for self-attention, as shown in Fig. 9. Each input word vector can be represented as three n-dimensional vectors, including , and . After self-attention, the output vector can be represented as and . All output vectors can be parallelly computed. Lin et al. (Lin et al., 2017) used source token self-attention to explore the weight of every token to the entire sentence in the sentence representation task. To capture long-range dependencies, Bi-directional Block Self-Attention Network (Bi-BloSAN) (Shen et al., 2018b) uses an intra-block self-attention network (SAN) to every block split by sequence and an inter-block SAN to the outputs.

Figure 9. An example of self-attention.

Aspect-based sentiment analysis (ABSA) breaks down a text into multiple aspects and allocates each aspect a sentiment polarity. The sentiment polarity can be divided into two types: positive, neutral and negative. Some attention-based models are proposed to identify the fine-grained opinion polarity towards a specific aspect for aspect-based sentiment tasks. ATAE-LSTM (Wang et al., 2016) can concentrate on different parts of each sentence according to the input through the attention mechanisms. MGAN (Fan et al., 2018) presents a fine-grained attention mechanism with a coarse-grained attention mechanism to learn the word-level interaction between context and aspect.

To catch the complicated semantic relationship among each question and candidate answers for the QA task, Tan et al. (Tan et al., 2016) introduce CNN and RNN and generate answer embeddings by using a simple one-way attention mechanism affected through the question context. The attention captures the dependence among the embeddings of questions and answers. Extractive QA can be seen as the text classification task. It inputs a question and multiple candidates answers and classifies every candidate answer to recognize the correct answer. Furthermore, AP-BILSTM (dos Santos et al., 2016) with a two-way attention mechanism can learn the weights between the question and each candidate answer to obtain the importance of each candidate answer to the question.

2.2.6. Transformer-based Methods

Pre-trained language models effectively learn global semantic representation and significantly boost NLP tasks, including text classification. It generally uses unsupervised methods to mine semantic knowledge automatically and then construct pre-training targets so that machines can learn to understand semantics.

As shown in Fig. 10, we give differences in the model architectures among ELMo (Peters et al., 2018), OpenAI GPT (Radford, 2018), and BERT (Devlin et al., 2019). ELMo (Peters et al., 2018) is a deep contextualized word representation model, which is readily integrated into models. It can model complicated characteristics of words and learn different representations for various linguistic contexts. It learns each word embedding according to the context words with the bi-directional LSTM. GPT (Radford, 2018) employs supervised fine-tuning and unsupervised pre-training to learn general representations that transfer with limited adaptation to many NLP tasks. Furthermore, the domain of the target task does not need to be similar to the unlabeled datasets. The training procedure of the GPT algorithm usually includes two stages. Firstly, the initial parameters of a neural network model are learned by a modeling objective on the unlabeled dataset. We can then employ the corresponding supervised objective to accommodate these parameters for the target task. To pre-train deep bidirectional representations from the unlabeled text through joint conditioning on both left and right context in every layer, BERT model (Devlin et al., 2019), proposed by Google, significantly improves performance on NLP tasks, including text classification. It is fine-tuned by adding just an additional output layer to construct models for multiple NLP tasks, such as SA, QA, and machine translation. Comparing with these three models, ELMo is a feature-based method using LSTM, and BERT and OpenAI GPT are fine-tuning approaches using Transformer. Furthermore, ELMo and BERT are bidirectional training models and OpenAI GPT is training from left to right. Therefore, BERT gets a better result, which combines the advantages of ELMo and OpenAI GPT.

Figure 10. Differences in pre-trained model architectures (Devlin et al., 2019), including BERT, OpenAI GPT and ELMo. represents embedding of i th input. Trm represents the transformer block. represents predicted tag of i th input.

Transformer-based models can parallelize computation without considering the sequential information suitable for large scale datasets, making it popular for NLP tasks. Thus, some other works are used for text classification tasks and get excellent performance. RoBERTa (Liu et al., 2019b) adopts the dynamic masking method that generates the masking pattern every time with a sequence to be fed into the model. It uses more data for longer pre-training and estimates the influence of various essential hyperparameters and the size of training data. ALBERT (Lan et al., 2020) uses two-parameter simplification schemes. In general, these methods adopt unsupervised objective functions for pre-training, including the next sentence prediction, masking technology, and permutation. These target functions based on the word prediction demonstrate a strong ability to learn the word dependence and semantic structure (Jawahar et al., 2019). XLNet (Yang et al., 2019) is a generalized autoregressive pre-training approach. It maximizes the expected likelihood across the whole factorization order permutations to learn the bidirectional context. Furthermore, it can overcome the weaknesses of BERT by an autoregressive formulation and integrate ideas from Transformer-XL (Dai et al., 2019) into pre-training.

2.2.7. GNN-based Methods

The DNN models like CNN get great performance on regular structure, not for arbitrarily structured graphs. Some researchers study how to expand on arbitrarily structured graphs (Henaff et al., 2015) (Defferrard et al., 2016). With the increasing attention of graph neural networks (GNNs), GNN-based models obtain excellent performance by encoding syntactic structure of sentences on semantic role labeling task (Marcheggiani and Titov, 2017), relation classification task (Li et al., 2019) and machine translation task (Bastings et al., 2017). It turns text classification into a graph node classification task. We show a GCN model for text classification with four input texts, as shown in Fig. 11. Firstly, the four input texts and the words in the text, defined as nodes, are constructed into the graph structures. The graph nodes are connected by bold black edges, which indicates document-word edges and word-word edges. The weight of each word-word edge usually means their co-occurrence frequency in the corpus. Then, the words and texts are represented through the hidden layer. Finally, the label of all input texts can be predicted by the graph.

The GNN-based models can learn the syntactic structure of sentences making some researchers study using GNN for text classification. DGCNN (Peng et al., 2018) is a graph-CNN converting text to graph-of-words, having the advantage of learning different levels of semantics with CNN models. Yao et al. (Yao et al., 2019) propose the text graph convolutional network (TextGCN), which builds a heterogeneous word text graph for a whole dataset and captures global word co-occurrence information. To enable GNN-based models to underpin online testing, Huang et al. (Huang et al., 2019) build graphs for each text with global parameter sharing, not a corpus-level graph structure, to help preserve global information and reduce the burden. TextING (Zhang et al., 2020) builds individual graphs for each document and learns text-level word interactions by GNN to effectively produce embeddings for obscure words in the new text.

Figure 11. The GCN based model. Black bold edges are document-word edges and word-word edges in the graph.

Graph attention networks (GATs) (Velickovic et al., 2018)

employ masked self-attention layers by attending over its neighbors. Thus, some GAT-based models are proposed to compute the hidden representations of each node. The heterogeneous graph attention network (HGAT)

(Hu et al., 2019) with a dual-level attention mechanism learns the importance of different neighboring nodes and node types in the current node. The model propagates information on the graph and captures the relations to address the semantic sparsity for semi-supervised short text classification. MAGNET (Pal et al., 2020) is proposed to capture the correlation among the labels based on GATs, which learns the crucial dependencies between the labels and generates classifiers by a feature matrix and a correlation matrix.

Event prediction (EP) can be divided into generated event prediction and selective event prediction (also known as script event prediction). EP, referring to scripted event prediction in this review, infers the subsequent event according to the existing event context. Unlike other text classification tasks, texts in EP are composed of a series of sequential subevents. Extracting features of the relationship among such subevents is of critical importance. SGNN (Li et al., 2018) is proposed to model event interactions and learn better event representations by constructing an event graph to utilize the event network information better. The model makes full use of dense event connections for the EP task.

2.2.8. Others

In addition to all the above models, there are some other individual models. Here we introduce some exciting models.

Siamese neural network.

The siamese neural network (Bromley et al., 1993) is also called a twin neural network (Twin NN). It utilizes equal weights while working in tandem using two distinct input vectors to calculate comparable output vectors. Mueller et al. (Mueller and Thyagarajan, 2016) present a siamese adaptation of the LSTM network comprised of couples of variable-length sequences. The model is employed to estimate the semantic similarity among texts, exceeding carefully handcrafted features and proposed neural network models of higher complexity. The model further represents text employing neural networks whose inputs are word vectors learned separately from a vast dataset. To settle unbalanced data classification in the medical domain, Jayadeva et al. (Jayadeva et al., 2019) use a Twin NN model to learn from enormous unbalanced corpora. The objective functions achieve the Twin SVM approach with non-parallel decision boundaries for the corresponding classes, and decrease the Twin NN complexity, optimizing the feature map to better discriminate among classes.

Virtual adversarial training (VAT)

Deep learning methods require many extra hyperparameters, which increase the computational complexity. VAT (Miyato et al., 2015), regularization based on local distributional smoothness can be used in semi-supervised tasks, requires only a small number of hyperparameters, and can be interpreted directly as robust optimization. Miyato et al. (Miyato et al., 2017a) use VAT to effectively improve the robustness and generalization ability of the model and word embedding performance.

Reinforcement learning (RL)

RL learns the best action in a given environment through maximizing cumulative rewards. Zhang et al. (Zhang et al., 2018) offer an RL approach to establish structured sentence representations via learning the structures related to tasks. The model has Information Distilled LSTM (ID-LSTM) and Hierarchical Structured LSTM (HS-LSTM) representation models. The ID-LSTM learns the sentence representation by choosing essential words relevant to tasks, and the HS-LSTM is a two-level LSTM for modeling sentence representation.

QA style for the sentiment classification task.

It is an interesting attempt to treat the sentiment classification task as a QA task. Shen et al. (Shen et al., 2018a) create a high-quality annotated corpus. A three-stage hierarchical matching network was proposed to consider the matching information between questions and answers.

External commonsense knowledge.

Due to the insufficient information of the event itself to distinguish the event for the EP task, Ding et al. (Ding et al., 2019) consider that the event extracted from the original text lacked common knowledge, such as the intention and emotion of the event participants. The model improves the effect of stock prediction, EP, and so on.

Quantum language model.

In the quantum language model, the words and dependencies among words are represented through fundamental quantum events. Zhang et al. (Zhang et al., 2019) design a quantum-inspired sentiment representation method to learn both the semantic and the sentiment information of subjective text. By inputting density matrices to the embedding layer, the performance of the model improves.

Summary. Deep Learning consists of multiple hidden layers in a neural network with a higher level of complexity and can be trained on unstructured data. Deep learning architecture can learn feature representations directly from the input without too many manual interventions and prior knowledge. However, deep learning technology is a data-driven method, which usually needs enormous data to achieve high performance. Although self-attention based models can bring some interpretability among words for DNNs, it is not enough comparing with shallow models to explain why and how it works well.

3. Datasets and Evaluation Metrics

Datasets C L N Related Papers Sources Applications
MR 2 20 10,662 (Kim, 2014) (Kalchbrenner et al., 2014) (Yang et al., 2018a) (Yao et al., 2019) (144) SA
SST-1 5 18 11,855 (Socher et al., 2013) (Kim, 2014) (Tai et al., 2015) (Zhu et al., 2015)(Cheng et al., 2016) (178) SA
SST-2 2 19 9,613 (Socher et al., 2013) (Kim, 2014) (Liu et al., 2015) (Liu et al., 2016a) (Devlin et al., 2019) (Socher et al., 2013) SA
Subj 2 23 10,000 (Kim, 2014) (Liu et al., 2016a) (Yang et al., 2018a) (Pang and Lee, 2004) QA
TREC 6 10 5,952 (Kim, 2014) (Kalchbrenner et al., 2014) (Liu et al., 2015) (Wang et al., 2017a) (186) QA
CR 2 19 3,775 (Kim, 2014) (Yang et al., 2018a) (Hu and Liu, 2004) QA
MPQA 2 3 10,606 (Socher et al., 2011) (Kim, 2014) (Shen et al., 2018b) (143) SA
Twitter 3 19 11,209 (Kalchbrenner et al., 2014)(Wang et al., 2017a) (187) SA
EP 5 129 31,675 (Socher et al., 2011) (76) SA
IMDB 2 294 50,000 (Le and Mikolov, 2014) (Iyyer et al., 2015) (Liu et al., 2015) (Liu et al., 2016a) (Miyato et al., 2017a) (Yang et al., 2019) (Diao et al., 2014) SA
20NG 20 221 18,846 (Lai et al., 2015) (Johnson and Zhang, 2016) (Bao et al., 2020) (Yao et al., 2019) (Wu et al., 2019) (2) NC
Fudan 20 2981 18,655 (Lai et al., 2015) (82) TL
AG News 4 45/7 127,600 (Zhang et al., 2015) (Johnson and Zhang, 2017) (Wang et al., 2017a) (Yang et al., 2018a) (Yang et al., 2019) (46) NC
Sogou 6 578 510,000 (Zhang et al., 2015) (Wang et al., 2008) NC
DBPedia 14 55 630,000 (Zhang et al., 2015) (Johnson and Zhang, 2017) (Miyato et al., 2017a) (Sun et al., 2019) (Lehmann et al., 2015) TL
Yelp.P 2 153 598,000 (Zhang et al., 2015) (Johnson and Zhang, 2017) (Tang et al., 2015) SA
Yelp.F 5 155 700,000 (Zhang et al., 2015) (Yang et al., 2016) (Johnson and Zhang, 2017) (Tang et al., 2015) SA
YahooA 10 112 1,460,000 (Zhang et al., 2015) (Yang et al., 2016) (Zhang et al., 2015) TL
Amz.P 2 91 4,000,000 (You et al., 2019) (Zhang et al., 2015) (47) SA
Amz.F 5 93 3,650,000 (Zhang et al., 2015) (Yang et al., 2016) (You et al., 2019) (47) SA
DSTC 4 89 - 30,000 (Lee and Dernoncourt, 2016) (Kim et al., 2016) DAC
MRDA 5 - 62,000 (Lee and Dernoncourt, 2016) (Ang et al., 2005) DAC
SwDA 43 - 1,022,000 (Lee and Dernoncourt, 2016) (Jurafsky and Shriberg, 1997) DAC
RCV1 103 240 807,595 (Johnson and Zhang, 2016) (Shimura et al., 2018) (Peng et al., 2018) (Chalkidis et al., 2019) (Pal et al., 2020) (Lewis et al., 2004) NC
RCV1-V2 103 124 804,414 (Yang et al., 2018b) (Pal et al., 2020) (165) NC
NLP&CC 2013 2 - 115,606 (Zhou et al., 2016b) (149) SA
SS-Twitter 2 - 2,113 (Felbo et al., 2017) (Thelwall et al., 2012) SA
SS-Youtube 2 - 2,142 (Felbo et al., 2017) (Thelwall et al., 2012) SA
SE1604 3 - 39,141 (Felbo et al., 2017) (Nakov et al., 2016) SA
Bing 4 20 34,871 (Wang et al., 2017a) (Wang et al., 2014) TL
AAPD 54 163 55,840 (Yang et al., 2018b) (Pal et al., 2020) (26) TL
Reuters 90 1 10,788 (Yang et al., 2018a) (Pal et al., 2020) (167) NC
R8 8 66 7,674 (Yao et al., 2019) (Wu et al., 2019) (Huang et al., 2019) (166) NC
R52 52 70 9,100 (Yao et al., 2019) (Wu et al., 2019) (Huang et al., 2019) (166) NC
NYTimes 2,318 629 1,855,659 (Peng et al., 2018) (150) NC
SQuAD - 5,000 5,570 (Peters et al., 2018) (Peters et al., 2018) (Liu et al., 2019b) (Lan et al., 2020) (Rajpurkar et al., 2016) QA
WikiQA - 873 243 (dos Santos et al., 2016) (Yang et al., 2015) QA
Ohsumed 23 136 7,400 (Yao et al., 2019) (Wu et al., 2019) (Huang et al., 2019) (152) TL
Amazon670K 670 244 643,474 (Shimura et al., 2018) (You et al., 2019) (48) TL
EUR-Lex 3,956 1,239 19,314 (Liu et al., 2017) (You et al., 2019) (Chalkidis et al., 2019) (Peng et al., 2019) (Chalkidis et al., 2019) (77) TL
Table 3. Summary statistics for the datasets. C: Number of target classes. L: Average sentence length. N: Dataset size.

3.1. Datasets

The availability of labeled datasets for text classification has become the main driving force behind the fast advancement of this research field. In this section, we summarize the characteristics of these datasets in terms of domains and give an overview in Table  3, including the number of categories, average sentence length, the size of each dataset, related papers, data sources to access and applications.

Sentiment Analysis (SA). SA is the process of analyzing and reasoning the subjective text within emotional color. It is crucial to get information on whether it supports a particular point of view from the text that is distinct from the traditional text classification that analyzes the objective content of the text. SA can be binary or multi-class. Binary SA is to divide the text into two categories, including positive and negative. Multi-class SA classifies text to multi-level or fine-grained labels. The SA datasets include MR, SST, MPQA, IMDB, Yelp, AM, Subj (Pang and Lee, 2004), CR (Hu and Liu, 2004), SS-Twitter, SS-Youtube, Twitter, SE1604, EP and so on. Here we detail several of the primary datasets.

Movie Review (MR). The MR (Pang and Lee, 2005) (144) is a movie review dataset, each of which corresponds to a sentence. The corpus has 5,331 positive data and 5,331 negative data. 10-fold cross-validation by random splitting is commonly used to test MR.

Stanford Sentiment Treebank (SST). The SST (178) is an extension of MR. It has two categories. SST-1 with fine-grained labels with five classes. It has 8,544 training texts and 2,210 test texts, respectively. Furthermore, SST-2 has 9,613 texts with binary labels being partitioned into 6,920 training texts, 872 development texts, and 1,821 testing texts.

The Multi-Perspective Question Answering (MPQA). The MPQA (Wiebe et al., 2005) (143) is an opinion dataset. It has two class labels and also an MPQA dataset of opinion polarity detection sub-tasks. MPQA includes 10,606 sentences extracted from news articles from various news sources. It should be noted that it contains 3,311 positive texts and 7,293 negative texts without labels of each text.

IMDB reviews. The IMDB review (Diao et al., 2014) is developed for binary sentiment classification of film reviews with the same amount in each class. It can be separated into training and test groups on average, by 25,000 comments per group.

Yelp reviews. The Yelp review (Tang et al., 2015) is summarized from the Yelp Dataset Challenges in 2013, 2014, and 2015. This dataset has two categories. Yelp-2 of these were used for negative and positive emotion classification tasks, including 560,000 training texts and 38,000 test texts. Yelp-5 is used to detect fine-grained affective labels with 650,000 training and 50,000 test texts in all classes.

Amazon Reviews (AM). The AM (Zhang et al., 2015) is a popular corpus formed by collecting Amazon website product reviews (47). This dataset has two categories. The Amazon-2 with two classes includes 3,600,000 training sets and 400,000 testing sets. Amazon-5, with five classes, includes 3,000,000 and 650,000 comments for training and testing.

News Classification (NC). News content is one of the most crucial information sources which has a critical influence on people. The NC system facilitates users to get vital knowledge in real-time. News classification applications mainly encompass: recognizing news topics and recommending related news according to user interest. The news classification datasets include 20NG, AG, R8, R52, Sogou, and so on. Here we detail several of the primary datasets.

20 Newsgroups (20NG). The 20NG (2) is a newsgroup text dataset. It has 20 categories with the same number of each category and includes 18,846 texts.

AG News (AG). The AG News (Zhang et al., 2015) (46) is a search engine for news from academia, choosing the four largest classes. It uses the title and description fields of each news. AG contains 120,000 texts for training and 7,600 texts for testing.

R8 and R52. R8 and R52 are two subsets which are the subset of Reuters (167). R8 (166) has 8 categories, divided into 2,189 test files and 5,485 training courses. R52 has 52 categories, split into 6,532 training files and 2,568 test files.

Sogou News (Sogou). The Sogou News (Sun et al., 2019) combines two datasets, including SogouCA and SogouCS news sets. The label of each text is the domain names in the URL.

Topic Labeling (TL). The topic analysis attempts to get the meaning of the text by defining the sophisticated text theme. The topic labeling is one of the essential components of the topic analysis technique, intending to assign one or more subjects for each document to simplify the topic analysis. The topic labeling datasets include DBPedia, Ohsumed, EUR-Lex, WOS, PubMed, and YahooA. Here we detail several of the primary datasets.

DBpedia. The DBpedia (Lehmann et al., 2015) is a large-scale multi-lingual knowledge base generated using Wikipedia’s most ordinarily used infoboxes. It publishes DBpedia each month, adding or deleting classes and properties in every version. DBpedia’s most prevalent version has 14 classes and is divided into 560,000 training data and 70,000 test data.

Ohsumed. The Ohsumed (152) belongs to the MEDLINE database. It includes 7,400 texts and has 23 cardiovascular disease categories. All texts are medical abstracts and are labeled into one or more classes.

Yahoo answers (YahooA). The YahooA (Zhang et al., 2015) is a topic labeling task with 10 classes. It includes 140,000 training data and 5,000 test data. All text contains three elements, being question titles, question contexts, and best answers, respectively.

Question Answering (QA). The QA task can be divided into two types: the extractive QA and the generative QA. The extractive QA gives multiple candidate answers for each question to choose which one is the right answer. Thus, the text classification models can be used for the extractive QA task. The QA discussed in this paper is all extractive QA. The QA system can apply the text classification model to recognize the correct answer and set others as candidates. The question answering datasets include SQuAD, MS MARCO, TREC-QA, WikiQA, and Quora (1). Here we detail several of the primary datasets.

Stanford Question Answering Dataset (SQuAD). The SQuAD (Rajpurkar et al., 2016) is a set of question and answer pairs obtained from Wikipedia articles. The SQuAD has two categories. SQuAD1.1 contains 536 pairs of 107,785 Q&A items. SQuAD2.0 combines 100,000 questions in SQuAD1.1 with more than 50,000 unanswerable questions that crowd workers face in a form similar to answerable questions (Rajpurkar et al., 2018).

MS MARCO. The MS MARCO (Nguyen et al., 2016) contains questions and answers. The questions and part of the answers are sampled from actual web texts by the Bing search engine. Others are generative. It is used for developing generative QA systems released by Microsoft.

TREC-QA. The TREC-QA (186) includes 5,452 training texts and 500 testing texts. It has two versions. TREC-6 contains 6 categories, and TREC-50 has 50 categories.

WikiQA. The WikiQA dataset (Yang et al., 2015) includes questions with no correct answer, which needs to evaluate the answer.

Natural Language Inference (NLI). NLI is used to predict whether the meaning of one text can be deduced from another. Paraphrasing is a generalized form of NLI. It uses the task of measuring the semantic similarity of sentence pairs to decide whether one sentence is the interpretation of another. The NLI datasets include SNLI, MNLI, SICK, STS, RTE, SciTail, MSRP, etc. Here we detail several of the primary datasets.

The Stanford Natural Language Inference (SNLI). The SNLI (Bowman et al., 2015) is generally applied to NLI tasks. It contains 570,152 human-annotated sentence pairs, including training, development, and test sets, which are annotated with three categories: neutral, entailment, and contradiction.

Multi-Genre Natural Language Inference (MNLI). The Multi-NLI (Williams et al., 2018) is an expansion of SNLI, embracing a broader scope of written and spoken text genres. It includes 433,000 sentence pairs annotated by textual entailment labels.

Sentences Involving Compositional Knowledge (SICK). The SICK (Marelli et al., 2014) contains almost 10,000 English sentence pairs. It consists of neutral, entailment and contradictory labels.

Microsoft Research Paraphrase (MSRP). The MSRP (Dolan et al., 2004) consists of sentence pairs, usually for the text-similarity task. Each pair is annotated by a binary label to discriminate whether they are paraphrases. It respectively includes 1,725 training and 4,076 test sets.

Dialog Act Classification (DAC). A dialog act describes an utterance in a dialog based on semantic, pragmatic, and syntactic criteria. DAC labels a piece of a dialog according to its category of meaning and helps learn the speaker’s intentions. It is to give a label according to dialog. Here we detail several of the primary datasets, including DSTC 4, MRDA, and SwDA.

Dialog State Tracking Challenge 4 (DSTC 4). The DSTC 4 (Kim et al., 2016) is used for dialog act classification. It has 89 training classes, 24,000 training texts, and 6,000 testing texts.

ICSI Meeting Recorder Dialog Act (MRDA). The MRDA (Ang et al., 2005) is used for dialog act classification. It has 5 training classes, 51,000 training texts, 11,000 testing texts, and 11,000 validation texts.

Switchboard Dialog Act (SwDA). The SwDA (Jurafsky and Shriberg, 1997) is used for dialog act classification. It has 43 training classes, 1,003,000 training texts, 19,000 testing texts and 112,000 validation texts.

Multi-label datasets. In multi-label classification, an instance has multiple labels, and each label can only take one of the multiple classes. There are many datasets based on multi-label text classification. It includes Reuters, Education, Patent, RCV1, RCV1-2K, AmazonCat-13K, BlurbGenreCollection, WOS-11967, AAPD, etc. Here we detail several of the main datasets.

Reuters news. The Reuters (166) (167) is a popularly used dataset for text classification from Reuters financial news services. It has 90 training classes, 7,769 training texts, and 3,019 testing texts, containing multiple labels and single labels. There are also some Reuters sub-sets of data, such as R8, BR52, RCV1, and RCV1-v2.

Patent Dataset. The Patent Dataset is obtained from USPTO 111https://www.uspto.gov/, which is a patent system grating U.S. patents containing textual details such title and abstract. It contains 100,000 US patents awarded in the real-world with multiple hierarchical categories.

Reuters Corpus Volume I (RCV1) and RCV1-2K. The RCV1 (Lewis et al., 2004) is collected from Reuters News articles from 1996-1997, which is human-labeled with 103 categories. It consists of 23,149 training and 784,446 testing texts, respectively. The RCV1-2K dataset has the same features as the RCV1. However, the label set of RCV1-2K has been expanded with some new labels. It contains 2456 labels.

Web of Science (WOS-11967). The WOS-11967 (Kowsari et al., 2017) is crawled from the Web of Science, consisting of abstracts of published papers with two labels for each example. It is shallower, but significantly broader, with fewer classes in total.

Arxiv Academic Paper Dataset (AAPD). The AAPD (26) is a large dataset in the computer science field for the multi-label text classification from website 222https://arxiv.org/. It has 55,840 papers, including the abstract and the corresponding subjects with 54 labels in total. The aim is to predict the corresponding subjects of each paper according to the abstract.

Others. There are some datasets for other applications, such as Geonames toponyms, Twitter posts, and so on.

3.2. Evaluation Metrics

In terms of evaluating text classification models, accuracy and F1 score are the most used to assess the text classification methods. Later, with the increasing difficulty of classification tasks or the existence of some particular tasks, the evaluation metrics are improved. For example, evaluation metrics such as P@K and Micro-F1 are used to evaluate multi-label text classification performance, and MRR is usually used to estimate the performance of QA tasks. In Table  4, we give the notations used in evaluation metrics.

Notations Descriptions
TP true positive
FP false positive
TN true negative
FN false negative
true positive of the th label on a text
false positive of the th label on a text
true negative of the th label on a text
false negative of the th label on a text
label set of all samples
the number of ground truth labels or possible answers on each text
the number of predicted labels on each text
the ranking of the ground-truth answer at answer th
k the number of selected labels on extreme multi-label text classification
Table 4. The notations used in evaluation metrics.

3.2.1. Single-label metrics

Single-label text classification divides the text into one of the most likely categories applied in NLP tasks such as QA, SA, and dialogue systems (Lee and Dernoncourt, 2016). For single-label text classification, one text belongs to just one catalog, making it possible not to consider the relations among labels. Here we introduce some evaluation metrics used for single-label text classification tasks.

Accuracy and Error Rate. Accuracy and Error Rate are the fundamental metrics for a text classification model. The Accuracy and Error Rate are respectively defined as

Precision, Recall and F1.

These are vital metrics utilized for unbalanced test sets regardless of the standard type and error rate. For example, most of the test samples have a class label. F1 is the harmonic average of Precision and Recall. Accuracy, Recall, and F1 as defined

The desired results will be obtained when the accuracy, F1 and recall value reach 1. On the contrary, when the values become 0, the worst result is obtained. For the multi-class classification problem, the precision and recall value of each class can be calculated separately, and then the performance of the individual and whole can be analyzed.

Exact Match (EM). The EM is a metric for QA tasks measuring the prediction that matches all the ground-truth answers precisely. It is the primary metric utilized on the SQuAD dataset.

Mean Reciprocal Rank (MRR). The MRR is usually applied for assessing the performance of ranking algorithms on QA and Information Retrieval (IR) tasks. MRR is defined as

Hamming-loss (HL). The HL (Schapire and Singer, 1999) assesses the score of misclassified instance-label pairs where a related label is omitted or an unrelated is predicted.

3.2.2. Multi-label metrics

Compared with single-label text classification, multi-label text classification divides the text into multiple category labels, and the number of category labels is variable. These metrics are designed for single label text classification, which are not suitable for multi-label tasks. Thus, there are some metrics designed for multi-label text classification.

. The (Manning et al., 2008) is a measure that considers the overall accuracy and recall of all labels. The is defined as:

where:

. The calculates the average of all labels. Unlike , which sets even weight to every example, sets the same weight to all labels in the average process. Formally, is defined as:

where:

In addition to the above evaluation metrics, there are some rank-based evaluation metrics for extreme multi-label classification tasks, including P@K and NDCG@K.

Precision at Top K (P@K). The is the precision at the top k. For , each text has a set of ground truth labels , in order of decreasing probability The precision at is

where

Normalized Discounted Cummulated Gains (NDCG@K). The at is

where

4. Quantitative Results

In this section, we tabulate the performance of the main models on classic datasets evaluated by classification accuracy, as shown in Table 5, including MR, SST-2, IMDB, Yelp.P, Yelp.F, Amazon.F, 20NG, AG, DBpedia and SNLI. We can see that BERT based models get better results on most datasets, which means that if you need to implement a text classification task, you can try BERT based models firstly, except MR and 20NG, which have not been experimented on BERT based models. RNN-Capsule (Wang et al., 2018b) obtains the best result on MR and BLSTM-2DCNN (Zhou et al., 2016a) gets the best on 20NG.

Sentiment News Topic NLI
Model MR SST-2 IMDB Yelp.P Yelp.F Amz.F 20NG AG DBpedia SNLI
RAE (Socher et al., 2011) 77.7 82.4 - - - - - - -
MV-RNN (Socher et al., 2012) 79 82.9 - - - - - - - -
RNTN (Socher et al., 2013) 75.9 85.4 - - - - - - - -
DCNN (Kalchbrenner et al., 2014) 86.8 89.4 - - - - - - -
Paragraph-Vec (Le and Mikolov, 2014) 87.8 92.58 - - - - - - -
TextCNN(Kim, 2014) 81.5 88.1 - - - - - - - -
TextRCNN (Lai et al., 2015) - - - - - - 96.49 - - -
DAN (Iyyer et al., 2015) - 86.3 89.4 - - - - - - -
Tree-LSTM (Tai et al., 2015) 88 - - - - - - - -
CharCNN (Zhang et al., 2015) - - - 95.12 62.05 - - 90.49 98.45 -
HAN (Yang et al., 2016) - - 49.4 - - 63.6 - - - -
SeqTextRCNN (Lee and Dernoncourt, 2016) - - - - - - - - - -
oh-2LSTMp (Johnson and Zhang, 2016) - - 94.1 97.1 67.61 - 86.68 93.43 99.16 -
LSTMN (Cheng et al., 2016) - 87.3 - - - - - - - -
Multi-Task (Liu et al., 2016a) - 87.9 91.3 - - - - - - -
BLSTM-2DCNN (Zhou et al., 2016a) 82.3 89.5 - - - - 96.5 - - -
TopicRNN (Dieng et al., 2017) - - 93.72 - - - - - - -
DPCNN (Johnson and Zhang, 2017) - - - 97.36 69.42 65.19 - 93.13 99.12 -
KPCNN (Wang et al., 2017a) 83.25 - - - - - - 88.36 - -
RAM (Chen et al., 2017) - - - - - - - - - -
RNN-Capsule (Wang et al., 2018b) 83.8 - - - - - - - -
ULMFiT (Howard and Ruder, 2018) - - 95.4 97.84 71.02 - - 94.99 99.2 -
LEAM(Wang et al., 2018a) 76.95 - - 95.31 64.09 - 81.91 92.45 99.02 -
TextCapsule (Yang et al., 2018a) 82.3 86.8 - - - - - 92.6 - -
TextGCN (Yao et al., 2019) 76.74 - - - - - 86.34 67.61 - -
BERT-base (Devlin et al., 2019) - 93.5 95.63 98.08 70.58 61.6 - - - 91.0
BERT-large (Devlin et al., 2019) - 94.9 95.79 98.19 71.38 62.2 - - - 91.7
MT-DNN(Liu et al., 2019a) - 95.6 83.2 - - - - - - 91.5
XLNet-Large (Yang et al., 2019) - 96.8 96.21 98.45 72.2 67.74 - - - -
XLNet (Yang et al., 2019) - 97 - - - - - 95.51 99.38 -
RoBERTa (Liu et al., 2019b) - 96.4 - - - - - - - 92.6
Table 5. Accuracy of deep learning-based text classification models on primary datasets evaluated by classification accuracy (in terms of publication year). Bold is the most accurate.

5. Future Research Challenges

Text classification – as efficient information retrieval and mining technology – plays a vital role in managing text data. It uses NLP, data mining, machine learning, and other techniques to automatically classify and discover different text types. Text classification takes multiple types of text as input, and the text is represented as a vector by the pre-training model. Then the vector is fed into the DNN for training until the termination condition is reached, and finally, the performance of the training model is verified by the downstream task. Existing models have already shown their usefulness in text classification, but there are still many possible improvements to explore.

Although some new text classification models repeatedly brush up the accuracy index of most classification tasks, it cannot indicate whether the model ”understands” the text from the semantic level like human beings. Moreover, with the emergence of the noise sample, the small sample noise may cause the decision confidence to change substantially or even lead to decision reversal. Therefore, the semantic representation ability and robustness of the model need to be proved in practice. Besides, the pre-trained semantic representation model represented by word vectors can often improve the performance of downstream NLP tasks. The existing research on the transfer strategy of context-free word vectors is still relatively preliminary. Thus, we conclude from data, models, and performance perspective, text classification mainly faces the following challenges.

5.1. Data

For a text classification task, data is essential to model performance, whether it is shallow learning or deep learning method. The text data mainly studied includes multi-chapter, short text, cross-language, multi-label, less sample text. For the characteristics of these data, the existing technical challenges are as follows:

Zero-shot/Few-shot learning. The current model of deep learning is too dependent on numerous labeled data. The performance of these models is significantly affected in zero-shot or few-shot learning.

The external knowledge.

As we all know, the more beneficial information is input into a DNN, its better performance. Therefore, we believe that adding external knowledge (knowledge base or knowledge graph) is an efficient way to promote the model’s performance. Nevertheless, how and what to add is still a challenge.

The multi-label text classification task. Multi-label text classification requires full consideration of the semantic relationship among labels, and the embedding and encoding of the model is a process of lossy compression. Therefore, how to reduce the loss of hierarchical semantics and retain rich and complex document semantic information during training is still a problem to be solved.

Special domain with many terminologies. Texts in a particular field, such as financial and medical texts, contain many specific words or domain experts intelligible slang, abbreviations, etc., which make the existing pre-trained word vectors challenging to work on.

5.2. Models

Most existing structures of shallow and deep learning models are tried for text classification, including integration methods. BERT learns a language representation that can be used to fine-tune for many NLP tasks. The primary method is to increase data, improve computation power, and design training procedures for getting better results How to tradeoff between data and compute resources and prediction performance is worth studying.

5.3. Performance

The shallow model and the deep model can achieve good performance in most text classification tasks, but the anti-interference ability of their results needs to be improved. How to realize the interpretation of the deep model is also a technical challenge.

The semantic robustness of the model. In recent years, researchers have designed many models to enhance the accuracy of text classification models. However, when there are some adversarial samples in the datasets, the model’s performance decreases significantly. Consequently, how to improve the robustness of models is a current research hotspot and challenge.

The interpretability of the model. DNNs have unique advantages in feature extraction and semantic mining and have achieved excellent text classification tasks. However, deep learning is a black-box model, the training process is challenging to reproduce, and the implicit semantics and output interpretability are poor. It makes the improvement and optimization of the model, losing clear guidelines. Furthermore, we cannot accurately explain why the model improves performance.

6. Conclusion

This paper principally introduces the existing models for text classification tasks from shallow learning to deep learning. Firstly, we introduce some primary shallow learning models and deep learning models with a summary table. The shallow model improves text classification performance mainly by improving the feature extraction scheme and classifier design. In contrast, the deep learning model enhances performance by improving the presentation learning method, model structure, and additional data and knowledge. Then, we introduce the datasets with a summary table and evaluation metrics for single-label and multi-label tasks. Furthermore, we give the quantitative results of the leading models in a summary table under different applications for classic text classification datasets. Finally, we summarize the possible future research challenges of text classification.

Acknowledgements.
This work is supported in part by the NSFC (61872022 and 61872294), NSF (III-1526499, III-1763325, III-1909323), CNS-1930941, NSF of Guangdong Province (2017A030313339), and the UK EPSRC (EP/T01461X/1).

References

  • [1] Note: https://data.quora.com/First-Quora-Dataset-Release-QuestionPairs Cited by: §3.1.
  • [2] (2007) 20NG Corpus. Note: http://ana.cachopo.org/datasets-for-single-label-text-categorization Cited by: §3.1, Table 3.
  • [3] (1995) A implementation of AdaBoost . Note: https://github.com/JiangXingRu/Texture-Classification Cited by: Table 1.
  • [4] (1993) A implementation of C4.5. Note: https://github.com/Cater5009/Text-Classify Cited by: Table 1.
  • [5] (1984) A implementation of CART. Note: https://github.com/sayantann11/all-classification-templetes-for-ML Cited by: Table 1.
  • [6] (2015) A implementation of DAN. Note: https://github.com/miyyer/dan Cited by: Table 1.
  • [7] (2016) A implementation of FastText. Note: https://github.com/SeanLee97/short-text-classification Cited by: Table 1.
  • [8] (1967) A implementation of KNN. Note: https://github.com/raimonbosch/knn.classifier Cited by: Table 1.
  • [9] (201) A implementation of LightGBM. Note: https://github.com/creatist/text_classify Cited by: Table 1.
  • [10] (2001) A implementation of RandomForest. Note: https://github.com/hexiaolang/RandomForest-In-text-classification Cited by: Table 1.
  • [11] (1998) A implementation of SVM. Note: https://github.com/Gunjitbedi/Text-Classification Cited by: Table 1.
  • [12] (2018) A implementation of ULMFiT. Note: http://nlp.fast.ai/category/classification.html Cited by: Table 1.
  • [13] (2016) A implementation of XGBoost. Note: https://xgboost.readthedocs.io/en/latest/ Cited by: Table 1.
  • [14] (2018)

    A Keras implementation of DeepMoji

    .
    Note: https://github.com/bfelbo/DeepMoji Cited by: Table 1.
  • [15] (2014) A Keras implementation of TextCNN. Note: https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras Cited by: Table 1.
  • [16] (2014) A Keras implementation of TextCNN. Note: https://github.com/richliao/textClassifier Cited by: Table 1.
  • [17] (2011) A MATLAB implementation of RAE. Note: https://github.com/vin00/Semi-Supervised-Recursive-Autoencoders-for-Predicting-Sentiment-Distributions Cited by: Table 1.
  • [18] (2013) A MATLAB implementation of RNTN. Note: https://github.com/pondruska/DeepSentiment Cited by: Table 1.
  • [19] (2019)

    A PyTorch implementation of ABSA-PyTorch

    .
    Note: https://github.com/songyouwei/ABSA-PyTorch Cited by: Table 1.
  • [20] (2017) A PyTorch implementation of DPCNN. Note: https://github.com/Cheneng/DPCNN Cited by: Table 1.
  • [21] (2018) A PyTorch implementation of ELMo. Note: https://github.com/flairNLP/flair Cited by: Table 1.
  • [22] (2019) A PyTorch implementation of MT-DNN. Note: https://github.com/namisan/mt-dnn Cited by: Table 1.
  • [23] (2016) A PyTorch implementation of Multi-Task. Note: https://github.com/baixl/text_classification Cited by: Table 1.
  • [24] (2014) A PyTorch implementation of Paragraph Vectors (doc2vec). Note: https://github.com/inejc/paragraph-vectors Cited by: Table 1.
  • [25] (2019) A PyTorch implementation of RoBERTa. Note: https://github.com/pytorch/fairseq Cited by: Table 1.
  • [26] (2018) A PyTorch implementation of SGM. Note: https://github.com/lancopku/SGM Cited by: Table 1, §3.1, Table 3.
  • [27] (2019) A PyTorch implementation of SGNN. Note: https://github.com/eecrazy/ConstructingNEEG_IJCAI_2018 Cited by: Table 1.
  • [28] (2017) A PyTorch implementation of TopicRNN. Note: https://github.com/dangitstam/topic-rnn Cited by: Table 1.
  • [29] (2015) A PyTorch implementation of Tree-LSTM. Note: https://github.com/stanfordnlp/treelstm Cited by: Table 1.
  • [30] (2020)

    A Tensorflow implementation of ALBERT

    .
    Note: https://github.com/google-research/ALBERT Cited by: Table 1.
  • [31] (2019) A Tensorflow implementation of BERT. Note: https://github.com/google-research/bert Cited by: Table 1.
  • [32] (2015) A Tensorflow implementation of CharCNN. Note: https://github.com/mhjabreel/CharCNN Cited by: Table 1.
  • [33] (2014) A Tensorflow implementation of DCNN. Note: https://github.com/kinimod23/ATS_Project Cited by: Table 1.
  • [34] (2018) A Tensorflow implementation of DeepGraphCNNforTexts. Note: https://github.com/HKUST-KnowComp/DeepGraphCNNforTexts Cited by: Table 1.
  • [35] (2018) A Tensorflow implementation of LEAM. Note: https://github.com/guoyinwang/LEAM Cited by: Table 1.
  • [36] (2016) A Tensorflow implementation of LSTMN. Note: https://github.com/JRC1995/Abstractive-Summarization Cited by: Table 1.
  • [37] (2012) A Tensorflow implementation of MV_RNN. Note: https://github.com/github-pengge/MV_RNN Cited by: Table 1.
  • [38] (2020) A Tensorflow implementation of SpanBERT. Note: https://github.com/facebookresearch/SpanBERT Cited by: Table 1.
  • [39] (2018) A Tensorflow implementation of TextCapsule. Note: https://github.com/andyweizhao/capsule_text_classification Cited by: Table 1.
  • [40] (2019) A Tensorflow implementation of TextGCN. Note: https://github.com/yao8839836/text_gcn Cited by: Table 1.
  • [41] (2015) A Tensorflow implementation of TextRCNN. Note: https://github.com/roomylee/rcnn-text-classification Cited by: Table 1.
  • [42] (2017) A Tensorflow implementation of Virtual adversarial training. Note: https://github.com/tensorflow/models/tree/master/adversarial_text Cited by: Table 1.
  • [43] (2019) A Tensorflow implementation of XLNet. Note: https://github.com/zihangdai/xlnet Cited by: Table 1.
  • A., P., Dempster, N., M., Laird, D., B., and Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Cited by: §2.1.1.
  • B. Adams and G. McKenzie (2018) Crowdsourcing the character of a place: character-level convolutional networks for multilingual geographic text classification. Trans. GIS 22 (2), pp. 394–408. External Links: Link, Document Cited by: §2.2.4, Table 2.
  • [46] (2004) AG Corpus. Note: http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html Cited by: §3.1, Table 3.
  • [47] (2015) Amazon review Corpus. Note: https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products Cited by: §3.1, Table 3.
  • [48] (2016) Amazon670K Corpus. Note: http://manikvarma.org/downloads/XC/XMLRepository.html Cited by: Table 3.
  • J. Ang, Y. Liu, and E. Shriberg (2005) Automatic dialog act segmentation and classification in multiparty meetings. See DBLP:conf/icassp/2005, pp. 1061–1064. External Links: Link, Document Cited by: §3.1, Table 3.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. See DBLP:conf/iclr/2015, External Links: Link Cited by: §2.2.5.
  • Y. Bao, M. Wu, S. Chang, and R. Barzilay (2020) Few-shot text classification with distributional signatures. See DBLP:conf/iclr/2020, External Links: Link Cited by: Table 2, Table 3.
  • J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Sima’an (2017) Graph convolutional encoders for syntax-aware neural machine translation. See DBLP:conf/emnlp/2017, pp. 1957–1967. External Links: Link, Document Cited by: §2.2.7.
  • A. Bouaziz, C. Dartigues-Pallez, C. da Costa Pereira, F. Precioso, and P. Lloret (2014) Short text classification using semantic random forest. See DBLP:conf/dawak/2014, pp. 288–299. External Links: Link, Document Cited by: §2.1.5.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. See DBLP:conf/emnlp/2015, pp. 632–642. External Links: Link, Document Cited by: §3.1.
  • L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone (1984) Classification and regression trees. Wadsworth. External Links: ISBN 0-534-98053-8 Cited by: Table 1.
  • L. Breiman (2001) Random forests. Mach. Learn. 45 (1), pp. 5–32. External Links: Link, Document Cited by: §2.1.5, Table 1, §2.
  • J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1993) Signature verification using a siamese time delay neural network. See DBLP:conf/nips/1993, pp. 737–744. External Links: Link Cited by: §2.2.8.
  • I. Chalkidis, M. Fergadiotis, P. Malakasiotis, and I. Androutsopoulos (2019) Large-scale multi-label text classification on EU legislation. See DBLP:conf/acl/2019-1, pp. 6314–6322. External Links: Link, Document Cited by: Table 2, Table 3.
  • J. Chen, Y. Hu, J. Liu, Y. Xiao, and H. Jiang (2019) Deep short text classification with knowledge powered attention. See DBLP:conf/aaai/2019, pp. 6252–6259. External Links: Link, Document Cited by: Table 2.
  • P. Chen, Z. Sun, L. Bing, and W. Yang (2017) Recurrent attention network on memory for aspect sentiment analysis. See DBLP:conf/emnlp/2017, pp. 452–461. External Links: Link, Document Cited by: Table 1, Table 5.
  • T. Chen and C. Guestrin (2016) XGBoost: A scalable tree boosting system. See DBLP:conf/kdd/2016, pp. 785–794. External Links: Link, Document Cited by: §2.1.5, Table 1, §2.
  • Z. Chen and T. Qian (2019) Transfer capsule network for aspect level sentiment classification. See DBLP:conf/acl/2019-1, pp. 547–556. External Links: Link, Document Cited by: §2.2.4.
  • J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. See DBLP:conf/emnlp/2016, pp. 551–561. External Links: Link, Document Cited by: §2.2.5, Table 1, Table 2, Table 3, Table 5.
  • C. Cortes and V. Vapnik (1995) Support-vector networks. Mach. Learn. 20 (3), pp. 273–297. External Links: Link, Document Cited by: §2.1.3.
  • T. M. Cover and P. E. Hart (1967) Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13 (1), pp. 21–27. External Links: Link, Document Cited by: §1, §2.1.2, Table 1.
  • T. M. Cover and J. A. Thomas (2006) Elements of information theory (wiley series in telecommunications and signal processing). Wiley-Interscience, USA. External Links: ISBN 0471241954 Cited by: §2.1.1.
  • W. Dai, G. Xue, Q. Yang, and Y. Yu (2007)

    Transferring naive bayes classifiers for text classification

    .
    See DBLP:conf/aaai/2007, pp. 540–545. External Links: Link Cited by: §2.1.1.
  • Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. See DBLP:conf/acl/2019-1, pp. 2978–2988. External Links: Link, Document Cited by: §2.2.6.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. See DBLP:conf/nips/2016, pp. 3837–3845. External Links: Link Cited by: §2.2.7.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. See DBLP:conf/naacl/2019-1, pp. 4171–4186. External Links: Link Cited by: Figure 10, §2.2.6, §2.2, Table 1, Table 2, §2, Table 3, Table 5.
  • Q. Diao, M. Qiu, C. Wu, A. J. Smola, J. Jiang, and C. Wang (2014) Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). See DBLP:conf/kdd/2014, pp. 193–202. External Links: Link, Document Cited by: §3.1, Table 3.
  • A. B. Dieng, C. Wang, J. Gao, and J. W. Paisley (2017) TopicRNN: A recurrent neural network with long-range semantic dependency. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.2.3, Table 1, Table 2, Table 5.
  • X. Ding, K. Liao, T. Liu, Z. Li, and J. Duan (2019) Event representation learning enhanced with external commonsense knowledge. See DBLP:conf/emnlp/2019-1, pp. 4893–4902. External Links: Link, Document Cited by: §2.2.8.
  • B. Dolan, C. Quirk, and C. Brockett (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. See DBLP:conf/coling/2004, External Links: Link Cited by: §3.1.
  • C. N. dos Santos, M. Tan, B. Xiang, and B. Zhou (2016) Attentive pooling networks. CoRR abs/1602.03609. External Links: Link, 1602.03609 Cited by: §2.2.5, Table 2, Table 3.
  • [76] (2011) EP Corpus. Note: http://www.experienceproject.com/confessions.php Cited by: Table 3.
  • [77] (2019) EUR-Lex Corpus. Note: http://www.ke.tu-darmstadt.de/resources/eurlex/eurlex.html Cited by: Table 3.
  • F. Fan, Y. Feng, and D. Zhao (2018) Multi-grained attention network for aspect-level sentiment classification. See DBLP:conf/emnlp/2018, pp. 3433–3442. External Links: Link, Document Cited by: §2.2.5, Table 1.
  • B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, and S. Lehmann (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. See DBLP:conf/emnlp/2017, pp. 1615–1625. External Links: Link Cited by: Table 1, Table 2, Table 3.
  • P. Frasconi, G. Soda, and A. Vullo (2002) Hidden markov models for text categorization in multi-page documents. J. Intell. Inf. Syst. 18 (2-3), pp. 195–217. External Links: Link, Document Cited by: §2.1.1.
  • Y. Freund and R. E. Schapire (1995) A decision-theoretic generalization of on-line learning and an application to boosting. See DBLP:conf/eurocolt/1995, pp. 23–37. External Links: Link, Document Cited by: §2.1.5, Table 1.
  • [82] (2015) Fudan Corpus. Note: www.datatang.com/data/44139and43543 Cited by: Table 3.
  • E. Grave, T. Mikolov, A. Joulin, and P. Bojanowski (2017) Bag of tricks for efficient text classification. See DBLP:conf/eacl/2017-2, pp. 427–431. External Links: Link Cited by: Table 1.
  • I. Gurevych and Y. Miyao (Eds.) (2018) Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, melbourne, australia, july 15-20, 2018, volume 1: long papers. Association for Computational Linguistics. External Links: Link, ISBN 978-1-948087-32-2 Cited by: G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin (2018a), J. Howard and S. Ruder (2018).
  • Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, and J. Zhao (2017) An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. See DBLP:conf/acl/2017-1, pp. 221–231. External Links: Link, Document Cited by: Table 2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. See DBLP:conf/eccv/2016-4, pp. 630–645. External Links: Link, Document Cited by: §2.2.4.
  • M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. CoRR abs/1506.05163. External Links: Link, 1506.05163 Cited by: §2.2.7.
  • G. E. Hinton, A. Krizhevsky, and S. D. Wang (2011) Transforming auto-encoders. In Artificial Neural Networks and Machine Learning – ICANN 2011, T. Honkela, W. Duch, M. Girolami, and S. Kaski (Eds.), Berlin, Heidelberg, pp. 44–51. External Links: ISBN 978-3-642-21735-7 Cited by: §2.2.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2.2.3.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. See Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, melbourne, australia, july 15-20, 2018, volume 1: long papers, Gurevych and Miyao, pp. 328–339. External Links: Link, Document Cited by: Table 1, Table 5.
  • L. Hu, T. Yang, C. Shi, H. Ji, and X. Li (2019) Heterogeneous graph attention networks for semi-supervised short text classification. See DBLP:conf/emnlp/2019-1, pp. 4820–4829. External Links: Link, Document Cited by: §2.2.7.
  • M. Hu and B. Liu (2004) Mining and summarizing customer reviews. See DBLP:conf/kdd/2004, pp. 168–177. External Links: Link, Document Cited by: §3.1, Table 3.
  • Z. Hu, X. Li, C. Tu, Z. Liu, and M. Sun (2018) Few-shot charge prediction with discriminative legal attributes. See DBLP:conf/coling/2018, pp. 487–498. External Links: Link Cited by: §2.2.5.
  • L. Huang, D. Ma, S. Li, X. Zhang, and H. Wang (2019) Text level graph neural network for text classification. See DBLP:conf/emnlp/2019-1, pp. 3442–3448. External Links: Link, Document Cited by: §2.2.7, Table 2, Table 3.
  • O. Irsoy and C. Cardie (2014) Deep recursive neural networks for compositionality in language. See DBLP:conf/nips/2014, pp. 2096–2104. External Links: Link Cited by: §2.2.1, Table 2.
  • Md. Z. Islam, J. Liu, J. Li, L. Liu, and W. Kang (2019) A semantics aware random forest for text classification. See DBLP:conf/cikm/2019, pp. 1061–1070. External Links: Link, Document Cited by: §2.1.5.
  • M. Iyyer, V. Manjunatha, J. L. Boyd-Graber, and H. D. III (2015) Deep unordered composition rivals syntactic methods for text classification. See DBLP:conf/acl/2015-1, pp. 1681–1691. External Links: Link, Document Cited by: Table 1, Table 2, Table 3, Table 5.
  • G. Jawahar, B. Sagot, and D. Seddah (2019) What does BERT learn about the structure of language?. See DBLP:conf/acl/2019-1, pp. 3651–3657. External Links: Link, Document Cited by: §2.2.6.
  • Jayadeva, H. Pant, M. Sharma, and S. Soman (2019) Twin neural networks for the classification of large unbalanced datasets. Neurocomputing 343, pp. 34 – 49. Note: Learning in the Presence of Class Imbalance and Concept Drift External Links: ISSN 0925-2312, Document, Link Cited by: §2.2.8.
  • T. Joachims (1998) Text categorization with support vector machines: learning with many relevant features. See DBLP:conf/ecml/1998, pp. 137–142. External Links: Link, Document Cited by: §1, §2.1.3, Table 1, §2.
  • T. Joachims (2001) A statistical learning model of text classification for support vector machines. See DBLP:conf/sigir/2001, pp. 128–136. External Links: Link, Document Cited by: §2.1.3.
  • JOACHIMS,T. (1999) Transductive inference for text classification using support vector macines. In International Conference on Machine Learning, Cited by: §2.1.3.
  • D. E. Johnson, F. J. Oles, T. Zhang, and T. Götz (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst. J. 41 (3), pp. 428–437. External Links: Link, Document Cited by: §2.1.4.
  • R. Johnson and T. Zhang (2015) Semi-supervised convolutional neural networks for text categorization via region embedding. See DBLP:conf/nips/2015, pp. 919–927. External Links: Link Cited by: §2.2.4.
  • R. Johnson and T. Zhang (2016) Supervised and semi-supervised text categorization using LSTM for region embeddings. See DBLP:conf/icml/2016, pp. 526–534. External Links: Link Cited by: Table 2, Table 3, Table 5.
  • R. Johnson and T. Zhang (2017) Deep pyramid convolutional neural networks for text categorization. See DBLP:conf/acl/2017-1, pp. 562–570. External Links: Link, Document Cited by: §2.2.4, Table 1, Table 2, Table 3, Table 5.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2020) SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguistics 8, pp. 64–77. External Links: Link Cited by: Table 1.
  • D. Jurafsky and E. Shriberg (1997) Switchboard swbd-damsl shallow-discourse-function annotation coders manual. pp. . Cited by: §3.1, Table 3.
  • M. k. Alsmadi, K. B. Omar, S. A. Noah, and I. Almarashdah (2009) Performance comparison of multi-layer perceptron (back propagation, delta rule and perceptron) algorithms in neural networks. In 2009 IEEE International Advance Computing Conference, Vol. , pp. 296–299. Cited by: §2.2.2.
  • N. Kalchbrenner, E. Grefenstette, and P. Blunsom (2014) A convolutional neural network for modelling sentences. See DBLP:conf/acl/2014-1, pp. 655–665. External Links: Link, Document Cited by: §1, Table 1, Table 2, Table 3, Table 5.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017)

    LightGBM: A highly efficient gradient boosting decision tree

    .
    See DBLP:conf/nips/2017, pp. 3146–3154. External Links: Link Cited by: Table 1, §2.
  • S. Kim, L. F. D’Haro, R. E. Banchs, J. D. Williams, and M. Henderson (2016) The fourth dialog state tracking challenge. See DBLP:conf/iwsds/2016, pp. 435–449. External Links: Link, Document Cited by: §3.1, Table 3.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. See DBLP:conf/emnlp/2014, pp. 1746–1751. External Links: Link, Document Cited by: §2.2.4, Table 1, Table 2, §2, Table 3, Table 5.
  • K. Kowsari, D. E. Brown, M. Heidarysafa, K. J. Meimandi, M. S. Gerber, and L. E. Barnes (2017) HDLTex: hierarchical deep learning for text classification. See DBLP:conf/icmla/2017, pp. 364–371. External Links: Link, Document Cited by: §3.1.
  • K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. E. Barnes, and D. E. Brown (2019) Text classification algorithms: A survey. Information 10 (4), pp. 150. External Links: Link, Document Cited by: §1.1.
  • S. Lai, L. Xu, K. Liu, and J. Zhao (2015) Recurrent convolutional neural networks for text classification. See DBLP:conf/aaai/2015, pp. 2267–2273. External Links: Link Cited by: Table 1, Table 2, Table 3, Table 5.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020) ALBERT: A lite BERT for self-supervised learning of language representations. See DBLP:conf/iclr/2020, External Links: Link Cited by: §2.2.6, Table 1, Table 2, Table 3.
  • Q. V. Le and T. Mikolov (2014) Distributed representations of sentences and documents. See DBLP:conf/icml/2014, pp. 1188–1196. External Links: Link Cited by: §2.2.2, Table 1, Table 2, Table 3, Table 5.
  • J. Y. Lee and F. Dernoncourt (2016) Sequential short-text classification with recurrent and convolutional neural networks. See DBLP:conf/naacl/2016, pp. 515–520. External Links: Link, Document Cited by: §1, Table 2, §3.2.1, Table 3, Table 5.
  • J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer (2015) DBpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6 (2), pp. 167–195. External Links: Link, Document Cited by: §3.1, Table 3.
  • D. D. Lewis, Y. Yang, T. G. Rose, and F. Li (2004) RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, pp. 361–397. External Links: Link Cited by: §3.1, Table 3.
  • Y. Li, R. Jin, and Y. Luo (2019) Classifying relations in clinical narratives using segment graph convolutional and recurrent neural networks (seg-gcrns). JAMIA 26 (3), pp. 262–268. External Links: Link, Document Cited by: §2.2.7.
  • Z. Li, X. Ding, and T. Liu (2018) Constructing narrative event evolutionary graph for script event prediction. See DBLP:conf/ijcai/2018, pp. 4201–4207. External Links: Link, Document Cited by: §2.2.7, Table 1.
  • Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.2.5, Table 2.
  • J. Liu, W. Chang, Y. Wu, and Y. Yang (2017) Deep learning for extreme multi-label text classification. See DBLP:conf/sigir/2017, pp. 115–124. External Links: Link, Document Cited by: Table 2, Table 3.
  • P. Liu, X. Qiu, X. Chen, S. Wu, and X. Huang (2015) Multi-timescale long short-term memory neural network for modelling sentences and documents. See DBLP:conf/emnlp/2015, pp. 2326–2335. External Links: Link, Document Cited by: §1, Table 2, Table 3.
  • P. Liu, X. Qiu, and X. Huang (2016a) Recurrent neural network for text classification with multi-task learning. See DBLP:conf/ijcai/2016, pp. 2873–2879. External Links: Link Cited by: §2.2.3, Table 1, Table 2, Table 3, Table 5.
  • X. Liu, P. He, W. Chen, and J. Gao (2019a) Multi-task deep neural networks for natural language understanding. See DBLP:conf/acl/2019-1, pp. 4487–4496. External Links: Link, Document Cited by: Table 1, Table 5.
  • Y. Liu, C. Sun, L. Lin, and X. Wang (2016b) Learning natural language inference using bidirectional LSTM model and inner-attention. CoRR abs/1605.09090. External Links: Link, 1605.09090 Cited by: Table 2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019b) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §2.2.6, Table 1, Table 2, Table 3, Table 5.
  • D. Ma, S. Li, X. Zhang, and H. Wang (2017) Interactive attention networks for aspect-level sentiment classification. See DBLP:conf/ijcai/2017, pp. 4068–4074. External Links: Link, Document Cited by: Table 1.
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011) Learning word vectors for sentiment analysis. See DBLP:conf/acl/2011, pp. 142–150. External Links: Link Cited by: §1.
  • C. D. Manning, P. Raghavan, and H. Schütze (2008) Introduction to information retrieval. Cambridge University Press. External Links: Link, Document, ISBN 978-0-521-86571-5 Cited by: §3.2.2.
  • D. Marcheggiani and I. Titov (2017) Encoding sentences with graph convolutional networks for semantic role labeling. See DBLP:conf/emnlp/2017, pp. 1506–1515. External Links: Link, Document Cited by: §2.2.7.
  • M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, and R. Zamparelli (2014) SemEval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. See DBLP:conf/semeval/2014, pp. 1–8. External Links: Link, Document Cited by: §3.1.
  • M. E. Maron (1961) Automatic indexing: an experimental inquiry. J. ACM 8 (3), pp. 404–417. External Links: Link, Document Cited by: §1, §2.1.1, Table 1, §2.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. See DBLP:conf/iclr/2013w, External Links: Link Cited by: §2.1, §2.2.2.
  • S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao (2020) Deep learning based text classification: A comprehensive review. CoRR abs/2004.03705. External Links: Link, 2004.03705 Cited by: §1.1.
  • T. M. Mitchell (1997) Machine learning. McGraw Hill series in computer science, McGraw-Hill. External Links: Link, ISBN 978-0-07-042807-2 Cited by: §2.1.4.
  • T. Miyato, A. M. Dai, and I. J. Goodfellow (2017a) Adversarial training methods for semi-supervised text classification. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.2.3, §2.2.8, Table 1, Table 2, Table 3.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2017b) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. CoRR abs/1704.03976. External Links: Link, 1704.03976 Cited by: §2.2.3.
  • T. Miyato, S. Maeda, M. Koyama, K. Nakae, and S. Ishii (2015) Distributional smoothing with virtual adversarial training. External Links: 1507.00677 Cited by: §2.2.8.
  • [143] (2005) MPQA Corpus. Note: http://www.cs.pitt.edu/mpqa/ Cited by: §3.1, Table 3.
  • [144] (2002) MR Corpus. Note: http://www.cs.cornell.edu/people/pabo/movie-review-data/ Cited by: §3.1, Table 3.
  • J. Mueller and A. Thyagarajan (2016) Siamese recurrent architectures for learning sentence similarity. See DBLP:conf/aaai/2016, pp. 2786–2792. External Links: Link Cited by: §2.2.8.
  • P. Nakov, A. Ritter, S. Rosenthal, F. Sebastiani, and V. Stoyanov (2016) SemEval-2016 task 4: sentiment analysis in twitter. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Cited by: Table 3.
  • H. Nguyen and M. Nguyen (2017) A deep neural architecture for sentence-level sentiment classification in twitter social networking. See DBLP:conf/pacling/2017, pp. 15–27. External Links: Link, Document Cited by: §2.2.4.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS MARCO: A human generated machine reading comprehension dataset. See DBLP:conf/nips/2016coco, External Links: Link Cited by: §3.1.
  • [149] (2013) NLP&CC Corpus. Note: http://tcci.ccf.org.cn/conference/2013/index.html Cited by: Table 2, Table 3.
  • [150] (2007) NYTimes Corpus. Note: https://catalog.ldc.upenn.edu/docs/LDC2008T19/new_york_times_annotated_corpus.pdf Cited by: Table 3.
  • M. O’Donnell (2009) CATALOGING and classification: an introduction. Technical Services Quarterly 26 (1), pp. 86–87. Cited by: §2.1.1.
  • [152] (2015) Ohsumed Corpus. Note: http://davis.wpi.edu/xmdv/datasets/ohsumed.html Cited by: §3.1, Table 3.
  • A. Pal, M. Selvakumar, and M. Sankarasubbu (2020) MAGNET: multi-label text classification using attention-based graph neural network. See DBLP:conf/icaart/2020-2, pp. 494–505. External Links: Link, Document Cited by: §2.2.7, Table 2, Table 3.
  • B. Pang and L. Lee (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, pp. 271–278. External Links: Link, Document Cited by: §3.1, Table 3.
  • B. Pang and L. Lee (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. See DBLP:conf/acl/2005, pp. 115–124. External Links: Link Cited by: §3.1.
  • H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, and Q. Yang (2018) Large-scale hierarchical text classification with recursively regularized deep graph-cnn. See DBLP:conf/www/2018, pp. 1063–1072. External Links: Link, Document Cited by: §2.2.7, Table 1, Table 2, Table 3.
  • H. Peng, J. Li, S. Wang, L. Wang, Q. Gong, R. Yang, B. Li, P. Yu, and L. He (2019) Hierarchical taxonomy-aware and attentional graph capsule rcnns for large-scale multi-label text classification. IEEE Transactions on Knowledge and Data Engineering. Cited by: Table 2, Table 3.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. See DBLP:conf/emnlp/2014, pp. 1532–1543. External Links: Link, Document Cited by: §2.1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. See DBLP:conf/naacl/2018-1, pp. 2227–2237. External Links: Link Cited by: §2.2.6, Table 1, Table 2, Table 3.
  • J. R. Quinlan (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 1558602380 Cited by: §2.1.4, Table 1.
  • R. J. Quinlan (1986) Induction of decision trees. Machine Learning 1 (1), pp. 81–106. Cited by: §2.1.4.
  • A. Radford (2018) Improving language understanding by generative pre-training. Cited by: §2.2.6.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. See DBLP:conf/acl/2018-2, pp. 784–789. External Links: Link, Document Cited by: §3.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. See DBLP:conf/emnlp/2016, pp. 2383–2392. External Links: Link, Document Cited by: §3.1, Table 3.
  • [165] (2004) RCV1-V2 Corpus. Note: http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm Cited by: Table 3.
  • [166] (2007) Reuters Corpus. Note: https://www.cs.umb.edu/~smimarog/textmining/datasets/ Cited by: §3.1, §3.1, Table 3.
  • [167] (2017) Reuters Corpus. Note: https://martin-thoma.com/nlp-reuters Cited by: §3.1, §3.1, Table 3.
  • R. E. Schapire and Y. Singer (1999) Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37 (3), pp. 297–336. External Links: Link, Document Cited by: §3.2.1.
  • K. Schneider (2004) A new feature selection score for multinomial naive bayes text classification based on kl-divergence. See DBLP:conf/acl/2004-p, External Links: Link Cited by: §2.1.1.
  • C. Shen, C. Sun, J. Wang, Y. Kang, S. Li, X. Liu, L. Si, M. Zhang, and G. Zhou (2018a) Sentiment classification towards question-answering with hierarchical matching network. See DBLP:conf/emnlp/2018, pp. 3654–3663. External Links: Link, Document Cited by: §2.2.8.
  • T. Shen, T. Zhou, G. Long, J. Jiang, and C. Zhang (2018b) Bi-directional block self-attention for fast and memory-efficient sequence modeling. See DBLP:conf/iclr/2018, External Links: Link Cited by: §2.2.5, Table 2, Table 3.
  • K. Shimura, J. Li, and F. Fukumoto (2018) HFT-CNN: learning hierarchical category structure for multi-label short text categorization. See DBLP:conf/emnlp/2018, pp. 811–816. External Links: Link, Document Cited by: Table 2, Table 3.
  • R. Socher, B. Huval, C. D. Manning, and A. Y. Ng (2012) Semantic compositionality through recursive matrix-vector spaces. See DBLP:conf/emnlp/2012, pp. 1201–1211. External Links: Link Cited by: §2.2.1, Table 1, Table 2, Table 5.
  • R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. See DBLP:conf/emnlp/2011, pp. 151–161. External Links: Link Cited by: §2.2.1, Table 1, Table 2, Table 3, Table 5.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: Table 3.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. See DBLP:conf/emnlp/2013, pp. 1631–1642. External Links: Link Cited by: §2.2.1, Table 1, Table 2, Table 3, Table 5.
  • P. Soucy and G. W. Mineau (2001) A simple KNN algorithm for text categorization. See DBLP:conf/icdm/2001, pp. 647–648. External Links: Link, Document Cited by: §2.1.2.
  • [178] (2013) SST Corpus. Note: http://nlp.stanford.edu/sentiment Cited by: §3.1, Table 3.
  • C. Sun, X. Qiu, Y. Xu, and X. Huang (2019) How to fine-tune BERT for text classification?. See DBLP:conf/cncl/2019, pp. 194–206. External Links: Link, Document Cited by: Table 2, §3.1, Table 3.
  • K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. See DBLP:conf/acl/2015-1, pp. 1556–1566. External Links: Link, Document Cited by: §1, §2.2.3, Table 1, Table 2, Table 3, Table 5.
  • M. Tan, C. dos Santos, B. Xiang, and B. Zhou (2016) Improved representation learning for question answer matching. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 464–473. External Links: Link, Document Cited by: §2.2.5.
  • S. Tan (2005) Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28 (4), pp. 667–671. External Links: Link, Document Cited by: §2.1.2.
  • D. Tang, B. Qin, and T. Liu (2015) Document modeling with gated recurrent neural network for sentiment classification. See DBLP:conf/emnlp/2015, pp. 1422–1432. External Links: Link, Document Cited by: §3.1, Table 3.
  • [184] (2009) Term frequency by inverse document frequency. See DBLP:reference/db/2009, pp. 3035. External Links: Link, Document Cited by: §2.1.
  • M. Thelwall, K. Buckley, and G. Paltoglou (2012) Sentiment strength detection for the social web. J. Assoc. Inf. Sci. Technol. 63 (1), pp. 163–173. External Links: Link, Document Cited by: Table 3.
  • [186] (2002) TREC Corpus. Note: https://cogcomp.seas.upenn.edu/Data/QA/QC/ Cited by: §3.1, Table 3.
  • [187] (2013) Twitter Corpus. Note: https://www.cs.york.ac.uk/semeval-2013/task2/ Cited by: Table 3.
  • A. van den Bosch (2017) Hidden markov models. See DBLP:reference/ml/2017, pp. 609–611. External Links: Link, Document Cited by: §2.1.1, §2.1.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. See DBLP:conf/nips/2017, pp. 5998–6008. External Links: Link Cited by: §2.2.5.
  • P. Vateekul and M. Kubat (2009) Fast induction of multiple decision trees in text categorization from large scale, imbalanced, and multi-label data. See DBLP:conf/icdm/2009w, pp. 320–325. External Links: Link, Document Cited by: §2.1.4.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. See DBLP:conf/iclr/2018, External Links: Link Cited by: §2.2.7.
  • C. Wang, M. Zhang, S. Ma, and L. Ru (2008) Automatic online news issue construction in web environment. See DBLP:conf/www/2008, pp. 457–466. External Links: Link, Document Cited by: Table 3.
  • F. Wang, Z. Wang, Z. Li, and J. Wen (2014) Concept-based short text classification and ranking. See DBLP:conf/cikm/2014, pp. 1069–1078. External Links: Link, Document Cited by: Table 3.
  • G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin (2018a) Joint embedding of words and labels for text classification. See Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, melbourne, australia, july 15-20, 2018, volume 1: long papers, Gurevych and Miyao, pp. 2321–2331. External Links: Link, Document Cited by: Table 1, Table 5.
  • J. Wang, Z. Wang, D. Zhang, and J. Yan (2017a) Combining knowledge with deep convolutional neural networks for short text classification. See DBLP:conf/ijcai/2017, pp. 2915–2921. External Links: Link, Document Cited by: Table 2, Table 3, Table 5.
  • S. I. Wang and C. D. Manning (2012) Baselines and bigrams: simple, good sentiment and topic classification. See DBLP:conf/acl/2012-2, pp. 90–94. External Links: Link Cited by: §1.
  • Y. Wang, M. Huang, X. Zhu, and L. Zhao (2016) Attention-based LSTM for aspect-level sentiment classification. See DBLP:conf/emnlp/2016, pp. 606–615. External Links: Link, Document Cited by: §2.2.5, §2.2.5.
  • Y. Wang, A. Sun, J. Han, Y. Liu, and X. Zhu (2018b) Sentiment analysis by capsules. See DBLP:conf/www/2018, pp. 1165–1174. External Links: Link, Document Cited by: §2.2.3, Table 2, Table 5, §4.
  • Z. Wang, W. Hamza, and R. Florian (2017b) Bilateral multi-perspective matching for natural language sentences. See DBLP:conf/ijcai/2017, pp. 4144–4150. External Links: Link, Document Cited by: §2.2.3.
  • J. Wiebe, T. Wilson, and C. Cardie (2005) Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39 (2-3), pp. 165–210. External Links: Link, Document Cited by: §3.1.
  • A. Williams, N. Nangia, and S. R. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. See DBLP:conf/naacl/2018-1, pp. 1112–1122. External Links: Link, Document Cited by: §3.1.
  • F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger (2019) Simplifying graph convolutional networks. See DBLP:conf/icml/2019, pp. 6861–6871. External Links: Link Cited by: Table 2, Table 3.
  • W. Xue, W. Zhou, T. Li, and Q. Wang (2017) MTNA: A neural multi-task model for aspect category classification and aspect term extraction on restaurant reviews. See DBLP:conf/ijcnlp/2017-2, pp. 151–156. External Links: Link Cited by: §2.2.4.
  • M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, and S. Zhang (2018a) Investigating capsule networks with dynamic routing for text classification. See DBLP:conf/emnlp/2018, pp. 3110–3119. External Links: Link Cited by: Table 1, Table 2, Table 3, Table 5.
  • P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang (2018b) SGM: sequence generation model for multi-label classification. See DBLP:conf/coling/2018, pp. 3915–3926. External Links: Link Cited by: Table 1, Table 2, Table 3.
  • Y. Yang, W. Yih, and C. Meek (2015) WikiQA: A challenge dataset for open-domain question answering. See DBLP:conf/emnlp/2015, pp. 2013–2018. External Links: Link, Document Cited by: §3.1, Table 3.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. See DBLP:conf/nips/2019, pp. 5754–5764. External Links: Link Cited by: §2.2.6, Table 1, Table 2, Table 3, Table 5.
  • Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy (2016) Hierarchical attention networks for document classification. See DBLP:conf/naacl/2016, pp. 1480–1489. External Links: Link Cited by: Figure 8, §2.2.5, Table 1, Table 2, Table 3, Table 5.
  • L. Yao, C. Mao, and Y. Luo (2019) Graph convolutional networks for text classification. See DBLP:conf/aaai/2019, pp. 7370–7377. External Links: Link, Document Cited by: §1, §2.2.7, §2.2, Table 1, Table 2, Table 3, Table 5.
  • K. Yi and J. Beheshti (2009) A hidden markov model-based text classification of medical documents. J. Inf. Sci. 35 (1), pp. 67–81. External Links: Link, Document Cited by: §2.1.1.
  • R. You, Z. Zhang, Z. Wang, S. Dai, H. Mamitsuka, and S. Zhu (2019) AttentionXML: label tree-based attention-aware deep model for high-performance extreme multi-label text classification. See DBLP:conf/nips/2019, pp. 5812–5822. External Links: Link Cited by: Table 2, Table 3.
  • M. Zhang and K. Zhang (2010) Multi-label learning by exploiting label dependency. See DBLP:conf/kdd/2010, pp. 999–1008. External Links: Link, Document Cited by: §2.1.1.
  • T. Zhang, M. Huang, and L. Zhao (2018)

    Learning structured representation for text classification via reinforcement learning

    .
    See DBLP:conf/aaai/2018, pp. 6053–6060. External Links: Link Cited by: §2.2.8.
  • X. Zhang, J. J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. See DBLP:conf/nips/2015, pp. 649–657. External Links: Link Cited by: §1, Table 1, Table 2, §3.1, §3.1, §3.1, Table 3, Table 5.
  • Y. Zhang, D. Song, P. Zhang, X. Li, and P. Wang (2019) A quantum-inspired sentiment representation model for twitter sentiment analysis. Applied Intelligence. Cited by: §2.2.8.
  • Y. Zhang, X. Yu, Z. Cui, S. Wu, Z. Wen, and L. Wang (2020) Every document owns its structure: inductive text classification via graph neural networks. See DBLP:conf/acl/2020, pp. 334–339. External Links: Link Cited by: §2.2.7.
  • P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu (2016a) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. See DBLP:conf/coling/2016, pp. 3485–3495. External Links: Link Cited by: §2.2.4, Table 2, Table 5, §4.
  • X. Zhou, X. Wan, and J. Xiao (2016b) Attention-based LSTM network for cross-lingual sentiment classification. See DBLP:conf/emnlp/2016, pp. 247–256. External Links: Link, Document Cited by: §2.2.5, Table 2, Table 3.
  • X. Zhu, P. Sobhani, and H. Guo (2015) Long short-term memory over recursive structures. See DBLP:conf/icml/2015, pp. 1604–1612. External Links: Link Cited by: §1, Table 2, Table 3.