, etc. Most of the research on text classification focus on how to extract effective features from text and construct accurate classifiers to integrate the features
. In deep neural methods, word vectors project the semantic information of words into a dense low-dimension space where the semantic similarity of words can be assessed by Euclidean distance or cosine similarity. Deep learning methods performing composition over word vectors to extract features, has been proven to be effective classifiers and achieve excellent performance on different NLP tasks ([15, 11, 16]).
Among the text classification deep neural methods, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) have been widely used in a lot of applications. RNN can capture dependencies between different components in a statement. CNN, employing convolution operation (
) to capture features, is originally used in computer vision. A plenty of work adopt the CNN network structure for the NLP tasks ([15, 11]). The results prove that CNN has a unique advantage in capturing local features of word vectors.
However, the deep neural methods still have a severe challenge: it is hard to interpret the classification results. For an end-to-end deep neural architecture, the intermediate hidden states are usually some real-valued vectors or matrices which are difficult to figure out the semantic meaning or to establish an association with target categories. Recent work on interpretability has focused on producing posteriori explanations for neural network predictions . There has been some related research on neural networks based RNN [4, 10, 7]. However, interpreting CNN-based models remainsan under-explored area. There have been plenty of research and applications interpreting and visualizing CNN in the field of computer vision [22, 23, 12]. However, due to the different for msof input data, it is hard to apply the methods designed for picture pixels on the word vectors.  studies how does CNN based classifiers works. The procedure can be divided into two steps: (1) 1-dimensional convolution operations are used to detect closely-related familyof n-grams. (2) Max-pooling interprets the n-grams features, which is further used to make the decision using linear layers. In both parts, each unit of the output have complicate association with multiple inputs units, and the non-linear layers and the complex structure of the convolutional layer exacerbate the difficulty of interpretation.  interprets CNN models by extracting and analyzing the n-gram features the modelhas learned. But their research does not establish an association between input tokens and n-grams features.
In this paper, we implement the interpretabilityof a variant of typical CNN based classifier. For each process procedure of the CNN based model,we propose a method to establish the point wise associations between the input and the output respectively. The max-pooling layer is used to interpret all the n-gram features to generate the sentence representations. In our work, the interpreting operations of the proposed model consists of two parts: convolution attribution and n-gram feature parsing. Convolution attribution involves the disassembly of the convolution operation. We construct the association between n-gram features and word vectors by analyzing the contribution of the values in each word vector to the convolution results. N-gram feature parsing constructs an association between n-gram features with the classification result by backtracking pooling operations. Besides, a multi-sentence strategy is proposed for the situation where the input text contains multiple sentences.
Common wisdom suggests that interpretabilityand performance currently stand in apparentconflict in machine learning[1, 19]. In experiments, we comparethe classification performance of the classificationmodel with several state-of-the-art classifiers firstly.Besides, we validate the interpretable performanceof the model visualize the predictions in 2scenarios, namely interpreting predictions with input tokens and interpreting predictions with samples.
2 Related Works
In this section, we review the related studies in two aspects, respectively text classification and neural network interpretable.
Among the large number of recent works employing deep neural networks on text classification task, our model use a quite common structure and has similarities to many models. Compared with textCNN in , we employ pooling operations -grams other than the vector dimensions. Our model is also similar to the model proposed in  The difference is that we employ multiple size of convolution kernels to capture features. There are some other variants similar to the model, such as .
The research on interpreting NLP neural networks involve various directions. Layer-wise relevance propagation (LRP) [bach2015pixel]
is widely used in plenty of interpretable researches, our methods also refer to this idea. LRP backpropagating relevance recursively from the output layer to the input layer, is originally designed to compute the contributions of single pixels to predictions for image classifiers.
uses LRP to interpret a attention-based encoderdecoder framework for neural machine translation.[croce2018explaining] explain neural classifiers by providing the examples which motivate the decision. They employ LRP to trace back the model predictions and further provide human-readable justification. Attention mechanism  is used in various NLP situations to adjust the weight of different parts of the input. Some research analyze and visualize the weights of the attention mechanism to provide humanreadable explanations for the prediction [20, luong2015effective]. Some researches propose new neural network architectures that are easier to interpret. [stahlberg2018operation] propose a neural machine translation model that incorporate explicit word alignment information in the representation of the target sentence, the interpretability is implemented by generating a target sentence in parallel with the source sentence. Although the definition of interpretability differs in various situation, most of related research indicate that where is conflict between the interpretability and performance for machine learning [19, murdoch2019interpretable, rudin2018please]. New neural architectures are usually proposed to take both requirements into account .
In this section, we introduce the structure of our proposed text classification model and the methods to interpret the classification result. The framework of the proposed model is shown in the Figure 1.
3.1 Text Classification Model
As shown in Figure 1, the input of the text classification model is a sequence of words. We convert each word into a word representation vector with dimensions, denoted as (). We employ convolution operations with different kernel size (, denotes the maximum convolution size) to derive -grams. These -grams serve as the feature for classification. The number of the convolution filters decides the dimension of feature vectors. Which is set to for all kernel size of convolution operations. We denote all the -gram features as , where is the total -gram number, which is related with and . Actually, the text classification tasks often face the challenge that the length of the input sentence is a variable. In our model, max-pooling layer is used to integrate the -gram features and generate a sentence vector with the fixed dimension . The sentence vector is then fed into a fully-connected layer to generate a score distribution over target categories. The final output of the model
is a probability distribution on target categories.
The convolution operations used in our methods, are the same with the architecture of . Generally, we represent the -gram as the concatenation of words . Suppose the feature vector is related with , is calculated by:
where and are parameters for convolution operation. The size of the convolution kernels is equal to the length of the -grams.
The filter is applied to every possible window in . The max-pooling operation captures the maximum value of the feature vectors in each dimension over the -grams to generate the sentence vector , which is defined as:
The sentence vector is then connected to a fully connected layer to calculate the score distribution on target categories. The score distribution and the probability distribution is respectively calculated by:
where and are the parameters of the fully connected layer, is the number of target categories.
3.2 Interpretable Principles
In this part, we introduce the methods to interpret the classification predictions of the proposed model. As shown in Figure 1, essentially, the interpreting process of the model is the inverse derivation of the model predictions. The implementation of the interpretability consists of two parts: convolution attribution and n-gram feature analysis.
In the proposed model, each feature vector obtained by the convolution operation correspond to a -gram. In this part, we propose the method convolution attribution to analyze each convolution operation independently and obtain the relevance between word vectors and the -gram features.
Let denote the feature vector for the -gram , and are the parameters for the convolution layer. We define the result of the element-wise product for the -th kernel as :
It is obvious that is calculated by:
where is the bias of the convolution. We define the contribution of word in the n-gram () for the -th dimension of the feature vector as , which is calculated by:
if the sum of is too small, we simplify the contribution distribution on corresponding dimension.
We take all the contributions for each word of on every dimension of the feature vector into account and build the matrix . The sum of each column of the matrix is , denoting the contribution distribution of the words on a specific dimension of the feature vector .
N-gram Feature Analysis
N-gram Feature Analysis splits the score distribution corresponding to sentence vector into the sum of the contributions of n-gram features . In the proposed classification model, all the feature vectors and the sentence vector share the same size. The sentence feature vector integrates all the -gram features with a max-pooling. As can be seen, the feature vectors also share the same structure. We analyze the effect of the feature vectors on the final classification prediction using the fully connected layer which is connected to sentence vector . The details are shown in Figure 3.
The analysis process is performed independently for each -gram. First, we figure out which parts of each feature vector actually play a role in the classification process using reverse derivation of the max-pooling layer. For each dimension of any feature vector, if the value is equal to the corresponding dimension of the sentence vector , or in other words, if the value in this dimension is the largest among all the n-gram features, we keep this value. Otherwise we set the value at this position as 0. We define the modified vector by:
We pass the disassembled vector through the fully connected layer in the original model to get a score distribution on target categories. We only focus on the differences in the contributions of different features on categories. Therefore, the offset in the fully connected layer is ignored. The score distribution is calculated by
If the values of all the features in a specific dimension are less than 0, may differ from the real situation. This can be avoided by linking a Rectified Linear Unit (Relu)activation function with the results of the convolution layers in the proposed model.
denotes the score distributions on target categories for the -gram feature . We further analyze among the vector dimension of . The analysis process is implemented by converting the form of Equation 9. We define the operations as following:
where is the matrix obtained by repeating the transposition of by times and concatenating by column axis. The matrix has the same dimensions with the weighting matrix of the fully connected layer . We define the element-wise product result between and as . The values of denote the score distributions for each dimension of feature vector on target categories.
Interpretable Features Aggregation
We define the contribution matrix and the score distribution matrix in convolution attribution and n-gram feature analysis respectively. The contribution matrix represent the contribution of word vectors on each dimension of the -gram feature and the score distribution matrix measures the impact of each dimension of -gram feature on target categories. We integrate the two matrices to evaluate the impact of each word vector on the classification results.
In the -gram feature , the values on target categories for the word vector , , is calculated by:
The value of word vector on target categories in the overall sentence obtained by:
where is the set of -gram features where each of -grams contains inside. Integrating the results of all the word vectors in , we define value matrix as the final interpretable result. shows the value of each word vector on every target categories.
3.3 Multi-sentence weighting strategy
For input context with any length, our model outputs a sentence feature vector with a fixed size. As we can expect, most of the text information will lose if the input text is rather long. In order to enable the model to be trained using long texts without changing the structure, we propose a multi-sentence weighting strategy.
As shown in Figure 4, we divide any input long text into parts. Generally speaking, it is wise to divide by sentences, denoted as . We apply the short text classification model to each part independently and obtain score distribution . In order to simplify the question, we ignore the order relationships and associations of the text parts. The output of all the sentences , is computed as a weighted sum of these distributions:
There are plenty of methods to set the value of sentence weights . A naive approach is to give the same weight to all sentences.  use a selective attention mechanism to de-emphasize the noisy sentence. They use a matrix to evaluate the degree to which the input sentence matches a particular category as a weighted value.
To simplify the interpretation process, we employ a method that do not add extra parameters to obtain the value of weights for sentences. We suppose that the higher the score of a sentence on a particular category is, the more important it is for the classification task. Conversely, an ambiguous sentence will receive low weight value. Inspired by the assumption, we use the maximum value of the score distribution as the weight value of the sentence. Specifically, we obtain the weight of each sentence by executing the softmax function on the maximum value of the score distribution obtained in different sentences. It is defined as:
In this part, we design experiments to evaluate the performance and interpretability of the proposed model respectively. The experiments are performed on both Chinese and English datasets to prove that the proposed methods can be used in both languages.
4.1 Comparison with State-of-the-art Classifiers
We compare the classification performance of the proposed model with several state-of-the-art classifiers using different neural structures. The baselines include:
n-grams + LR: We collect some of the most frequent n-grams (5 times of the size of the training set) in the training set as the features and employ a LR to make the classification.
CNN: The model proposed in  with pre-trained word embeddings. The most difference compared to our methods is the employment method of the pooling layer.
Bi-lstm: Bi-directional LSTM with pre-trained word embeddings.
FastText: The model proposed in  which averages the word embeddings as the sentence representation.
The datasets we use are described below:
Ohsumed444http://disi.unitn.it/moschitti/corpora.htm: The Ohsumed corpus is a bibiographic dataset containing 3,357 documents for training and 4,043 documents for testing. There are 23 target categories in total.
20NG555http://qwone.com/ jason/20Newsgroups/: Document classification dataset consists of 11,314 training samples and 7,532 test samples distributed over 20 target categories.
Datasets Ohsumed and 20Ng contain more than 1 sentence in each sample. We adopt two processing modes of treating all sentences as a whole and using mul-sentence weighting strategy for these two datasets. The proposed model uses randomly initialized word vectors. The embedding dimension and the feature vector dimension are set to 50. The maximum convolution length is set to 6. For baseline models, some of the results refer to  and , the others is reproduced with the default parameters in the original papers. The models where pre-trained vectors are needed employ 300-dimensional word vectors using the architecture of .
|n-grams + LR||93.7||87.0||87.2||54.7||83.2|
We present the results in Table 1. From the results we can see that our methods achieve great performance in most of the datasets. Our methods show poor performance on the dataset TREC, this may be because the scale of TREC is small and our method does not use the pre-trained word vectors. In Obsumed and 20NG, the multi-sentence weighting strategy effectively promotes the performance of the model.
4.2 Chinese Classification Task
We also evaluate the performance of the model on Chinese tasks. The dataset we employed consists of some user survey feedback in taxi industry. The target of the classification task is to distinguish the scope of the description. There are 7 target categories, including driver related, products and services, etc. There are a total of 13,525 instances in the dataset, and the distribution between classification categories is approximately equal. Each instance in the dataset consists of a title and a context of variable length. The length of the context varies greatly, ranging from 1 sentence to more than 100 sentences. 10% of total instances are randomly split for testing.
The experiment consisted of two groups. In the first group, the training process only uses the titles. And the second group use both of the title and the context. The baselines and the parameter settings are similar to Section 4.1.
|methods||title||title and context|
The results show that the proposed model outperforms other methods. Our methods with multi-sentence weighting strategy achieve the best results due to the integration of title and context information.
5 Analysis and Visualization
In this section, we analyze the interpretability of the proposed method on both English and Chinese datasets. The performance on English is assessed on the TREC dataset and the performance on Chinese is assessed on the proposed Chinese user survey feedback dataset.
5.1 Interpreting Predictions with input tokens
The interpretability of the proposed model is implemented by providing human-readable explanations for neural network predictions. After the proposed model converges on the training set, we can use the proposed methods to generate the contribution matrix. We visualize the contributions of the input tokens to the final categories to observe that which tokens in the input sentence that indicate the final predictions.
We evaluate the interpretable performance on the TREC dataset. The dataset involves classifying a question into 6 types (location, entity, abbreviation, description, human, and number). The attention mechanism is used as a baseline. We employ the attention mechanism in a Bi-lstm structure and the weights are visualized to indicate the importance of the input tokens. The proposed model is employed on the dataset and we visualize the weights of each word on the ground category. The examples are shown in Figure 6.
As the Figure 6 shown, the attention mechanism evaluates the importance of each word for the classification task, but the results are rather vague. It is difficult to locate certain words for the attention mechanism. In contrary, our model accurately locates the words that are related to the ground category.
In the text classification task, there are cases where different components in a sentence are associated with different target categories. We can use the interpretable result to prove that the proposed model can capture these information. As shown in Figure 5, intelligent has a strong association with target category technology brand. In addition, intelligent scheduling model contributes to products and services, scheduling model contributes to driver related. Most of the associations that the model has learned are reasonable.
5.2 Interpreting Predictions with Samples
In text classification tasks, the effect of classification model depends heavily on the dataset. Limited by the scale of the dataset, the judgment bases of the model often differ from human cognition. In this section, we try to find out the judgment bases of the model in a specific classification prediction and further reason the bases with samples in the training set.
Given a input context, after the model make the prediction, we employ the proposed interpretable method to find out the value of each word in the context on target categories. We focus on the position of the words which have positive values on a particular category to generate a combination pattern. In general, the combination patterns can answer for what features indicate the model to make the prediction. Next, we look for the samples which have the highest suitability with the generated patterns in the training set. These samples usually answer how does the model learn the features.
Examples are shown in Figure 7. The judgment of the model on dataset TREC is actually based on some combination patterns. For the first example, the model classifies the input sentence into NUM with the confidence greater than 99% solely based on the strong pattern how long. As for the predictions with weak confidence, we can observe that most of the judgment bases are unreasonable combination patterns, such it take to leads to ENTY and what is the leads to NUM. We can figure out which samples of training set mislead the model to learn wrong features with the proposed method.
This work can be used to attribute the results of model mispredictions. By reasoning the error cases, we can find out the samples that mislead the model to learn wrong judgment bases. In this way, we can make directional adjustments to the judgments bases learned by the model.
In this work, we propose a CNN text classification model and employ strategies including n-gram feature analysis and convolution attribution to interpreting the classification process. -gram feature analysis analyze the model prediction with the contributions of n-gram feature and convolution attribution further establishes the relevance between n-gram features and word vectors. We also employ a multi-sentence weighting strategy to employ the model in long text situation. Experiments show that the model has a similar performance to state-of-the-art classifiers and the multi-sentence weighting strategy works for long context input. Two applications with several interpretable cases are introduced to justify the interpretable performance of the model. In the future, we will further explore more methods to interpret neural models in NLP tasks and study how to use the interpretability of the model to directionally adjust model performance.
-  (2018) Towards robust interpretability with self-explaining neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7786–7795. Cited by: §1, §1, §2.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.
-  (2011) Natural language processing (almost) from scratch. Journal of machine learning research 12 (Aug), pp. 2493–2537. Cited by: §3.1.
-  (2017) Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1150–1159. Cited by: §1, §2.
-  (2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford 1 (12). Cited by: §1.
A primer on neural network models for natural language processing.
Journal of Artificial Intelligence Research57, pp. 345–420. Cited by: §1.
-  (2018) Interpreting word-level hidden state behaviour of character-level lstm language models. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 258–266. Cited by: §1.
-  (2018) Understanding convolutional neural networks for text classification. arXiv preprint arXiv:1809.08037. Cited by: §1.
-  (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: 4th item.
-  (2015) Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078. Cited by: §1.
-  (2014) Convolutional neural networks for sentence classification. In EMNLP, Cited by: §1, §1, §2, 2nd item, §4.1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (1998) Gradient-based learning applied to document recognition. Cited by: §1.
-  (2002) Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pp. 1–7. Cited by: 2nd item.
-  (2016) Neural relation extraction with selective attention over instances. In ACL, Cited by: §1, §1, §2, §3.3.
-  (2016) Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101. Cited by: §1.
A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, Vol. 752, pp. 41–48. Cited by: §1.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §4.1.
-  (2019) Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering 24 (2), pp. 779–825. Cited by: §1, §2.
-  (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §1, §2.
-  (2018) Graph convolutional networks for text classification. arXiv preprint arXiv:1809.05679. Cited by: 1st item, §4.1.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1.
-  (2011) Adaptive deconvolutional networks for mid and high level feature learning.. In ICCV, Vol. 1, pp. 6. Cited by: §1.
-  (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §2.