Convolutional Neural Networks (CNNs) have been showing promising results in text classification, including movie reviews binary classification, multi-class classification of the sentiment treebank, and topic categorization (Collobert et al. 2011, Kim 2014, Conneau et al. 2017)
. This competitive performance of CNN on a wide range of text classification tasks has become its main attraction as end-to-end applications in industries beyond computer vision applications. However, in many critical domains (e.g. banking, health care and medical services), there is also an increasing demand for models and an evaluation framework that can support aspects of CNN models interpretability and exploratory analysis.
The importance of model interpretability in the domain of banking services is exemplified in the deployment of machine learning models for analyzing customer behaviour in the Customer Due Diligence (CDD) stage of Know Your Customer (KYC). Given customer data in a form of CDD reports and the corresponding historical assessment from the analyst (labels of customer categorization), a classifier can be built to characterize customers based on the content of their reports, e.g. as a category of “low” or “high” financial risk customer. Providing an interpretable model is therefore desirable since it could reveal any confounding factors that further explain the model’s final prediction. For instance, by providing the reasoning why a customer is categorized as “high” risk, instead of “low” one – or the reasoning why the model misclassifies a customer during the validation stage.
Several approaches have been explored for improving interpretability of Deep Neural Network (DNN) models. Proposed approaches so far include global (layer-wise) and local (individual feature importance) explanation methods, as exemplified in the preliminary work on visualizing DNN for image classification (Simonyan et al. 2013, Samek et al. 2017, Ancona et al. 2018)
. The latter work summarizes several attribution methods for explaining what DNN models have learned in the corresponding prediction task, including the two back-propagation-based methods, i.e. Layer-wise Relevance Propagation (LRP) and saliency maps. An evaluation metric based on the sensitivity analysis for evaluating different gradient-based and perturbation-based methods for image and text classification was proposed in(Ancona et al. 2018).
In the domain of text classification, the aforementioned attribution methods were also employed to further explain the predictions of neural models. The works on local explanation (Nguyen 2018) and visualization of linguistic compositionality in neural models (Li et al. 2016) utilized the first derivative saliency to identify most influential inputs (words) for and against a particular prediction. Likewise, LRP was also employed for explaining CNN predictions on a topic categorization task (Arras et al. 2016).
To compare different attribution-based models, experiments with word removal were used in (Arras et al. 2017). The main idea is that by deleting the words with the highest attribution scores, a drastic drop in the model accuracy should be observed. However, there is also a drawback. There may exist dependent factors that contribute to the change of accuracy scores. For instance, the model’s decisions could be influenced by the relevance of phrases (-grams). Removing words will not only eliminate the contribution of the particular words, but could also affect the contribution of other words within the same context window (grams), sentence, or document.
In this paper, we employ the two attribution methods (i.e. saliency maps and LRP) on binary and multi-class classification of customer reviews (public data sets). Different from previous approaches that measure the quality of explanation methods with “word deleting” perturbation experiments, we evaluate the attribution scores with “feature removal” method. As example of an application in real world data set, we utilize our CNN model and the two attribution-based explanations on CDD reports (corporate data set). We also developed an interactive visualization tool 111https://peaceful-journey-19056.herokuapp.com/ to further help analysts in investigating the model’s prediction outputs.
The rest of paper is organized as follows. In Section 2, we describe the architecture of CNN in this study. The two attribution-based explanations and our proposed evaluation framework are explained in Section 3. Experiments and results are discussed in Section 5. The conclusion is presented in Section 6.
2 CNN model
We employed a word-based CNN model as a document classifier, i.e. to predict whether the text review is positive or negative (binary classification task) and to perform the categorization of text documents (multi-class classification). Figure 1 depicts CNN architecture in this study, which we refer as TextCNN. “Conv-block” denotes the convolutional layer with the corresponding feature map (filter). In image classification problem, the filters correspond to red, green, blue (RGB) filters, while in this text classifier the filters are referred to three (3) different grams filters, (where in this study).
3 Attribution-based explanations
Gradient-based saliency maps or Sensitivity Analysis (SA) (Simonyan et al. 2013)
construct the attribution score by taking the partial derivative of the target output for a particular class() with respect to the input features . Instead of the common absolute form of saliency, we employed raw values of saliency (signed saliency), as in:
LRP (Samek et al. 2017) redistributes the prediction score layer by layer until reaching the desired layer:
The following rule holds for LRP attribution scores from all layers that for a particular class , the sum of attribution scores on a layer is equal to the prediction score :
4 Evaluation framework
4.1 Embedded-word relevance
In this experiment, the attribution score that is assigned on each feature (each dimension) of word embedding was utilized without any perturbation-based experiments. For both the quantitative and qualitative evaluation of the embedded-word relevance, we carry out experiments with document highlighting (Arras et al. 2016)
, i.e. by using the document embedding as the input of classifiers (KNN, SVM, Decision Tree, Random Forest) to predict the category label of the corresponding embedded document. A higher accuracy is expected for weighted document representations if truly important features are assigned higher weight. We do not employ “feature deletions” in this word-based relevance model since we are more interested in higher abstraction than words (i.e. perturbations of features of embedded-grams or embedded-document) as explained in section 4.2 and 4.3.
Given a three-dimensional output of the embedding layer (-samples (), -sequence of words (), dimension of word embedding ()), the attribution score is assigned for each dimension of this matrix. To create a document representation (document embedding), the attribution score of each word is used as weighting factor. The feature-based attribution score for word- in document- and embedding column is described as , while the total attribution score for this word- is . Given the representation of words (word embedding) , the non-weighted document representation for document- is the average of representation of words in that document . The weighted document representation for document- is .
4.2 Embedded-document relevance
Experiments based on the embedded-document perturbations were performed to evaluate whether the important features are assigned high attribution scores. Intuitively, different fragments of a document (e.g. between sentences) may tell different sentiment polarity weights. A review could be started by mentioning a negative criticism about a small aspect of a product, but the final conclusion may give positive recommendation. Assuming these different aspects of polarities are embedded as features of the learned document embedding, we utilize “feature” or each dimension of the embedded document to evaluate the importance of scores assigned by attribution methods in the corresponding prediction task.
Similar to the score acquired in Section 4.1, the feature-based attribution score for word- in document- and embedding column is described as , while the total attribution score for this word- is . The attribution score for each embedding column of document embedding is calculated by adding the relevance score of words in that document . The feature removal was done by setting all values in the corresponding columns to be 0. The evaluation was carried out on three (3) different settings:
Removing features with the largest attribution scores.
The embedding columns with the largest attribution scores for the true class were removed. The accuracy was therefore expected to be lower. For a correctly classified document, the predicted probability for its true class should be lower.
Removing features with the smallest attribution scores. The embedding columns with the smallest absolute attribution scores for the true class were removed. For both methods, the predicted probability should not be affected more than by randomly removing an embedding column. The purpose of this evaluation was to assess whether features with low attribution scores are truly unimportant features.
Removing features that contribute differently for different classes. For a document , the attribution difference between true class and class for embedding column is . When the columns with the largest attribution differences were removed, the predicted probability for class should decrease while the probability for class should increase. This setting was only applied to classification tasks with multiple classes.
4.3 Embedded-ngrams relevance
In our TextCNN model, the learned feature representation from convolutional layer hypothetically represents the
-gram features. For each filter, only the convolution window with the maximum value has an impact on the output (after a max pooling layer). Thus, we assume that removing a filter on a convolutional layer is equivalent to removing representation of an-gram feature. Here, we defined a filter of a convolutional layer as a “feature”. Each filter was assigned by one non-zero attribution score, which represents attribution score of the grams of the input sequence. Likewise, the evaluation was conducted on three different settings as previously explained in section 4.2.
5 Experiments and analysis
5.1 Data sets
Table 1 shows three data sets that were used in this study and their corresponding statistics. TextCNN was trained on these three datasets. The corresponding classification performance is shown in Table 2.
Yelp reviews (public data set)
The data set 222https://www.yelp.com/dataset is a collection of customer reviews on Yelp. For every review text, the customer gave it a “stars“ rating ranging from to . A higher rating indicates a more positive review. On this Yelp review data set, we removed neutral reviews with 3 stars. We redefined the reviews labeled as and to label , and the reviews labeled as and as label . As a result, the classification task on Yelp review data set was binary.
US consumer finance complaints (public data set)
The dataset 333https://www.kaggle.com/cfpb/us-consumer-finance-complaints contains the customer complaints about 11 financial products and services. Each complaint contains one or more sentences.
Customer Due Diligence (CDD) reports (corporate data set)
is an extracted report of customers from Customer Due Diligence (CDD) cases. This data set contains pre-processed text reports with the corresponding risk-based labels, i.e. whether the customer is categorized as “low” (class “0”) or “high” (class “1”) financial risk.
|size||(nr. of words)||Length||Length|
5.2 Evaluating embedded-word relevance
Figures 2 and 3 show the visualization of the two attribution methods on the correctly classified “0” and “1” of Yelp reviews respectively. Positive scores (positive contribution to class “1”) are shaded in “red”, while negative scores (negative contribution to class “1”) are highlighted as “blue”. From Figure 2, we can see that LRP was able to highlight the compositionality of negative words (e.g. “no stars”) that contributes to negative “0” class. SA could find a negation (“no” word), but not as a phrase or combined words. Both attribution methods were able to put relevance scores on phrase with excessive expression (“too"), but SA put a higher weight on this type of phrase. In the example of positive review (Figure 3), LRP assigned a higher relevance score on positive words (e.g. “good”), while in this example, SA did not correctly assign the score on the same word or phrase as compared to LRP.
For measuring the quality of the embedded-word relevance scores, we employed different weighting schemes of document embedding (i.e. based on the score assigned after embedding layer) as an input of a classifier. The comparison on four classifiers is shown in Table 3. “w-0” denotes unweighted document embedding as input, “w-LRP” denotes LRP-based weighted, and “w-SA” denotes saliency-based weighted document representation. On three data sets, the LRP-based weighted document representation achieved higher accuracy as compared to the non-weighted and saliency-based weighted ones. In this experiment, with the LRP attribution as weighting factor, words that are relevant to the actual class label were assigned larger weights, and thus became more influential in the generated document representations. While saliency-based weighting (w-SA) is not always distinctive, as such, the classification performance is often similar or even lower than non-weighted document embedding.
|Classifier||Yelp reviews||US Customer complaints||CDD reports|
5.3 Evaluating embedded-document relevance
5.3.1 On binary classification task (Yelp review and CDD reports)
To measure the quality of attribution scores, in this experiment, the columns in the embedded documents (referred as features) are gradually removed. While removing features with the largest (Table 4) or smallest absolute (Table 5) attribution scores, the model accuracy was recorded to assess whether the truly relevant features have been identified. In Table 4, LRP resulted in larger decrease in model accuracy by removing the most relevant features. In Table 5, compared to random feature removal, LRP and SA were both able to preserve the accuracy by removing the least relevant features.
|Nr-removal||Yelp reviews||CDD reports|
|Positive (class “1”)||Negative (class “0”)||High risk (class “1”)||Low risk (class “0”)|
|Nr-removal||Yelp reviews||CDD reports|
|Positive (class “1”)||Negative (class “0”)||High risk (class “1”)||Low risk (class “0”)|
5.3.2 On multi-class classification task (US customer financial complaints)
In this experiment, 415 documents that were correctly classified as class "0" (bank account or service) were investigated. Based on both LRP and saliency attribution scores, as well as the attribution differences between actual class and other classes, we gradually removed embedding columns with the largest relevance. Figure 4 shows the changes in model accuracy. A significant decline in model accuracy can be observed for the LRP attributions. When using the saliency approach, the accuracy change is similar to random feature removal.
|Predictions||Nr. columns removed||attr_0||attr_0 - attr_2||attr_0 - attr_3|
The accuracy decrease was also observed by using perturbation based on the LRP attribution differences. To investigate how the model prediction is altered based on attribution differences, the number of mis-classifications in each class were recorded while the embedding columns are gradually removed, as shown in Table 6. Documents that are correctly classified as class "0" (bank account or service) is used as a baseline (“attr”). To investigate the role of attribution differences, we choose an example of attribution differences between true class “0” and class “2” (credit card) (“attrattr”), and between actual class “0” and class“3” (credit reporting) (“attrattr”).
When attributions towards the true class were used, the number of documents correctly classified was smaller with the LRP approach, which is consistent with the results presented in Figure 4. What is worth noticing is that when using the LRP attribution differences, the prediction is guided towards favoring a certain class. When applying attribution differences between true class and class “2”, for instance, the number of documents mis-classified as “2” is significantly larger than using other feature removal metrics. We make the same observation with the attribution differences between true class and class “3”. This shows that we could also use the attribution differences removal method, in addition to removing largest and smallest relevance score, to evaluate the quality of attribution methods.
5.4 Evaluating embedded--grams relevance
5.4.1 On binary classification task (Yelp review and CDD reports)
|Nr-removal||Remove relevant features||Remove irrelevant features|
|Yelp reviews||CDD reports||Yelp reviews||CDD reports|
Table 7 invites us to make similar observations as in Section 5.3, but by using the convolutional filter feature removal method. By removing relevant features, larger impact on the model accuracy was resulted in LRP-based approach. Likewise, by removing irrelevant features, both LRP and SA were able to preserve model accuracy compared to the random feature removal.
5.4.2 On multi-class classification task (US consumer financial complaints)
Similar to the procedure described in Section 5.3.2, only the documents that were correctly classified were investigated. Instead of removing relevant embedding columns, convolutional filters were regarded as the feature to be assessed. In both the LRP and saliency approaches, the model accuracy decreased drastically as -gram influences on certain positions were removed from the model. To investigate whether the predictions were guided towards a certain class, the number of mis-classifications for each class is also recorded in Table 8. While the feature removal based on attributions of the true class was able to alter the predictions towards class “2” and “3”, the mis-classification numbers were significantly higher when using both the LRP and saliency attribution differences. The predictions were indeed guided towards desired classes.
|Predictions||Nr. filters removed||attr_0||attr_0 - attr_2||attr_0 - attr_3|
In this paper, we presented an experimental study on feature-based perturbations for evaluating attribution-based explanations on CNN model for text classification (TextCNN). Instead of utilizing “word-deleting” evaluation, we investigated the attribution-based explanations on different layers of TextCNN. Our experimental analysis was performed on two public data sets (Yelp reviews and US customer complaints) and extracted customer reports from CDD cases of a financial institution, by using three different aspects of attribution scores: the embedded word level, the embedded document level, and the embedded -gram level. Our proposed evaluation was able to assess the quality of attribution scores with a measurable metric, while showing the differences in different explanation approaches. The results of our experimental study suggest that LRP is better at finding features that are relevant to the prediction. By investigating the attribution differences, we were also able to analyze whether the model’s prediction is guided to a certain outcome. We provided a visualization tool to offer deeper insights into the model’s predictions by visualizing the LRP attributions as well as the attribution differences between different classes on individual words and -grams.
- Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. Natural language processing (almost) from scratch. CoRR, abs/1103.0398, 2011. URL http://arxiv.org/abs/1103.0398.
Convolutional neural networks for sentence classification.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/D14-1181.
- Conneau et al. (2017) Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. Very deep convolutional networks for text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1107–1116, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/E17-1104.
- Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013. URL http://arxiv.org/abs/1312.6034.
- Samek et al. (2017) W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller. Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, Nov 2017. ISSN 2162-237X. doi: 10.1109/TNNLS.2016.2599820.
- Ancona et al. (2018) Marco Ancona, Enea Ceolini, Cengiz Oztireli, and Markus Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations (ICLR 2018), 2018.
- Nguyen (2018) Dong Nguyen. Comparing automatic and human evaluation of local explanations for text classification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1069–1078, 2018.
- Li et al. (2016) Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. In Proceedings of NAACL-HLT, pages 681–691, 2016.
- Arras et al. (2016) Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. Explaining predictions of non-linear classifiers in nlp. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 1–7, 2016.
- Arras et al. (2017) Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. " what is relevant in a text document?": An interpretable machine learning approach. PloS one, 12(8):e0181142, 2017.