We propose a method for explaining classes in text classification tasks using deep learning models and feature attribution techniques, such as the Integrated Gradients (IG) method introduced by Sundararajan et al. (2017)
. We focus specifically on IG as it provides a general framework for estimating feature importance in deep neural networks and has been shown to provide reliable saliency maps in text classification tasks among othersBastings and Filippova (2020); Kokhlikyan et al. (2020).
Recently, explaining the predictions of deep neural networks has attracted a considerable amount of research interest in fields such as NLP and computer vision. Given the importance of this endeavour, several different techniques have been suggested in order to interpret model predictions(see Montavon et al., 2018, for recent discussion). Nevertheless, these tend to focus on explaining individual predictions rather than how models perceive whole classes. To the best of our knowledge, we present the first method for aggregating explanations of individual examples in text classification to general descriptions of the classes. The method consists of three steps: 1) repeated model training and application of IG on random train/test splits, 2) aggregation of word scores of individual examples and extraction of keywords, and 3) filtering to remove spurious keywords.
We test this method by training Transformer-based text classifiers on a large Web register identification corpus and show that it is able to provide descriptive keywords for the classes. The class descriptions provide both linguistic insight and a means for analyzing and debugging neural classification models in text classification.
2 Data and classifier
In our experiments, we focus on text classification using the Corpus of Online Registers of English (CORE) Egbert et al. (2015), a large-scale collection of Web texts annotated for their register (genre) Biber (1988). The CORE registers are coded using a two-level taxonomy. In this study, we focus on the upper level which consists of eight register classes: Narrative (NA), Opinion (OP), How-to (HI), Interactive discussion (ID), Informational description (IN), Lyrical (LY), Spoken (SP) and Informational persuasion (IP). The dataset features the full range of registers found on the unrestricted open Web and consists of nearly 50,000 texts. In our experiments, we combine the train and development sets, totaling 38,760 documents.
Web registers have been frequently studied in recent research both in linguistics and NLP Titak and Robertson (2013); Dayter and Messerli (2021); Madjarov et al. (2019); Biber and Egbert (2019). The range of linguistic variation has, however, caused challenges for both fields, and, in particular, Web register identification studies have lacked robustness Sharoff et al. (2010); Petrenz and Webber (2011). The method we propose in this study can benefit both fields as it provides insight about classification models and the corpora they are trained on, including potential biases.
As a classifier, we use the XLM-R deep language model Conneau et al. (2020) because of its strong ability to model multiple languages, both in monolingual and cross-lingual settings. We use the base size, since it uses less resources and its predictive performance on the CORE corpus is competitive with XLM-R large Repo et al. (2021). The task is modeled as a multilabel classification task.
The descriptions of classes are extracted through the following steps:
Step 1: Train and explain. We combine the training and development sets of the corpus and randomly split them into a new training and validation set according to a set ratio , using stratification to keep class distributions stable (cf. Laippala et al., 2021). The pre-trained language model is loaded and the decision layer (a sequence regression head) is randomly initialized. Both are fine-tuned on the new training set. Text examples in the validation set are classified and the IG method is applied in order to obtain attribution scores for the network inputs, i.e., each dimension of each input token embedding, w.r.t. each predicted class . The embedding dimensions are summed up per token to provide a token-level score and all tokens in a document are normalized by the L norm. This provides a word attribution score directly if the word consists of a single token, otherwise it is calculated as the maximum of all sub-word token scores.
Step 2: Aggregate attributions. We calculate the average attribution scores , for each , as a means for ranking of keywords per class. In order to reduce noise, we only select the top-scoring words per document , and we only consider true positive predictions. We note that the method could alternatively be used for error analysis by targeting false predictions.
Step 3: Select stable keywords. The above process is repeated times, each time randomly shuffling and splitting the data according to Step 1, in order to quantify the stability of the keywords. The keyword candidates ranked by are filtered based on selection frequency: a word is considered stable if the ratio by which it is selected (in Step 2) across the experiments is larger than a threshold value . We also ignore words that occur in documents or less in the corpus.
The selection frequency filtering allows us to remove keywords that are unstable across runs, likely reflecting spurious features, for instance, resulting from an unrepresentative split of the data or stochastic factors in the training of the classifier itself. McCoy et al. (2020) show in repeated experiments on a text inference task with random initialization of the decision layer and randomized order of training examples that, while consistent test set performance was achieved, the degree of generalization as measured on a related task varied significantly. Similarly, we test the persistence and presumed generalizability of the estimated keywords by considering the randomness both in training and in data selection.
In our experiments, we have used the parameters , , , and .
The classifiers trained in our 100 experiments achieved a mean micro average F1-score of 65.10% () and mean class-wise F1-scores in the range 26.45%–82.92% for the eight main register classes (see Table 1 in Appendix). The Spoken (SP) class stands out as a particularly difficult case where performance was particularly unstable (), partly due to its small size.
Our method was able to produce descriptive keywords that clearly reflect our understanding of all the main classes (see Table 2 in Appendix) except for the Spoken class, where no keyword surpassed the selection frequency threshold. The keywords reflect both topical and functional features typical of the registers. For instance, the highest scoring words for Interactive discussion (ID) were question, faq, forum, answer. Similarly, we observe other register-specific linguistic characteristics, such as words associated with research papers in Informational description (IN) and with news in Narrative (NA). The keywords also share many similarities with keywords produced with other methods applied in previous studies (e.g., Biber and Egbert, 2019; Laippala et al., 2021).
Furthermore, the estimated keywords display a strong discriminative power as indicated by their uniqueness in the respective register classes. On average, 82 () of the top 100 keywords for a given register were not shared with the other registers demonstrating that the method was able to identify register-specific keywords. Moreover, the selection frequency of the keywords across the 100 rounds demonstrated their stability – they are consistently identified, often in over 90% of the repetitions.
We have proposed a method for describing classes in a text classification task based on IG attributions on predictions and shown that it produces stable and interpretable results for Web register classification with XLM-R. We see the method as generally applicable and useful for studying text classes also beyond registers. In the future, we seek to extend the method and its evaluation, and apply the approach to other languages and cross-lingual settings. In particular, the comparison of monolingual and zero-shot models will be informative of both the linguistic characteristics of registers and what models such as XLM-R learn to recognize.
- The elephant in the interpretability room: why use attention as explanation when we have saliency methods?. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 149–155. Cited by: §1.
- Register variation online. Cambridge University Press, Cambridge. Cited by: §2, §4.
- Variation across speech and writing. Cambridge University Press, Cambridge. Cited by: §2.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8440–8451. External Links: Cited by: §2.
- Persuasive language and features of formality on the rchangemyview subreddit. Internet Pragmatics. External Links: Cited by: §2.
- Developing a bottom‐up, user‐based method of web register classification. Journal of the Association for Information Science and Technology 66, pp. 1817–1831. Cited by: §2.
Captum: a unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896. Cited by: §1.
- Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents. Language Resources and Evaluation. External Links: Cited by: §3, §4.
- Web genre classification with methods for structured output prediction. Information Sciences 503, pp. 551 – 573. External Links: Cited by: §2.
- BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 217–227. Cited by: §3.
- Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1–15. Cited by: §1.
- Stable classification of text genres. Computational Linguistics 37 (2), pp. 385–393. External Links: Cited by: §2.
- Beyond the English web: zero-shot cross-lingual and lightweight monolingual classification of registers. In Proceedings of the EACL 2021 Student Research Workshop, Cited by: §2.
- The web library of babel: evaluating genre collections. In Proceedings of LREC), Cited by: §2.
Axiomatic attribution for deep networks.
International Conference on Machine Learning, pp. 3319–3328. Cited by: §1.
- Dimensions of web registers: an exploratory multidimensional comparison. Corpora 8, pp. 239–271. Cited by: §2.
|Class||F1 (M)||SD||Sup. (M)|
|Inter. discussion (ID)||0.7623||0.0787||876|
|Inform. description (IN)||0.6336||0.0662||3399|
|Inform. persuasion (IP)||0.4094||0.0573||531|
|— How-to (HI) —|
|— Inter. Discussion (ID) —|
|— Inform. Description (IN) —|
|— Inform. Persuasion (IP) —|
|— Lyrical (LY) —|
|— Narrative (NA) —|
|— Opinion (OP) —|