Learning Explanations from Language Data
It is unclear if existing interpretations of deep neural network models
Deep neural networks are becoming more and more popular due to their
Recent work has demonstrated how data-driven AI methods can leverage con...
The adoption of machine learning in high-stakes applications such as
In this paper, our focus is on constructing models to assist a clinician...
It is widely recognized that the predictions of deep neural networks are...
Expanding the domain that deep neural network has already learned withou...
Learning Explanations from Language Data
In the last decade, deep neural classifiers achieved state-of-the-art results in many domains, among others in vision and language. Due to the complexity of a deep neural model, however, it is difficult to explain its decisions. Understanding its decision process potentially allows to improve the model and may reveal new knowledge about the input.
Recently, Kindermans et al. (2018) claimed that “popular explanation approaches for neural networks (…) do not provide the correct explanation, even for a simple linear model.” They show that in a linear model, the weights serve to cancel noise in the input data and thus the weights show how to extract the signal but not what the signal is. This is why explanation methods need to move beyond the weights, the authors explain, and they propose the methods “PatternNet” and “PatternAttribution” that learn explanations from data. We test their approach in the language domain and point to room for improvement in the new framework.
Kindermans et al. (2018) assume that the data passed to a linear model is composed of signal () and noise (, from distraction) . Furthermore, they also assume that there is a linear relation between signal and target where
is a so called signal base vector, which is in fact the “pattern” that PatternNet finds for us. As mentioned in the introduction, the authors show that in the model above,serves to cancel the noise such that
They go on to explain that a good signal estimatorshould comply to the conditions in Eqs. 1 but that these alone form an ill-posed quality criterion since already satisfies them for any for which . To address this issue they introduce another quality criterion over a batch of data :
and point out that Eq. 2 yields maximum values for signal estimators that remove most of the information about in the noise.
We argue that Eq. 2 still is not exhaustive. Consider the artificial estimator
is again just scaled noise and thus does not correlate with the output . To solve this issue, we propose the following criterion:
The minuend measures how much noise is left in the signal, the subtrahend measures how much signal is left in the noise. Good signal estimators split signal and noise well and thus yield large . We leave it to future research to evaluate existing signal estimators with our new criterion.
For our experiments, the authors equip us with expressions for the signal base vectors
for simple linear layers and ReLU layers. For the simple linear model, for instance, it turns out that. To retrieve contributions for PatternAttribution, in the backward pass, the authors replace the weights by .
. We used 150 bigram filters, dropout regularization and a dense FC projection with 128 neurons. Our classifier achieves an Fscore of 0.875 on a fixed test split. We then used Kindermans et al. (2018) PatternAttribution to retrieve neuron-wise signal contributions in the input vector space.111Our experiments are available at https://github.com/DFKI-NLP/language-attributions.
To align these contributions with plain text, we summed up the contribution scores over the word vector dimensions for each word and used the accumulated scores to scale RGB values for word highlights in the plain text space. Positive scores are highlighted in red, negative scores in blue. This approach is inspired by Arras et al. (2017a). Example contributions are shown in Figs. 1 and 2.
We observe that bigrams are highlighted, in particular no highlighted token stands isolated. Bigrams with clear positive or negative sentiment contribute heavily to the sentiment classification. In contrast, stop words and uninformative bigrams make little to no contribution. We consider these meaningful explanations of the sentiment classifications.
Many of the approaches used to explain and interpret models in NLP mirror methods originally developed in the vision domain, such as the recent approaches by Li et al. (2016), Arras et al. (2017a), and Arras et al. (2017b). In this paper we implemented a similar strategy.
Following Kindermans et al. (2018), however, our approach improves upon the latter methods for the reasons outlined above. Furthermore, PatternAttribution is related to Montavon et al. (2017) who make use of Taylor decompositions to explain deep models. PatternAttribution reveals a good root point for the decomposition, the authors explain.
We successfully transferred a new explanation method to the NLP domain. We were able to demonstrate that PatternAttribution can be used to identify meaningful signal contributions in text inputs. Our method should be extended to other popular models in NLP. Furthermore, we introduced an improved quality criterion for signal estimators. In the future, estimators can be deduced from and tested against our new criterion. ††footnotetext: * Co-first authorship.††footnotetext: This research was partially supported by the German Federal Ministry of Education and Research through the projects DEEPLEE (01IW17001) and BBDC (01IS14013E).
”What is relevant in a text document?”: An interpretable machine learning approach.PLOS ONE, 12(8).
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751.