In the last couple of years, neural network models trained on large text data and fine-tuned on supervised tasks, have been rapidly advancing the state-of-the-art benchmarks in Natural Language Processing (NLP). Recent models like ELMoPeters et al. (2018), GPT Radford et al. (2018), and BERT Devlin et al. (2019) are gradually replacing the word embedding models like Word2Vec Mikolov et al. (2013) and GloVe Pennington et al. (2014) as the goto approaches for tackling NLP tasks. The recent success in NLP tasks apparently mean that neural networks must be able to learn syntactic, semantic, and/or certain other linguistic information from input training data. However due to their blackbox nature, no one knows exactly why neural networks are able to outperform previous state-of-the-art methods by such a big margin.
Recently there is a growing interest in solving this mystery of how and why neural networks work the way they do. Previously, while some researchers have tried to observe the internal hidden vector representations of models by applying methods like probing classifiersBelinkov et al. (2017), others have examined the outputs of language models by varying the input data Linzen et al. (2016). There also has been recent research on analyzing how attention works in such models Clark et al. (2019). All these works have produced evidence that deep neural language models are capable of encoding some form of syntactic and semantic information. This linguistic knowledge enables models like BERT to tackle challenging tasks in the classical NLP pipeline Tenney et al. (2019). One of the main components behind the recent massive success of neural networks, especially in NLP tasks, is attention. Attention Bahdanau et al. (2014) is simply the parameter that determines how important the past data is, given the current context. It is a weight matrix that helps in calculating the next representation for the current word in text.
Like Clark et al. (2019), we here analyze all the attention heads of BERT, except that we inspect a BERT model pre-trained for question-answering on the SQuaD 2.0 dataset Rajpurkar et al. (2018) with the aim to find how important are different parts of the input on each attention layer. We first extract the attention values of each layer in the forward pass during model training and run DeepLIFT Shrikumar et al. (2017) on the results to determine the contribution of each attention component on the output for a given input question. We then try to detect patterns of shifting attentions by clustering the resulting representations and analyzing questions typical for each cluster.
Outline of the report. Having briefly discussed some of the background terminologies in section 2, we mention some related work in section 3. Section 4 explains our approach and we analyze the results of our experiment in section 5. We conclude in section 6.
2.1 BERT for Question Answering
BERT Devlin et al. (2019) is a large neural network model based on the transformer architecture that is pre-trained on task specific data. Transformers Vaswani et al. (2017) are large networks made up of multiple encoder-decoder layers, each layer containing multi-headed attention.
BERT (Bidirectional Encoder Representations from Transformers) is primarily trained for two different tasks; masked language modeling where the model tries to predict words that have been removed or masked, and next sentence prediction where it tries to guess whether a statement follows a given proposition or not. BERT is pre-trained on 3.3 billion English text tokens and then fine-tuned on supervised task specific domain data to produce impressive results. Special tokens [CLS] and [SEP] are added to the beginning and end of the text respectively. We here use the base version of BERT which has 12 transformer layers containing 12 attention heads each, thus a total of 144 attention heads.
Researchers have used different gradient-based attribution methods to analyze the flow of information inside Deep Neural Networks (DNNs). There are perturbation-based methods like Occlusion Zeiler and Fergus (2014)
where output change is monitored on replacing a single feature with a zero baseline. Similarly there are methods replying on backpropagation like Gradient*Input (Shrikumar et al., 2016), Integrated GradientsSundararajan et al. (2017), and Layer-wise Relevance Propagation (LRP) Bach et al. (2015)
. We here use DeepLIFT (Deep Learning Important FeaTures)Shrikumar et al. (2017)
for this purpose. According to the authors, DeepLIFT is a method to decompose the prediction of a neural network for a specific input by backpropagating once layer-wise through the model architecture and monitoring the contribution of each neuron to every input feature. DeepLIFT assigns separate values for positive and negative contributions, and thus is able to reveal dependencies that other methods might miss. It also avoids placing misleading importance on bias terms. A comparative case-study of different attribution methodsAncona et al. (2018) shows that DeepLIFT has high correlation and it is a faster and better approximation of Integrated Gradients, making it a good choice for our analysis.
We run DeepLIFT on the final results of our model, using the highest probability start and end words as the target neurons. DeepLIFT, generally defined for feed-forward networks only, gets more complicated for multi-headed multi-layer attention with multiple inputs and inner products. We had to rewrite the backward pass of DeepLIFT ourselves as PyTorch111https://pytorch.org/ does not support backward hooks for complex modules, like the ones utilized by BERT. We also had to change the propagation algorithm from the original DeepLIFT paper, since computing multipliers in the forward pass takes up more memory quadratically, compared to our implementation.
3 Related Work
As mentioned before, researchers have tried to unravel the mystery of why neural networks work so well. There have been some recent work on analyzing the BERT model to understand if it is able to encode and learn linguistic information from given input text data.
Clark et al. (2019) analyzed what BERT looks at and found evidence that attention heads in BERT attend to patterns in data like the next token, delimiters, and periods, with the same layer often exhibiting similar behavior. Certain attention heads can relate to specific linguistic information like syntax and coreference. Substantial amount of language representations can be found in BERT’s attention maps. Researchers have also found proof that BERT is able to represent the classical NLP pipeline Tenney et al. (2019) in a localized interpretable manner in the sequence of POS tagging, parsing, NER, semantics, and coreference. It is further proved Jawahar et al. (2019) that BERT captures phrasal information in its lower layers, followed by syntactic and finally semantic representations as it goes through the upper layers. Deep neural networks with higher number of layers are better suited to capture long-distance dependency information from input text data.
Here we use a BERT model fine-tuned on the SQuaD 2.0 dataset. We are able to obtain high accuracy scores, comparable to state of the art benchmark results. However, our main objective in this experiment is not to come up with a new model that would beat the current state of the art. Rather, we want to investigate how the multi-headed attention mechanism works in BERT and how similar are the changes in focus on different input tokens to human thought processes.
The test data is fed into the model in this format. We want to monitor changes in the amount of attention that is given to each of the tokens in input text. We run DeepLIFT on the results by backpropagating through the layers of our neural network model. DeepLIFT produces certain scores for each token in the input text for each layer. The scores are either positive or negative, representing higher or lower attention on the tokens respectively. We want to focus on those units which receive higher attention in the process, and the shift in those values. The tokens are highlighted with colors which represent their DeepLIFT scores.
Here we present an example from our experiment. The question is when did beyonce start becoming popular?, to which the answer is late 1990s.
The code for this experiment will be open-sourced.
5 Results and Analysis
Please refer to the images included in the Supplemental Material section of this report relevant for both the examples mentioned above.
Example 1. when did beyonce start becoming popular? Figures 1 to 12 show us which tokens are given more attention by our model. The blue color represents low attention while red indicates higher values (check the reference scale provided with each image). In the initial layers, the model focuses on the separator tokens in the input text first. Then it switches the focus to punctuation symbols and gradually to certain tokens in text. This behavior relates to BERT focusing on the syntactic information of input Clark et al. (2019) in the initial layers. On the other hand, the model shifts its attention to tokens which are most likely related to the question and can be important part of the probable answer. This captures the semantic representations of the text. We can see how attention changes for a token if we look at the word beyonce in figures 4 and 5. The model focus on beyonce till layer 4 and then completely ignores (no attention) it from layer 5. It was focusing on the token as it is a keyword in the question. It removes its focus when it tries to find its answer in the text. It focuses on the question keywords (for example beyonce) again in the last layers to semantically verify the answer context with the question keywords. In the output layer (figure 13), we see our model gives its full attention to only those tokens which are part of the prediction (answer).
Here we have used DeepLIFT, a backpropagation-based attribution method, to first analyze how the values of multi-headed attention change across the layers of BERT and then clustered the results to find patterns in the data similar to human reasoning. We find that BERT, fine-tuned for question-answering tasks, first focuses on the tokens in the text with respect to keywords in the question. Later on it shifts its attention to only those tokens which it thinks are vital in constructing the final answer to the question. We pictorially demonstrate how the model’s widespread span of attention narrows down to the answer tokens in the final layers, keeping in mind the syntactic and semantic representations during the entire process.
This report is part of the Neural Question Answering seminar offered in Saarland University, Germany. We are grateful to Prof. Günter Neumann and Stalin Varanasi for offering this seminar and guiding us.
- Ancona et al. (2018) Marco Ancona, Enea Ceolini, Cengiz Oztireli, and Markus Gross. 2018. Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations (ICLR 2018).
- Bach et al. (2015) Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Belinkov et al. (2017)
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass.
What do neural machine translation models learn about morphology?In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872.
- Clark et al. (2019) Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of bert’s attention. CoRR, abs/1906.04341.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Jawahar et al. (2019) Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language? In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy.
- Linzen et al. (2016) Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://blog.openai.com/language-unsupervised.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
Shrikumar et al. (2017)
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017.
Learning important features through propagating activation
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3145–3153. JMLR. org.
- Shrikumar et al. (2016) Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. 2016. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713.
- Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–3328. JMLR. org.
- Tenney et al. (2019) Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Zeiler and Fergus (2014)
Matthew D Zeiler and Rob Fergus. 2014.
Visualizing and understanding convolutional networks.
European conference on computer vision, pages 818–833. Springer.
Appendix A Supplemental Material
We have included all the attention heatmap images on input questions and the corresponding input paragraph for each layer of BERT and its prediction. In each image, tokens of the text data input is highlighted with a specific color to denote the score that is assigned to them after applying DeepLIFT. That score tells us how much attention the model gives on that particular token in the text.
a.1 Example 1
Q: when did beyonce start becoming popular?
A: late 1990s
Figure 1 - 12 show the importance given to each token after running DeepLIFT.
Figure 13 shows amount of attention given by BERT on the tokens in the output. We see the tokens ”late 1990s” receives highest attention, that being the correct answer.
The blue highlighted color refers to a negative score, which means that this value contributed negatively to the final result. Similarly, the color red represents a positive score, that is, more positive contribution to the result. A color scale is provided for reference.