Since their introduction in Machine Translation, attention mechanisms Bahdanau et al. (2014); Luong et al. (2015) have been extended to other tasks such as text classification Yang et al. (2016), natural language inference Chen et al. (2016) and language modeling Salton et al. (2017).
Self-attention and transformer architectures Vaswani et al. (2017) are now the state of the art in language understanding Devlin et al. (2018); Yang et al. (2019)Liu (2019), semantic role labeling Strubell et al. (2018) and machine translation for low-resource languages Rikters (2018); Rikters et al. (2018).
Attention mechanisms provide explainable attention distributions that can help to interpret predictions. For example, for their machine translation predictions, Bahdanau et al. (2014) show a heat map of attention weights from source language words to target language words. Similarly, a self-attention head produces attention distributions from the input words to the same input words, as shown in the second row of the right side of Figure 1. However, self-attention mechanisms have multiple heads, making the combined outputs difficult to interpret.
We hypothesize that label-specific representations can increase performance and provide interpretable predictions. We introduce the Label Attention Layer: a modified version of self-attention, where each attention head represents a label. We project the output at the attention head level, rather than after aggregating all outputs, to preserve the source of label-specific information.
To test our proposed Label Attention Layer, we build upon the parser of Zhou and Zhao (2019) and establish a new state of the art for both constituency and dependency parsing. We also release our trained parser, as well as our code to encourage experiments with models that include the Label Attention Layer111Code and Model to be released soon at https://github.com/KhalilMrini/LAL-Parser..
The rest of this paper is organized as follows: we explain the architecture and intuition behind our proposed Label Attention Layer in Section 2. In Section 3 we describe our syntactic parsing model, and Section 4 presents our experiments and results. Finally, we survey related work in Section 5 and lay out conclusions and suggest future work in Section 6.
2 Label Attention Layer
The self-attention mechanism of Vaswani et al. (2017) propagates information between the words of a sentence. Each resulting word representation contains its own attention-weighted view of the sentence. We hypothesize that a word representation can be enhanced by including each label’s attention-weighted view of the sentence, on top of the information obtained from self-attention.
The Label Attention Layer is a novel, modified form of self-attention, where only one query vector is needed per attention head. Each attention head represents a label, and this allows the model to learn label-specific views of the input sentence.
We explain the architecture and intuition behind our proposed Interpretable Label Attention Layer through the example application of constituency parsing.
Figure 2 shows one of the main differences between our Label Attention mechanism and self-attention: the absence of the Query matrix . Instead, we have a learned matrix of query vectors representing the labels. More formally, for the attention head of label and an input matrix of word vectors, we compute the corresponding attention weights vector as follows:
where is the dimension of query and key vectors, is the matrix of key vectors. Given a learned label-specific key matrix , we compute as:
Each attention head in our Label Attention layer has an attention vector, instead of an attention matrix as in self-attention. Consequently, we do not obtain a matrix of vectors, but a single vector that contains label-specific context information. This context vector corresponds to the green vector in Figure 3. We compute the context vector of the label as follows:
where is the vector of attention weights in Equation 1, and is the matrix of value vectors. Given a learned label-specific value matrix , we compute as:
The context vector gets added to each individual input vector, as shown in the red box in Figure 3. We project the resulting matrix of word vectors to a lower dimension before normalizing. We then distribute the word vectors computed by each label attention head, as shown in Figure 5.
Our Label Attention Layer contains one attention head per label. The values coming from each label are identifiable within the final word representation, as shown in the color-coded vectors in the middle of Figure 5.
The activation functions of the position-wise feed-forward layer make it difficult to follow the path of the contributions. Therefore we can remove the position-wise feed-forward layer, and compute the contributions from each label. We provide an example in Figure
The activation functions of the position-wise feed-forward layer make it difficult to follow the path of the contributions. Therefore we can remove the position-wise feed-forward layer, and compute the contributions from each label. We provide an example in Figure6, where the contributions are computed using normalization and averaging. In this case, we are computing the contributions of each label to the span vector. The span representation for “the person” is computed following the method of Gaddy et al. (2018) and Kitaev and Klein (2018). However, forward and backward representations are not formed by splitting the entire word vector at the middle, but rather by splitting each label-specific word vector at the middle.
3 Syntactic Parsing Model
Our parser has an encoder-decoder architecture. The encoder has self-attention layers Vaswani et al. (2017), preceding the Label Attention Layer. We follow the attention partition of Kitaev and Klein (2018), who show that separating content embeddings from position embeddings increases performance.
Sentences are pre-processed following Zhou and Zhao (2019). They represent trees using a simplified Head-driven Phrase Structure Grammar (HPSG) Pollard and Sag (1994). They propose two kinds of span representations: the division span and the joint span. We choose the joint span representation after determining that it was the best performing one. We show in Figure 4 how the example sentence in Figure 2 is represented.
The token representations for our model are a concatenation of content and position embeddings. The content embeddings are a sum of word and part-of-speech embeddings.
3.2 Constituency Parsing
For constituency parsing, span representations follow the definition of Gaddy et al. (2018) and Kitaev and Klein (2018). For a span starting at the -th word and ending at the -th word, the corresponding span vector is computed as:
where and are respectively the backward and forward representation of the -th word obtained by splitting its representation in half. An example of a span representation is shown in the middle of Figure 6.
The score vector for the span is obtained by applying a one-layer feed-forward layer:
where LN is Layer Normalization, and , , and are learned parameters. For the -th syntactic category, the corresponding score is then the -th value in the vector.
Consequently, the score of a constituency parse tree is the sum of all of the scores of its spans and their syntactic categories:
We then use a CKY-style agorithm Stern et al. (2017); Gaddy et al. (2018) to find the optimal tree with the highest score. The model is trained to find the correct parse tree , such that for all trees , the following margin constraint is satisfied:
where is the Hamming loss on labeled spans. The corresponding loss function is the hinge loss:
is the Hamming loss on labeled spans. The corresponding loss function is the hinge loss:
3.3 Dependency Parsing
We use the biaffine attention mechanism Dozat and Manning (2016) to compute a probability distribution for the dependency head of each word. The child-parent score
to compute a probability distribution for the dependency head of each word. The child-parent scorefor the -th word to be the head of the -th word is:
where is the dependent representation of the -th word obtained by putting its representation through a one-layer perceptron. Likewise,
through a one-layer perceptron. Likewise,is the head representation of the -th word obtained by putting its representation through a separate one-layer perceptron. The matrices , and are learned parameters.
The model trains on dependency parsing by minimizing the negative likelihood of the correct dependency tree. The loss function is cross-entropy:
where is the correct head for dependent , is the probability that is the head of , and is the probability of the correct dependency label for the child-parent pair .
The model jointly trains on constituency and dependency parsing by minimizing the sum of the constituency and dependency losses:
For the evaluation, we follow standard practice: using the EVALB algorithm Sekine and Collins (1997) for constituency parsing, and reporting results without punctuation for dependency parsing.
In our experiments, the Label Attention Layer has 128 dimensions for the query, key and value vectors, as well as for the output vector of each label attention head. For the dependency and span scores, we use the same hyperparameters as
In our experiments, the Label Attention Layer has 128 dimensions for the query, key and value vectors, as well as for the output vector of each label attention head. For the dependency and span scores, we use the same hyperparameters asZhou and Zhao (2019). We use the large cased pre-trained XLNet Yang et al. (2019) as our embedding model. We use a batch size of 100 and each model is trained on a single 32GB GPU.
4.2 Ablation Study
As shown in Figure 6, our Label Attention Layer is interpretable only if there is no position-wise feed-forward layer. We investigate the impact of removing this component from the LAL.
We show the results of our ablation study on the Residual Dropout and Position-wise Feed-forward Layer in Table 1. The second row shows that the performance of our parser decreases significantly when removing the Position-wise Feed-forward Layer and keeping the Residual Dropout. However, that performance is recovered when removing the Residual Dropout as well, as shown in the last row. In fact, removing both the Residual Dropout and Position-wise Feed-forward layer is the best-performing option for a LAL parser with 2 self-attention layers.
4.3 Self-Attention Layers
Table 2 shows the results of our experiments varying the number of self-attention layers in our parser’s encoder. The best-performing option is 3 layers. However, removing the position-wise feed-forward layer and residual dropout actually decreases performance.
4.4 State of the Art
Finally, we compare our results with the state of the art in constituency and dependency parsing. Table 3 compares our results with the parsers of Zhou and Zhao (2019). Our LAL parser establishes new state-of-the-art results, improving significantly in dependency parsing.
|+ 12 self-attention|
|+ 12 self-attention|
|+ 3 self-attention|
5 Related Work
Before the HPSG-based model of Zhou and Zhao (2019), the previous state-of-the-art model architecture in constituency parsing was held by Kitaev and Klein (2018). The latter use an encoder-decoder parser. The novelty of Kitaev and Klein (2018) is in their self-attentive encoder, where they stack multiple levels of self-attention to embed words. The resulting word embeddings are fed onto a decoder, which they borrow from Stern et al. (2017).
Gaddy et al. (2018) use a bidirectional LSTM to compute forward and backward representations of each word. A contiguous span of words from position to position of the form therefore has forward representations and backward representations . The scores for this span to be attributed a non-terminal are computed as the following output vector, as per Stern et al. (2017):
where and are biases, is a vector of dimensionality (the number of possible nonterminal labels), and is computed as:
Therefore, indicates the score for the span of words to be labelled with non-terminal . These scores are then used in a CKY-style algorithm that produces the final parse tree.
Kitaev and Klein (2018) redefine as the following:
where is the output of the encoder layer, i.e. the sum of all self-attention layers of the encoder.
6 Conclusions and Future Work
In this paper, we introduce a revised form of self-attention: the Label Attention Layer. In our proposed architecture, attention heads represent labels. We have only one learned vector as query, rather than a matrix, thereby diminishing the number of parameters per attention head. We incorporate our Label Attention Layer into the HPSG parser Zhou and Zhao (2019) and obtain state-of-the-art results on the English Penn Treebank benchmark dataset. Our results show 96.34 F1 score for constituency Parsing, and 97.33 UAS and 96.29 LAS for dependency parsing.
In future work, we want to investigate the interpretability of the Label Attention Layer, notably through the label-to-word attention distributions and the contributions of each label attention head. We also want to incorporate it in more self-attentive NLP models for other tasks.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Chen et al. (2016) Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. 2016. Enhancing and combining sequential and tree lstm for natural language inference. arXiv preprint arXiv:1609.06038.
- Cocke (1969) John Cocke. 1969. Programming languages and their compilers: Preliminary notes.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dozat and Manning (2016) Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734.
- Gaddy et al. (2018) David Gaddy, Mitchell Stern, and Dan Klein. 2018. What’s going on in neural constituency parsers? an analysis. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 999–1010.
- Kasami (1966) Tadao Kasami. 1966. An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report no. R-257.
- Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2676–2686.
- Liu (2019) Yang Liu. 2019. Fine-tune BERT for extractive summarization. CoRR, abs/1903.10318.
Luong et al. (2015)
Thang Luong, Hieu Pham, and Christopher D Manning. 2015.
Effective approaches to attention-based neural machine translation.In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
- Marcus et al. (1993) Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank.
- Pollard and Sag (1994) Carl Pollard and Ivan A Sag. 1994. Head-driven phrase structure grammar. University of Chicago Press.
- Rikters (2018) Matīss Rikters. 2018. Impact of corpora quality on neural machine translation. arXiv preprint arXiv:1810.08392.
- Rikters et al. (2018) Matīss Rikters, Mārcis Pinnis, and Rihards Krišlauks. 2018. Training and adapting multilingual nmt for less-resourced and morphologically rich languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).
- Salton et al. (2017) Giancarlo Salton, Robert Ross, and John Kelleher. 2017. Attentive language models. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 441–450.
- Sekine and Collins (1997) Satoshi Sekine and Michael Collins. 1997. Evalb bracket scoring program. URL: http://www. cs. nyu. edu/cs/projects/proteus/evalb.
- Stern et al. (2017) Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A minimal span-based neural constituency parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 818–827.
- Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. arXiv preprint arXiv:1804.08199.
- Toutanova et al. (2003) Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for computational Linguistics.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
- Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.
- Younger (1967) Daniel H Younger. 1967. Recognition and parsing of context-free languages in time n3. Information and control, 10(2):189–208.
- Zhou and Zhao (2019) Junru Zhou and Hai Zhao. 2019. Head-driven phrase structure grammar parsing on penn treebank. arXiv preprint arXiv:1907.02684.