Log In Sign Up

Rethinking Self-Attention: An Interpretable Self-Attentive Encoder-Decoder Parser

Attention mechanisms have improved the performance of NLP tasks while providing for appearance of model interpretability. Self-attention is currently widely used in NLP models, however it is difficult to interpret due to the numerous attention distributions. We hypothesize that model representations can benefit from label-specific information, while facilitating interpretation of predictions. We introduce the Label Attention Layer: a new form of self-attention where attention heads represent labels. We validate our hypothesis by running experiments in constituency and dependency parsing and show our new model obtains new state-of-the-art results for both tasks on the English Penn Treebank. Our neural parser obtains 96.34 F1 score for constituency parsing, and 97.33 UAS and 96.29 LAS for dependency parsing. Additionally, our model requires fewer layers, therefore, fewer parameters compared to existing work.


page 1

page 2

page 3

page 4


Constituency Parsing with a Self-Attentive Encoder

We demonstrate that replacing an LSTM encoder with a self-attentive arch...

Combining Improvements for Exploiting Dependency Trees in Neural Semantic Parsing

The dependency tree of a natural language sentence can capture the inter...

Multilingual Constituency Parsing with Self-Attention and Pre-Training

We extend our previous work on constituency parsing (Kitaev and Klein, 2...

Levi Graph AMR Parser using Heterogeneous Attention

Coupled with biaffine decoders, transformers have been effectively adapt...

Probing for Labeled Dependency Trees

Probing has become an important tool for analyzing representations in Na...

Attention-based Ingredient Phrase Parser

As virtual personal assistants have now penetrated the consumer market, ...

Heads-up! Unsupervised Constituency Parsing via Self-Attention Heads

Transformer-based pre-trained language models (PLMs) have dramatically i...

1 Introduction

Since their introduction in Machine Translation, attention mechanisms Bahdanau et al. (2014); Luong et al. (2015) have been extended to other tasks such as text classification Yang et al. (2016), natural language inference Chen et al. (2016) and language modeling Salton et al. (2017).

Self-attention and transformer architectures Vaswani et al. (2017) are now the state of the art in language understanding Devlin et al. (2018); Yang et al. (2019)

, extractive summarization

Liu (2019), semantic role labeling Strubell et al. (2018) and machine translation for low-resource languages Rikters (2018); Rikters et al. (2018).

Figure 1: Comparison of the attention head architectures of our proposed Label Attention Layer and a Self-Attention Layer Vaswani et al. (2017). The input matrix

contains the word vectors for the example input sentence “

Select the person”.

Attention mechanisms provide explainable attention distributions that can help to interpret predictions. For example, for their machine translation predictions, Bahdanau et al. (2014) show a heat map of attention weights from source language words to target language words. Similarly, a self-attention head produces attention distributions from the input words to the same input words, as shown in the second row of the right side of Figure 1. However, self-attention mechanisms have multiple heads, making the combined outputs difficult to interpret.

We hypothesize that label-specific representations can increase performance and provide interpretable predictions. We introduce the Label Attention Layer: a modified version of self-attention, where each attention head represents a label. We project the output at the attention head level, rather than after aggregating all outputs, to preserve the source of label-specific information.

To test our proposed Label Attention Layer, we build upon the parser of Zhou and Zhao (2019) and establish a new state of the art for both constituency and dependency parsing. We also release our trained parser, as well as our code to encourage experiments with models that include the Label Attention Layer111Code and Model to be released soon at

The rest of this paper is organized as follows: we explain the architecture and intuition behind our proposed Label Attention Layer in Section 2. In Section 3 we describe our syntactic parsing model, and Section 4 presents our experiments and results. Finally, we survey related work in Section 5 and lay out conclusions and suggest future work in Section 6.

2 Label Attention Layer

The self-attention mechanism of Vaswani et al. (2017) propagates information between the words of a sentence. Each resulting word representation contains its own attention-weighted view of the sentence. We hypothesize that a word representation can be enhanced by including each label’s attention-weighted view of the sentence, on top of the information obtained from self-attention.

The Label Attention Layer is a novel, modified form of self-attention, where only one query vector is needed per attention head. Each attention head represents a label, and this allows the model to learn label-specific views of the input sentence.

We explain the architecture and intuition behind our proposed Interpretable Label Attention Layer through the example application of constituency parsing.

Figure 2: The architecture of our proposed Label Attention Layer. In this figure, the example application is constituency parsing, and the example input sentence is “Select the person driving”.

Figure 2 shows one of the main differences between our Label Attention mechanism and self-attention: the absence of the Query matrix . Instead, we have a learned matrix of query vectors representing the labels. More formally, for the attention head of label and an input matrix of word vectors, we compute the corresponding attention weights vector as follows:


where is the dimension of query and key vectors, is the matrix of key vectors. Given a learned label-specific key matrix , we compute as:

Figure 3: The Value vector computations in our proposed Label Attention Layer.
Figure 4: Parsing representations of the example sentence in Figure 2.

Each attention head in our Label Attention layer has an attention vector, instead of an attention matrix as in self-attention. Consequently, we do not obtain a matrix of vectors, but a single vector that contains label-specific context information. This context vector corresponds to the green vector in Figure 3. We compute the context vector of the label as follows:


where is the vector of attention weights in Equation 1, and is the matrix of value vectors. Given a learned label-specific value matrix , we compute as:


The context vector gets added to each individual input vector, as shown in the red box in Figure 3. We project the resulting matrix of word vectors to a lower dimension before normalizing. We then distribute the word vectors computed by each label attention head, as shown in Figure 5.

Figure 5: Redistribution of the label-specific word representations to form word vectors by concatenation. The label color scheme follows Figure 2. We do not use colors for the vectors resulting from the position-wise feed-forward layer, as the label-specific information moved.

Our Label Attention Layer contains one attention head per label. The values coming from each label are identifiable within the final word representation, as shown in the color-coded vectors in the middle of Figure 5.

The activation functions of the position-wise feed-forward layer make it difficult to follow the path of the contributions. Therefore we can remove the position-wise feed-forward layer, and compute the contributions from each label. We provide an example in Figure

6, where the contributions are computed using normalization and averaging. In this case, we are computing the contributions of each label to the span vector. The span representation for “the person” is computed following the method of Gaddy et al. (2018) and Kitaev and Klein (2018). However, forward and backward representations are not formed by splitting the entire word vector at the middle, but rather by splitting each label-specific word vector at the middle.

In the example in Figure 6, we show averaging one way of computing contributions, but other functions, such as softmax, can be used. Another way of interpreting predictions would be to look at the label-to-word attention distributions, which are the output vectors in the computation in Figure 2.

Figure 6: If we remove the position-wise feed-forward layer, we can compute the contributions from each label attention head to the span representation, and thus offer interpretability. This illustrative example follows the label color scheme in Figure 2.

3 Syntactic Parsing Model

3.1 Encoder

Our parser has an encoder-decoder architecture. The encoder has self-attention layers Vaswani et al. (2017), preceding the Label Attention Layer. We follow the attention partition of Kitaev and Klein (2018), who show that separating content embeddings from position embeddings increases performance.

Sentences are pre-processed following Zhou and Zhao (2019). They represent trees using a simplified Head-driven Phrase Structure Grammar (HPSG) Pollard and Sag (1994). They propose two kinds of span representations: the division span and the joint span. We choose the joint span representation after determining that it was the best performing one. We show in Figure 4 how the example sentence in Figure 2 is represented.

The token representations for our model are a concatenation of content and position embeddings. The content embeddings are a sum of word and part-of-speech embeddings.

3.2 Constituency Parsing

For constituency parsing, span representations follow the definition of Gaddy et al. (2018) and Kitaev and Klein (2018). For a span starting at the -th word and ending at the -th word, the corresponding span vector is computed as:


where and are respectively the backward and forward representation of the -th word obtained by splitting its representation in half. An example of a span representation is shown in the middle of Figure 6.

The score vector for the span is obtained by applying a one-layer feed-forward layer:


where LN is Layer Normalization, and , , and are learned parameters. For the -th syntactic category, the corresponding score is then the -th value in the vector.

Consequently, the score of a constituency parse tree is the sum of all of the scores of its spans and their syntactic categories:


We then use a CKY-style agorithm Stern et al. (2017); Gaddy et al. (2018) to find the optimal tree with the highest score. The model is trained to find the correct parse tree , such that for all trees , the following margin constraint is satisfied:



is the Hamming loss on labeled spans. The corresponding loss function is the hinge loss:


3.3 Dependency Parsing

We use the biaffine attention mechanism Dozat and Manning (2016)

to compute a probability distribution for the dependency head of each word. The child-parent score

for the -th word to be the head of the -th word is:


where is the dependent representation of the -th word obtained by putting its representation

through a one-layer perceptron. Likewise,

is the head representation of the -th word obtained by putting its representation through a separate one-layer perceptron. The matrices , and are learned parameters.

The model trains on dependency parsing by minimizing the negative likelihood of the correct dependency tree. The loss function is cross-entropy:


where is the correct head for dependent , is the probability that is the head of , and is the probability of the correct dependency label for the child-parent pair .

3.4 Decoder

The model jointly trains on constituency and dependency parsing by minimizing the sum of the constituency and dependency losses:


The decoder is a CKY-style Kasami (1966); Younger (1967); Cocke (1969); Stern et al. (2017) algorithm, modified by Zhou and Zhao (2019) to include dependency scores.

4 Experiments

Self- Position-wise Residual Precision Recall F1 UAS LAS
Attention Feed-forward Dropout
Layers Layer
2 Yes Yes 96.23 96.03 96.13 97.16 96.09
2 No Yes 84.85 84.88 84.86 87.14 82.98
2 No No 96.33 95.98 96.16 97.30 96.17
Table 1: Ablation study on our parser with a Label Attention Layer and 2 self-attention layers. The parameters are the Position-wise Feed-forward Layer and Residual Dropout of the Label Attention Layer. The evaluation is done on the Penn Treebank test set.
Self- Position-wise Residual Precision Recall F1 UAS LAS
Attention Feed-forward Dropout
Layers Layer
2 Yes Yes 96.23 96.03 96.13 97.16 96.09
3 Yes Yes 96.47 96.20 96.34 97.33 96.29
3 No No 96.30 95.92 96.11 97.12 95.94
12 Yes Yes 96.27 96.06 96.16 97.24 96.14
Table 2: Performance on the Penn Treebank test set of our LAL parser according to the number of self-attention layers.

We evaluate our model on the English Penn Treebank Marcus et al. (1993) benchmark dataset, and use the Stanford tagger Toutanova et al. (2003) to predict part-of-speech tags.

For the evaluation, we follow standard practice: using the EVALB algorithm Sekine and Collins (1997) for constituency parsing, and reporting results without punctuation for dependency parsing.

4.1 Setup

In our experiments, the Label Attention Layer has 128 dimensions for the query, key and value vectors, as well as for the output vector of each label attention head. For the dependency and span scores, we use the same hyperparameters as

Zhou and Zhao (2019). We use the large cased pre-trained XLNet Yang et al. (2019) as our embedding model. We use a batch size of 100 and each model is trained on a single 32GB GPU.

4.2 Ablation Study

As shown in Figure 6, our Label Attention Layer is interpretable only if there is no position-wise feed-forward layer. We investigate the impact of removing this component from the LAL.

We show the results of our ablation study on the Residual Dropout and Position-wise Feed-forward Layer in Table 1. The second row shows that the performance of our parser decreases significantly when removing the Position-wise Feed-forward Layer and keeping the Residual Dropout. However, that performance is recovered when removing the Residual Dropout as well, as shown in the last row. In fact, removing both the Residual Dropout and Position-wise Feed-forward layer is the best-performing option for a LAL parser with 2 self-attention layers.

4.3 Self-Attention Layers

Table 2 shows the results of our experiments varying the number of self-attention layers in our parser’s encoder. The best-performing option is 3 layers. However, removing the position-wise feed-forward layer and residual dropout actually decreases performance.

4.4 State of the Art

Finally, we compare our results with the state of the art in constituency and dependency parsing. Table 3 compares our results with the parsers of Zhou and Zhao (2019). Our LAL parser establishes new state-of-the-art results, improving significantly in dependency parsing.

Model F1 UAS LAS
+ 12 self-attention
+ BERT 95.84 97.00 95.43
+ 12 self-attention
+ XLNet 96.33 97.20 95.72
+ 3 self-attention
+ XLNet 96.34 97.33 96.29
Table 3: Comparison of the performance for Constituency and Dependency Parsing of our Label Attention Layer (LAL) parser and the HPSG parser of Zhou and Zhao (2019) on the Penn Treebank test set.

5 Related Work

Before the HPSG-based model of Zhou and Zhao (2019), the previous state-of-the-art model architecture in constituency parsing was held by Kitaev and Klein (2018). The latter use an encoder-decoder parser. The novelty of Kitaev and Klein (2018) is in their self-attentive encoder, where they stack multiple levels of self-attention to embed words. The resulting word embeddings are fed onto a decoder, which they borrow from Stern et al. (2017).

Gaddy et al. (2018) use a bidirectional LSTM to compute forward and backward representations of each word. A contiguous span of words from position to position of the form therefore has forward representations and backward representations . The scores for this span to be attributed a non-terminal are computed as the following output vector, as per Stern et al. (2017):


where and are biases, is a vector of dimensionality (the number of possible nonterminal labels), and is computed as:


Therefore, indicates the score for the span of words to be labelled with non-terminal . These scores are then used in a CKY-style algorithm that produces the final parse tree.

Kitaev and Klein (2018) redefine as the following:


where is the output of the encoder layer, i.e. the sum of all self-attention layers of the encoder.

None of these papers provided visualisations of their learned attention distributions, but Kitaev and Klein (2018) and Gaddy et al. (2018) do extensive interpretation studies, using ablation and probing of linguistic theories.

6 Conclusions and Future Work

In this paper, we introduce a revised form of self-attention: the Label Attention Layer. In our proposed architecture, attention heads represent labels. We have only one learned vector as query, rather than a matrix, thereby diminishing the number of parameters per attention head. We incorporate our Label Attention Layer into the HPSG parser Zhou and Zhao (2019) and obtain state-of-the-art results on the English Penn Treebank benchmark dataset. Our results show 96.34 F1 score for constituency Parsing, and 97.33 UAS and 96.29 LAS for dependency parsing.

In future work, we want to investigate the interpretability of the Label Attention Layer, notably through the label-to-word attention distributions and the contributions of each label attention head. We also want to incorporate it in more self-attentive NLP models for other tasks.