Self-attention networks (SANs) has recently become a popular research topic in the speech recognition community [Dong2018Speech-transformer, zhou2018syllable, pham2019veryDeep, povey2018time, zeyer2019comparison]. Previous studies showed SANs can yield superior speech recognition results compared to recurrent neural networks (RNNs) [wang2019transformerHybrid, karita2019comparative].
RNNs are conventional models to model sequential data. However, due to gradient vanishing, it is difficult for RNNs to model long-range dependencies [bengio1994learning]
. Although gated structures, such as Long Short-Term Memory (LSTM)[hochreiter1997LSTM]
and Gated Recurrent Unit (GRU)[chung2014GRU] are proposed to alleviate this problem, capturing temporal relationships across a wide time span remains challenging for these models. In SANs, self-attention layers encode contextual information through attention mechanisms [bahdanau2015neural, luong2015effective, vaswani2017attention]
. With the attention mechanism, when learning the hidden representation for each time step of a sequence, a self-attention layer has a global view of the entire sequence and it can thus capture temporal relationships without the limitation of range. This is believed to be a key factor for the success of SANs[vaswani2017attention].
In this paper we study Transformers [vaswani2017attention], end-to-end SAN-based models with two components: an encoder and a decoder. The encoder uses self-attention layers to encode input sequences. At decoding time step , the decoder generates the current output by attending to the encoded input sequence and to the outputs generated before time . For attention-based RNN end-to-end models [luong2015effective, bahdanau2015neural], the RNN encoder encodes the input sequence. The RNN decoder interacts with the encoded input sequence through an attention layer to produce outputs.
Previous work on attention-based RNN end-to-end models has shown that for speech recognition, since acoustic events usually happen in a left-to-right order within small time spans, restricting the attention to be monotonic along the time axis improves the model’s performance [tjandra2017local, kim2017joint, zhang2019windowed]. This seemingly leads to a contradiction to the reason of SANs’ success: if the global view provided by the attention module of self-attention layers is beneficial, then why does forcing the attention mechanism to focus on local information result in performance gains for RNN end-to-end models?
To investigate this, we explore replacing the upper (further from the input) self-attention layers of the Transformer’s encoder with feed forward layers. We ran extensive experiments on a read speech corpus Wall Street Journal (WSJ) [paul1992wsj] and a conversational telephone speech corpus Switchboard (SWBD) [godfrey1992switchboard]. We found that replacing the upper self-attention layers with feed forward layers does not yield higher error rates – it even leads to improved accuracies. Since a feed forward layer can be viewed as a pure “monotonic left-to-right diagonal attention”, this observation does not contradict the previous studies on RNN-based end-to-end models which restrict the attention to be diagonal. Thus, it indicates the inputs to the upper layers of the Transformer encoder have encoded enough contextual information and learning further temporal relationships through self-attention is not helpful. These experiments also do not nullify the benefits of the self-attention layers – the range of learned context is increased from bottom to up through the self-attention layers and it is important for the lower layers to well encode the context information. Only when the lower layers have captured sufficient contextual information, the attention mechanism becomes redundant for the upper layers.
It should be noticed that for the attention-based RNN models, the attention mechanism interacts with both the decoder and the encoder. Since an output unit (e.g. a character) is often related to a small time span of acoustic features, the attention needs to attend a small window of the elements in the encoded input sequence in a left-to-right order. In this work we study the self-attention encoder which learns the hidden representation for each time step of the input sequence. Thus, feed forward layers, which can be viewed as “left-to-right attention without looking at the context”, are sufficient in learning further abstract representation for the frame in the current time step when temporal relationships among input frames are well captured.
Our observations also make practical contributions. Compared to self-attention layers, feed forward layers have a reduced number of parameters. Furthermore, without parallel computation, the time complexity for a self-attention layer is where is the length of the input sequence. Replacing self-attention layers with feed forward layers also reduces the time complexity.
2 Multi-head attention
Self-attention and its multi-head attention module [vaswani2017attention] which uses multiple attention mechanisms to encode context are key components of Transformers. Previous works have analysed the importance of these modules. For a self-attention layer, a single-layer feed forward module is stacked on the multi-head attention module. Irie et al [irie2020how] extend the single-layer feed forward module to a multi-layer module, arguing it can bring more representation power, and show that a SAN with fewer modified self-attention layers can have minor performance drops compared to a SAN with a larger number of the original self-attention layers. However, with fewer layers, the models with the modified self-attention layers give reduced number of parameters/decoding time. In this work we study the effect of the stacked context among the self-attention layers of the encoder. We do not change the architecture of the self-attention layers and we replace the upper self-attention layers in the encoder of Transformers with feed forward layers.
Michel et al [michel2019are] remove a proportion of the heads in the multi-head attention for each self-attention layer in Transformers, finding it leads to minor performance drops. This implies the benefits of multi-head attention mainly come from the training dynamics. In our work, instead of removing some attention heads, we replace the entire self-attention layer with feed forward layers to investigate how the self-attention layers encode the speech signal.
When the upper self-attention layers are replaced with feed forward layers, the architecture of the encoder is similar to the CLDNN – Convolutional, Long Short-Term Memory Deep Neural Network [sainath2015CLDNN]. The CLDNN uses an LSTM to model the sequential information and a deep neural network (DNN) to learn further abstract representation for each time step. Stacking a DNN on an LSTM results in a notable error rate reduction compared to pure LSTM models. While we found the upper self-attention layers of the encoder of Transformers can be replaced with feed forward layers, stacking more feed forward layers does not result in further performance gains. The main goal of this work is to understand the self-attention encoder.
3 Model Architecture
In this section we describe multi-head attention, self-attention layers, and the self-attention encoder module in Transformers. Then, we introduce the replacement of the upper self-attention layers of the encoder by feed forward layers.
3.1 Multi-head Attention
Multi-head attention uses attention mechanisms to encode sequences [vaswani2017attention]. We firstly consider a single attention head. The input sequences to the attention mechanism are mapped to a query sequence , a key sequence and a value sequence . These sequences have the same dimension, and and have the same length. For the -th element of
, an attention vector is generated by computing the similarity betweenand each element of . Using the attention vector as weights, the output is a weighted sum over the value sequence . Thus, an attention head of the multi-head attention can be described as:
where , are inputs and denote the lengths of the input sequences; are trainable matrices. The three input sequences can be the same sequence, e.g., the speech signal to be recognised. The multi-head attention uses attention heads and a trainable matrix to combined the outputs of each attention head:
3.2 Self-attention Encoder
The self-attention encoder in a Transformer is a stack of self-attention layers. The -th layer reads the output sequence from its lower layer and uses multi-head attention to process the input sequence. That is, . The multi-head attention only contains linear operations. Thus, in a self-attention layer, a non-linear feed forward layer is stacked on the multi-head attention module. A self-attention layer in the encoder of a Transformer can be described as:
where , , and . are trainable matrices and vectors .
3.3 Feed Forward Upper Encoder Layers
We argue that for the encoder, the upper self-attention layers can be replaced with feed forward layers. In the encoder, since each self-attention layer learns contextual information from its lower layer, the span of the learned context increases from the lower layers to the upper layers. Since acoustic events often happen within small time spans in a left-to-right order, if the inputs to the upper layer have encoded a sufficient large span of context, then it is unnecessary for the upper layers to learn further temporal relationships. Thus, the multi-head attention module which extracts the contextual information may be redundant, and the self-attention layer is not essential. However, if the upper layers of the encoder are self-attention layers, and the lower layers have already seen a sufficiently wide context, then since no further contextual information is needed the attention mechanism will focus on a narrow range of inputs. Assuming that acoustic events often happen left-to-right, the attention vector will tend to be diagonal. Then, and self-attention is not helpful, and replacing them with feed forward layers will not lead to a performance drop. The architecture of the feed forward layers is:
Figure 1 demonstrates the architecture of a self-attention layer and a feed forward layer. Furthermore, the feed forward layer can be viewed as the a self-attention layer with a diagonal attention weight matrix – in the attention weight matrix, the elements among the diagonal are ones and all other elements are zeros.
4 Experiments and Discussion
4.1 Experimental Setup
We experiment on two datasets, Wall Street Journal (WSJ) which contains 81 hours of read speech training data and Switchboard (SWBD), which contains 260 hours of conversational telephone speech training data. We use WSJ dev93 and eval92 test sets, and SWBD are eval2000 SWBD/callhome test sets. We use Kaldi [povey2011kaldi]
for data preparation and feature extraction – 83-dim log-mel filterbank frames with pitch[ghahremani2014pitch]. The output units for the WSJ experiments are 26 characters, and the apostrophe, period, dash, space, noise and sos / eos tokens. The output tokens for SWBD experiments are tokenized using Byte Pair Encoding (BPE) [sennrich2016BPE].
We compare Transformers with different types of encoders. The baseline Transformer encoders comprise self-attention layers and are compared with Transformers whose encoders have feed forward layers following the self-attention layers. Each self-attention/feed forward layer is counted as one layer and encoders with the same number of layers are compared. Except the number of self-attention/feed forward layers in the encoder, all the components of all the models have the same architecture. In each model, below the Transformer’s encoder there are two convolutional neural network layers with 256 channels, stride size 2 and kernel size 3, which maps the dimension of the input sequence to. The multi-head attention components of the self-attention layers have 4 attention heads and . For the feed forward module of the self-attention layers as well as the proposed feed forward encoder layers, . Dropout rate is used when dropout is applied. The decoder of the Transformer has 6 layers. The input sequences to the encoder and the decoder are concatenated with sinusoidal positional encoding [vaswani2017attention]. All models are implemented through the ESPnet toolkit [watanabe2018espnet]
Adam [kingma2015adam] is used as the optimizer. The training schedule (warm up steps, learning rate decay) follows previous work [Dong2018Speech-transformer]
. The batch size is 32. Label smoothing with smoothing weight 0.1 is used. We train the model for 100 epochs and the averaged parameters of the last 10 epochs are used as the parameters of the final model[Dong2018Speech-transformer]. Besides the loss from the Transformer’s decoder , a connectionist temporal classification (CTC) [graves2006connectionist] loss is also applied to the Transformer encoder [kim2017joint]. With for WSJ and for SWBD, the final loss for the model is:
4.2 Experiments on WSJ
For the experiments on WSJ, we first train a baseline model with a 12-layer self-attention encoder. Then, we use this model to decode WSJ dev93 and have computed the averaged attention vectors on dev93 generated by the lowest layer (near input), a middle layer and the topmost layer. Figure 2 shows the plots of the averaged attention vectors for each attention head of these layers. The lowest layer attends to a wide range of context. The middle layers put more attention weights among the diagonal and the middle two heads of the topmost layer have nearly pure diagonal attention matrices. This implies since the range of learned context increases higher up the encoder, the global view becomes less important and the upper layers focus more on local information. Eventually no additional contextual information is needed so the attention becomes diagonal. For the topmost layer, .
|Number of Layers||CER|
To demonstrate that the range of learned context is stacked up and that multi-head attention is redundant for the upper layers of the encoder, we train models whose encoders are built by different numbers of self-attention layers and feed forward layers. For the encoder of these models, there are 12 layers in total and the lower layers are self-attention layers while the upper layers are feed forward layers. We start from an encoder with 6 self-attention layers and 6 feed forward layers. Then, we increase the number of self-attention layers and decrease the number of feed forward layers. Table 1 shows that as the number of self-attention layers increases, the character error rate (CER) decreases, which implies learning further contextual information is beneficial.
However, when the number of self-attention layers increases to 10, with 2 upper feed forward layers, the encoder gives almost identical results compared to the 12-layer self-attention baseline, although the 10-layer self-attention encoder has notably higher CERs. Furthermore, although the 11-layer self-attention encoder gives worse results compared to the 12-layer baseline, the encoder which has 11 self-attention layers and one upper feed forward layers yields the best results. Increasing the number of self-attention layers to 12 and decreasing the number of feed forward layers to 0 is harmful. This set of experiments shows the temporal relationships are well captured through 10 or 11 self-attention layers and further contextual information is unnecessary. Thus, feed forward layers are sufficient in learning further hidden representations. If the additional self-attention layer upon the 11 self-attention layers learns pure diagonal attention matrices, then it can be viewed as a feed forward layer. However, learning the diagonal attention is hard and redundant, whereas the feed forward layers guarantee “diagonal attention”.
We tested if stacking more feedforward layers to make deeper encoders is beneficial. However, as shown in Table 1
, this does not give performance gains. We also tried to modify the architecture of the stacked feed forward layers, such as removing residual connections or using an identity mapping[he2016identity]. However, these modifications were not helpful and we did not observe CER reduction compared to the 11-layer self-attention 1-layer feed forward encoder.
4.3 Experiments on SWBD
We further test our argument on a larger and more complicated dataset, SWBD. The results are shown in Table 2. The encoder with 10 self-attention layers have worse results than the encoders with 11 and 12 self-attention layers. Also, the 12-layer self-attention encoder has higher word error rates (WER) than the 11-layer encoder. However, the encoder with 10 self-attention layers and 2 feed forward layers, which has 12 layers in total, gives the lowest WERs. The 9 self-attention layers + 3 feed forward layers encoder yields worse WERs. Thus, the 10 self-attention layer is crucial in learning contextual information. Upon the 10 self-attention layer feed forward layers are sufficient in learning further abstract representations.
|Number of Layers||WER|
In this paper, based on the argument that acoustic events often happen in small time spans with a strict left-to-right ordering and that the encoded context increases through the lowest self-attention layer to the highest self-attention layer through the Transformer encoder, we propose that for speech recognition, replacing the upper self-attention layers with feed forward layers will not harm the model’s performance. Our experiments on WSJ and SWBD confirms this – replacing the upper self-attention layers with feed forward networks even gives small error rate reductions. Future work include investigating this idea on other domains (e.g., natural language processing) and developing novel network architectures based on this observation.
Acknowledgements: Supported by a PhD Studentship funded by Toshiba Research Europe Limited (SZ), and EPSRC Project SpeechWave, EP/R012180/1 (EL, PB, SR).