The sequence-to-sequence (S2S) approach [sutskever2014sequence] has achieved remarkable results in automatic speech recognition (ASR), in particular, large vocabulary continuous speech recognition (LVCSR) [chorowski2015attention, bahdanau2016end, chorowski2016towards, zhang2017very, kim2017joint, chan2016listen, huang2019exploring]
. Unlike the conventional hybrid ASR, S2S requires neither lexicons, nor prerequisite models or decision trees. It can jointly optimize the acoustic and language model simultaneously and directly learn the mapping from speech to text.
The most commonly used structure in S2S approaches is the attention-based encoder-decoder model (AED) [AED]. This model maps the input feature sequences to the output character sequences and has been widely used in ASR tasks [chorowski2014end, chan2016listen, Speech-transformer]. Among them, the Listen, Attend and Spell (LAS) [chiu2018state]
structure has shown superior performance to a conventional hybrid system using large amounts of training data. LAS uses an encoder that is a pyramidal recurrent neural network (RNN) to convert low-level speech signals into higher-level acoustic representations, and then the relationship between these representations and targets is learned by an attention mechanism at the RNN-based decoder. However, due to the sequential nature of RNNs, the LAS model doesn’t support parallelization of calculations, therefore, is prevented from big data training.
To remedy this problem, new encoder-decoder structures with self-attention networks have recently been proposed [Transformer]. With self-attention, these structures can not only effectively capture global interaction between sequences [yang2018modeling], learning the direct dependence of long-distance sequences [hochreiter2001gradient], but also support parallelized model training [Transformer]
. Now, these structures are widely used in a variety of machine learning tasks providing significant improvements. Vaswani et al.[Transformer] first proposed a S2S self-attention based model called the Transformer, and it achieves state-of-the-art performance on WMT2014 English-to-French translation task with remarkable lower training cost. For the ASR task, Transformer also uses the AED structure. Unlike LAS, the Transformer uses the multi-head self-attention (MHA) sub-layer to learn the source-target relationship, and capture the mutual information within the sequences to extract the most effective high-level features. This enables Transformer-based ASR systems to achieve competitive performance over the conventional hybrid and other end-to-end approaches [Speech-transformer, zhou2019improving, pham2019very, li2019speechtransformer, li2019improving, zhou2018syllable, zhou2018comparison].
Inspired by the Transformer and the layer-wise coordination [Layer-wise], we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR. With reference to the standard Speech-Transformer in [Speech-transformer], several improvements have been made at the decoder.
In the Speech-Transformer decoder, the linguistic information is first extracted using a self-attention sub-layer, and then processed together with the encoder output in another source-target attention sub-layer. The same encoder output is repeatedly taken by every decoder layer to establish the acoustic-target relationship. In this paper, we propose a new attention block, called self-and-mixed attention (SMA), as an unified attention sub-layer in the decoder, that takes the concatenation of the encoded acoustic representation and the word label embedding as input. In this way, the acoustic and linguistic information is projected into the same subspace in the deep decoder network structure during the attention calculation.
Furthermore, our decoder learns the acoustic and linguistic information together in a layer-by-layer fashion with the SMA mechanism instead of repeatedly using the same acoustic representations in each decoder layer. This is motivated by two intuitions, 1) we hope to benefit from a deep decoder network structure that encodes multi-level of abstraction from both acoustic and linguistic representation, and 2) we hypothesize that a shared acoustic and linguistic embedding space will help the network to learn the association between acoustic and linguistic information, and improve their alignments.
We will introduce the Transformer-based ASR as the prior work in Section 2, and discuss the details of the new decoder in Section 3.
2 Transformer-based ASR
2.1 Encoder-Decoder with Attention
The Transformer model [Transformer]
uses an encoder-decoder structure similar to many other neural sequence transduction models. The encoder can be regarded as a feature extractor, which converts the input vectorinto a high-level representation . Given , the decoder generates prediction sequence one token at a time in an auto-regressive manner. In an ASR task [Speech-transformer, Transformer-CTC], tokens are usually modeling units, such as phones, characters or sub-word, etc.
The encoder has layers, each of which contains two sub-layers: a multi-head self-attention and a position-wise fully connected feed-forward network (FFN). Similar to the encoder, the decoder is also composed of a stack of identical layers. In addition to the two sub-layers, each layer of decoder also has a third sub-layer between the FFN and MHA to perform multi-head source-target attention over the output representation of the encoder stack.
2.2 Multi-Head Attention
Multi-head attention is the core module of the Transformer model. Unlike single-head attention, MHA can learn the relationship between queries, keys and values from different subspaces. It computes the “Scaled Dot-Product Attention” with the following form:
where is the query, is the key and is the value. are the length of input and are the dimension of corresponding elements. To prevent pushing the softmax into extremely small gradient regions caused by large dimensions, the is used to scale the dot products.
In order to calculate attention from multiple subspaces, multi-head attention is constructed as follow:
where is the projection matrix. , , and , is the dimension of the input vector to the encoder, is the number of heads. For each , , in each attention, they are projected to dimensions through three linear projection layers respectively. After performing attentions, the outputs are then concatenated and projected again to obtain the final values.
2.3 Positional Encoding
Unlike RNN, the MHA contains no recurrence and convolution, it cannot model the order of the input acoustic sequence. We follow the idea of “positional encoding” that is added to the input as described in [Transformer].
3 Self-and-Mixed Attention Decoder
Figure 1 is the architecture of the encoder-decoder architecture for ASR. We adopt the same encoder as in Transformer, and propose a Self-and-Mixed Attention Decoder (SMAD), that is an attention-based auto-regressive structure. The main difference between SMAD and the standard Transformer decoder[Speech-transformer] lies in the ways we integrate acoustic representations, .
Firstly, unlike the decoder in Transformer which takes the same repeatedly to every decoder layer, SMAD employs a deep acoustic structure (DAS), which is a -layer network to capture multiple level of acoustic abstraction. For simplicity, we use a single-head self-and-mixed attention in Figure 2 to illustrate the Self-and-Mixed MHA component in the decoder layer of Figure 1, where the self-attention handles the acoustic representations, and the mixed-attention handles the acoustic-target alignment. With decoder layers stacking in a serial pipeline, the flow of acoustic information in Figure 2 (green) forms a deep acoustic structure.
Secondly, the decoder in the standard Transformer uses a self-attention module to learn the current target representation based on the previous tokens and learn the acoustic-targets dependencies using another separate source-target attention sub-layer. However, in SMAD, we merge these two attentions into one as illustrated in Figure 2
. We concatenate the encoded acoustic representation and linguistic targets to form a joint embedding as the input to the decoder layer. After the self-and-mixed MHA, the concatenated representation with both acoustic and linguistic information is fed to the FFN and next self-and-mixed MHA. Since the information flow in the proposed decoder contains two modalities, we also employ modality-specific residual connections and position-wise feed forward networks to separate linguistic and acoustic information before obtaining the posterior probability at the output of the softmax layer preceded by a linear layer.
We perform the same downsampling as in [Transformer-CTC] before the encoder using two
CNN layers with stride 2 to reduce the GPU memory occupation and and the length of the input sequence.
3.2 Self-and-Mixed Attention (SMA)
For simplicity, we take one head of the self-and-mixed MHA in Figure 1 as an example. The SMA consists of two independent attention mechanisms: a self-attention for acoustic-only representation, and a mixed attention to learn linguistic representation and the acoustic-target association. As shown in Figure 2, refers to source, which is the acoustic representation (marked in green) and refers to target, which is the linguistic information (marked in yellow).
Specifically, for self-attention in the SMA, , , are projected by , , from respectively, with
-length acoustic representation. The acoustic hidden representation in the current layer is generated using the accumulated acoustic information in the previous layer using the self-attention mechanism.
For the mixed attention, a linguistic token in the current layer is generated using the acoustic hidden representation and the preceding linguistic tokens in the previous layer using a mixed attention mechanism. The mixed attention is formulated as follows:
where , , , and is the mask matrix, is the number of tokens, and refer to the index of row and column of . To project the acoustic and linguistic information into the same subspace, we concatenate and and apply the same projection matrix for and . We use the acoustic representation of entire sentence for the decoding of tokens, at the same time, we introduce a to ensure that the prediction of token sequence is causal, i.e., when predicting a token, we only use information of the tokens before it. When equals to , the corresponding position in softmax output will approach zero, which prevents position from attending to position .
3.3 Multi-Ojective Learning
As often done in encoder-decoder structures, our model also uses the connectionist temporal classification (CTC) [watanabe2017hybrid, graves2006connectionist, Transformer-CTC]
to benefit from the monotonic alignment. The CTC loss function is used to jointly train the extraction of acoustic representation by Multi-Objective Learning (MOL):
where is the multi-objective loss with a tuning parameter ,
is the Transformer loss modeled by Kullback-Leibler divergence[szegedy2016rethinking] loss . is a -length speech feature sequence, is a -dimensional speech feature vector at frame t, is an -length letter sequence containing all the characters in this task, and is the ground truth of ’s previous token. is a framewise letter sequence with an additional blank symbol “”, and is the acoustic hidden representation vector, is a linear layer to convert to a dimensional vector.
We explore two different locations where the CTC loss can be applied as shown in Figure 1. For CTC1,
accepts the full feature sequence and output acoustic representation at t. Similar to the previous techniques, this CTC loss is used to jointly train the encoder.
For CTC2, since the acoustic representation is also updated in the SMAD, is produced as follows:
where the is the acoustic side of decoder stack output at t. In this way, the entire acoustic representation extraction process is jointly trained using the CTC loss.
4.1 Experimental setup
We conduct experiments on 170 hours Aishell-1 [aishell_2017] using the ESPnet [watanabe2018espnet] end-to-end speech processing toolkit. For all experiments, we extract 80-dimensional log Mel-filter bank plus pitch and its , as acoustic features and normalize them with global mean computed from the training set. The frame-length is 25 ms with a 10 ms shift.
The standard configuration of the state-of-the-art ESPnet Transformer recipe on Aishell-1 is used for both the baseline and proposed model. Each model contains 12-layer encoder and 6-layer decoder, where the and the dimensionality of inner-layer in FFN
. In all attention sub-layers, 4 heads are used for MHA. The whole network is trained for 50 epochs and warmup[Transformer] is used for the first 25,000 iterations. We use 4,230 Chinese characters which are extracted from the train set as modeling unit. A ten-hypotheses-width beam search is used with the the one-pass decoding for CTC as described in [watanabe2017hybrid] and a two-layer RNN language model (LM) shallow fusion [RNN-LM, shallowfusion], which was trained on the training transcriptions of Aishell-1 with 4,230 Chinese characters. We also evaluate the effect of speed perturbation (SP) [ko2015audio], SpecAugment (SpecA) [park2019specaugment] and CTC joint training in our experiments.
4.2 Results and Discussion
Table 1 reports the results of the proposed Transformer-based ASR, referred to as T-SMAD, the conventional Kaldi hybrid [kaldi] and other E2E ASR systems. Shallow fusion with 5-gram language model is used in both [LAS_Aishell, Transducers_Aishell]. In ESPnet RNN, Transformer[ESPnet_result] and T-SMAD, the RNN LM was also used for shallow fusion. We consider the Transformer with speed perturbation and CTC in ESPnet as our reference baseline.
According to Table 1, the T-SMAD system with the proposed CTC2 outperforms all other systems, including both the Transformer baseline and the Kaldi hybrid systems. A relative 20.0% CER reduction on the test set is obtained over the best hybrid system (chain). A relative 10% CER reduction on the dev set and 10.4% CER reduction on the test set is reported over the best E2E system (baseline). Moreover, it can be seen that CTC2 provides better ASR results than CTC1 due to the fact that the acoustic feature extraction of the entire network and the decoder are jointly trained in CTC2. In these experiments, the default parameter, which is tuned for the baseline system in ESPnet, is used for both CTC1 and CTC2.
|Transformer +SP+CTC (baseline)[ESPnet_result]||6.0||6.7|
In addition to the default configuration of ESPnet, we further implement the SpecAugment in our system to investigate its impact on the ASR performance. As shown in Table 2, the baseline system with SpecAugment gives a relative CER reduction of 13.3% on dev set and 17.9% on test set. T-SMAD with SpecAugment continues to outperform the corresponding Transformer baseline with SpecAugment by relative 8.6% on dev and 9.4% on the test set. The best performing system, T-SMAD+SP+SpecA+CTC2 achieves a CER of 4.8% and 5.1% on the dev and test set, respectively. To the best of authors’ knowledge, these are the best results reported on the Aishell-1 corpus. It can be concluded that the proposed SMAD achieves improved alignment due to the deep acoustic structure and the mixed attention, and yields consistent performance improvements over the standard Transformer architecture.
To examine the contribution of each component in SMAD, we perform several ASR experiments by removing the encoder, DAS, mixed attention and modelity-specific network one at a time. To focus on the SMAD mechanism, all the results are produced without additional language model and reported in Table 3.
Firstly, we remove the encoder in T-SMAD. For a fair comparison, we increase the number of decoder layers to 18 in order to keep the number of model parameters the same. Directly concatenating the acoustic features with linguistic targets as the input to the decoder increases the CER from 6.1% to 8.2% on the test set, that suggests the encoder block is essential for effective acoustic representation. Secondly, we give the same encoder acoustic representation to each SMAD layer in the say way as the standard Transformer, without the deep acoustic structure (‘T-SMAD w/o DAS’). This system gives a higher CER than T-SMAD, that confirms the effectiveness of the deep acoustic structure. Thirdly, we replace the mixed attention with the two standard attention mechanisms of Transformer to extract the linguistic features and learn the source-target alignment, respectively. We observe that the removal of mixed attention degrades the performance, that suggests that mapping acoustic-linguistic into the same subspace does help to learn a better alignment. Lastly, the contribution of a modality-specific network has been found to be less prominent than the previous components. It is worth noting that even without modality-specific network, T-SMAD still outperforms the standard Transformer without any increase in the number of model parameters.
|T-SMAD w/o encoder||7.5||8.2|
|T-SMAD w/o DAS||6.6||7.3|
|T-SMAD w/o mixed attention||6.2||6.9|
|T-SMAD w/o modality-specific network||5.8||6.3|
This paper proposes a novel decoder structure for Transformer-based LVCSR that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation. With SMAD mechanism, we have studied the interaction between acoustic and linguistic representation in the training and decoding of LVCSR system, that opens up a promising future direction for improving E2E ASR systems. We confirm that SMAD and DAS effectively improve the acoustic-linguistic representation in the decoder. The improved performance is attributed to the self-and-mixed attention mechanism that improves the acoustic-linguistic association and alignment in the Transformer decoder. The proposed technique has achieved the best results ever reported on both the dev and test sets of Aishell-1. Furthermore, we also investigate the impact of the components of the SMAD on the ASR performance and validate their effectiveness.