Multi-Head Decoder for End-to-End Speech Recognition

04/22/2018 ∙ by Tomoki Hayashi, et al. ∙ Johns Hopkins University 0

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then, they are integrated into a single attention. On the other hand, instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, leading the improvement of the recognition performance with an ensemble effect. To evaluate the effectiveness of our proposed method, we conduct an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrate that our proposed method outperforms the conventional methods such as location-based and multi-head attention models, and that it can capture different speech/linguistic contexts within the attention-based encoder-decoder framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speech recognition (ASR) is the task to convert a continuous speech signal into a sequence of discrete characters, and it is a key technology to realize the interaction between human and machine. ASR has a great potential for various applications such as voice search and voice input, making our lives more rich. Typical ASR systems [1]

consist of many modules such as an acoustic model, a lexicon model, and a language model. Factorizing the ASR system into these modules makes it possible to deal with each module as a separate problem. Over the past decades, this factorization has been the basis of the ASR system, however, it makes the system much more complex.

With the improvement of deep learning techniques, end-to-end approaches have been proposed 

[2]. In the end-to-end approach, a continuous acoustic signal or a sequence of acoustic features is directly converted into a sequence of characters with a single neural network. Therefore, the end-to-end approach does not require the factorization into several modules, as described above, making it easy to optimize the whole system. Furthermore, it does not require lexicon information, which is handcrafted by human experts in general.

The end-to-end approach is classified into two types. One approach is based on connectionist temporal classification (CTC) 

[3, 4, 2], which makes it possible to handle the difference in the length of input and output sequences with dynamic programming. The CTC-based approach can efficiently solve the sequential problem, however, CTC uses Markov assumptions to perform dynamic programming and predicts output symbols such as characters or phonemes for each frame independently. Consequently, except in the case of huge training data [5, 6], it requires the language model and graph-based decoding [7].

The other approach utilizes attention-based method [8]. In this approach, encoder-decoder architecture [9, 10]

is used to perform a direct mapping from a sequence of input features into text. The encoder network converts the sequence of input features to that of discriminative hidden states, and the decoder network uses attention mechanism to get an alignment between each element of the output sequence and the encoder hidden states. And then it estimates the output symbol using weighted averaged hidden states, which is based on the alignment, as the inputs of the decoder network. Compared with the CTC-based approach, the attention-based method does not require any conditional independence assumptions including the Markov assumption, language models, and complex decoding. However, non-causal alignment problem is caused by a too flexible alignment of the attention mechanism 

[11]. To address this issue, the study [11] combines the objective function of the attention-based model with that of CTC to constrain flexible alignments of the attention. Another study [12] uses a multi-head attention (MHA) to get more suitable alignments. In MHA, multiple attentions are calculated, and then, they are integrated into a single attention. Using MHA enables the model to jointly focus on information from different representation subspaces at different positions [13], leading to the improvement of the recognition performance.

Inspired by the idea of MHA, in this study we present a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. Instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, leading to the improvement of the recognition performance with an ensemble effect. To evaluate the effectiveness of our proposed method, we conduct an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrate that our proposed method outperforms the conventional methods such as location-based and multi-head attention models, and that it can capture different speech/linguistic contexts within the attention-based encoder-decoder framework.

2 Attention-Based End-to-End ASR

The overview of attention-based network architecture is shown in Fig. 1.

Figure 1: Overview of attention-based network architecture.

The attention-based method directly estimates a posterior , where represents a sequence of input features, represents a sequence of output characters. The posterior

is factorized with a probabilistic chain rule as follows:

(1)

where represents a subsequence , and is calculated as follows:

(2)
(3)
(4)
(5)

where Eq. (2) and Eq. (5) represent encoder and decoder networks, respectively, represents an attention weight,

represents an attention weight vector, which is a sequence of attention weights

, represents a subsequence of attention vectors , and represent hidden states of encoder and decoder networks, respectively, and represents the letter-wise hidden vector, which is a weighted summarization of hidden vectors with the attention weight vector .

The encoder network in Eq. (2) converts a sequence of input features into frame-wise discriminative hidden states

, and it is typically modeled by a bidirectional long short-term memory recurrent neural network (BLSTM):

(6)

In the case of ASR, the length of the input sequence is significantly different from the length of the output sequence. Hence, basically outputs of BLSTM are often subsampled to reduce the computational cost [8, 14].

The attention weight in Eq. (3) represents a soft alignment between each element of the output sequence and the encoder hidden states .

  • in Eq. (3), which is the most simplest attention [15], is calculated as follows:

    (7)
    (8)

    where represents trainable matrix parameters, and represent a sequence of energies

  • in Eq. (3) is additive attention [16], and the calculation of the energy in Eq. (7) is replaced with following equation:

    (9)

    where represents trainable matrix parameters, and represent trainable vector parameters.

  • in Eq. (3) is location-based attention [8], and the calculation of the energy in Eq. (7) is replaced with following equations:

    (10)
    (11)

    where consists of the vectors , and represents trainable one-dimensional convolution filters.

  • in Eq. (3) is coverage mechanism [17], and the calculation of the energy in Eq. (7) is replaced with following equations:

    (12)
    (13)

    where represents trainable vector parameters.

The decoder network in Eq. (5) estimates the next character from the previous character , hidden state vector of itself and the letter-wise hidden state vector , similar to RNN language model (RNNLM) [18]. It is typically modeled using LSTM as follows:

(14)
(15)

where and represent trainable matrix and vector parameters, respectively.

Finally, the whole of above networks are optimized using back-propagation through time (BPTT) [19] to minimize the following objective function:

(16)

where represents the ground truth of the previous characters.

3 Multi-Head Decoder

Figure 2: Overview of multi-head decoder architecture.

The overview of our proposed multi-head decoder (MHD) architecture is shown in Fig. 2. In MHD architecture, multiple attentions are calculated with the same manner in the conventional multi-head attention (MHA) [13]. We first describe the conventional MHA, and extend it to our proposed multi-head decoder (MHD).

3.1 Multi-head attention (MHA)

The layer-wise hidden vector at the head is calculated as follows:

(17)
(18)

where , , and represent trainable matrix parameters, and any types of attention in Eq. (3) can be used for in Eq. (17).

In the case of MHA, the layer-wise hidden vectors of each head are integrated into a single vector with a trainable linear transformation:

(19)

where is a trainable matrix parameter, represents the number of heads.

3.2 Multi-head decoder (MHD)

On the other hand, in the case of MHD, instead of the integration at attention level, we assign multiple decoders for each head and then integrate their outputs to get a final output. Since each attention decoder captures different modalities, it is expected to improve the recognition performance with an ensemble effect. The calculation of the attention weight at the head in Eq. (17) is replaced with following equation:

(20)

Instead of the integration of the letter-wise hidden vectors with linear transformation, each letter-wise hidden vector is fed to -th decoder LSTM:

(21)

Note that each LSTM has its own hidden state which is used for the calculation of the attention weight , while the input character is the same among all of the LSTMs. Finally, all of the outputs are integrated as follows:

(22)

where represents a trainable matrix parameter, and represents a trainable vector parameter.

3.3 Heterogeneous multi-head decoder (HMHD)

As a further extension, we propose heterogeneous multi-head decoder (HMHD). Original MHA methods [13, 12] use the same attention function such as dot-product or additive attention for each head. On the other hand, HMHD uses different attention functions for each head. We expect that this extension enables to capture the further different context in speech within the attention-based encoder-decoder framework.

4 Experimental Evaluation

 # training  445,068 utterances (581 hours)
 # evaluation (task 1)  1,288 utterances (1.9 hours)
 # evaluation (task 2)  1,305 utterances (2.0 hours)
 # evaluation (task 3)  1,389 utterances (1.3 hours)
 Sampling rate  16,000 Hz
 Window size  25 ms
 Shift size  10 ms
 Encoder type  BLSTMP
 # encoder layers  6
 # encoder units  320
 # projection units  320
 Decoder type  LSTM
 # decoder layers  1
 # decoder units  320
 # heads in MHA  4
 # filter in location att.  10
 Filter size in location att.  100
 Learning rate  1.0
 Initialization  Uniform

 Gradient clipping norm

 5
 Batch size  30

 Maximum epoch

 15
 Optimization method  AdaDelta [20]
 AdaDelta  0.95
 AdaDelta  
 AdaDelta decay rate  
 Beam size  20
 Maximum length  0.5
 Minimum length  0.1
Table 1: Experimental conditions.

To evaluate the performance of our proposed method, we conducted experimental evaluation using Corpus of Spontaneous Japanese (CSJ) [21], including 581 hours of training data, and three types of evaluation data. To compare the performance, we used following dot, additive, location, and three variants of multi-head attention methods:

  • Dot-product attention-based model (Dot),

  • Additive attention-based model (Add),

  • Location-aware attention-based model (Loc),

  • Multi-head dot-product attention model (MHA-Dot),

  • Multi-head additive attention model (MHA-Add),

  • Multi-head location attention model (MHA-Loc).

We used the input feature vector consisting of 80 dimensional log Mel filter bank and three dimensional pitch feature, which is extracted using open-source speech recognition toolkit Kaldi [22]. Encoder and decoder networks were six-layered BLSTM with projection layer [23] (BLSTMP) and one-layered LSTM, respectively. In the second and third bottom layers in the encoder, subsampling was performed to reduce the length of utterance, yielding the length . For MHA/MHD, we set the number of heads to four. For HMHD, we used two kind of settings: (1) dot-product attention + additive attention + location-based attention + coverage mechanism attention (Dot+Add+Loc+Cov), and (2) two location-based attentions + two coverage mechanism attentions (2Loc+2Cov). The number of distinct output characters was 3,315 including Kanji, Hiragana, Katakana, alphabets, Arabic number and sos/eos symbols. In decoding, we used beam search algorithm [10] with beam size 20. We manually set maximum and minimum lengths of the output sequence to 0.1 and 0.5 times the length of the subsampled input sequence, respectively, and the length penalty to 0.1 times the length of the output sequence. All of the networks were trained using end-to-end speech processing toolkit ESPnet [24] with a single GPU (Titan X pascal). Character error rate (CER) was used as a metric. The detail of experimental condition is shown in Table 1.

Experimental results are shown in Table 2.

CER [%]
Task 1 Task 2 Task 3
Dot 12.7 9.8 10.7
Add 11.1 8.4 9.0
Loc 11.7 8.8 10.2
MHA-Dot 11.6 8.5 9.3
MHA-Add 10.7 8.2 9.1
MHA-Loc 11.5 8.6 9.0
MHD-Loc 11.0 8.4 9.5
HMHD (Dot+Add+Loc+Cov) 11.0 8.3 9.0
HMHD (2Loc+2Cov) 10.4 7.7 8.9
Table 2: Experimental results.

First, we focus on the results of the conventional methods. Basically, it is known that location-based attention yields better performance than additive attention [11]. However, in the case of Japanese sentence, its length is much shorter than that of English sentence, which makes the use of location-based attention less effective. In most of the cases, the use of MHA brings the improvement of the recognition performance. Next, we focus on the effectiveness of our proposed MHD architecture. By comparing with the MHA-Loc, MHD-Loc (proposed method) improved the performance in Tasks 1 and 2, while we observed the degradation in Task 3. However, the heterogeneous extension (HMHD), as introduced in Section 3.3, brings the further improvement for the performance of MHD, achieving the best performance among all of the methods for all test sets.

Finally, Figure 3 shows the alignment information of each head of HMHD (2Loc+2Cov), which was obtained by visualizing the attention weights.

Figure 3: Attention weights of each head. Two left figures represent the attention weights of the location-based attention, and the remaining figures represent that of the coverage mechanism attention.

Interestingly, the alignments of the right and left ends seem to capture more abstracted dynamics of speech, while the rest of two alignments behave like normal alignments obtained by a standard attention mechanism. Thus, we can see that the attention weights of each head have a different tendency, and it supports our hypothesis that HMHD can capture different speech/linguistic contexts within its framework.

5 Conclusions

In this paper, we proposed a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. Instead of the integration in the attention level, our proposed method utilized multiple decoders for each attention and integrated their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, we used different attention functions for each head. To evaluate the effectiveness of our proposed method, we conducted an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrated that our proposed methods outperformed the conventional methods such as location-based and multi-head attention models, and that it could capture different speech/linguistic contexts within the attention-based encoder-decoder framework.

In the future work, we will combine the multi-head decoder architecture with Joint CTC/Attention architecture [11], and evaluate the performance using other databases.

References