means the use of algorithms to convert voices to texts. Traditional ASR systems usually consists of individual components such as acoustic models, lexicons and language models. As these components are constructed independently, additional effort is required to develop algorithms and collect data for each of the components. To solve this problem, end-to-end ASR systems, which are based on sequence-to-sequence modelsSutskever et al. (2014); Amodei et al. (2016); Chiu et al. (2018)
, have been developed to convert acoustic data to texts directly. While the traditional ASR systems based on Gaussian Mixture Models and Hidden Markov Models (GMM-HMM) requires context-dependent signal pre-processing and force alignment to obtain input data and labels for supervised learning, the end-to-end ASR systems can directly use acoustic data and texts to perform supervised learning.
The attention mechanism Chorowski et al. (2015); Vaswani et al. (2017) is one of the essential functions to boost the performance of the end-to-end ASR systems. The attention mechanism can increase the impact of useful information and decrease the impact of useless information by applying different weights to specific districts of data. It is still unclear how the attention mechanism impacts final decisions of ASR. According to two past studies Serrano and Smith (2019); Jain and Wallace (2019), the attention mechanism may have two types of behaviours, which depend on the attention mechanism itself and a whole model, respectively. As the impact on attention mechanism itself may have more model-independent rules, in this article, we study how the attention mechanism of ASR impact itself. To explain the attention mechanism, we build an ASR system and use the Silas decision tree tool Bride et al. (2021) to learn the distributions of attention weights and extract the relationships among the attention weights.
We train an ASR model based on the encoder-decoder architecture with the attention mechanism Chorowski et al. (2015)
. The encoder consists of two LSTM-RNNs. The decoder is a single LSTM-RNN. The attention mechanism is based on the hybrid structure. The whole encoder-decoder model is trained on the TIMIT training dataset. During each epoch of training, 200 speech files in the TIMIT dataset are randomly selected to train the whole encoder-decoder model. The training process terminates after 1,000 epochs.
In order to generate data for attention mechanism analysis, the TIMIT evaluation set is fed into the trained ASR model. We select 770 audio files that obtain the best phoneme error rate and extract their attention weight matrices in the ASR model. The extracted attention weight matrices are analysed by the following steps.
All attention weights in the attention weight matrices are sorted in ascending order. The sorted attention weights are averagely split into 10 domains. The domains are annotated with 10 levels that represent the strength of attention.
Attention level matrices are produced by converting all of the attention weights in the attention weight matrices to their corresponding levels. As the size of an attention matrix subjects to the encoder output and the decoder output, the size of a attention level matrix is defined by the maximal size, i.e., , where 100 is the maximal size of the encoder output, and 659 is the maximal size of the decoder output. All vacancies in the attention level matrices are filled by 0.
To observe how the th row of the attention level matrix is influenced by the th rows, where and , we produce a feature by concatenating the th rows.
Each attention level in the th row is converted to a label. An attention level higher than 5 is considered as “high”, while an attention level not higher than 5 is considered as “low”. The labels and the features together form a binary classification dataset.
The dataset is shuffled and split into a training set and an evaluation set that consists of 80% and 20% of data, respectively. The training set is used to train 100 decision trees using the Silas tool. Each decision tree has a maximum depth of 64, and each leaf node has at least 64 training examples.
The trained decision trees are scored on the evaluation set. We observe the scores, i.e., prediction accuracy, for each encoder state. Besides, we collected the decision conditions and their influence scores computed by Silas Bride et al. (2021).
shows the accuracy of attention level prediction. It is observable that the accuracy is high for small encoder IDs. This is probably because the smaller encoder IDs correspond to the beginnings of audio files that are almost silence. As the silence does not contain useful information, the attention on the silence is almost stable, which means that the attention level is relative level to predict. For most encoder IDs, the accuracy is around 80%, which means that the attention is mostly predictable. For larger encoder IDs, the accuracy is unstable because of the lack of training data, i.e., most audio files do not use such a large number of encoder IDs. As a supplementary, Figure2 shows the data distribution of attention levels on all of the training data. It indicates that high level attention weights have positive impacts on the accuracy.
Figure 3 shows the accuracy with respect to the number of previous states. It indicates that the accuracy is increasing as the number of precious states increases. Moreover, the previous four states have the highest impact on the accuracy. When the number of previous states is greater than four, more previous states cannot increase the accuracy.
Figure 4 shows the frequencies of attention levels on decision conditions. It indicates that higher attention levels contribute more decision conditions, i.e., higher attention weights have larger impact on the future attention states.
Figure 5 shows the average influence scores of previous states. It indicates a trend that the influence scores decrease when the time interval increases, which means that the nearer previous attention states have more impact on the current attention state. This phenomenon agrees with the results in Figure 3.
In this study, we have used decision trees to explain how the attention mechanism impact itself in end-to-end ASR models. The results show that the current attention state is mainly impacted by its previous attention states rather than the encoder and decoder states. It is possible that the attention mechanism on sequential tasks, e.g., speech recognition, is continuously impacted by its historical attention states. Moreover, the past four previous attention states have the highest impact on the current attention state. The influence scores keep decreasing when the time interval increases. However, in real ASR applications, time intervals are usually very large. The abovementioned phenomenon indicates that the attention mechanism should be improved by strengthening the attention on larger time intervals. This indicates a possible way to improve the attention mechanism in the future.
Deep speech 2: end-to-end speech recognition in english and mandarin.
Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 173–182. External Links: Cited by: §1.
- Silas: A high-performance machine learning foundation for logical reasoning and verification. Expert Syst. Appl. 176, pp. 114806. External Links: Cited by: §1, item 6.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016, pp. 4960–4964. External Links: Cited by: §1.
- State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018, pp. 4774–4778. External Links: Cited by: §1.
- Attention-based models for speech recognition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 577–585. External Links: Cited by: §1, §2.
- Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 3543–3556. External Links: Cited by: §1.
- Is attention interpretable?. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 2931–2951. External Links: Cited by: §1.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. External Links: Cited by: §1.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §1.
- Transformer-based acoustic modeling for hybrid speech recognition. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pp. 6874–6878. External Links: Cited by: §1.