A Multi-Task Learning Framework for Overcoming the Catastrophic Forgetting in Automatic Speech Recognition

04/17/2019 ∙ by Jiabin Xue, et al. ∙ Harbin Institute of Technology 0

Recently, data-driven based Automatic Speech Recognition (ASR) systems have achieved state-of-the-art results. And transfer learning is often used when those existing systems are adapted to the target domain, e.g., fine-tuning, retraining. However, in the processes, the system parameters may well deviate too much from the previously learned parameters. Thus, it is difficult for the system training process to learn knowledge from target domains meanwhile not forgetting knowledge from the previous learning process, which is called as catastrophic forgetting (CF). In this paper, we attempt to solve the CF problem with the lifelong learning and propose a novel multi-task learning (MTL) training framework for ASR. It considers reserving original knowledge and learning new knowledge as two independent tasks, respectively. On the one hand, we constrain the new parameters not to deviate too far from the original parameters and punish the new system when forgetting original knowledge. On the other hand, we force the new system to solve new knowledge quickly. Then, a MTL mechanism is employed to get the balance between the two tasks. We applied our method to an End2End ASR task and obtained the best performance in both target and original datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural network(DNN) is currently becoming a successful model in the automatic speech recognition (ASR) field with the continuous development of the theory and technology of deep learning

[1, 2, 3]. However, the performance of them is heavily affected by the data scale and domain coverage of training data since they are data-driven.

Various new domains are constantly emerging with the wide application of the Internet. Thus, it is necessary that the ASR systems adapt these domains as soon as possible so as to meet the application requirements. Two transfer learning methods, i.e., retraining (RT) [4] and fine-tuning (FT) [5], are often used when the existing ASR systems are adapted to the target domain. For the RT of the ASR system, it usually conducts with all training data both from the original and target domains. This method, however, has some shortcomings, e.g., it cannot effectively model the ASR system for the target domain when the difference between the original and target datasets is large, it can achieve the best performance neither on the original nor the target domain. For the FT of the ASR system, it performs with only the training data from the target domain, and the system parameters are initialized with the original ones. This method can accelerate the domain adaptation process by sharing the learned parameter with the new system which is trained not from scratch. However, the performance on the original domain will be considerably degraded although it achieves a better performance on the target domain. In summary, both the above two methods are prone to the catastrophic forgetting (CF) [6] problem.

CF is a serious common problem in artificial intelligent systems. It results in the failure of the aforementioned systems to learn the new knowledge fast without forgetting previously acquired knowledge. One of the most commonly studied methods to overcome the CF is lifelong learning

[7, 8]

. However, the existing methods in lifelong learning cannot be applied to large-scale machine learning problems. Therefore, we attempt to solve the domain adaptation problem of ASR by using improved lifelong learning in this paper, and to the best of our knowledge, it is the first attempt to employ the lifelong learning in this field.

The current mainstream lifelong learning methods can be divided into two approaches depending on whether the network architecture has been changed during the learning process. The first one is to adopt a complex fixed network architecture with a large capacity. And when training the network for consecutive tasks, some regularization term is enforced to avoid that the model parameters deviate too much from the previously learned parameters according to their significance to old tasks[7]. The second one is to dynamically extend the network structure to accommodate new tasks while the previous architecture parameters are kept unchanged, for instance, progressive networks [8] which extend the architecture with a fixed node or layer size, and dynamically expandable network (DEN) [9] which introduces group sparse regularization when new parameters are added to the original networks.

At present, most methods based on the dynamic extended network cannot be effectively applied in large-scale speech recognition tasks, since the continuous increase of tasks has increased the complexity of the model structure. This situation results in an increase in the training and reasoning times and failures to meet the requirements of real-time processing and computational resources. Therefore, we select the first fixed model for the model training.

A surprising result in statistics is Stein s paradox [10], which emphasizes that learning multiple tasks together can obtain superior performance than learning each task independently. Thus, in order to overcome the CF problem, we attempt to divide the task of domain adaptation into two parts: reserving original knowledge and learning new knowledge. Then, a multi-task learning (MTL) [11] mechanism is explored to balance the two tasks. Many studies have shown that different configurations of the system can get the same performance. Accordingly, we try to find the best configurations of the target domain near to the configurations on the original dataset.

In accordance with the above-mentioned analysis, we propose a novel MTL training framework for ASR to solve the CF problem. This framework considers the reserving original knowledge and learning new knowledge as two independent tasks, and an MTL mechanism is employed to balance the two tasks. Experimental results on the LibriSpeech corpus [12] and the Switchboard corpus [13] show that our proposed method can achieve a better performance than FT and RT on these two datasets.

2 The Proposed Method

2.1 Reason Analysis of Catastrophic Forgetting

Suppose we have a dataset of the original domain and have trained a model on it. We collect another dataset belonging to the target domain. The problem is how to acquire a fresh that has the preferable performance on the original domain and target domain simultaneously thereby expanding the covered domain of the DNN based ASR system.

A DNN contains multiple layers projection, which comprises a linear projection and followed by element-wise nonlinearities projection. Learning a task consists of adjusting the set of weights of the linear projections to optimize the performance. Many works have pointed out that different configurations of possibly can get the same performance [14, 15]. Besides, the original dataset and the new one all belong to the ASR domain. Thus, their optimal parameter space may well not deviate too far away and have an overlapping part in which we can find an optimal parameter which has a good performance on both the original domain and the target domain. In this paper, we propose a new MTL ASR training framework to avoid the CF, i.e., Multiple Task Learning for Catastrophic Forgetting (MTLCF). It tries to look for the optimal parameters of a new model near the parameters of the original model.

In Figure 1, we illustrate training trajectories in a schematic parameter space with parameter regions leading to a good performance on original knowledge (yellow) and on new knowledge (blue). After learning the first task, the parameters are at . From this figure, we can see that if we take gradient steps according to the FT alone (green arrow), we may only minimize the loss of new knowledge but destroy what we have learned for original knowledge. Then, if we take gradient steps according to the MI alone (red arrow), the restriction imposed is too severe, and we can remember original knowledge only at the expense of not learning the new knowledge. Our MTLCF, conversely, hopes to find a solution for the new knowledge without incurring a significant loss on the original knowledge (blue arrow).

Figure 1: Schematic parameter space of the original knowledge and the new knowledge.

2.2 Multi-Task Learning for Catastrophic Forgetting

2.2.1 Multi-Task Object Function of MTLCF

Our method splits the problem of overcoming the CF into two independent tasks, i.e., and , and jointly optimizes the two tasks. The is used to avoid the new model leaving the optimal parameter space of the original domain, so that the parameter in this space can achieve a good performance in the original domain. Further, the is divided into two subtasks, i.e., and .

In the , we restrict the retrained model not to deviate too far from the original one by minimizing the KL divergence of the output distributions between the original model and the retrained model,

(1)

where is the loss of , and are the output distributions of the and the with a pair of sample sampled from the , is the batch size,

is the Kullback-Leibler Divergence,

is a sharpness parameter. In this paper, we use the fixed to reduce the number of hyperparametrices.

Then, we consider to employ the Viterbi algorithm to find the optimal path in the Connectionist Temporal Classification (CTC) [16] loss, so even if the output distributions of the above two models are similar, it is also possible to find two different paths. For the , its main work is to minimize the CTC Loss of the training data of the original domain. From this subtask, we can calculate the degree to which the new model forgets the original knowledge and then punish it. Thus, the loss of is as follows,

(2)

where is the loss of the , is a hyperparametric, which controls the penalty for forgetting,

is the loss function of CTC Loss between

and .

We force the new model to solve the new knowledge quickly by the , which optimize the CTC Loss of the retained model on the target domain training data. Therefore, the loss in this method is as follows,

(3)

where is the loss of the , is a hyperparametric, which controls the speed to solve the new knowledge, is the output distribution of the with the pair of sample sampled from the .

Figure 2: The structure of the MTLCF. It uses two data set which are selected from two different data distributions. The framework consists of an original model , which is trained in and a new model , which is copied from . Only is updated during the process of back propagation.

2.2.2 The Structure of MTLCF

We display the structure of MTLCF in Figure 2

. It can be seen that our MTLCF is composed of two task networks, which can be any network of interest in solving a particular task, such as Convolutional Neural Networks (CNN)

[17]

and Long Short-Term Memory (LSTM)

[18] Neural Networks. In this paper, we use the LSTM Deep Neural Networks (LDNN) [1] architecture as the task network to do an End-to-End ASR task.

The MTLCF training procedure is described in Algorithm 1. In the training process, we first copy a new model from the current model, and then fix the parameter of the original model and only train the parameter of the new model.

Input: , the original domain data. , the target domain data. , the original model. , the batch size.
Output: , the target model
Copy from ;
while  has not converged do
       Sample a batch of original speech data.;
       Sample a batch of new speech data.;
       ;
       ;
       ;
       ;
       ;
       ;
      
end while
;
Algorithm 1 MTLCF

3 Experimental Details

3.1 Experiment Data

Our experiments are conducted on the LibriSpeech corpus, which consists of 1000h of training data, and the 300-hour Switchboard English conversational telephone speech task, both of them are the most studied ASR benchmarks today [19, 20, 21, 22, 23]. We select data from the train-clean-360 folder, the dev-clean and the test-clean folder for the training set, development set and test set for , respectively. For , we select 95% and 5% from Switchboard-1 Release 2 (LDC97S62) as a training set and development set, respectively. And then, we select data from the Hub5 2000 (LDC2002S09) evaluation set, which contains 20 ten-minute conversations from Switchboard (SW) and 20 ten-minute conversations from CallHome English (CH), as the test set.

The acoustic feature is 80-dimensional log-mel filterbank energies, computed using a 25ms window every 10ms, which is extracted using kaldi [24]. Similar to the low frame rate (LFR) model [25], at the current frame

, these features are stacked with two frames to the left and downsampled at a 30ms frame rate, producing a 240-dimensional feature vector.

3.2 Model Training

For evaluation, all neural networks are trained using the CTC objective function with character targets and TensorFlow

[26]. Similar to the [1]

, all networks consist of a LDNN model, which comprises three bidirectional LSTM (Bi-LSTM) layers followed by a Rectified Linear Unit (ReLu) layer and a linear layer. Each LSTM consists of

cells. The fully connected ReLu layer has

nodes which are followed by a softmax layer with

output units.

The stage of training is similar to [27]. The weights for all layers of are uniformly initialized to lie between -0.05 and 0.05. And the weights for all layers of are copied from . All networks are trained using Adam [28] with a learning rate of . The learning rate is halved whenever the held-out loss does not decrease by at least 10%. We clip the gradients to lie in to stabilize training. The training data sorted by length.

3.3 Experiment Results

3.3.1 Analysis of the Models Convergence Speed

We first analyze the convergence rate of the models for RT, FT and MTLCF with (MTLCF), and the results are shown in Figure 3.

In Figure 3LABEL:, we show the convergence rate of these models on the original domain test set (

), where the point in the epoch

presents the character error rate (CER) of the original model on the . As can be seen from Figure 3LABEL:, the FT forgets all the learned knowledge at the end of the first epoch, since it suffers a serious CF. The learning of RT is unstable, it often forgets a lot of learned knowledge and then learn them at once. However, it cannot achieve the initial performance. The MTLCF is stable on the without forgetting the previous knowledge. And the convergence rate of these models on the target domain test set () is shown in Figure 3LABEL:.

Figure 3: The description of the converge rate of , and on the (a) and on the (b).

We can see that the fastest converge rate is achieved by the MTLCF, which is converged within epochs.

From the above analysis, it can be seen that the proposed model achieves the best convergence speed on both the original and target domain.

3.3.2 Analysis of the CER

In Table 1, we give the performance of the , which is trained on the original domain training data using a random initialization, on the and . achieves a CER of on the , but the CER on the reaches . Therefore, we can find that the data distribution of the original domain and the target domain is quite different.

Training data set (CER) (CER)
LibriSpeech
Table 1: Performance of the on the and

We show the final convergence results of those models in Table 2. On the one hand, we analyze the results of an equal amount of training data. It can be seen from the top three rows in Table 2 that the FT basically forgets the learned knowledge, but it can achieve good performance on the . The RT model can reach the optimal convergence result on neither nor . The MTLCF achieves a better performance on both the and , and has achieved relative reduction of CER on , which is compared with the initial model in Table 1.

Method (CER) (CER)
FT h
RT h
MTLCF h 0.111771 0.288898
FT h 0.323052
RT h
MTLCF h 0.121136
Table 2: Performance of the three Training methods on SwitchBoard corpus and Fisher corpus

For the above phenomena, we analyze the reason is that learning the new knowledge while not forgetting the old knowledge can not only effectively prevent the CF but also improve the performance on the original domain, which had been proven by the Stein s paradox. And it can also prove our previous assumptions about the change of parameters.

On the other hand, we analyze the results of an unequal amount of training data. As can be seen from the last three rows in Table 2, we can find that it obtained the results similar to the above experiment. This fully proves that our proposed method is still applicable when the data scale differs greatly.

3.3.3 Effect of hyperparametrices to MTLCF

To further discuss the experimental results, we try to analyze the influence of different hyperparametrices on the training results of the MTLCF. On the one hand, we analyze the influence of the change of on the CER, which is shown in Figure 4LABEL:. In this figure, it can be found that the MTLCF model converges to a better results on both the and when we choose . And as increases, the performance of the continues to be degraded. Therefore, in the following analysis we fixed .

Figure 4: The influence of hyperparametric on the convergence results of the new model when is 0.5 (a). The influence of hyperparametric on the convergence results of the new model when is 0.5 (b).

On the other hand, the effect of the transformation of the hyperparametric on the model convergence results is shown in Figure 4LABEL:. From Figure 4LABEL:, it can be easily seen that with the increase of , the proportion of in the optimized gradient increases, this results in the lower error rate for and a higher error rate for . When , the model achieved the best performance on the new dataset, and the CER dropped to .

Through the above analysis, we find that the is more important to the convergence result of the MTLCF than , and we can adjust the performance of the original domain and the target domain by controlling the size of it.

4 Conclusions

Domain adaptation is an important topic for ASR systems. In this paper, we attempt to overcome the CF in this process by using the lifelong learning, which adopts a new way of thinking about MTL. We further propose a novel MTL based method (MTLCF) to learn new knowledge quickly without forgetting the learned knowledge. We evaluate our proposed methods on the SwitchBoard corpus and LibriSpeech corpus. From the experiment results, we can see that the proposed method achieves good performance on both the original domain and the target domain test set.

5 Acknowledgements

This research was supported by National Key Research and Development Plan of China under Grant 2017YFB1002102 and National Natural Science Foundation of China under Grant U1736210

References

  • [1] T. N. Sainath, O. Vinyals, A. W. Senior, and H. Sak, “Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2015, pp. 4580–4584.
  • [2] V. Valtchev, J. Odell, P. C. Woodland, and S. J. Young, “Lattice-based discriminative training for large vocabulary speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, ICASSP, 1996, pp. 605–608.
  • [3] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin,” in International Conference on Machine Learning, ICML, 2016, pp. 173–182.
  • [4] M. Paulik, C. Fügen, S. Stüker, T. Schultz, T. Schaaf, and A. Waibel, “Document driven machine translation enhanced ASR,” in INTERSPEECH, 2005, pp. 2261–2264.
  • [5] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition,” in Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
  • [6] R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.
  • [7] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
  • [8] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
  • [9] J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong Learning with Dynamically Expandable Networks,” in International Conference on Learning Representations, ICLR, 2018.
  • [10]

    C. Stein, “Inadmissibility of the usual estimator for the mean of a multivariate normal distribution,” STANFORD UNIVERSITY STANFORD United States, Tech. Rep., 1956.

  • [11] R. Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997.
  • [12] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2015, pp. 5206–5210.
  • [13] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 1992, pp. 517–520.
  • [14]

    R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in

    Neural networks for perception.   Elsevier, 1992, pp. 65–93.
  • [15] H. J. Sussmann, “Uniqueness of the weights for minimal feedforward nets with a given input-output map,” Neural networks, vol. 5, no. 4, pp. 589–593, 1992.
  • [16]

    A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in

    Machine Learning, Proceedings of the International Conference ICML, 2006, pp. 369–376.
  • [17] L. O. Chua and L. Yang, “Cellular neural networks: Theory,” IEEE Transactions on circuits and systems, vol. 35, no. 10, pp. 1257–1272, 1988.
  • [18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [19] G. Saon and M. Picheny, “Recent advances in conversational speech recognition using convolutional and recurrent neural networks,” IBM Journal of Research and Development, vol. 61, no. 4, pp. 1–1, 2017.
  • [20] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “The Microsoft 2016 conversational speech recognition system,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2017, pp. 5255–5259.
  • [21] K. Veselỳ, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.” in INTERSPEECH, 2013, pp. 2345–2349.
  • [22] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.” in INTERSPEECH, 2016, pp. 2751–2755.
  • [23] W. Hartmann, R. Hsiao, T. Ng, J. Ma, F. Keith, and M.-H. Siu, “Improved Single System Conversational Telephone Speech Recognition with VGG Bottleneck Features,” in INTERSPEECH, 2017, pp. 112–116.
  • [24] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, 2011, pp. 1–4.
  • [25] G. Pundak and T. N. Sainath, “Lower Frame Rate Neural Network Acoustic Models,” in INTERSPEECH, 2016, pp. 22–26.
  • [26] M. Abadi, P. Barham, J. Chen et al., “TensorFlow: A System for Large-Scale Machine Learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2016, pp. 265–283.
  • [27] K. Audhkhasi, B. Kingsbury, B. Ramabhadran, G. Saon, and M. Picheny, “Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2018, pp. 4759–4763.
  • [28] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in International Conference on Learning Representations, ICLR, 2015.