Retraining, a method that trains the existing model with new data in the target domain to improve the performance, is an important topic in Automatic Speech Recognition (ASR) systems [1, 2]. The performance of ASR systems is heavily affected by the data scale and domain coverage of training data since they are data-driven. Thus, the most commonly used retraining method is adding a lot of training data that belongs to the target domain when we try to expend a trained model to be suitable for a target domain. It is well known that the unlabeled training data can be obtained easily, but it is time-consuming and challenging for the annotation of a large number of unlabeled speech by hand. Therefore, we wonder: is it possible that just a little data in the whole unlabeled data is manually labeled as the new training data to retrain ASR systems?
Recently, most of ASR systems are based on the Deep Neural Network (DNN)) [3, 4, 5]. In the training process of them, the back propagation (BP)  method is often used to update the parameters of the systems. In BP, the systems firstly compute the error between the transcription and the corresponding annotated text. Then, an error back propagation is used to optimize and update the parameters. Therefore, it can be found that the parameters are more influenced by the training samples with a big propagation error than the samples with a small one. We name the former samples as the hard samples and the other as the easy samples. It represents a different knowledge between the original and the target domain. Thus, the subset which contains a large number of the hard samples is the one what we want most, and it is useful to improve the ASR retraining via little new training data.
Unfortunately, the hard samples are sparse in the training data of the target domain, and it is expensive to manually label them. Thus, we try to find them from the transcription in this paper since transcription can fully indicate whether a sentence is misidentified or not. There are some characters of this task: one sentence is a composition of a series of words, and the correctness of sentence depends on words inside; meanwhile it is easy to mark a whole sentence manually but hard to label each word or words causing the error. According to above description, we hope to get the label of every word within the sentence by using the sentence-level label only. So our task can be regard as a weakly supervised learning task. Since Deep Multiple Instance Learning (DMIL) is one of the most successful methods of weakly supervised learning[7, 8, 9, 10, 11, 12], we select it as a basic method and modify it according to our aim.
The contribution of this paper is that we explore a new approach to improve the retraining of ASR by using the hard samples, and we propose three methods, i.e., Sparse-Attention based DMIL, Gated Sparse-Attention based DMIL and Discriminative DMIL, to effectively find the hard samples.
2 Proposed Method
2.1 Retraining Framework Based on Hard Samples
2.1.1 Influence of Retraining using Hard Samples
is assumed as the distribution of the unlabeled data. When we randomly select a subset from the unlabeled training data, there is a great probability of theis selected which contains a small number of the hard samples since the hard samples in the unlabeled data are very sparse. However, we aim to obtain as many as hard samples as possible, such as the . Figure 1LABEL: presents the relationship between the parameters and character error rate (CER) of the target domain, where is the parameters of current model and is the best parameters in the target domain. If we select the subset to train the current system, we will find near to fits well, but it is not suitable for the whole target domain. On the contrary, if the subset which includes lots of hard samples is selected to train the current model, we can find fits well and it is very close to the suitable for the whole target domain. The reason is that there is a great difference in distribution between the hard samples in the target domain and the samples in the original training data, but for the easy samples, there is not. In summary, the hard samples overwhelmingly promote the model retraining process, thus we attempt to improve the ASR system by using subset full of the hard samples like .
We define the hard samples subset as follows:
where, is the hard samples subset, is the training data of current domain, is the training data of target domain and is the data sample.
2.1.2 Hard Samples Mining based ASR Retraining Framework
Based on the problem described above, we attempt to improve the retraining of ASR systems by using the hard samples subset and propose a novel framework to retrain the ASR system. The structure of the framework is shown in Figure 2, and it contains three parts as follows. In the first part, we collect a lot of unlabeled target domain data. In the second part, we mine the hard samples subset from the unlabeled data by using a ASR error detection model and then manually label them. In the third part, the ASR systems parameters are updated by using the hard samples.
In this process, mining hard samples from unlabeled target domain data is an extremely critical step.
Thus, we introduce the method about how to mine hard samples from the unlabeled training. As shown in the red box of Figure 2, this method can be divided into two parts, including training ASR error detection model and mining the hard samples from unlabeled data by using the trained error detection model. First, we select a subset according to a stationary sampling interval from the unlabeled data which has been sorted before by the sentence length. Then, the selected subset is input into the original ASR system and it will output the corresponding transcription. The domain experts label the transcription by comparing the received transcription and corresponding sentence: if they are identical, mark ; otherwise mark . In this way, all transcriptions are labeled with or . Next, the labeled transcription is used to train the ASR error detection model. Finally, the trained ASR error detection model is used to mine the candidate hard samples from the unlabeled data, and the candidate hard samples will be further manually labeled to determine the final results.
2.2 Hard Samples Mining by Using DMIL
2.2.1 Problem Formulation
In this paper, our aim is to find the hard samples from lots of unlabeled target domain data. Because it can be used to improve the retraining of ASR systems, we propose to use transcription to achieve our aim.
It is well known that the sentence consists of a series of words and there are some context relationship between them. We often use a fixed-length context window to divide the transcription into multiple context text blocks to consider the context between them. Thus, we determine the class of the transcription based on the classification result of each text block. Further, we can write the description by the form as follows:
where is the sentence generated by the ASR system. is the number of words in the .
represents the feature vector of theword in the sentence. is the dimension of the word feature and is the width of the context window. is the word label of and is the sentence label of . And we define the as key word if . We hope that the ASR error detection model learns while learning when we input the sample pair into it.
2.2.2 Attention based DMIL
We use the Attention based DMIL model 
as the baseline in this paper, which includes word level embedding network, pooling, and classifier network. In this structure, the word level embedding network is responsible for extracting the feature of the text block. The pooling is used to compress multiple word features into one sentence feature. And the classifier network predicts the class of sentence by using this feature.
When we input a sample pair , the whole process is as follows:
where is the word level embedding network, is the D-dimensional feature vector of the word . and are the weight matrices of the attention mechanism. and are the score and the attention value of the respectively.
2.2.3 Hard Samples Mining using Sparse-Attention
For the attention based on DMIL, with the improvement of the front-end ASR system, the number of errors in the transcription is decreasing, and the similarity of the right and error transcription is increasing. In this case, the traditional Attention based DMIL cannot efficiently find error in the transcription.
In the traditional attention mechanism, we calculate the score of each word through a fully-connected network, and then normalize the score by using the softmax transformation function:
However, the softmax transformation function projects the input into a normalized output vector of the specified dimension, while each dimension of the obtained output is greater than zero; this is wasteful. Moreover, the attention value of each word may be close to the average value, which results the situation that the key word cannot be found.
With this in mind, we propose an improved DMIL from the aspect of attention mechanism, i.e., Sparse-Attention based DMIL. We replace the (6) by the sparsemax translation function, which is as follows:
where . In other words, it is the Euclidean projection of the scores
onto the probability simplex. These projections tend to hit the boundary of the simplex and yield a sparse probability distribution. This allows the classifier to attend only to a few words in the sentence and assign zero probability mass to all other words. It has been shown that the asymptotic cost of sparsemax and softmax is the same and the gradient back-propagation of sparsemax is faster than that of softmax which takes sublinear time.
2.2.4 Hard Samples Mining using Gated Sparse-Attention
To further enhance the ability of Sparse-attention based DMIL, we notice that it is difficult to learn complex relations efficiently by using the in (5). Our concern follows from the fact that is approximately linear for , which probably limits the final expressiveness of the learned relations among words. Thus, we propose to replace the (5) and (6) by the Gated Sparse Attention mechanism, which additionally uses the gating mechanism  together with that yields:
where is the weight matrix, is an element-wise multiplication and
is the sigmoid activation function. The gating mechanism probably removes the problem inby introducing a learnable non-linearity.
2.2.5 Hard Samples Mining using Discriminative Embedding
In traditional DMIL, we usually use the softmax function as the activation function for the last layer of the classifier network. However, there are two disadvantages, which lead it cannot find the key word effectively. On the one hand, as the similarity of the training data increases, the classifier network structure will become complex. Meanwhile, the gradient of the attention will be very small, even encounter the problem of vanishing gradient. On the other hand, many studies have shown that the softmax function is unable to effectively guide the training of the embedding network and it is difficult to find key word effectively [14, 15]. Thus, the performance of the traditional DMIL is limited.
In order to solve the problem mentioned above, we also propose the SVM-based DMIL, which uses a two-stage training strategy to train the word embedding network and the classifier network separately. In the first training stage, we optimize the SVM and embedding network jointly to make the obtained embedding more discriminating.
The original SVM is used to solve the binary classification problems. Given the training sample pair , , the SVM optimizes the following constraint problems:
where the is the slack variables, which are used to penalize the misclassified samples. The is the penalty factor, which controls the penalty size for misclassified samples. At the same time, the selection of will greatly affect the training speed of the neural network.
Then, we can convert the above optimization problem to an unconstrained optimization problem as follows:
Further, we can convert (16) into a neural network objective function.
is the generalized logistic loss function, which is a smoothed approximation of the hinge loss function, is a sharpness parameter . In this paper, we use the fixed
to reduce the number of hyperparameters.
|Model Name||Type of Model||SI|
|Attention based DMIL||82.8|
|Attention based DMIL||81.9|
|Attention based DMIL||81.7|
|Sparse-Attention based DMIL||82.5|
|Attention-based DMIL + DT||83.0|
|Sparse-Attention-based DMIL + DT||82.7|
|Gated-Sparse-Attention-based DMIL + DT||82.9|
3 Experimental Details
3.1 Experiment Settings
In this section, we shall introduce the datasets, the structure of models, and the training strategies which are used in the experiments on this paper.
Our experiments are conducted on the 300 hours Switchboard English conversational telephone corpus  and the 2000 hours Fisher corpus, which are the most studied ASR benchmark today [18, 19, 20, 21].
We have trained a Listen Attend and Spell (LAS)  model with ESPnet  tools 111The code of ESPnet is available at https://github.com/espnet/espnet on SwitchBoard corpus. We select 100h training data from the Fisher corpus to train the ASR error detection model and then use it to mine the hard samples from the rest of this corpus.
For various of the DMIL models, we design the word level embedding network as a Convolutional Neural Network (CNN) for extract the word level feature, the pooling as two layers fully-connected network, and use three layers fully-connection neural network with Rectified Linear Unit (ReLu) activation function as the classifier network.
The training strategies in this paper are as follows. First, the weights for all layers are uniformly initialized to lie between -0.05 and 0.05. Then networks are trained using Adam  with a learning rate of 0.001. The learning rate is halved whenever the held-out loss does not decrease by at least 10%. Finally, we clip the gradients to lie in to stabilize training.
3.2 Results of Sentence Level ASR Error Detection
For evaluating the sentence classification ability of our methods, we choose the text Convolutional Neural Network classification (TextCNN), which is widely used in the field of sentence classification, as the baseline model  222The code of TextCNN is available at https://github.com/dennybritz/ cnn-text-classification-tf .
We shall analyze the results of the sentence level ASR error detection model. All the model structures and experimental results are described in Table 1. Where is the width of the context window, and the is the accuracy of the sentence detection.
First, we explore the performance of Attention based DMIL on the sentence level ASR error detection model. By comparing the accuracy of the and , we found that it has similar performance with the model in this task.
Then, we try to explore the influence of the width of the context window . We compare the models with various , which include , and . We have found that the best performance of Attention based DMILs is with and the performance of them is decreased with increases. Thus, in the next experiments, we use .
Finally, we explore the performance of the improved methods. We find the Sparse-Attention and the Gated Sparse-Attention are unable to improve the performance of Attention based DMIL by comparing , with . We also compare with and compare with at the same time. The discriminative training (DT) strategy can significantly improve the performance of DMIL, and the best performance of these models is .
3.3 Results of Word Level ASR Error Detection
For evaluate the word classification ability of our methods, we choose the as the baseline model. We analyze the results of the word level ASR error detection model. All the model structures and experimental results are described in Table 1. Where the ,, and are the Precision, Recall, F1 Score and Accuracy.
First, we explore the performance of Attention based DMIL on the word level ASR error detection model. From the second row in Table 1, we can find that the is close to , thus it cannot find the key words efficiently. The reason is that the shortcoming of the attention mechanism with the softmax.
Then, we explore the performance of the Sparse-Attention DMIL model. By observing the recorded value of the fifth row in Table 1, we found that it can find a small number of key words with high accuracy and cannot find most of the key instances, and the F1 score is lower than the baseline model.
Next, we shall explore the influence of the gating mechanism. By comparing and , we can find that it can improve the performance of Sparse-Attention based DMIL by helping it to find more key words, which proves our previous assumption that the gating mechanism probably removes the problem in by introducing a learnable non-linearity.
Finally, we try to explore the influence of the DT strategy. We compare and and find that the DT strategy can improve the precision of the Attention based DMIL. And then, we compare with and compare with at the same time. We get the same conclusion as the first comparison.
|SwitchBoard + Easy samples||800|
|SwitchBoard + Hard samples||800||17.2|
3.4 Results of Retraining ASR via Hard Samples
We show the results of the retrained ASR system via the hard samples with different datasets in Table 2. We can see that the model with a 500h hard samples dataset gets the best performance in the switchboard test set.
In this paper, we first propose an improved retraining framework for ASR by using the hard samples. And then, we propose a novel method, which is based on Attention based DMIL, for mining the hard samples from unlabeled data. This method is able to find locations of errors while determining the class of transcription. Furthermore, we propose three enhanced methods from the attention mechanism and training strategy respectively, i.e., Sparse-Attention based DMIL, Gated Sparse-Attention based DMIL and DDMIL. We verified our proposed methods on the SwitchBoard corpus and Fisher corpus. From the experiment results, we can see that compared with the traditional training method, the retrained model via hard samples gets great improved performance.
This research was supported by the National Key Research and Development Plan of China under Grant 2017YFB1002102 and National Natural Science Foundation of China under Grant U1736210
-  C. Konopka and L. C. Almstrand, “Retraining and updating speech models for speech recognition,” Sep. 6 2005, US Patent 6,941,264.
-  R. Haimi-Cohen, “Automatic retraining of a speech recognizer while using reliable transcripts,” Apr. 16 2002, US Patent 6,374,221.
T. N. Sainath, O. Vinyals, A. W. Senior, and H. Sak, “Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2015, pp. 4580–4584.
-  V. Valtchev, J. Odell, P. C. Woodland, and S. J. Young, “Lattice-based discriminative training for large vocabulary speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, ICASSP, 1996, pp. 605–608.
D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case,
J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2:
End-to-end speech recognition in english and mandarin,” in
International conference on machine learning, 2016, pp. 173–182.
-  Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in neural information processing systems, 1990, pp. 396–404.
Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, and E. I. Chang, “Deep learning of feature representation with multiple instance learning for medical image analysis,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2014, pp. 1626–1630.
-  J. Wu, Y. Yu, C. Huang, and K. Yu, “Deep multiple instance learning for image classification and auto-annotation,” in
-  O. Z. Kraus, L. J. Ba, and B. J. Frey, “Classifying and segmenting microscopy images with deep multiple instance learning,” Bioinformatics, vol. 32, no. 12, pp. 52–59, 2016.
-  X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu, “Revisiting multiple instance neural networks,” Pattern Recognition, vol. 74, pp. 15–24, 2018.
-  X. Liu, L. Jiao, J. Zhao, J. Zhao, D. Zhang, F. Liu, S. Yang, and X. Tang, “Deep Multiple Instance Learning-Based Spatial-Spectral Classification for PAN and MS Imagery,” IEEE Trans. Geoscience and Remote Sensing, vol. 56, no. 1, pp. 461–473, 2018.
-  M. Ilse, J. M. Tomczak, and M. Welling, “Attention-based Deep Multiple Instance Learning,” in International Conference on Machine Learning, ICML, 2018, pp. 2132–2141.
-  Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language Modeling with Gated Convolutional Networks,” in International Conference on Machine Learning, ICML, 2017, pp. 933–941.
-  W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-Margin Softmax Loss for Convolutional Neural Networks,” in International Conference on Machine Learning, ICML, 2016, pp. 507–516.
W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “SphereFace: Deep Hypersphere Embedding for Face Recognition,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 6738–6746.
-  A. Mignon and F. Jurie, “PCCA: A new approach for distance learning from sparse pairwise constraints,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2012, pp. 2666–2672.
-  J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 1992, pp. 517–520.
-  W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “The Microsoft 2016 conversational speech recognition system,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2017, pp. 5255–5259.
-  K. Veselỳ, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.” in INTERSPEECH, 2013, pp. 2345–2349.
-  D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.” in INTERSPEECH, 2016, pp. 2751–2755.
-  W. Hartmann, R. Hsiao, T. Ng, J. Ma, F. Keith, and M.-H. Siu, “Improved Single System Conversational Telephone Speech Recognition with VGG Bottleneck Features,” in INTERSPEECH, 2017, pp. 112–116.
-  W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2016, pp. 4960–4964.
-  S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” in INTERSPEECH, 2018, pp. 2207–2211.
-  D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in International Conference on Learning Representations, ICLR, 2015.
Y. Kim, “Convolutional neural networks for sentence classification,” in
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1746–1751.