Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition

by   Subhadeep Dey, et al.
Idiap Research Institute

In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled speech data is employed through a data selection mechanism to obtain the best hypothesized output, further used to retrain the seed model. However, uncertainties of the model may not be well captured with a single hypothesis. As opposed to this technique, we apply a dropout mechanism to capture the uncertainty by obtaining multiple hypothesized text transcripts of an speech recording. We assume that the diversity of automatically generated transcripts for an utterance will implicitly increase the reliability of the model. Finally, the data selection process is also applied on these hypothesized transcripts to reduce the uncertainty. Experiments on freely available TEDLIUM corpus and proprietary Adobe's internal dataset show that the proposed approach significantly reduces ASR errors, compared to the baseline model.



There are no comments yet.


page 1

page 2

page 3

page 4


Multiple-hypothesis CTC-based semi-supervised adaptation of end-to-end speech recognition

This paper proposes an adaptation method for end-to-end speech recogniti...

End-to-End Feedback Loss in Speech Chain Framework via Straight-Through Estimator

The speech chain mechanism integrates automatic speech recognition (ASR)...

Adapting End-to-End Speech Recognition for Readable Subtitles

Automatic speech recognition (ASR) systems are primarily evaluated on tr...

Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Semi-supervised learning has demonstrated promising results in automatic...

End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning

We propose a semi-supervised learning method for building end-to-end ric...

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

In this work, we explore a multimodal semi-supervised learning approach ...

BBS-KWS:The Mandarin Keyword Spotting System Won the Video Keyword Wakeup Challenge

This paper introduces the system submitted by the Yidun NISP team to the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art approaches in automatic speech recognition (ASR) exploit the powerful discriminative capability of deep neural networks (DNN) for acoustic modelling 

[hinton2012deep, deng2013new, povey2016purely]. The current ASR advancements offer low error-rates, making the systems applicable for commercialization. In the past few years, sequence level optimization algorithms, such as lattice free maximum mutual information (LF-MMI) and end-to-end frameworks, have been adopted over the frame level discrimination approaches (like hybrid-DNN [motlicek2015exploiting][kim2017joint, povey2016purely, watanabe2018espnet]

. As opposed to LF-MMI, the end-to-end approaches do not require the creation of lexicon or decision trees for training. End-to-end sequence classification approaches such as connectionist temporal classification (CTC) and encoder-decoder frameworks have been successfully applied in ASR 

[watanabe2018espnet, miao2015eesen]. However, the end-to-end ASR requires large amount of training data to optimize the network, as the model needs to automatically learn the mapping from the acoustic features to text-transcripts.

Another interesting concept is a semi-supervised learning. Our objective is to exploit (relatively) large amount of unlabelled data when building an end-to-end ASR system [manohar2018semi, karita2018semi]

. This scenario is attractive to wide range of applications, such as low-resource speech recognition and computer vision, where unsupervised data is abundant but obtaining labels is costly 

[lee2013pseudo]. Various approaches to semi-supervised learning have been proposed in the literature [walker2017semi, karita2018semi, manohar2018semi]. A typical approach involves first training an initial seed-model on the limited amount of supervised data. The seed-model is applied on the unsupervised data to automatically generate the transcripts [lee2013pseudo, walker2017semi, bachman2014learning]. As automatically generated transcripts may be erroneous, a data-selection mechanism is applied to filter-out the low confident speech utterances. In [walker2017semi], the utterance-level confidences are obtained to post-process the one best hypotheses. As opposed to using only 1-best hypothesized transcripts, the whole decoding-lattice is used in [manohar2018semi] as a supervision output. For end-to-end ASR, [karita2018semi]

has recently explored an approach exploiting unpaired text and audio data. This technique proposes to extract intermediate hidden representation of speech and text data with a shared encoder network. However, this approach requires text data from the target domain which may not be practically available during the training stage.

In this paper, we explore a data-selection mechanism for semi-supervised learning of end-to-end ASR, as it has shown to be a promising approach in various applications, such as image, text, and speech, and it has not been well explored for the end-to-end ASR. For data selection, we explore two confidence based measures, namely, (i) utterance-based decoding confidence, and (ii) entropy-based confidence. We hypothesize that these measures indicate the reliability of automatically generated transcripts using end-to-end ASR, given the speech recording. In the proposed approach, the -best hypothesized transcripts (filtered using the confidence measures) are used to further retrain the seed-model.

Further, this paper also explores the application of dropout mechanism for augmenting the -best hypothesized text. Dropout is usually employed in the conventional ASR during training as a regularizer [srivastava14a], while during inference, the dropout is not applied. In [Vyas_ICASSP2019_2019, gal2016dropout], dropout was applied for semi-supervised learning for characterizing the uncertainties of the DNN. Motivated by these evidences, we propose to exploit the dropout mechanism to augment the -best list as follows: During the decoding of an utterance, the dropout mechanism is employed to output 1-best transcripts. The dropout is applied on the same utterance multiple times to obtain different versions of transcripts. We hypothesize that the diversity of decoded outputs for any utterance can localize the uncertainties of the model. Experiments are performed on the publicly-available TEDLIUM corpus and proprietary Adobe’s internal dataset. The results indicate that the proposed approach allows to efficiently exploit unlabelled data, leading to significant increase in ASR performance.

This paper is organized as follows. The baseline end-to-end ASR approach is described in Section 2. The semi-supervised training is described in Section 3. The experimental setup and results are described in Sections 4 and 5 respectively. Finally, the paper is concluded in Section 6.

2 End-to-end ASR

The state-of-the-art based ASR (chain model) requires a lexicon and alignments, usually generated with respect to context-dependent tri-phonetic states [povey2016purely]. An alternative technique, referred to as end-to-end, aims to learn the mapping from acoustic features to text directly without the need of intermediate steps. Recently, various end-to-end sequence-to-sequence approaches, such as encoder-decoder model, have been successfully applied to ASR [watanabe2018espnet, miao2015eesen]. The basic components of the encoder-decoder model are illustrated in Figure 1 and are described below:

  • Encoder: The purpose of the encoder is to produce hidden representation of the utterance = {, , , }, as represented by {, , , }, where L

    T. Typically, the encoder consists of a few convolutional neural network (CNN) layers and a few layers of bidirectional long short-term memory (BLSTM).

  • Attention unit

    : The attention unit takes as input a sequence of features and estimates the relative importance of each feature vector. The attention unit computes a context vector (

    ) for output unit.

  • Decoder

    : The context vector is used by the decoder unit to predict a character unit. The decoder also uses the previous decoded output character to infer current character. Training such an end-to-end ASR is performed by optimizing the following loss function:


    where is the character sequence corresponding to the utterance .

A connectionalist temporal classification (CTC) based loss is also combined with the objective function of Equation 1 for training the network. During decoding, scores from the CTC and encoder-decoder as the acoustic models are combined using a beam-search algorithm. Furthermore, a shallow fusion of language model (LM) with the acoustic-model scores are applied to obtain text-transcripts. Further details of the end-to-end ASR can be found in [kim2017joint, watanabe2018espnet, hori2018end].

Figure 1: Architecture for encoder decoder network for end-to-end ASR.

3 Semi supervised learning

The end-to-end ASR is typically trained with a large amount (at least100 hours) of labelled data [hori2018end]. However in a semi-supervised setting, it is assumed that only a small amount of supervised data (10 to 15 hours) is available for training in addition to a large amount of untranscribed audio for the target domain. Estimating parameters of the end-to-end model on the limited data may not lead to a reliable solution. In this paper, we exploit publicly available data (source domain) with relatively large amount of speech recordings for estimating the parameters of the model. This ASR is referred to as source domain model. The parameters of the model are then adapted using the limited amount of transcribed data from the target domain. The adapted end-to-end ASR is finally used as the seed-model for exploiting the unsupervised data for further retraining.

In the past, various approaches have been explored for unsupervised model adaptation [walker2017semi, karita2018semi, manohar2018semi, lee2013pseudo, bachman2014learning]. Most of the approaches rely on data-selection process for bootstrapping the model with additional labelled data selected based on high confidence predictions. Process of data-selection has not been well explored for end-to-end ASR. In this paper, we explore data-selection approach using, (i) utterance-level decoding-scores, and (ii) entropy based confidence measures. These confidence-measures are then applied for selecting highly reliable utterances.

The decoding-scores are obtained using the posterior probabilities of an utterance given the acoustic features, as a result of the beam-search process. Decoding-score for each

-best text-transcript can be generated by the ASR. Utterance-based decoding scores can be finally compared to a predefined threshold to perform a data selection.

Furthermore, we apply entropy as a criteria to filter out utterances from the hypothesized ASR outputs. The entropy of an utterance measures the amount of uncertainty of the model. We hypothesize that entropy of the utterance is well correlated with the performance of the ASR system. The entropy of an utterance is computed as follows: The posterior probabilities of the character units are obtained by forward-pass of the model. The entropy () is then:


where is the number of character outputs, () is the posterior probability of the character unit given the acoustic features ().

We also propose to localize the uncertainty of the end-to-end ASR by applying dropout mechanism. This method is motivated by the recent advances of DNN for measuring the reliability of the model [Vyas_ICASSP2019_2019, gal2016dropout]. The conventional methods do not use dropouts during the decoding time. As opposed to this approach, sampling from the DNN weight distribution is done by applying the dropout. The proposed approach is as follows:

  1. Dropout: apply dropout during the inference to obtain -best hypothesized transcript

  2. Data-selection: augment the adaptation data with this utterance if the entropy or the decoding-score is above a threshold

  3. Repeat steps 1 and 2 for times

The above steps are applied to all the utterances of the unsupervised dataset.

4 Experimental Setup

In this section, the experimental setup for the semi-supervised ASR is detailed. Experiments are performed on TEDLIUM and Adobe (internal) datasets as the target domain data.

Data LibriSpeech TEDLIUM Adobe
100 hours - -
- 15 hours -
- 50 hours 20 hours
- 3 hours -
5 hours 2.5 hours 2.5 hours
Table 1: Training, adaptation and test data for different dataset. The represents the supervised data while comprises the unlabelled-data.

4.1 LibriSpeech

We selected 100 hours of LibriSpeech clean portion as the source-domain data, denoted as in Table 1 [panayotov2015librispeech]. The LibriSpeech test set comprises five hours of clean speech recordings.

4.2 Tedlium

Experiments are conducted in TEDLIUM speech dataset as the target domain data [rousseau2012ted]. For our experiments, we used only 15 hours of labelled data ( part from Table 1). Furthermore, we use 3 hours and 50 hours of data as the cross validation and unsupervised set ( from Table 1) respectively. The test data consists of 2.5 hours. The details of the TEDLIUM corpora can be found in [rousseau2012ted].

4.3 Adobe

The experiments are also conducted on Adobe’s internal speech dataset. This corpora contains users uttering a list of commands. An example of a command is , ”Move the table”. The unsupervised data contains 24 k utterances spoken by 250 speakers with average duration of each utterance being 3 s (). The test data () consists of 1300 utterances spoken by 50 speakers. The performances of the ASR are reported in terms of word error rate (WER).

4.4 chain model

For chain model, 40 dimensional mel frequency cepstral coefficients (MFCC) are extracted from the speech utterance as input features to the neural network [povey2016purely]. Furthermore, we also use online i-vector features as input. The dimension of the i-vector extractor is fixed to 100. The DNN uses 7 hidden layers of time delay neural network (TDNN) with 1 k dimensional units. The DNN is trained to predict senones as the output and LF-MMI is applied as the optimization criteria. The chain model employs a 3-gram LM during decoding phase. The pronunciation dictionary was created on the publicly available CMU-dictionary and include vocabularies from the training text of LibriSpeech and TEDLIUM datasets.

4.5 End-to-end ASR

For the end-to-end ASR, 40 dimensional filter-bank energies are extracted from the utterances to constitute features for the DNN [watanabe2018espnet, hori2018end]. Delta filter-bank energies and pitch features are appended to the original features to make it 83 dimensional vectors. The end-to-end ASR as described in Section 2 is trained to predict English characters (including semicolon, commas, etc. to make output dimension of 30). The end-to-end ASR uses 3 CNN layers followed by 2 BLSTM layers as the encoder, with the dimension of each layer fixed to 512. The decoder network employs 2 LSTM layers, each with 512 dimensional units. A word based language model is also trained with a vocabulary size of 50 k words. The LM uses 2 LSTM layers with dimension of each layers being 1 k. The training text-data from LibriSpeech and TEDLIUM are used to train the word-based LM.

5 Results

In this section, the results of the end-to-end ASR are presented. The following ASR systems will be analyzed:

  • LF-MMI: This ASR refers to the traditional chain model using LF-MMI optimization criteria. The system is described in Section 4.4. This system is trained using the standard kaldi’s recipe [povey2011kaldi].

  • End-to-end: This refers to end-to-end ASR as presented in Section 2. The end-to-end ASR is trained to predict characters and referred to as E2E. We also trained an end-to-end ASR using a dropout value of 0.2. The dropout is applied to all the layers in encoder and decoder. The network (with dropout) is trained to predict characters. The end-to-end ASR (source domain data of Section 4.1) with dropout is referred to as E2E-drop.

  • Adapted ASR: The end-to-end and LF-MMI based ASR are adapted to TEDLIUM labelled data. The end-to-end adapted ASR that is trained in a supervised manner (on data of Table 1) is referred to as E2E, while the adapted LF-MMI is referred to as . The end-to-end ASR that exploits the unsupervised data from TEDLIUM is referred to as and the ASR that uses unlabelled Adobe’s internal data is referred to as .

5.1 Baseline

For training the models on source domain, subset of LibriSpeech data is used as described in the Section 4.1. The results of experiments on LibriSpeech clean test set (column 2) are tabulated in Table 2. We observe that LF-MMI outperforms the end-to-end ASR. Furthermore, we also observe that the E2E-drop performs worse than E2E by 0.8% absolute WER. The poor performance of the end-to-end ASR could be due to the limited amount of training data.

Systems LibriSpeech TEDLIUM Adobe
LF-MMI 29.7
E2E 11.2 38.5 40.5
E2E-drop 12.0 39.1 40.7
Table 2: Performance of the various baseline ASR systems in terms of WER (%) on LibriSpeech clean, TEDLIUM and Adobe test set with the source domain model. The chain model, LF-MMI, performs better than the end-to-end ASR systems.

5.2 Experiments on the TEDLIUM data

The performances of the various source domain ASRs on the TEDLIUM test portion are shown in Table 2 (Column 3). It can be observed that the LF-MMI performs the best on the TEDLIUM test set as well. These ASR systems are then adapted to labelled data, (Table 1). For the , retraining all the parameters of the model provides good performance while for end-to-end ASR, retraining the encoder performs the best. From Table 3, it can be observed that outperforms the . The end-to-end models are used as seed-model, for exploiting the unsupervised data. We first present results of an experiment employing decoding-score for end-to-end ASR.

5.2.1 Decoding-score

The first step in data-selection process is to fix a threshold on decoding-score generated by the seed-model on unlabelled data. The threshold is obtained by minimizing false alarm and miss detection rate as follows. The cross validation (CV) part of TEDLIUM data ( data as described in Table 1) is first decoded using the . For each utterance, the WER is computed. Thus, WER and decoding-score are associated to each utterance. We divide the CV set into two parts, (i) Set1: utterances for which WER %, and (ii) Set2: WER for these utterances %. The histogram plot of the decoding-scores of Set1 and Set2 is illustrated in Figure 2. To minimize false positive and miss detection (from Figure 2), the threshold should be fixed between -0.3 to -0.6 (Refer to Figure 2). This threshold is applied on the unsupervised data ( of Table 1) for selecting speech utterances.

The end-to-end ASR (seed-model) is applied to decode the data to obtain text-transcripts (10-best) and decoding-scores for an utterance. Data-selection is applied on these decoding-scores for filtering the highly confident outputs (for further retraining the system). The results of data-selection process using two threshold values are illustrated in Table 3. We observed that 10-best hypothesis is beneficial for than 1-best hypothesis. Data-selection based on 1-best hypothesis provides WER of 29.5% while data-selection using 10-best decoded-outputs provides WER of 28.9%. Furthermore, the performance of does not improve on applying more than 10-best hypothesis. From Table 3, it can be observed that performs better with threshold of -0.5. We observed that thresholds less than -0.3 lead to the selection of shorter duration utterances. The outperforms by 0.3% absolute WER. In rest of the experimental section, threshold of -0.5 is applied on decoding-score for data-selection.

We augment the -best hypothesized transcripts (generated by E2E) by using dropout mechanism from the E2E-drop. The process for data-selection is described in Section 3. From Table 4, it can be observed that significant gain in performance is obtained by this approach ( + E2E-drop). Furthermore, this technique outperforms the E2E by 2% absolute WER (29.2% to 27.2%) showing the importance of localizing the uncertainties in the model.

5.2.2 Entropy

We also explore an approach of using entropy as data-selection criteria as described in Section 3. From Table 4, it can be observed that the performance of does not improve upon on using additional unsupervised data. Furthermore, we apply dropout mechanism as described in Section 3 for augmenting the data (as done in Section 5.2.1 on decoding-score). It can be observed that this approach outperforms by 0.4% absolute WER.

5.3 Experiments on Adobe’s internal dataset

We performed the data-selection process using decoding-score and entropy based confidence scores on Adobe’s internal data as well. Due to the lack of development data from Adobe, the thresholds from TEDLIUM are used for data-selection process. For example, threshold of -0.5 is applied on decoding-score for utterance-selection. The result of data-selection process using decoding-score is shown in Table 4. From the table, it can be observed that performs best (retrained with 10-best hypothesized transcripts) with a WER of 32.2%. Furthermore, additional transcripts generated by the dropout model are used for augmenting the 10-best hypothesized text. The performance of this approach ( + E2E-drop) does not improve upon the result of 32.2% WER. The poor performance could be due to choice of non optimal threshold on decoding-score for data-selection process. It can be observed that dropout mechanism benefits performance of the + E2E-drop over the non-adapted E2E using entropy based data-selection criteria.

Systems adapt-data Threshold TEDLIUM
E2E - 29.2
+ -0.3 29.7
+ -0.5 28.9
Table 3: Performance of various ASR systems in terms of WER (%) on TEDLIUM test set. The details of supervised and unsupervised data (adapt-data) have been tabulated in Table 1.
Systems - TED Adb
E2E - 38.5 40.5
E2E - 29.2 -
- 28.9 -
- -
+ E2E-drop - -
+ E2E-drop - - 34.3
29.2 -
- 38.1
+ E2E-drop 28.8 -
+ E2E-drop - 37.7

Table 4: Performance of various ASR systems on TEDLIUM (TED) and Adobe (Adb) test set using decoding-score (dec-score) and entropy based data-selection (data-sel) criteria. The + E2E-drop performs the best on TEDLIUM test set.
Figure 2: Histogram plot of decoding-scores for set of utterances, with WER 10% (blue) and decoding-scores for utterances with WER 10% (orange).

6 Conclusions

Techniques for semi-supervised learning for ASR were investigated in this paper. For exploiting unlabelled data, the baseline system employs a single best hypothesized text-transcript. As opposed to this approach, we proposed to capture the uncertainties by applying a dropout mechanism to generate multiple hypothesized transcripts. Furthermore, we also used data-selection mechanism to filter the highly confident hypotheses. The techniques were evaluated in publicly available TEDLIUM and Adobe’s internal dataset. Experiments show that the proposed approach ( + E2E-drop) outperforms the baseline method by 2% absolute reduction in WER on TEDLIUM test set.

7 Acknowledgements

This work was done under the “SM2 - Extracting semantic meaning from spoken material” project, partially supported by the Swiss Innovation Agency (InnoSuisse) as well as by a research grant from Adobe Research, USA.