Recent Advances in End-to-End Spoken Language Understanding

09/29/2019 ∙ by Natalia Tomashenko, et al. ∙ Université d'Avignon et des Pays de Vaucluse University of Nantes Le Mans Université 0

This work investigates spoken language understanding (SLU) systems in the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. Two SLU tasks are considered: named entity recognition (NER) and semantic slot filling (SF). For these tasks, in order to improve the model performance, we explore various techniques including speaker adaptation, a modification of the connectionist temporal classification (CTC) training criterion, and sequential pretraining.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spoken language understanding (SLU) is a key component of conversational artificial intelligence (AI) applications. Traditional SLU systems consist of at least two parts. The first one is an automatic speech recognition (ASR) system that transcribes acoustic speech signal into word sequences. The second part is a natural language understanding (NLU) system which predicts, given the output of the ASR system, named entities, semantic or domain tags, and other language characteristics depending on the considered task. In classical approaches, these two systems are often built and optimized independently.

Recent progress in deep learning has impacted many research and industrial domains and boosted the development of conversational AI technology. Most of the state-of-the art SLU and conversational AI systems employ neural network models. Nowadays there is a high interest of the research community in end-to-end systems for various speech and language technologies. A few recent papers 

[21, 15, 24, 10, 4, 17] present ASR-free end-to-end approaches for SLU tasks and show promising results. These methods aim to learn SLU models from acoustic signal without intermediate text representation. Paper [4] proposed an audio-to-intent architecture for semantic classification in dialog systems. An encoder-decoder framework [26] is used in [24] for domain and intent classification, and in [15] for domain, intent, and argument recognition. A different approach based on the model trained with the connectionist temporal classification (CTC) criterion [12] was proposed in [10] for named entity recognition (NER) and slot filling. End-to-end methods are motivated by the following factors: possibility of better information transfer from the speech signal due to the joint optimization on the final objective criterion, and simplification of the overall system and elimination of some of its components. However, deep neural networks and especially end-to-end models often require more training data to be efficient. For SLU, this implies the demand of big semantically annotated corpora. In this work, we explore different ways to improve the performance of end-to-end SLU systems.

2 SLU tasks

In SLU for human-machine conversational systems, an important task is to automatically extract semantic concepts or to fill in a set of slots in order to achieve a goal in a human-machine dialogue. In this paper, we consider two SLU tasks: named entity recognition (NER) and semantic slot filling (SF). In the NER task, the purpose is to recognize information units such as names, including person, organization and location names, dates, events and others. In the SF task, the extraction of wider semantic information is targeted. These last years, NER and SF where addressed as word labelling problems, through the use of the classical BIO (begin/inside/outside) notation. For instance, ”I would like to book three double rooms in Paris for tomorrow” will be represented for the NER and SF task as the following BIO labelled sentences:

  • NER: ”I:: would:: like:: to:: book:: three::B-amount double:: rooms:: in:: Paris::B-location/city for:: tomorrow::B-time/date”.

  • SF: ”I::B-command would::I-command like::I-command to::I-command book::I-command three::B-room/number double::B-room/type rooms::I-room/type in:: Paris::B-location/city for:: tomorrow::B-time/date”.

In this paper, similarly to [10], the BIO representation is abandoned in profit to a chunking approach. For instance for NER, the same sentence will be presented as:

  • NER: ”I would like to book amount three double rooms in location/city Paris for time/date tomorrow ”.

In this study, we train an end-to-end neural model to reproduce such textual representation from speech. Since our neural model emits characters, we use specific characters corresponding to each opening tag (one by named entity category or one by semantic concept), while the same symbol is used to represent the closing tag.

3 Model training

End-to-end training of SLU models is realized through the recurrent neural network (RNN) architecture and CTC loss function 

[12] as shown in Figure 1. A spectrogram of power normalized audio clips calculated on 20ms windows is used as the input features for the system. As shown in Figure 1

, it is followed by two 2D-invariant (in the time and-frequency domain) convolutional layers, and then by five BLSTM layers with sequence-wise batch normalization. A fully connected layer is applied after BLSTM layers, and the output layer of the neural network is a softmax layer. The model is trained using the CTC loss function. The neural architecture is similar to the Deep Speech 2 

[18] for ASR.

The outputs of the network depend on the task. For ASR, the outputs consist of graphemes of a corresponding language, a space symbol to denote word boundaries and a blank symbol. For NER, in addition to ASR outputs, we add outputs corresponding to named entity types and a closing symbol for named entities. In the same way, for SF, we use all ASR outputs and additional tags corresponding to semantic concepts and a closing symbol for semantic tags.

In order to improve model training, we investigate speaker adaptive training (SAT), pretraining and transfer learning. First, we formalize the

-mode, that proved its effectiveness in all our previous and current experiments.

3.1 CTC loss function interpretation related to -mode

The CTC loss function [12]

is relevant to train models for ASR without Hidden Markov Models. The

-mode can be seen as a minor modification of the CTC loss function.

3.1.1 CTC loss function definition

By means of a many-to-one mapping function, CTC transforms a sequence of the network outputs, emitted for each acoustic frame, to a sequence of final target labels by deleting repeated output labels and inserting a blank (no label) symbol. The CTC loss function is defined as:


where is a sequence of acoustic observations,

is the target output label sequence, and

the training dataset. is defined as:


where is a sequence of initial output labels emitted by the model for each input frame. To compute

we use the probability of the output label

emitted by the neural model for frame to build this sequence. This probability is modeled by the value given by the output node of the neural model related to the label . is defined as where denotes the number of frames.

3.1.2 CTC loss function and -mode

In the framework of the -mode, we introduce a new symbol, ””, that represents the presence of a label (the opposite of the blank symbol) that does not need to be disambiguated. We expect to build a model that is more discriminant on the important task-specific labels. For example, for the SF SLU task important labels are the ones corresponding to semantic concept opening and closing tags, and characters involved in the word sequences that support the value of these semantic concepts (i.e characters occurring between an opening and a closing concept tag). In the CTC loss function framework, the -mode consists in applying another kind of mapping function before . While converts a sequence of initial output labels into the final sequence to be retrieved, we introduce the mapping function that is applied to each final target output label. Let be the set of elements included in subsequences such as is an opening concept tag and the associated closing tag; , and are indexes that handle positions in sequence , and . Let be the vocabulary of all the symbols present in sequences in , and let consider the new symbol . Let define , and (resp. ) the set of all the label sequences that can be generated from (resp. ).

Considering as the number of elements in , an integer such as , we define the mapping function in two steps:


By applying on the last example sentence used in Section 2 for NER, this sentence is transformed to:

  • sentence: ”I would like to book amount three double rooms in location/city Paris for time/date tomorrow ”.

  • (sentence): ” amount three location/city Paris time/date tomorrow ”.

To introduce -mode in the CTC loss function definition, we modify the formulation of in formula (2) by introducing the mapping function applied to :


3.2 Speaker adaptive training

Adaptation is an efficient way to reduce the mismatches between the models and the data from a particular speaker or channel. For many years, acoustic model adaptation has been a key component of any state-of-the-art ASR system. For end-to-end approaches, speaker adaptation is less studied, and most of the first end-to-end ASR systems do not use any speaker adaptation and are built on spectrograms [18] or filterbank features [1]. However, some recent works [6, 28] have demonstrated the effectiveness of speaker adaptation for end-to-end models.

For SLU tasks, there is also an emerging interest in the end-to-end models which have a speech signal as input. Thus, acoustic, and particularly speaker, adaptation for such models can play an important role in improving the overall performance of these systems. However, to our knowledge, there is no research on speaker adaptation for end-to-end SLU models, and the existing works do not use any speaker adaptation.

One way to improve SLU models which we investigate in this paper is speaker adaptation. We apply i-vector based speaker adaptation 

[23]. The proposed way of integration of i-vectors into the end-to-end model architecture is shown in Figure 1. Speaker i-vectors are appended to the outputs of the last (second) convolutional layer, just before the first recurrent (BLSTM) layer. In this paper, for better initialization, we first train a model with zero pseudo i-vectors (all values are equal to 0). Then, we use this pretrained model and fine-tune it on the same data but with the real i-vectors. This approach was inspired by [5], where an idea of using zero auxiliary features during pretraining was implemented for language models and in our preliminary experiments it demonstrated better results than direct model training with i-vectors [27].

Figure 1: Universal end-to-end deep neural network model architecture for ASR, NER and SF tasks. Depending on the task, the set of the output characters consists of: (1) ASR: graphemes for a given language; (2) NER: graphemes and named entity tags; (3) SF: graphemes and semantic SF tags.

3.3 Transfer learning

Transfer learning is a popular and efficient method to improve the learning performance of the target predictive function using knowledge from a different source domain [19]. It allows to train a model for a given target task using available out-of-domain source data, and hence to avoid an expensive data labeling process, which is especially useful in case of low-resource scenarios.

In this paper, for SF, we investigate the effectiveness of transfer learning for various source domains and tasks: (1) ASR in the target and out-of-domain languages; (2) NER in the target language; (3) SF. For all the tasks, we used similar model architectures (Section 4.2 and Figure 1). The difference is in the text data preparation and output targets. For training ASR systems, the output targets correspond to alphabetic characters and a blank symbol. For NER tasks, the output targets include all the ASR targets and targets corresponding to named entity tags. We have several symbols corresponding to named entities (in the text these characters are situated before the beginning of a named entity, which can be a single word or a sequence of several words) and a one tag corresponding to the end of the named entity, which is the same for all named entities. Similarly, for SF tags, we use targets corresponding to the semantic concept tags and one tag corresponding to the end of a concept. Transfer learning is realized through the chain of consequence model training on different tasks. For example, we can start from training an ASR model on audio data and corresponding text transcriptions. Then, we change the softmax layer in this model by replacing the targets with the SF targets and continue training on the corpus annotated with semantic tags. Further in the paper, we denote this type of chain as . Models in this chain can be trained on different corpora, that can make this method useful in low-resource scenario when we do not have enough semantically annotated data to train an end-to-end model, but have sufficient amount of data annotated with more general concepts or only transcribed data. For NER, we also investigates the knowledge transfer from ASR.

Task Corpora Size,h #Speakers
NER train EPAC, ESTER 1,2, ETAPE, REPERE 323.8 7327
NER dev ETAPE (dev) 6.6 152
NER test ETAPE (test), Quaero (test) 12.3 474
SF train 1. MEDIA (train), 16.1 727
2. PORTMEDIA (train) 7.2 257
SF dev MEDIA (dev) 1.7 79
SF test MEDIA (test) 4.8 208
Table 1: Corpus statistics for ASR, NER and SF tasks.

4 Experiments

4.1 Data

Several publicly available corpora have been used for experiments (see Table 1).

4.1.1 ASR data

The corpus for ASR training was composed of corpora from various evaluation campaigns in the field of automatic speech processing for French. The EPAC [8], ESTER 1,2 [9], ETAPE [13], REPERE [11] contain transcribed speech in French from TV and radio broadcasts. These data were originally in the microphone channel and for experiments in this paper were downsampled from 16kHz to 8kHz, since the test set for our main target task (SF) consists of telephone conversations. The DECODA [2] corpus is composed of dialogues from the call-center of the Paris transport authority. The MEDIA [7, 3] and PORTMEDIA [16] are corpora of dialogues simulating a vocal tourist information server. The target language in all experiments is French. For experiments with transfer learning from ASR built in a different source language to SF in the target language, we used the TED-LIUM corpus [22]. This publicly available dataset contains 1495 TED talks in English that amount to 207 hours of speech data from 1242 speakers.

4.1.2 NER data

To train the NER system, we used the following corpora: EPAC, ESTER 1,2, ETAPE, and REPERE. These corpora contain speech with text transcriptions and named entity annotation. The named entity annotation is performed following the methodology of the Quaero project [14]. The taxonomy is composed of 8 main types: person, function, organization, location, product, amount, time, and event. Each named entity can be a single word or a sequence of several words. The total amount of annotated data is 112 hours. Based on this data, a classical NER system was trained using NeuroNLP2111 to automatically extract named entities for the rest 212 hours of the training corpus. This was done in order to increase the amount of the training data for NER. Thus, the total amount of audio data to train the NER system is about 324 (112+212) hours. The development part of the ETAPE corpus was used for development, and as a test set we used the ETAPE test and Quaero test datasets.

4.1.3 SF data

The following two French corpora, dedicated to semantic extraction from speech in a context of human/machine dialogues, were used in the current experiments: MEDIA and PORTMEDIA. The corpora have manual transcription and conceptual annotation [29, 7]. The MEDIA corpus is related to the hotel booking domain, and its annotation contains semantic tags: room number, hotel name, location, date, etc. The PORTMEDIA corpus is related to the theater ticket reservation domain and its annotation contains semantic tags which are very similar to the tags used in the MEDIA corpus. For joint training on these corpora, we used a combined set of semantic tags.

4.2 Models

We used the


implementation222 for training speaker independent (SI) models, and our modification of this implementation to integrate speaker adaptation. The open-source Kaldi toolkit [20] was used to extract 100-dimensional speaker i-vectors. All models had similar topology (except for the number of outputs) shown in Figure 1 for SAT models. SI models were trained in the same way, but without i-vector integration. Input features are spectrograms. They are followed by two 2D-invariant (in the time and-frequency domain) convolutional layers333

With parameters: kernel size=(41, 11), stride=(2, 2), padding=(20, 5)

, and then by five 800-dimensional BLSTM layers with sequence-wise batch normalization. A fully connected layer is applied after BLSTM layers, and the output layer of the neural network is a softmax layer. The size of the output layer depends on the task (see Section 4.3). The model is trained using the CTC loss function.

4.3 Tasks

The target tasks for us are NER and SF. For each of this task, other tasks can be used for knowledge transfer. To train NER, we use ASR for transfer learning. To train SF, we use ASR on French and English, NER and another auxiliary SF task for transfer learning. Hence, we consider the following set of tasks:

  • – French ASR with 43 outputs {French characters, blank symbol}.

  • – English ASR with 28 outputs {English characters, blank symbol}.

  • – French NER with 52 outputs {43 outputs from , 8 outputs corresponding to named entity tags, 1 output corresponding to the closing tag for all named entities}.

  • – target SF task with 130 outputs {43 outputs from , 86 outputs for semantic slot tags, 1 output for the closing tag}; trained on the training part of the MEDIA corpus.

  • – auxiliary SF task; trained on the MEDIA plus PORTMEDIA training corpora.

  • , – for the target tasks and , we also considered -mode (Section 3.1).

4.4 Results for NER

Performance of NER was evaluated in terms of precision, recall, and F-measure. Results for different training chains for speaker-independent (SI) and speaker adaptive training models (SAT) are given in Table 2. We can see, that pretraining with task does not lead to significant improvement in performance. When the is added to the training chain, it improves all the evaluation measures. In particular, F-measure is increased by 1.9% absolute. For each training chain, we trained a corresponding chain with speaker adaptation. Results for SAT models are given in the right part of Table 2. For all training chains, SAT models outperform SI models. The best result with SAT (F-measure 71.8%) outperforms the best SI result by 1.1% absolute.

Model training SI SAT
78.9 60.7 68.6 80.9 60.9 69.5
80.5 60.0 68.8 80.2 61.7 69.7
82.1 62.1 70.7 83.1 63.2 71.8
Table 2: NER results on the test dataset in terms of Precision (P,%), Recall (R,%) and F-measure (F, %) for SI and SAT models.

4.5 Results for SF

SF performance was evaluated in terms of F-measure, concept error rate (CER) and concept value error rate (CVER). Training performance on the MEDIA development dataset in terms of character error rate (CER) is shown in Figure 2 for different transfer learning chains for SI and SAT models. The blue curves corresponds to the SI baseline model when the model was directly trained on the target SF task without pretraining. All curves of other colours correspond to different sequential transfer learning chains. All considered transfer learning schemes substantially improve the training performance. By comparing and , we can conclude that training on the auxiliary task improves the performance. When we further trained this model on the target task (), the performance continued to improve. This demonstrates, that in given conditions, the sequence transfer learning provides better improvement than just joint training. The best SI model is obtained through the following training chain: . These results are confirmed further in Table 3. Also, we can see that SAT gives an additional improvement in performance for all the models.

Figure 2: Training performance on the MEDIA development dataset in terms of character error rate (CER) for training SF models. For each type of the model chain, a solid line corresponds to a SI model, and a dash line of the same colour denotes a SAT version of a given model.
Model training SI SAT
1 72.5 39.4 52.7
2 73.2 39.0 50.1
3 77.4 33.9 44.9
4 81.3 28.4 37.3
5 85.9 21.7 28.4 9 87.5 19.4 25.4
6 86.4 20.9 27.5 10 87.3 19.5 26.0
7 85.9 21.2 27.9 11 87.7 (89.2) 18.8 (16.5) 25.5 (20.8)
8 87.1 19.5 27.0 12 87.6 (89.2) 18.6 (16.2) 24.6 (20.8)
Table 3: SF performance results on the MEDIA test dataset for end-to-end SF models trained with different transfer learning approaches. Results are given in terms of F-measure (F), CER and CVER metrics (%); SF – target task; SF – auxiliary task; F and E refer to the languages. For the best models, the results in blue correspond to decoding using beam search with a LM.
Systems in literature:   CER Systems in this paper:   CER
Pipeline: ASR+SLU, [25] 19.9 —gready mode 18.6
End-to-end, [10] 27.0 —beam search with LM 16.2
Table 4: SF performance results on the MEDIA test dataset for different systems.

Results for different training chains for speaker-independent (SI) models on the test set are given in Table 3 (#1–8). The first line shows the baseline result on the test MEDIA dataset for the SF task, when a model was trained directly on the target task using in-domain data for this task (training part of the MEDIA corpus). The second line corresponds to the case when the model was trained on the auxiliary SF task. Other lines in the table correspond to different training chains described in Section 3.3. In , we can see a chain that starts from training an ASR model for English. We can observe that using a pretrained ASR model from a different language can significantly ( of relative CER reduction) improve the performance of the SF model (#4 vs #3). This result is noticeable since it shows that we can take benefit from linguistic resources from another language in case of lack of data for the target one. Using an ASR model trained in French (#5) provides better improvement: of relative CER reduction (#5 vs #3). When we start the training process from a NER model (#6) we can observe slightly better results. Further, for the best two model training chains (#5 and 6) we trained corresponding models in -mode (#7 and 8). Results with speaker adaptation for four best models are shown in the right part of Table 3 (#9–12). We can see that SAT models show better results than SI ones. For CVER, we can observe a similar tendency. The results for the best models using beam search and a 4-gram LM are shown in brackets in blue. The LM was built on the texts including ””. Finally, Table 4 resumes our best results (in greedy and beam search modes) and shows the comparison results on the MEDIA dataset from other works [25, 10]. We can see, that the reported results significantly outperform the results reported in the literature for the current task.

Figure 3: Concept error rate (CER,%) results on the MEDIA test dataset for different concepts depending on the number of corresponding concepts in the training corpus. The CER results are given for the SAT model (#12), decoding with beam search and a 4-gram LM.

4.5.1 Error analysis

In the training corpus, different semantic concepts have different number of samples, that may impact the SF performance. Figure 3 demonstrates the relation between the concept error rate (CER) of a particular semantic concept and its frequency in the training corpus. Each point in Figure 3

corresponds to a particular semantic concept. For rare tags, the distribution of errors has larger variance and means than for more frequent tags.

Figure 4: Confusion matrix for concepts on the MEDIA test dataset. The last row and last column represent insertion and deletion errors correspondingly. The CER results are given for the SAT model (#12), decoding with beam search and a 4-gram LM.

In addition, we are interested in the distribution of different types of SF errors (deletions, insertions and substitutions), which is shown in the form of a confusion matrix in Figure 4. For better representation, we first ordered the concepts in descending order by the total number of errors. Then, we chose the first 36 concepts which have the biggest number of errors. The total amount of errors of the chosen 36 concepts corresponds to 90% of all the errors for all concepts in the test MEDIA dataset. The diagonal corresponds to the correctly detected concepts and other elements (except for the last row and last column) correspond to the substitution errors. The final raw represents insertion errors and the final column – deletions. Each element in the matrix shows the total number of the corresponding events (correctly recognized concept, substitution, deletion or insertion) normalized by the total number of such events in the row. The most frequent errors are deletions (50% of all errors), then substitutions (32.3%) and insertions (17.7%).

5 Conclusions

In this paper, we have investigated several ways to improve the performance of end-to-end SLU systems. We demonstrated the effectiveness of speaker adaptive training and various transfer learning approaches for two end-to-end SLU tasks: NER and SF. In order to improve the quality of the SF models, during the training, we proposed to use knowledge transfer from an ASR system in another language and from a NER in a target language. Experiments on the French MEDIA test corpus demonstrated that using knowledge transfer from the ASR in English improves the SF model performance by about 16% of relative CER reduction for SI models.

The improvement from the transfer learning is greater when the ASR model is trained on the target language (36% of relative CER reduction) or when the NER model in the target language is used for pretraining. Another contribution concerns SAT training for SLU models – we demonstrated that this can significantly improve the model performance for NER and SF.

6 Acknowledgements

This work was supported by the French ANR Agency through the ON-TRAC project, under the contract number ANR-18-CE23-0021-01, and by the RFI Atlanstic2020 RAPACE project.


  • [1] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio (2016) End-to-end attention-based large vocabulary speech recognition. In ICASSP, pp. 4945–4949. Cited by: §3.2.
  • [2] F. Bechet, B. Maza, N. Bigouroux, T. Bazillon, M. El-Beze, et al. (2012) DECODA: a call-centre human-human spoken conversation corpus.. In LREC, pp. 1343–1347. Cited by: §4.1.1.
  • [3] H. Bonneau-Maynard, C. Ayache, F. Bechet, et al. (2006) Results of the French Evalda-Media evaluation campaign for literal understanding. In LREC, Cited by: §4.1.1.
  • [4] Y. Chen, R. Price, and S. Bangalore (2018) Spoken language understanding without speech recognition. In ICASSP, Cited by: §1.
  • [5] S. Deena et al. (2017) Semi-supervised adaptation of RNNLMs by fine-tuning with domain-specific auxiliary features. In INTERSPEECH, pp. 2715–2719. Cited by: §3.2.
  • [6] M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani (2018) Auxiliary feature based adaptation of end-to-end asr systems. In INTERSPEECH, pp. 2444–2448. Cited by: §3.2.
  • [7] L. Devillers et al. (2004) The french MEDIA/EVALDA project: the evaluation of the understanding capability of spoken language dialogue systems.. In LREC, Cited by: §4.1.1, §4.1.3.
  • [8] Y. Estève, T. Bazillon, J. Antoine, F. Béchet, and J. Farinas (2010) The EPAC corpus: manual and automatic annotations of conversational speech in french broadcast news.. In LREC, Cited by: §4.1.1.
  • [9] S. Galliano et al. (2009) The ESTER 2 evaluation campaign for the rich transcription of french radio broadcasts. In Interspeech, Cited by: §4.1.1.
  • [10] S. Ghannay, A. Caubrière, Y. Estève, et al. (2018) End-to-end named entity and semantic concept extraction from speech. In SLT, pp. 692–699. Cited by: §1, §2, §4.5, Table 4.
  • [11] A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and L. Quintard (2012) The REPERE corpus: a multimodal corpus for person recognition.. In LREC, pp. 1102–1107. Cited by: §4.1.1.
  • [12] A. Graves et al. (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In

    Proceedings of the 23rd international conference on Machine learning

    pp. 369–376. Cited by: §1, §3.1, §3.
  • [13] G. Gravier, G. Adda, N. Paulson, et al. (2012) The ETAPE corpus for the evaluation of speech-based TV content processing in the french language. In LREC, Cited by: §4.1.1.
  • [14] C. Grouin, S. Rosset, P. Zweigenbaum, K. Fort, O. Galibert, and L. Quintard (2011) Proposal for an extension of traditional named entities: from guidelines to evaluation, an overview. In Proceedings of the 5th Linguistic Annotation Workshop, pp. 92–100. Cited by: §4.1.2.
  • [15] P. Haghani et al. (2018) From audio to semantics: approaches to end-to-end spoken language understanding. arXiv preprint arXiv:1809.09190. Cited by: §1.
  • [16] F. Lefèvre et al. (2012) Robustness and portability of spoken language understanding systems among languages and domains: the PortMedia project [in French]. In The International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey. ELRA, pp. 779–786. Cited by: §4.1.1.
  • [17] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio (2019) Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670. Cited by: §1.
  • [18] et al. (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In International conference on machine learning, pp. 173–182. Cited by: §3.2, §3.
  • [19] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on knowledge and data engineering, pp. 1345–1359. Cited by: §3.3.
  • [20] D. Povey, A. Ghoshal, et al. (2011) The Kaldi speech recognition toolkit. In ASRU, Cited by: §4.2.
  • [21] Y. Qian, R. Ubale, et al. (2017) Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system. In ASRU, pp. 569–576. Cited by: §1.
  • [22] A. Rousseau, P. Deléglise, and Y. Esteve (2014) Enhancing the TED-LIUM corpus with selected data for language modeling and more ted talks.. In LREC, pp. 3935–3939. Cited by: §4.1.1.
  • [23] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny (2013) Speaker adaptation of neural network acoustic models using i-vectors. In ASRU, pp. 55–59. Cited by: §3.2.
  • [24] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Bengio (2018) Towards end-to-end spoken language understanding. arXiv preprint arXiv:1802.08395. Cited by: §1.
  • [25] E. Simonnet et al. (2018) Simulating asr errors for training SLU systems. In LREC 2018, Cited by: §4.5, Table 4.
  • [26] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  • [27] N. Tomashenko, A. Caubrière, and Y. Estève (2019) Investigating adaptation and transfer learning for end-to-end spoken language understanding from speech. In Interspeech, Graz, Austria. Cited by: §3.2.
  • [28] N. Tomashenko and Y. Estève (2018) Evaluation of feature-space speaker adaptation for end-to-end acoustic models. In LREC, Cited by: §3.2.
  • [29] V. Vukotic, C. Raymond, and G. Gravier (2015) Is it time to switch to word embedding and recurrent neural networks for spoken language understanding?. In Interspeech, Cited by: §4.1.3.