The task of spoken language understanding (SLU) system is to detect fragments of semantic knowledge in speech data. Popular models are made of frames describing relations between entities and their properties [39, 36, 26]
. The SLU system instantiates a predefined set of frame structures called concepts that can be mentioned in a sentence or a dialogue turn. Concept mentions express dialogue acts (DA), intents, domain knowledge, and frame properties often represented by slots, identified by entity names, and slot filler values identified by mention types. Concept mentions are difficult to characterize in terms of words or characters. They may be localized by head words or short word sequences called concept supports. For example, word spans can be hypothesized to be mentions of concepts, while entire sentence can be considered for hypothesizing dialogue acts. Unfortunately, mentions may be ambiguous because their word spans may express more semantic constituents, be incomplete or be affected by errors of an automatic speech recognition (ASR) system. These difficulties can be alleviated by considering certain head words, word spans, or a sentence as a seed for hypotheses generation and using additional context for providing predictions useful for constraining instantiation decision. An example of additional distant context used so far is a representation of dialogue history made of embeddings of sentences preceding the sentence or dialogue turn to be interpreted[6, 17, 45, 34, 16, 41, 21, 23, 24].
A problem that has not yet thoroughly investigated is to select what to embed and how. Some popular corpora used so far (e.g. ATIS ) do not have explicit sentence history. In this case, the only context to pay attention to is the sentence to be interpreted. If some history information is available, then distant contexts for DA and concepts may be different. Specific contexts for DA have been proposed in [28, 30]. For concepts, the selection of distant contexts may depend on the complexity of the application semantic domain. For example, the French MEDIA corpus  has concepts of reference, relative time, locations, prices, logical conjunction and disjunction that are expressed by short semantically ambiguous words, which are often difficult to recognize, requiring knowledge of a semantic context called state of-the world to reduce the perplexity. Furthermore, the problem of deciding the type of embedding is also relevant as made evident in recent published papers [22, 27, 32, 43, 44, 42].
In this paper, we investigate the use of different types of dialog history representation, extracted with or without supervision, and their impact on the performance of an end-to-end signal-to-concept neural network.
. Considering the concern expressed in  and prior knowledge, we propose to focus on types of history contents starting by considering the previous system turn that contains semantically unambiguous information. In fact, the sequence of words in the system turn is generated by a semantic model whose goal is to reach a commit state for performing a transaction. Furthermore, using the train set, it is possible to compute prediction probabilities of user enunciated concepts, given the system enunciated concepts. The most likely predicted concepts can thus be used for reducing interpretation uncertainty in the following user turn.
The rest of the paper is organized as follows. Section 2 presents an architecture of an end-to-end signal-to-concept model and the proposed way of integration of dialog history ebmeddings (to which we refer as h-vectors) into this model. Section 3 introduces different ways to represent the dialog history. Sections 4 describes the experimental setup and results. Finally, the conclusions are given in Section 5.
2 End-to-end signal-to-concept neural architecture
, it is followed by 2D-invariant (in the time and-frequency domain) convolutional layers, and then BLSTM layers. A fully connected layer is applied after BLSTM layers, and the output layer of the neural network is a softmax layer. The model is trained using the CTC loss function. H-vectors are appended to the outputs of the last (second) convolutional layer, just before the first recurrent (BLSTM) layer.
The outputs of the network consist of the two subsets: (1) outputs to represent the words (graphemes of a corresponding language, a space symbol to denote word boundaries, and a blank symbol), and (2) outputs to represent semantic concepts types and a closing symbol for semantic tags. We have several symbols corresponding to semantic concepts (in the text these characters are situated before the beginning of a semantic concept, which can be a single word or a sequence of several words) and a one tag corresponding to the end of the semantic concept, which is the same for all semantic concepts.
In order to improve model performance, we integrate dialog history information in form of h-vectors into the model as shown in Figure 1. Each h-vector is calculated from the last dialog system response as described further in Section 3.
H-vectors are appended to the outputs of the last (second) convolutional layer, just before the first recurrent (BLSTM) layer. In this paper, for better initialization, we first train a model using zero vectors of the same dimension (all values are equal to 0) instead of h-vectors. Then, we use this pretrained model and finetune it on the same data but with the real h-vectors. This approach was inspired by , where the idea of using zero auxiliary features during pretraining was implemented for language models, and by , where it was used for i-vectors. In our preliminary experiments this type of pretraing demonstrated better results than direct model training with h-vectors, hence we use it in the experiments presented in this paper.
3 Dialog history representation
The MEDIA corpus is a French corpus of spoken human/machine dialogues dedicated to hotel booking . Recently, it has been shown that this corpus is one of the current most challenging corpora for slot filling (SF) task  due to its complexity. In this dataset, a human/machine dialogue is composed of 15 utterances from the user on average, and the same number from the system.
For this work, we decided to use as history information, the previous system prompt as it provides most of the time a good evidence of what the user answers. The goal is to help the main system to predict concept tags, hence our aim is to encode the previous system prompt into an embedding that contains useful information to achieve this objective.
3.1 Embedding with supervision
A first h-vector type is produced using a bidirectional gated recurrent unit (GRU) network to analyse the system prompt and produce a vector of embedding that is the input to a decision layer whose objective is to predict the bag of concepts of the future user answer, illustrated in Figure (a)a. The bag of concepts is represented by a vector whose size is the number of unique concepts (slots) in the application, the concepts that appears in the next user intervention are set to one. The output layer is thus a multiclass multi-output sigmoid layer and the network is trained using a binary cross-entropy loss. As the turn in the dialog itself may identify some useful statistics, a very short part (2%) of the h-vector is reserved to encode the dialog turn itself. Obviously, predicting the presence or absence of all the concepts of the next user answer from the previous system prompt is not possible and the network may overfit.
3.2 Embedding with no supervision
Another solution that is more straightforward to train is to use a recurrent autoencoder to encode the prompt into a single h-vector. This h-vector is obtained by a symmetric neural network using a forward GRU in the encoder and decoder part, the output is a softmax layer (size is the vocabulary of the system) whose objective is to reconstruct the input prompt, illustrated in Figure (b)b.
Several publicly available corpora have been used for experiments (see Table 1).
|ASR train||EPAC , ESTER 1,2 ||404.6|
|ETAPE , REPERE |
|DECODA , MEDIA |
|SF train||MEDIA (train)||15.8|
|SF dev||MEDIA (dev)||1.6|
|SF test||MEDIA (test)||4.6|
4.1.1 ASR data
In this paper, the ASR data (audio speech files with text transcriptions) are used for transfer learning as described in Section4.3. The corpus for ASR training is composed of corpora from various evaluation campaigns in the field of automatic speech processing for French, as shown in Table 1. The EPAC , ESTER 1,2 , ETAPE , REPERE  contain transcribed speech in French from TV and radio broadcasts. These data were originally in the microphone channel and for experiments in this paper were downsampled from 16kHz to 8kHz, since the test set for our main target task (SF) consists of telephone conversations. The DECODA  corpus is composed of dialogues from the call-center of the Paris transport authority. The MEDIA [11, 3] and PORTMEDIA  are corpora of dialogues simulating a vocal tourist information server.
4.1.2 SF data
The MEDIA French corpus, dedicated to semantic extraction from speech in a context of human/machine dialogues, is used in the current experiments (see Table 1). The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences). A concept is defined by a label and a value, for example with the concept date, the value 2001/02/03 can be associated [40, 11]. The MEDIA corpus is related to the hotel booking domain, and its annotation contains semantic concept tags: room number, hotel name, location, date, room equipment, etc.
4.2 H-vector extraction
We produced three different types of h-vectors: two types of h-vectors using the neural architecture trained in a supervised way to predict the bag of MEDIA concepts:
supervised-all h-vectors. To extract these h-vectors, we trained a model as described in Section 3.1. The accuracy of the model to predict the next bag of concepts is 45% on the train and 26% on the test dataset. The model has 30.382 parameters.
supervised-freq h-vectors. This version has been trained with a bag of the four history concepts that have been observed in the train and development set to predict concepts that are frequently misrecognized.
This version tends to overfit with around 60% of accuracy on the train and only 16% on the test. The model has 23.918 parameters.
The third type of embeddings is trained in an unsupervised way:
unsupervised h-vectors. These h-vectors are produced by the autoencoder architecture as described in Section 3.2. The autoencoder has 246.270 parameters, and the accuracy in the reconstruction is 52% on the train and 48% on the test.
4.3 Signal-to-concept models
The neural architecture is inspired by the Deep Speech 2  for ASR. The two major differences in comparison with the original architecture are the following. First, we integrated dialog history into this system based on dialog history embedding vectors (h-vectors) as shown in Figure 1 and proposed in Section 3. Second, in this paper, the task is SF, therefore the output sequence besides the alphabetic characters also contains special characters corresponding to the semantic tags [14, 38].
A spectrogram of power normalized audio clips calculated on 20ms windows is used as the input features for the system. As shown in Figure 1, input features are spectrograms. They are followed by two 2D-invariant (in the time and-frequency domain) convolutional layers111
, and then by five 800-dimensional BLSTM layers with sequence-wise batch normalization. A fully connected layer is applied after BLSTM layers, and the output layer of the neural network is a softmax layer. The model is trained using the CTC loss function. We used the
deepspeech.torchimplementation222https://github.com/SeanNaren/deepspeech.pytorch for training baseline models, and our modification of this implementation to integrate dialog history embedding vectors.
In this work, we performed experiments with two types of models: (1) models that are trained directly on the target task using the MEDIA corpus dataset and (2) models that are trained using the transfer learning paradigm. Transfer learning is performed from the ASR task as described in .
For transfer learning experiments, we first trained an ASR model on the ASR data (described in Section 4.1.1) using a similar end-to-end model architecture as we used for the SLU model. The difference is in the text data preparation and output targets. For training ASR systems, the output targets correspond to alphabetic characters and a blank symbol, while for slot filling task, we used additional targets corresponding to the semantic concept tags and one tag corresponding to the end of a concept. Then, we changed the softmax layer in this model by replacing the targets with the SF targets and continue training on the corpus annotated with semantic tags (Section 4.1.2).
Performance was evaluated in terms of concept error rate (CER)333CER is defined as the ratio of the total number of deleted, inserted and confused concepts and the total number of concepts in reference utterances. and concept value error rate (CVER)444CVER, in comparison to CER, takes into account concept/value pairs instead of only concepts. on the MEDIA test dataset.
In the first series of experiments, we trained a baseline model and models with different types of h-vectors described in Section 4.2. Results for these models are given in Table 2. All the models in this table are trained directly on the MEDIA training corpus. The first line shows the baseline result for the end-to-end signal-to-concept model. The other three lines (#2,3,4) correspond to the models trained with dialog history integration and differ from each other in the way the dialog history is represented in the form of h-vectors. We can observe, that all types of h-vectors provide an improvement over the baseline model for both metrics CER and CVER. The best result (line #4) is obtained for supervised-all h-vectors and corresponds to 12.5% of relative CER reduction and to 11.9% of CVER reduction in comparison with the baseline model.
It was shown in , that transfer learning can significantly improve the performance of end-to-end SLU models. In this work, we are also interested in exploring the proposed approach for more accurate models trained using the transfer learning paradigm. For this purpose, we trained two models using transfer learning from the ASR task as proposed in  and described in Section 4.3. Results for these models are presented in Table 3. The first line corresponds to a baseline model. The second line demonstrates the result for the model trained with the best type of dialog history embedding vectors (supervised-all) chosen according to our first series of experiments. We can see that h-vectors continue to provide an improvement in performance over the stronger baseline: 7.7% of relative CER reduction and 6.3% of relative CVER reduction.
In this paper, we have proposed a novel way of integration of the dialog history information into end-to-end signal-to-concept SLU models by means of using so-called h-vectors. We have proposed different types of h-vectors and investigated their effectiveness for end-to-end SLU using as an example the semantic slot filling task. Experiments on the MEDIA corpus demonstrated that using h-vectors improves the slot filling model performance by about 8–13% of relative CER reduction, and by about 6-12% of relative CVER reduction. The best result was obtained using supervised-all h-vectors predicting bag-of-concepts representations of the user’s answer from the last system response.
-  (2012) DECODA: a call-centre human-human spoken conversation corpus.. In LREC, pp. 1343–1347. Cited by: §4.1.1, Table 1.
-  (2019-09) Benchmarking benchmarks: introducing new automatic indicators for benchmarking Spoken Language Understanding corpora. In Interspeech, External Links: Cited by: §3.
-  (2006) Results of the French Evalda-Media evaluation campaign for literal understanding. In LREC, Cited by: §4.1.1.
-  Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability. In Interspeech 2019, pp. 1198–1202. Cited by: §2.
-  (2018) Spoken language understanding without speech recognition. In ICASSP, Cited by: §2.
-  (2016) End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding.. In Interspeech, Cited by: §1, §1.
-  (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: Cited by: §3.1.
-  (2015) Keras, https://github.com/fchollet/keras. Cited by: §4.2.
-  (1994) Expanding the scope of the ATIS task: The ATIS-3 corpus. In Workshop on Human Language Technology, pp. 43–48. Cited by: §1.
-  (2017) Semi-supervised adaptation of rnnlms by fine-tuning with domain-specific auxiliary features. In Interspeech, pp. 2715–2719. Cited by: §2.
-  (2004) The French MEDIA/EVALDA project: the evaluation of the understanding capability of spoken language dialogue systems.. In LREC, Cited by: §1, §3, §4.1.1, §4.1.2, Table 1.
-  (2010) The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news.. In LREC, Cited by: §4.1.1, Table 1.
-  (2009) The ESTER 2 evaluation campaign for the rich transcription of french radio broadcasts. In Tenth Annual Conference of the International Speech Communication Association, Cited by: §4.1.1, Table 1.
-  (2018) End-to-end named entity and semantic concept extraction from speech. In SLT, pp. 692–699. Cited by: §2, §4.3.
-  (2012) The REPERE corpus: a multimodal corpus for person recognition.. In LREC, pp. 1102–1107. Cited by: §4.1.1, Table 1.
-  (2019) HyST: a hybrid approach for flexible and accurate dialogue state tracking. In Interspeech, Cited by: §1, §1.
-  (2018) Slot-gated modeling for joint slot filling and intent prediction. In NAACL-HLT, Cited by: §1, §1.
-  (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, Cited by: §2, §4.3.
-  (2012) The ETAPE corpus for the evaluation of speech-based TV content processing in the french language. In LREC, Cited by: §4.1.1, Table 1.
-  (2018) From audio to semantics: approaches to end-to-end spoken language understanding. arXiv preprint arXiv:1809.09190. Cited by: §2.
-  (2016) Tracking the world state with recurrent entity networks. arXiv preprint arXiv:1612.03969. Cited by: §1.
-  (2016) Dependency based embeddings for sentence classification tasks. In NAACL-HLT, Cited by: §1.
-  (2019) Dialogue state tracking with convolutional semantic taggers. In ICASSP, pp. 7220–7224. Cited by: §1.
-  (2019) Sumbt: slot-utterance matching for universal and scalable belief tracking. arXiv preprint arXiv:1907.07421. Cited by: §1.
-  (2012) Robustness and portability of spoken language understanding systems among languages and domains: the PortMedia project [in French]. In JEP-TALN-RECITAL, pp. 779–786. Cited by: §4.1.1, Table 1.
-  (2019) Incremental transformer with deliberation decoder for document grounded conversations. In ACL, Cited by: §1.
-  (2017) A structured self-attentive sentence embedding. In ICLR, Cited by: §1.
-  (2017) Using context information for dialog act classification in dnn framework. In EMNLP, Cited by: §1.
-  (2019) Speech model pre-training for end-to-end spoken language understanding. In Interspeech, Cited by: §2.
-  (2019) Context-aware neural-based dialog act classification on automatically generated transcriptions. In ICASSP, Cited by: §1.
-  (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In ICML, Cited by: §4.3.
-  (2018) Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: §1.
-  (2017) Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system. In ASRU, Cited by: §2.
-  (2019) Do neural dialog systems use the conversation history effectively? an empirical study. In ACL, Cited by: §1.
-  (2018) Towards end-to-end spoken language understanding. arXiv preprint arXiv:1802.08395. Cited by: §2.
-  (2019) Modeling semantic relationship in multi-turn conversations with hierarchical latent variables. In ACL, Cited by: §1.
-  (2019) Recent advances in end-to-end spoken language understanding. In SLSP, Cited by: §2.
-  (2019) Investigating adaptation and transfer learning for end-to-end spoken language understanding from speech. Interspeech. Cited by: §2, §2, §4.3, §4.3, §4.4.
-  (2011) Spoken language understanding: systems for extracting semantic information from speech. John Wiley & Sons. Cited by: §1.
-  (2015) Is it time to switch to word embedding and recurrent neural networks for spoken language understanding?. In Interspeech, Cited by: §4.1.2.
-  (2016) A step beyond local observations with a dialog aware bidirectional GRU network for Spoken Language Understanding. In Interspeech, External Links: Cited by: §1.
-  (2019) Probing for semantic classes: diagnosing the meaning content of word embeddings. In ACL, Cited by: §1.
-  (2018) On the dimensionality of word embedding. In Advances in Neural Information Processing Systems, pp. 887–898. Cited by: §1.
-  (2018) Diffusion maps for textual network embedding. In NIPS, Cited by: §1.
-  (2019) A hierarchical decoding model for spoken language understanding from unaligned data. In ICASSP, Cited by: §1.