Joint On-line Learning of a Zero-shot Spoken Semantic Parser and a Reinforcement Learning Dialogue Manager

by   Matthieu Riou, et al.

Despite many recent advances for the design of dialogue systems, a true bottleneck remains the acquisition of data required to train its components. Unlike many other language processing applications, dialogue systems require interactions with users, therefore it is complex to develop them with pre-recorded data. Building on previous works, on-line learning is pursued here as a most convenient way to address the issue. Data collection, annotation and use in learning algorithms are performed in a single process. The main difficulties are then: to bootstrap an initial basic system, and to control the level of additional cost on the user side. Considering that well-performing solutions can be used directly off the shelf for speech recognition and synthesis, the study is focused on learning the spoken language understanding and dialogue management modules only. Several variants of joint learning are investigated and tested with user trials to confirm that the overall on-line learning can be obtained after only a few hundred training dialogues and can overstep an expert-based system.


page 1

page 2

page 3

page 4


Findings from Experiments of On-line Joint Reinforcement Learning of Semantic Parser and Dialogue Manager with real Users

Design of dialogue systems has witnessed many advances lately, yet acqui...

Investigation of Language Understanding Impact for Reinforcement Learning Based Dialogue Systems

Language understanding is a key component in a spoken dialogue system. I...

Deep Reinforcement Learning for On-line Dialogue State Tracking

Dialogue state tracking (DST) is a crucial module in dialogue management...

Neural Belief Tracker: Data-Driven Dialogue State Tracking

One of the core components of modern spoken dialogue systems is the beli...

Reward Shaping with Recurrent Neural Networks for Speeding up On-Line Policy Learning in Spoken Dialogue Systems

Statistical spoken dialogue systems have the attractive property of bein...

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems

The ability to compute an accurate reward function is essential for opti...

Sounding Board: A User-Centric and Content-Driven Social Chatbot

We present Sounding Board, a social chatbot that won the 2017 Amazon Ale...

1 Introduction

While a new avenue of research on end-to-end deep-learning-based dialogue systems has shown promising results lately

[1, 2], a major hindrance remains the need of a huge quantity of data for these models to be trained efficiently. So far, in this case, it is not clear how can some initial (low cost) knowledge be leveraged for a warm start of the system development followed by on-line training with users as describe in [3, 4].

In the experiments reported here our underlying goal is to develop a system intended to be used in a neuroscience experiment. From inside a fMRI, users interact with a robotic platform, vocally powered by our system, which is live-recorded and displayed inside the head-antenna. It is then crucial to have it develop in French. Therefore not only it is not possible to use the publicly available corpora as the vast majority is in English [5, 6]. But also because a new task is targeted (see section 5) for which no data is available yet. Likewise crowdsourcing to realise large-scale data collection is not affordable, as not platform (AMT or others) offers enough NLP-skilled French-speaking workers.

As a consequence in this work we still refer to a classical architecture, with proven capabilities, for goal-directed vocal interaction. It is basically a pipeline of modules dealing with the audio information from the user downstream; progressive treatments aims to first extract the content (speech recognition), then the meaning (semantic parsing, SP), to finally combine it with previous information (including grounding status) from the dialogue history (belief tracking) so that a policy can decide upon this dialogue state representation the next best action to perform according to some global criteria (generally dialogue length and success in reaching the goal). This step of dialogue management (DM) is then followed by processings to convey back the information upstream to the user: conversion of the dialogue manager action into natural language (NLG) followed by speech synthesis. The HIS architecture [7] offers such a setup encompassing a global statistical framework to account for the relations between the data handled by the main modules of the system, allowing a reinforcement learning of the DM policy. In this system could have been implemented the most sample-efficient learning algorithms [8], from which on-line learning with direct interactions with the user could have been proposed [9]. More recently on-line learning has been generalized to the input/output modules, SP and NLG, with protocols to control the cost of such operations during the system development (as in [10, 11, 12]). In this work it is our first attempt to combine the on-line learning of SP and DM in a single phase of development. Not only it is expected to help speed-up and simplify the process, but also to benefit from intertwined improvements of the modules.

In dialogue systems, SP extracts a list of semantic concept hypotheses from an input sentence transcription of the user’s query. This list is generally expressed as a sequence of Dialogue Acts (DAs) of the form acttype(slot=value)

and transmitted to the Dialogue Manager (DM) to make decision on the future action to perform. State-of-the-art SPs are based on probabilistic approaches and trained with various machine learning methods to tag the user input with these semantic concepts 

[13, 14, 15]. Dealing with supervised machine learning techniques requires a large amount of annotated data which are domain dependent and hardly available.

To deal with this limitation, Dauphin et al. [16]

proposed a zero-shot learning algorithm for Semantic Utterance Classification (SUC). This method tries to find a sentence-wise link between categories and utterances in a semantic space. A deep neural network can be trained on a large amount of non-annotated and unstructured data to learn this semantic space. In the same line, in 

[10] was presented a zero-shot learning method for SP (ZSSP) based on word embeddings [17]

. This approach requires neither annotated data nor in-context data. Indeed, only the ontological description of the target domain and generic word embedding features (learned from freely available and general purpose data) are required to initiate the model. On top of that an active learning strategy based on an adversarial bandit has been proposed 

[11] in order to train ZSSP with a light and controlled supervision from the users.

In the same line of ideas, thanks to the sample-efficient RL algorithm KTD [18], an active learning scheme has also been proposed for the DM training which uses reward shaping [19] to take into account local (turn-based) rewards from the user to offer a better control over the learning process and speed it up [9].

Since solutions exist for active on-line learning of both SP and DM sub-systems, we now consider their joint application to address the issue of the overall training of the system. First a direct application of existing techniques is presented and tested. Both modules remain separated and the parameters of their on-line trainings are kept disjoint (a bandit algorithm for SP, a Q-learner for DM). Then a new possibility with shared parameters in a single Q-learner is also introduced and evaluated.

The reminder of this paper is organized as follows. After presenting the basis of the on-line learning versions of SP in Section 2 and DM in Section 3, we define the joint on-line learning strategies in Section 4. Section 5 provides an experimental study with human evaluations of the proposed approaches. And we conclude in Section 6.

2 On-line Learning for Zero-shot SP

The SP model concerned by this study is the ZSSP model presented in [11] and illustrated in Fig 1. This latter makes use of a semantic knowledge base and a semantic feature space . contains some examples of lexical chunks associated with each targeted DA and is a word embeddings learnt with neural network algorithms on large non-annotated open domain data [17, 20]

. The SP composes a scored graph of hypotheses from user utterances. All possible contiguous chunks are considered in the graph and a dot product between the k-most similar vectors and their corresponding assignment coefficients in the

matrix is computed to attribute to each chunk a list of scored semantic hypotheses. A best-path decoding is performed in order to find the best semantic tags hypothesis for the considered user utterance.

Figure 1: Basics of the ZSSP Model

An on-line adaptation strategy is adopted, as presented in [11] and briefly recalled here. In this approach, at each dialogue iteration, the system chooses an adaptation action and uses the user feedback to update .

The system gain , the user effort

and the loss function

for performing each action are defined and can be estimated during on-line training.

Three possible actions are considered:

  • Skip: Skip the adaptation process for this turn ().

  • AskConfirm: A yes/no question is presented to the user about the correctness of the selected DAs in the best semantic hypothesis. If the whole sentence is accepted, . Otherwise, is equal to the number of DA in the best semantic hypothesis (one yes/no confirmation request per DA).

  • AskAnnotation: the user is asked to re-annotate the whole utterance. if the sentence is accepted straight away. Otherwise, the user will first inform the system about the word boundaries of the concept he plans to annotate (), and then the system will sequentially ask for , and if necessary ( per interim question) for each DA.

An adversarial bandit algorithm is used in order to find such that for every , the system minimizes the loss . The loss function is calculated as follows:

where balances the importance of information improvement and user effort for the system and is the maximum number of exchanges between the system and the user (in a same turn/round). In this work, has been set to for example.

3 On-line Learning for RL Dialogue Manager

The dialogue manager used in this paper adapts a system presented in [9]. It is based on a POMDP-based dialogue management framework, the Hidden Information State (HIS)  [7]. In this setup, the system maintains a distribution over possible dialogue states (the belief state) and uses it to generate an adequate answer. A reinforcement learning (RL) algorithm is used to train the system by maximizing an expected cumulative discounted reward.

At each turn, the dialogue manager generates several possible answers, depending on its belief state. It generates 11 dialogue acts, matching the 11 summary acts described in Table 1. Some can be deemed impossible at some point if no conversion to full action is possible (for instance inform if no entity is selected yet).

Greet Greet user
Bye End the dialogue
BoldRQ Bold query request
TentRQ Tentative query request
Confirm Confirm an ungrounded piece of information
FindAlt Find alternative database entity
Split Distinguish two hypotheses
Repeat Repeat
Offer Offer a database entity
Inform Give info about current offer
QMore Query if the user wants more information
Table 1: Descriptions of the dialogue manager’s summary acts

The dialogue manager then chooses the best summary act according to the given context. To learn this policy, a RL approach is used: the KTDQ learning algorithm [21], derived from a Kalman-based Temporal Differences (KTD) framework. At each turn, the policy select a summary act to answer the user, then feedback is given by the users to score the response and update the policy. There are two types of feedback. The global feedback is given at the end of the dialogue by asking the user if the entire dialogue is a success or not. The social feedback is given at each turn to score the last response only. It is composed of two parts, the score given by the user to this last response (named additional-feedback ) minus a function (to limit the influence of the additional-feedback to their turn), and the turn cost which allow to penalise too long dialogues by adding a negative score to each turn taken (named feedback ).

Here is the previous turn’s additional-feedback and . At the end of the dialogue, the policy is updated according to all the collected feedbacks.

In this work, the global feedback value is set to 20 in case of success, 0 otherwise. The feedback is set to -1 for each turn and the additional-feedback .

4 Joint On-line Learning

In order to effectively learn on-line the dialogue system, the user needs to be able to both improve the SP model and the dialogue manager. Two different joint learning protocols are proposed to achieve it.

The first one, referred to as BR hereafter, directly juxtaposes the bandit to learn the ZSSP and the Q-learner RL approach to learn the dialogue manager policy. An adversarial bandit algorithm as described in Section 2 is applied for learning ZSSP and a Q-learner as described in Section 3 is used to learn the DM policy. The knowledge base of the ZSSP as well as the DM policy are adapted after each dialogue turn.

The second protocol, referred to as RR hereafter, directly adds the ZSSP learning actions to the dialogue manager RL policy, and therefore combines the two learning process into one single policy.

This variant of joint learning merges both policies in a single Q-learner. In that purpose the DM summary state vector has been augmented with a ZSSP-related dimension. Only one has been added so as to avoid dispersion of the policy which could arise from a larger increase in state size.

This new dimension has been evaluated based on compound indices of the requirements for correction of the current annotation from the ZSSP. On a 3-point scale, three features have been used:

  1. confidence: confidence score of the semantic parser in

  2. fertility: ratio of concepts on the utterance word length in . As ZSSP tends to produce an over-segmentation of the incoming utterances with inserted concepts.

  3. rare: binary presence of rare concepts in the annotation. Rare concepts are ”help”, ”repeat”, ”restart”, ”reqalts”, ”reqmore”, ”ack” or ”thankyou”, and are generally wrongly annotated.

From these features, an inform_zssp feature is computed as:

  1. all clear: confidence and fertility and rare

  2. average condition: confidence and fertility and rare and (confidence or fertility )

  3. alarming: confidence or fertility or rare

At the same time the two ZSSP-annotation actions (Askconfirm and Askannotation, see previous section) are included in the list of summary actions dealt with by the dialogue policy and can be picked up by it. In such case the user is presented with the appropriate annotation window in the system’s graphical interface and can correct the current annotation. Purely vocal interactions for this process are under study. Yet feasible it remains a challenging task which could introduce errors of its own, so it seemed more appropriate to evaluate the whole process first with a graphical interface and no input errors. Once done, the turn is updated (i.e. the annotation process has taken the place of the normal user audio response) and the dialogue is pursued. Even though it might be possible to have the policy learn it by itself we chose to inhibit two Ask actions in a row (they are tagged as impossible in the next turn). This two ZSSP-annotation actions have a specific social-feedback: the feedback uses the loss function described in Section 2, which is rescaled to obtain a score : .

5 Experimental Study

5.1 Task Description

Experiments presented in this paper concern a chit-chat dialogue system framed in a goal-oriented dialogue task. In this context, users discuss with the system about an image (out of a small predefined set of 6), and they tried jointly to discover the message conveyed by the image, as described in [22]. In order to use a goal-oriented system for such a task, the principle which has been followed was to construct, as the system’s back-end, a database containing several hundreds of possible combinations of characteristics of the image, each associated with a hypothesis of the conveyed message. During its interaction with the system, the user by progressively giving elements from the image matching entities in the database will make the system select a small subset of possible entities from which it can pick both additional characteristics to inform the user with or ultimately a pre-defined message to give as a plausible explanation for the image’s purpose. This would allow the user to speak rather freely about the image for several tens of seconds before arguing briefly about the message. No argumentation is possible from the system’s side, it can only propose a canned message and the discussion is expected to last only around one minute at most.

The task-dependent knowledge base used in the experiments is derived from INT task description [22], as well as from a generic dialogue information. The semantics of the task is represented by 16 different act types, 9 slots and 51 values. The lexical forms () used to model act types were manually elaborated.

5.2 Results

Model Train Test Success Avg cum. Sys. Underst. Sys. Gener.
(#dial) (#dial) (%) Reward Rate Rate
ZH 0 94 31 -1.7 1.3 4.1
BH - Train1 80 48 79 9.1 3.2 4.7
BH - Train2 80 48 73 9.1 2.4 4.5
BR - Train1 140 48 62 7.0 2.9 4.4
BR - Train2 140 48 94 11.7 2.8 4.6
RR - Train1 140 48 58 1.7 2.2 4.3
RR - Train2 140 48 67 4.4 2.5 4.8
Table 2: Evaluation of the different configurations of on-line learning

The evaluation of the two joint learning approaches is presented here. Two complementary systems are proposed in comparison: ZH is a baseline system without on-line learning using the initial ZSSP and a handcrafted dialogue manager policy, whereas the system BH combines the bandit on-line learning for ZSSP and the handcrafted dialogue manager policy.

For each system, two expert users communicated with the system to train two separate models. Then a group of (mostly) naive users tested each model. At the end of each session, the users were asked to give a rating on a scale of 0 (worst) to 5 (best) to the understanding and generation qualities of the system. The number of training dialogues, as well as the number of test sets for each configuration are given in Table 2.

The dialogue success rate is calculated and the average cumulated reward is estimated for each protocol. Figure 2 presents the learning curve for the BH, BR and RR systems. The curves are smoothed out by a 20-point sliding window with a 5-point shift. Regarding BH, we can observe that both trainings tend to initially have a high success rate while the expert users validate the baseline system. Then they started to diversify their inputs, thus quickly and temporarily degrading the results. Within fifty training dialogues, they achieved a higher success score. BR and RR start from scratch and initially have a very low success rate. At the beginning of the learning process, the expert is inclined to use simple dialogues to build and efficient dialogue manager policy, leading to a large increase of the success rate. Then, when the system starts to be usable, more sophisticated dialogue development are tested to teach more adaptability to the system. During the training, it drifts towards a decrease of the reward and success rates. The integration of the ZSSP learning actions to the dialogue manager RL policy presents some difficulties. Those actions are difficult to reward despite the social reward, since the most profitable ones usually happen during unsuccessful dialogues. Thus, they tend to quickly disappear during the learning and so to limit the ZSSP learning for RR systems. This can explain the lower scores of the RR systems during user tests discussed below.

For the BR model, it can be observed with Train2 that this protocol can reach very good performances (94% success rate) with less than a hundred fifty training dialogues. Yet this performance seems to suffer of great variability, depending on the choices made by the expert. The experts may have the same tools at their disposal they still have a large margin of action in how they train their system: for instance they can decide to locally reward only the good actions, or reversely only the bad ones (or ideally, but more costly, both). They also have a large leeway on which kind of inputs they train the system with: either to stay very simple to ensure a steady learning curve or very realistic to try to reach immediately the interesting regions of the policy’s state space. This is what can be observed in the differences between BR-Train1 (62% succ rate) and BR-Train2 (94%). However it is worth noticing that in both cases, the system performance is increasing with time, and so a system can be further improved to a certain level. Unfortunately, since these experiments are extremely expensive in time and require expert users to train the models, larger scale experiments are not affordable to statistically estimate the variance of performance of the resulting models over quite different experts applying different strategies.

Figure 2: (a) Average cumulated reward for different models (b) Average cumulated success for different models

The user trials of the two training trials for each protocol are given in Table 2. The results show that the different configurations of the system displays acceptable performance. The difference in performance between the ZH and the two BH models (+48% absolute for Train1 and +42% absolute for Train2) shows the impact of the ZSSP adaptation on the overall success of the conversation, along with a better understanding (1.3 for ZH vs. 3.2 for BH-Train1 and 2.4 for BH-Train2). The average cumulated reward rate on the test is directly correlated to the success rate and comes in confirmation of the previous observations. Also, due to a well-tuned template-based generation system, the system generation rate is high () for all configurations.

The RR protocol offers smaller success rates compared to BH and BR (58% for Train1 and 67% for Train2 to be compared with 94% for the best BR case). After analysing the training logs it seems to be related to the very low triggering level of the ZSSP learning actions after the exploration steps during RR w.r.t the use of the bandit (in BH and BR). To remedy this the policy state space should be modified to take a better account of the situations favorable to ZSSP actions, while preserving its capacities of discrimination for the dialogue actions. Anyhow this approach remains to be developed further and improved as it is based on a unique framework for joint learning, which simplifies the system elaboration from a programming point of view.

6 Conclusion

After proposing methods to interactively train both semantic parsing and dialogue management on-line, this paper proposed and evaluated ways to combine them in a joint learning process. Experiments have been carried in real conditions and are therefore scarce. Yet it has been possible to show that joint learning can be operated, and that after only a hundred dialogues the performance of the various configurations tested were generally good enough compared to a handcrafted system.

Based on these results, we now investigate the possibility of merging the resulting policies between trials, so as to be able to pile up trainings coming from different users and save even more time to the system developers.


  • [1] Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, “Building end-to-end dialogue systems using generative hierarchical neural network models,” in

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    . 2016, AAAI’16, pp. 3776–3783, AAAI Press.
  • [2] Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gasic, Lina M. Rojas Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young, “A network-based end-to-end trainable task-oriented dialogue system,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 2017, pp. 438–449, Association for Computational Linguistics.
  • [3] Emmanuel Ferreira and Fabrice Lefèvre, “Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management,” in ASRU, 2013.
  • [4] M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, and S. Young, “Gaussian processes for fast policy optimisation of pomdp-based dialogue managers,” in SIGDIAL, 2010.
  • [5] J. Williams, A. Raux, D. Ramachandran, and A. Black, “The dialog state tracking challenge,” in SIGDIAL, 2013.
  • [6] Iñigo Casanueva, Paweł Budzianowski, Pei-Hao Su, Nikola Mrkšić, Tsung-Hsien Wen, Stefan Ultes, Lina Rojas-Barahona, Steve Young, and Milica Gašić, “A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management,”, nov 2017.
  • [7] S. Young, M. Gašić, S. Keizer, F. Mairesse, J. Schatzmann, B. Thomson, and K. Yu, “The hidden information state model: A practical framework for pomdp-based spoken dialogue management,” Computer Speech and Language, vol. 24, no. 2, pp. 150–174, 2010.
  • [8] L. Daubigney, M. Geist, S. Chandramohan, and O. Pietquin, “A comprehensive reinforcement learning framework for dialogue management optimization,” Selected Topics in Signal Processing, vol. 6, no. 8, pp. 891–902, 2012.
  • [9] Emmanuel Ferreira and Fabrice Lefèvre, “Reinforcement-learning based dialogue system for human-robot interactions with socially-inspired rewards,” Computer Speech & Language, vol. 34, no. 1, pp. 256–274, 2015.
  • [10] E Ferreira, B Jabaian, and F Lefèvre, “Online adaptative zero-shot learning spoken language understanding using word-embedding,” in ICASSP, 2015.
  • [11] Emmanuel Ferreira, Alexandre Reiffers-Masson, Bassam Jabaian, and Fabrice Lefèvre, “Adversarial bandit for online interactive active learning of zero-shot spoken language understanding,” in Proceedings of ICASSP, 2016.
  • [12] Matthieu Riou, Bassam Jabaian, Stéphane Huet, and Fabrice Lefèvre,

    “Online adaptation of an attention-based neural network for natural language generation,”

    in Proceedings of INTERSPEECH, 2017.
  • [13] S. Hahn, M. Dinarelli, C. Raymond, F. Lefèvre, P. Lehnen, R. De Mori, A. Moschitti, H. Ney, and G. Riccardi, “Comparing stochastic approaches to spoken language understanding in multiple languages,” IEEE TASLP, vol. 19, no. 6, pp. 1569–1583, 2010.
  • [14] F. Lefèvre,

    “Dynamic Bayesian networks and discriminative classifiers for multi-stage semantic interpretation,”

    in ICASSP, 2007.
  • [15] A. Deoras and R. Sarikaya,

    Deep belief network based semantic taggers for spoken language understanding,”

    in INTERSPEECH, 2013.
  • [16] Y. Dauphin, G. Tur, D. Hakkani-Tur, and L. Heck, “Zero-shot learning and clustering for semantic utterance classification,” arXiv preprint arXiv:1401.0509, 2014.
  • [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [18] M. Geist and O. Pietquin, “Kalman temporal differences,” Artificial Intelligence Research, vol. 39, no. 1, pp. 483–532, Sept. 2010.
  • [19] A. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in ICML, 1999.
  • [20] J. Bian, B. Gao, and T. Liu, “Knowledge-powered deep learning for word embedding,” in ECML, 2014.
  • [21] Matthieu Geist and Olivier Pietquin, “Managing uncertainty within value function approximation in reinforcement learning,” in Active Learning and Experimental Design workshop (collocated with AISTATS 2010), Sardinia, Italy, 2010, vol. 92.
  • [22] Thierry Chaminade, “An experimental approach to study the physiology of natural social interactions,” Interaction Studies, vol. 18, no. 2, pp. 254–276, 2017.