Attention over Parameters for Dialogue Systems

01/07/2020 ∙ by Andrea Madotto, et al. ∙ Salesforce The Hong Kong University of Science and Technology 0

Dialogue systems require a great deal of different but complementary expertise to assist, inform, and entertain humans. For example, different domains (e.g., restaurant reservation, train ticket booking) of goal-oriented dialogue systems can be viewed as different skills, and so does ordinary chatting abilities of chit-chat dialogue systems. In this paper, we propose to learn a dialogue system that independently parameterizes different dialogue skills, and learns to select and combine each of them through Attention over Parameters (AoP). The experimental results show that this approach achieves competitive performance on a combined dataset of MultiWOZ, In-Car Assistant, and Persona-Chat. Finally, we demonstrate that each dialogue skill is effectively learned and can be combined with other skills to produce selective responses.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unlike humans who can do both, goal-oriented dialogues (Williams and Young, 2007; Young et al., 2013) and chit-chat conversations (Serban et al., 2016a; Vinyals and Le, 2015) are often learned with separate models. A more desirable approach for the users would be to have a single chat interface that can handle both casual talk and tasks such as reservation or scheduling. This can be formulated as a problem of learning different conversational skills across multiple domains. A skill can be either querying a database, generating daily conversational utterances, or interacting with users in a particular task-domain (e.g. booking a restaurant). One challenge of having multiple skills is that existing datasets either focus only on chit-chat or on goal-oriented dialogues. This is due to the fact that traditional goal-oriented systems are modularized (Williams and Young, 2007; Hori et al., 2009; Lee et al., 2009; Levin et al., 2000; Young et al., 2013); thus, they cannot be jointly trained with end-to-end architecture as in chit-chat. However, recently proposed end-to-end trainable models (Eric and Manning, 2017; Wu et al., 2019b; Reddy et al., 2018; Yavuz et al., 2018) and datasets (Bordes and Weston, 2017; Eric et al., 2017) allow us to combine goal-oriented (Budzianowski et al., 2018; Eric et al., 2017) and chit-chat (Zhang et al., 2018) into a single benchmark dataset with multiple conversational skills as shown in Table 1.

A straight forward solution would be to have a single model for all the conversational skills, which has shown to be effective to a certain extent by (Zhao et al., 2017) and (McCann et al., 2018). Putting aside the performance in the tasks, such fixed shared-parameter framework, without any task-specific designs, would lose controllability and interpretability in the response generation. In this paper, instead, we propose to model multiple conversational skills using the Mixture of Experts (MoE) (Jacobs et al., 1991) paradigm, i.e., a model that learns and combine independent specialized experts using a gating function.

Spk. Conversation Skills
Usr: Can you help me find a cheap 2 star hotel?
SELECT * FROM hotel WHERE pricerange=‘cheap’
AND stars=2 AND type=‘hotel’
Mem: (Result table from the SQL query, with real entities.)
Arms is a 2 star hotel in the centre. Shall I book a room for you?
Usr: Yes, thank you!
Sys: Sure, for how many people should I book it for? HOTEL
Sys: Hello! How are you today? PERSONA
Usr: I am good thank you! Do you like to ski? I really like it.
Sys: No, I like to code in Haskell. Do you like to code? PERSONA
Table 1: An example from the dataset which includes both chit-chat and task-oriented conversations. The model has to predict all the Sys turn, which includes SQL query and generating response from a the Memory content, which is dynamically updated with the queries results. The skills are the prior knowledge needed for the response, where Persona refers to chit-chat.

For instance, each expert could specialize in different dialogues domains (e.g., Hotel, Train, Chit-Chat etc.) and skills (e.g., generate SQL query). A popular implementation of MoE (Shazeer et al., 2017; Kaiser et al., 2017)

uses a set of linear transformation (i.e., experts) in between two LSTM

(Schmidhuber, 1987) layers. However, several problems arise with this implementation: 1) the model is computationally expensive as it has to decode multiple times each expert and make the combination at the representation-level; 2) no prior knowledge is injected in the expert selection (e.g., domains); 3) Seq2Seq model has limited ability in extracting information from a Knowledge Base (KB) (i.e., generated by the SQL query) (Eric et al., 2017), as required in end-to-end task-oriented dialogues systems (Bordes and Weston, 2017). The latter can be solved by using more advanced multi-hop models like the Transformer (Vaswani et al., 2017), but the remaining two need to be addressed. Hence, in this paper we:

  • [leftmargin=*]

  • propose a novel Transformer-based architecture called Attention over Parameters (AoP). This model parameterize the conversational skills of end-to-end dialogue systems with independent decoder parameters (experts), and learns how to dynamically select and combine the appropriate decoder parameter sets by leveraging prior knowledge from the data such as domains and skill types;

  • proof that AoP is algorithmically more efficient compared to forwarding all the Transformer decoders and then mix their output representation, like is normally done in MoE. Figure 1 illustrates the high-level intuition of the difference;

  • empirically show the effectiveness of using specialized parameters in a combined dataset of MultiWOZ (Budzianowski et al., 2018), In-Car Assistant (Eric et al., 2017), and Persona-Chat (Zhang et al., 2018), which to the best of our knowledge, is the first evaluation of this genre i.e. end-to-end large-scale multi-domains/skills. Moreover, we show that our model is highly interpretable and is able to combine different learned skills to produce compositional responses.

2 Methodology

We use the standard encoder-decoder architecture and avoid any task-specific designs (Wu et al., 2019b; Reddy et al., 2018), as we aim to build a generic conversation model for both chit-chat and task-oriented dialogues. More specifically, we use a Transformer for both encoder and decoder.

Let us define the sequence of tokens in the dialogue history as and the dynamic memory content as a sequence of tokens . The latter can be the result of a SQL query execution (e.g., table) or plain texts (e.g., persona description), depending on the task. The dialogue history and the memory are concatenated to obtain the final input denoted by . We then denote as the sequence of tokens that the model is expected to produce. Without loss of generality,

can be both plain text and SQL-like queries. Hence, the model has to learn when to issue database queries and when to generate human-like responses. Finally, we define a binary skill vector

that specifies the type of skills required to generate . This can be considered as a prior vector for learning to select the correct expert during the training111the vector will be absent during the testing. For example, in Table 1 the first response is of type SQL in the Hotel domain, thus the skill vector will have and , while all the other skill/domains are set to zero 222With the assumption that at each index in is assigned a semantic skill (e.g. SQL position ). More importantly, we may set the vector to have multiple ones to enforce the model to compose skills to achieve a semantic compositionality of different experts.

Figure 1: Comparisons between Single model, Mixture of Experts (MoE) (Jacobs et al., 1991), and Attention over Parameters (AoP).

2.1 Encoder-Decoder

To map the input sequence to the output sequence, we use a standard Transformer (Vaswani et al., 2017) and denote the encoder and decoder as TRS and TRS, respectively. The input of a Transformer is the embedded representation of the input words; thus, we define a word embedding matrix where is the embedding size and is the cardinality of the vocabulary. The input , with its positional embedding (Appendix A1 for more information), are encoded as the following equation:


where , and E. Then the decoder receives the target sequence shifted by one as the input. Using teacher-forcing (Williams and Zipser, 1989), the model is trained to produce the correct sequence . The output of the decoder is produced as follow:


where . Finally, a distribution over the vocabulary is generated for each token by an affine transformation followed by a Softmax function.


In addition, is mixed with the encoder-decoder attention distribution to enable to copy token from the input sequence as in (See et al., 2017)

. The model is then trained to minimize a standard cross entropy loss function and at inference time to generate one token at the time in an auto-regressive manner 

(Graves, 2013). Hence, the training loss is defined as:


2.2 Attention over Parameters

The main idea is to produce a single set of parameters for decoder TRS by the weighted sum of independently parameterized decoders. This process is similar to attention (Luong et al., 2015) where the memories are the parameters and the query is the encoded representation. Let us define as the list of parameters for decoders, since a TRS is represented by its parameters . Since each can be sized in the order of millions, we assign the corresponding key vectors to each , similar to key-value memory networks (Miller et al., 2016). Thus, we use a key matrix

and a Recurrent Neural Networks (RNN), in this instance a GRU 

(Cho et al., 2014), to produce the query vector by processing the encoder output . The attention weights for each decoders’ parameters is computed as follow:


where and is the attention vectors where each is the score corresponding to . Hence, the new set of parameters is computed as follow:


The combined set of parameters are then used to initialize a new TRS, and Equation 2 will be applied to the input based on this. Equation 6 is similar to the gating function proposed in (Shazeer et al., 2017; Jacobs et al., 1991), but the resulting scoring vector is applied directly to the parameter instead of the output representation of each decoder, holding an algorithmically faster computation.

Theorem 1.

The computation cost of Attention over Parameters (AoP) is always lower than Mixture Of Experts (MoE) for sequence longer than 1.


Let a generic function parametrized by . Without loss of generality, we define as a affine transformation . Let a generic input sequence of length and dimensional size. Let the set be the set of experts. Hence, the operation done by MoE are:


Thus the computational cost in term of operation is since the cost of is and it is repeated times, and the cost of summing the representation is . On the other hand, the operation done by AoP are:


in this case the computational cost in term of operation is since the cost of summing the parameters is and the cost of is . Hence, it is easy to verify that if then:


Furthermore, the assumption of using a simple affine transformation is actually an optimal case. Indeed, assuming that the cost of parameters sum is equal to the number of operation is optimistic, for instance already by using attention the number of operations increases but the number of parameters remains constant. ∎

Importantly, if we apply to each of the output representation generated by the TRS, we end up having a Transformer-based implementation of MoE. We call this model as Attention over Representation (AoR). Finally, an additional loss term is used to supervise the attention vector by using the prior knowledge vector . Since multiple decoder parameters can be selected at the same time, we use a binary cross-entropy to train each . Thus a second loss is defined as:

The final loss is the summation of and .

Finally, in AoP, but in general in the MoE framework, stacking multiple layers (e.g., Transformer) leads to models with a large number of parameters, since multiple experts are repeated across layers. An elegant workaround is the Universal Transformer (Dehghani et al., 2019), which loops over an unique layer and, as shown by (Dehghani et al., 2019), holds similar or better performance than a multi-layer Transformer. In our experiment, we report a version of AoP that uses this architecture, which for instance does not add any further parameter to the model.

3 Experiments and Results

3.1 Dataset

To evaluate the performance of our model for different conversational skills, we propose to combine three publicly available datasets: MultiWOZ (Budzianowski et al., 2018), Stanford Multi-domain Dialogue (Eric et al., 2017) and Persona-Chat (Zhang et al., 2018).

SMD MWOZ Persona
#Dialogues 2425 8,438 12,875
#turns 12,732 115,424 192,690
Avg. turns 5.25 13.68 14.97
Avg. tokens 8.02 13.18 11.96
Vocab 2,842 24,071 20,343
Figure 2: Datasets statistics


is a human-to-human multi-domain goal-oriented dataset annotated with dialogue acts and states. In this dataset, there are seven domains (i.e., Taxi, Police, Restaurant, Hospital, Hotel, Attraction, Train) and two APIs interfaces: SQL and BOOK. The former is used to retrieve information about a certain domain and the latter is used to book restaurants, hotels, trains, and taxis. We refine this dataset to include SQL/BOOK queries and their outputs using the same annotations schema as (Bordes and Weston, 2017).

Hence, each response can either be plain text conversation with the user or SQL/BOOK queries, and the memory is dynamically populated with the results from the queries as the generated response is based on such information. This transformation allows us to train end-to-end models that learns how and when to produce SQL queries, to retrieve knowledge from a dynamic memory, and to produce plain text response. A detailed explanation is reported in Appendix A3, together with some samples.

Stanford Multi-domain Dialogue (SMD)

is another human-to-human multi-domain goal-oriented dataset that is already designed for end-to-end training. There are three domains in this dataset (i.e., Point-of-Interest, Weather, Calendar). The difference between this dataset and MWOZ is that each dialogue is associated with a set of records relevant to the dialogues. The memory is fixed in this case so the model does not need to issue any API calls. However, retrieving the correct entities from the memory is more challenging as the model has to compare different alternatives among records.


is a multi-turn conversational dataset, in which two speakers are paired and different persona descriptions (4-5 sentences) are randomly assigned to each of them. For example, “I am an old man” and “I like to play football” are one of the possible persona descriptions provided to the system. Training models using this dataset results in a more persona consistent and fluent conversation compared to other existing datasets (Zhang et al., 2018). Currently, this dataset has become one of the standard benchmarks for chit-chat systems, thus, we include it in our evaluation.

For all three datasets, we use the training/validation/test split provided by the authors and we keep all the real entities in input instead of using their delexicalized version as in (Budzianowski et al., 2018; Eric et al., 2017). This makes the task more challenging, but at the same time more interesting since we force the model to produce real entities instead of generic and frequent placeholders. Table 2 summarizes the dataset statistics in terms of number of dialogues, turns, and unique tokens. Finally, we merge the three datasets obtaining 154,768/19,713/19,528 for training, validation and, test respectively, and a vocabulary size of 37,069 unique tokens.

3.2 Evaluation Metrics


For both MWOZ and SMD, we follow the evaluation done by existing works (Eric and Manning, 2017; Zhao et al., 2017; Madotto et al., 2018; Wu et al., 2017). We use BLEU333Using the multi-bleu.perl script score (Papineni et al., 2002) to measure the response fluency and Entity F1-Score (Wen et al., 2017; Zhao et al., 2017) to evaluates the ability of the model to generate relevant entities from the dynamic memory. Since MWOZ also includes SQL and BOOK queries, we compute the exact match accuracy (i.e., and ) and BLEU score (i.e., and ). Furthermore, we also report the F1-score for each domain in both MWOZ and SMD.


We compare perplexity, BLEU score, F1-score (Dinan et al., 2019), and Consistency score of the generate sentences with the human-generated prediction. The Consistency score is computed using a Natural Language Inference (NLI) model trained on dialogue NLI (Sean et al., 2018), a recently proposed corpus based on Persona dataset. We fine-tune a pre-trained BERT model (Devlin et al., 2018) using the dialogue DNLI corpus and achieve a test set accuracy of 88.43%, which is similar to the best-reported model in (Sean et al., 2018). The consistency score is defined as follow:


where is a generated utterance and is one sentence in the persona description. In Sean et al. (2018); Madotto et al. (2019), the authors showed that by re-ranking the beam search hypothesis using the DNLI score (i.e., C score), they achieved a substantial improvement in dialogue consistency. Intuitively, having a higher consistency C score means having a more persona consistent dialogue response.

3.3 Baselines

In our experiments, we compare Sequence-to-Sequence (Seq2Seq(See et al., 2017), Transformer (TRS(Vaswani et al., 2017), Mixture of Expert (MoE(Shazeer et al., 2017) and Attention over Representation (AoR) with our proposed Attention over Parameters (AoP). In all the models, we used the same copy-mechanism as in (See et al., 2017). In AoR instead of mixing the parameters as in Equation 7, we mix the output representation of each transformer decoder (i.e. Equation 2). For all AoP, AoR, and MoE, is the number of decoders (experts): 2 skills of SQL and BOOK, 10 different domains for MWOZ+SMD, and 1 for Persona-Chat. Furthermore, we include also the following experiments: AoP that uses the gold attention vector , which we refer as AoP w/ Oracle (or AoP + O); AoP trained by removing the from the optimization (AoP w/o ); and as aforementioned, the Universal Transformer for both AoP (AoP + U) and the standard Transformer (TRS + U) (i.e., 6 hops). All detailed model description and the full set of hyper-parameters used in the experiments are reported in Appendix A4.

3.4 Results

Table 2 and Table 3 show the respectively evaluation results in MWOZ+SMD and Persona-Chat datasets.

Model Ppl. F1 C BLEU
Seq2Seq 39.42 6.33 0.11 2.79
TRS 43.12 7.00 0.07 2.56
MoE 38.63 7.33 0.19 2.92
AoR 40.18 6.66 0.12 2.69
AoP 39.14 7.00 0.21 3.06
TRS + U 43.04 7.33 0.15 2.66
AoP + U 37.40 7.00 0.29 3.22
AoP w/o 42.81 6.66 0.12 2.85
AoP + O 40.16 7.33 0.21 2.91
Figure 3: Results for the Persona-Chat dataset.

From Table 2, we can identify four patterns. 1) AoP and AoR perform consistently better then other baselines which shows the effectiveness of combining parameters by using the correct prior ; 2) AoP performs consistently, but marginally, better than AoR, with the advantage of an algorithmic faster inference; 3) Using Oracle (AoP+O) gives the highest performance in all the measures, which shows the performance upper-bound for AoP. Hence, the performance gap when not using oracle attention is most likely due to the error in attention (i.e., 2% error rate). Moreover, Table 2 shows that by removing (AoP w/o ) the model performance decreases, which confirms that good inductive bias is important for learning how to select and combine different parameters (experts). Additionally, in Appendix A5, we report the per-domain F1-Score for SQL, BOOK and sentences, and Table 3 and Table 2

with the standard deviation among the three runs.

Furthermore, from Table 3, we can notice that MoE has the lowest perplexity and F1-score, but AoP has the highest Consistency and BLUE score. Notice that the perplexity reported in (Zhang et al., 2018) is lower since the vocabulary used in their experiments is smaller. In general, the difference in performance among different models is marginal except for the Consistency score; thus, we can conclude that all the models can learn this skill reasonably well. Consistently with the previous results, when is removed from the optimization, the models’ performance decreases.

Seq2Seq 38.37 9.42 49.97 81.75 39.05 79.00
TRS 36.91 9.92 61.96 89.08 46.51 78.41
MoE 38.64 9.47 53.60 85.38 37.23 78.55
AoR 40.36 10.66 69.39 90.64 52.15 81.15
AoP 42.26 11.14 71.1 90.90 56.31 84.08
TRS + U 39.39 9.29 61.80 89.70 50.16 79.05
AoP + U 44.04 11.26 74.83 91.90 56.37 84.15
AoP w/o 38.50 10.50 61.47 88.28 52.61 80.34
AoP+O 46.36 11.99 73.41 93.81 56.18 86.42
Table 2: Results for the goal-oriented responses in both MWOZ and SMD. Last raw, and italicized, are the Oracle results, and bold-faced are best in each setting (w and w/o Universal). Results are averaged among three run (full table in Appendix A6).

Finally, in both Table 2 and Table 3, we report the results obtained by using the Universal Transformer, for both AoP and the Transformer. By adding the layer recursion, both models are able to consistently improve all the evaluated measures, in both Persona-Chat and the Task-Oriented tasks. Especially AoP, which achieves better performance than Oracle (i.e. single layer) in the SQL accuracy, and a consistently better performance in the Persona-Chat evaluation.

4 Skill Composition

To demonstrate the effectiveness of our model in learning independent skills and composing them together, we manually trigger skills by modifying and generate 14 different responses for the same input dialogue context. This experiment allows us to verify whether the model accurately captures the meaning of each skill and whether it can properly learn to compose the selected parameters (skills). Table 3 first shows the dialogue history along with the response of AoP on the top, and then different responses generated by modifying (i.e., black cells correspond to 1 in the vector, while the whites are 0). By analyzing Table 3  we can notice that:

  • [leftmargin=*]

  • The model learns the correct semantics of each skill. For instance, the AoP response is of type SQL and Train, and by deactivating the SQL skill and activating other domain-skills, including Train, we can see that the responses are grammatical and they are coherent with the selected skill semantics. For instance, by just selecting Train, the generated answer becomes “what time would you like to leave?” which is coherent with the dialogue context since such information has not been yet provided. Interestingly, when Persona skill is selected, the generated response is conversational and also coherent with the dialogue, even though it is less fluent.

  • The model effectively learns how to compose multiple skills. For instance, when SQL or BOOK are triggered the response produces the correct SQL-syntax (e.g. “SELECT * FROM ..” etc.). By also adding the corresponding domain-skill, the model generates the correct query format and attributes relative to the domain type (e.g. in SQL, Restaurant, the model queries with the relevant attribute food for restaurants).

5 Related Work


Task-oriented dialogue models (Gao et al., 2018) can be categorized in two types: module-based (Williams and Young, 2007; Hori et al., 2009; Lee et al., 2009; Levin et al., 2000; Young et al., 2013; Wu et al., 2019a) and end-to-end. In this paper, we focus on the latter which are systems that train a single model directly on text transcripts of dialogues. These tasks are tackled by selecting a set of predefined utterances (Bordes and Weston, 2017; Liu and Perez, 2017; Williams et al., 2017; Seo et al., 2017) or by generating a sequence of tokens (Wen et al., 2017; Serban et al., 2016b; Zhao et al., 2017; Serban et al., 2017). Especially in the latter, copy-augmented models (Eric and Manning, 2017; Reddy et al., 2018; Yavuz et al., 2018) are very effective since extracting entities from a knowledge base is fundamental. On the other hand, end-to-end open domain chit-chat models have been widely studied (Serban et al., 2016a; Vinyals and Le, 2015; Wolf et al., 2019; Lin et al., 2019a, b). Several works improved on the initially reported baselines with various methodologies (Kulikov et al., 2018; Yavuz et al., 2018; Hancock et al., 2019; Zemlyanskiy and Sha, 2018; Dinan et al., 2019). Finally, (Zhao et al., 2017) was the first attempt of having an end-to-end system for both task-oriented models and chit-chat. However, the dataset used for the evaluation was small, evaluated only in single domain, and the chit-chat ability was added manually through rules.

Sys: There are lots of trains to choose from! Where are you departing from?
Usr: I am departing from london heading to cambridge.
Sys: What time will you be travelling?
Usr: I need to arrive by 1530.
AoP: SELECT * FROM train WHERE destination=“cambridge” AND
    day=“monday” AND arriveBy < “1530” and departure=“london”










AoP responses using different parameter combination
I would be happy to book that for you.
I have found you a few to choose from. what time would ..
I have many options for you. is there a certain time you …
Is there a certain area you would like?
what time would you like to leave?
okay I have two trains what time would you like to cambridge?
Where do you do for work ?
BOOK FROM taxi WHERE leaveAt>“1530” AND destination=..
BOOK FROM restaurant WHERE time=“1530” AND …
BOOK FROM hotel WHERE people=“1” AND day=“monday”
BOOK FROM train WHERE people=“1” AND id_booking=’
SELECT * FROM restaurant WHERE food=“1530” …
SELECT * FROM hotel WHERE type=“1530”
SELECT * FROM attraction WHERE name=“departure”
Table 3: Selecting different skills thought the attention vector results in a skill-consistent response. AoP response activates SQL and Train.

Mixture of Expert & Conditional Computation

The idea of having specialized parameters, or so-called experts, has been widely studied topics in the last two decades (Jacobs et al., 1991; Jordan and Jacobs, 1994). For instance, different architecture and methodologies have been used such as Gaussian Processes (Tresp, 2001), Hierarchical Experts (Yao et al., 2009), and sequential expert addition (Aljundi et al., 2017). More recently, the Mixture Of Expert (Shazeer et al., 2017; Kaiser et al., 2017) model was proposed which added a large number of experts between two LSTMs. To the best of our knowledge, none of these previous works applied the results of the gating function to the parameters itself. On the other hand, there are Conditional Computational models which learn to dynamically select their computation graph (Bengio et al., 2013; Davis and Arel, 2013)

. Several methods have been used such as reinforcement learning 

(Bengio et al., 2016), a halting function (Graves, 2016; Dehghani et al., 2019; Figurnov et al., 2017), by pruning (Lin et al., 2017; He et al., 2018) and routing/controller function (Rosenbaum et al., 2018). However, this line of work focuses more on optimizing the inference performance of the model more than specializing parts of it for computing a certain task.

Multi-task Learning

Even though our model processes only input sequence and output sequences of text, it actually jointly learns multiple tasks (e.g. SQL and BOOK query, memory retrieval, and response generation), thus it is also related to multi-task learning (Caruana, 1997). Interested readers may refer to (Ruder, 2017; Zhou et al., 2011)

for a general overview on the topic. In Natural Language Processing, multi-task learning has been applied in a wide range of applications such as parsing 

(Collobert et al., 2011; Hashimoto et al., 2017; Ruder et al., 2017), machine translation in multiple languages (Johnson et al., 2017)

, and parsing image captioning and machine translation 

(Luong et al., 2016). More interestingly, DecaNLP (McCann et al., 2018) has a large set of tasks that are cast to question answering (QA), and learned by a single model. In this work, we focus more on conversational data, but in future works, we plan to include these QA tasks.

6 Conclusion

In this paper, we propose a novel way to train a single end-to-end dialogue model with multiple composable and interpretable skills. Unlike previous work, that mostly focused on the representation-level mixing (Shazeer et al., 2017), our proposed approach, Attention over Parameters, learns how to softly combine independent sets of specialized parameters (i.e., making SQL-Query, conversing with consistent persona, etc.) into a single set of parameters. By doing so, we not only achieve compositionality and interpretability but also gain algorithmically faster inference speed. To train and evaluate our model, we organize a multi-domain task-oriented datasets into end-to-end trainable formats and combine it with a conversational dataset (i.e. Persona-Chat). Our model learns to consider each task and domain as a separate skill that can be composed with each other, or used independently, and we verify the effectiveness of the interpretability and compositionality with competitive experimental results and thorough analysis.

Several extensions of this work are possible, for example: incremental learning and zero-shot skill composition. The first, would be similar to Rusu et al. (2016) where we can add skills through time and learn how to combine it to existing ones. The second, instead, is more related to the semantic compositionality shown in the analysis, where each skill is correctly learned and can be apply to control the generation. An interesting direction would be to learn more general skills (e.g. Machine Translation (MT) or emotional responses), and being able to mix it to existing skills to obtain compositional responses without labeled data.


  • R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3366–3375. Cited by: §5.
  • E. Bengio, P. Bacon, J. Pineau, and D. Precup (2016) Conditional computation in neural networks for faster models. ICLR. Cited by: §5.
  • Y. Bengio, N. Léonard, and A. Courville (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §5.
  • A. Bordes and J. Weston (2017) Learning end-to-end goal-oriented dialog. International Conference on Learning Representations abs/1605.07683. Cited by: §1, §1, §3.1, §5.
  • P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gasic (2018) MultiWOZ-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5016–5026. Cited by: 3rd item, §A.3, §A.3, §A.3, Attention over Parameters for Dialogue Systems, 3rd item, §1, §3.1, §3.1.
  • R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §5.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Cited by: §2.2.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch. Journal of machine learning research 12 (Aug), pp. 2493–2537. Cited by: §5.
  • A. Davis and I. Arel (2013) Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv. Cited by: §5.
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2019) Universal transformers. ICLR. Cited by: §2.2, §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2.
  • E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, et al. (2019) The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098. Cited by: §3.2, §5.
  • M. Eric, L. Krishnan, F. Charette, and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 37–49. External Links: Link Cited by: Attention over Parameters for Dialogue Systems, 3rd item, §1, §1, §3.1, §3.1.
  • M. Eric and C. Manning (2017) A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 468–473. External Links: Link Cited by: §1, §3.2, §5.
  • M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov (2017) Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1039–1048. Cited by: §5.
  • J. Gao, M. Galley, and L. Li (2018) Neural approaches to conversational ai. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1371–1374. Cited by: §5.
  • A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §2.1.
  • A. Graves (2016) Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983. Cited by: §5.
  • B. Hancock, A. Bordes, P. Mazare, and J. Weston (2019) Learning from dialogue after deployment: feed yourself, chatbot!. arXiv preprint arXiv:1901.05415. Cited by: §5.
  • K. Hashimoto, Y. Tsuruoka, R. Socher, et al. (2017) A joint many-task model: growing a neural network for multiple nlp tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1923–1933. Cited by: §5.
  • Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §5.
  • C. Hori, K. Ohtake, T. Misu, H. Kashioka, and S. Nakamura (2009) Statistical dialog management applied to wfst-based dialog systems. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009., pp. 4793–4796. Cited by: §1, §5.
  • R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, et al. (1991) Adaptive mixtures of local experts.. Neural computation 3 (1), pp. 79–87. Cited by: §1, Figure 1, §2.2, §5.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. (2017)

    Google’s multilingual neural machine translation system: enabling zero-shot translation

    Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §5.
  • M. I. Jordan and R. A. Jacobs (1994) Hierarchical mixtures of experts and the em algorithm. Neural computation 6 (2), pp. 181–214. Cited by: §5.
  • L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit (2017) One model to learn them all. arXiv. Cited by: §1, §5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.4.
  • I. Kulikov, A. H. Miller, K. Cho, and J. Weston (2018) Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907. Cited by: §5.
  • C. Lee, S. Jung, S. Kim, and G. G. Lee (2009) Example-based dialog modeling for practical multi-domain dialog system. Speech Communication 51 (5), pp. 466–484. Cited by: §1, §5.
  • E. Levin, R. Pieraccini, and W. Eckert (2000) A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing 8 (1), pp. 11–23. Cited by: §1, §5.
  • J. Lin, Y. Rao, J. Lu, and J. Zhou (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. Cited by: §5.
  • Z. Lin, A. Madotto, J. Shin, P. Xu, and P. Fung (2019a) MoEL: mixture of empathetic listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 121–132. Cited by: §5.
  • Z. Lin, P. Xu, G. I. Winata, Z. Liu, and P. Fung (2019b) Caire: an end-to-end empathetic chatbot. arXiv preprint arXiv:1907.12108. Cited by: §5.
  • F. Liu and J. Perez (2017) Gated end-to-end memory networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 1–10. External Links: Link Cited by: §5.
  • T. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser (2016) Multi-task sequence to sequence learning. In International Conference on Learning Representations, Cited by: §5.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Link Cited by: §2.2.
  • A. Madotto, Z. Lin, C. Wu, and P. Fung (2019) Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5454–5459. Cited by: §3.2.
  • A. Madotto, C. Wu, and P. Fung (2018) Mem2seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. arXiv preprint arXiv:1804.08217. Cited by: §3.2.
  • B. McCann, N. S. Keskar, C. Xiong, and R. Socher (2018) The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: §1, §5.
  • A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1400–1409. Cited by: §2.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: §3.2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §A.4.
  • R. Reddy, D. Contractor, D. Raghu, and S. Joshi (2018) Multi-level memory for task oriented dialogs. arXiv preprint arXiv:1810.10647. Cited by: §1, §2, §5.
  • C. Rosenbaum, T. Klinger, and M. Riemer (2018) Routing networks: adaptive selection of non-linear functions for multi-task learning. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard (2017) Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142. Cited by: §5.
  • S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §5.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §6.
  • J. Schmidhuber (1987) Evolutionary principles in self-referential learning. on learning now to learn: the meta-meta-meta…-hook. Diploma Thesis, Technische Universitat Munchen, Germany. External Links: Link Cited by: §1.
  • W. Sean, J. Weston, A. Szlam, and K. Cho (2018) Dialogue natural language inference. arXiv preprint arXiv:1811.00671. Cited by: §3.2.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. External Links: Link Cited by: §2.1, §3.3.
  • M. Seo, S. Min, A. Farhadi, and H. Hajishirzi (2017) Query-reduction networks for question answering. International Conference on Learning Representations. Cited by: §5.
  • I. V. Serban, R. Lowe, L. Charlin, and J. Pineau (2016a) Generative deep neural networks for dialogue: a short review. arXiv preprint arXiv:1611.06216. Cited by: §1, §5.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016b) Building end-to-end dialogue systems using generative hierarchical neural network models.. In AAAI, pp. 3776–3784. Cited by: §5.
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. C. Courville, and Y. Bengio (2017) A hierarchical latent variable encoder-decoder model for generating dialogues.. In AAAI, pp. 3295–3301. Cited by: §5.
  • N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. Cited by: Figure 6, §1, §2.2, §3.3, §5, §6.
  • V. Tresp (2001) Mixtures of gaussian processes. In Advances in neural information processing systems, pp. 654–660. Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010. Cited by: §A.4, §1, §2.1, §3.3.
  • O. Vinyals and Q. V. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §1, §5.
  • T. Wen, M. Gasic, N. Mrksic, L. M. Rojas-Barahona, P. Su, S. Ultes, D. Vandyke, and S. J. Young (2017) A network-based end-to-end trainable task-oriented dialogue system. In EACL, Cited by: §3.2, §5.
  • J. D. Williams, K. Asadi, and G. Zweig (2017) Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 665–677. External Links: Link Cited by: §5.
  • J. D. Williams and S. Young (2007)

    Partially observable markov decision processes for spoken dialog systems

    Computer Speech & Language 21 (2), pp. 393–422. Cited by: §1, §5.
  • R. J. Williams and D. Zipser (1989) A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §2.1.
  • T. Wolf, V. Sanh, J. Chaumond, and C. Delangue (2019)

    TransferTransfo: a transfer learning approach for neural network based conversational agents

    arXiv preprint arXiv:1901.08149. Cited by: §A.1, §5.
  • C. Wu, A. Madotto, E. Hosseini-Asl, C. Xiong, R. Socher, and P. Fung (2019a) Transferable multi-domain state generator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743. Cited by: §5.
  • C. Wu, A. Madotto, G. Winata, and P. Fung (2017) End-to-end recurrent entity network for entity-value independent goal-oriented dialog learning. In Dialog System Technology Challenges Workshop, DSTC6, Cited by: §3.2.
  • C. Wu, R. Socher, and C. Xiong (2019b) Global-to-local memory pointer networks for task-oriented dialogue. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • B. Yao, D. Walther, D. Beck, and L. Fei-Fei (2009) Hierarchical mixture of classification experts uncovers interactions between brain regions. In Advances in Neural Information Processing Systems, pp. 2178–2186. Cited by: §5.
  • S. Yavuz, A. Rastogi, G. Chao, D. Hakkani-Tür, and A. A. AI (2018) DEEPCOPY: grounded response generation with hierarchical pointer networks. Conversational AI NIPS workshop. Cited by: §1, §5.
  • S. Young, M. Gašić, B. Thomson, and J. D. Williams (2013) Pomdp-based statistical spoken dialog systems: a review. Proceedings of the IEEE 101 (5), pp. 1160–1179. Cited by: §1, §5.
  • Y. Zemlyanskiy and F. Sha (2018) Aiming to know you better perhaps makes me a more engaging dialogue partner. CoNLL 2018, pp. 551. Cited by: §5.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213. External Links: Link Cited by: Attention over Parameters for Dialogue Systems, 3rd item, §1, §3.1, §3.1, §3.4.
  • T. Zhao, A. Lu, K. Lee, and M. Eskenazi (2017) Generative encoder-decoder models for task-oriented spoken dialog systems with chatting capability. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 27–36. External Links: Link Cited by: §1, §3.2, §5.
  • J. Zhou, J. Chen, and J. Ye (2011) Clustered multi-task learning via alternating structure optimization. In Advances in neural information processing systems, pp. 702–710. Cited by: §5.

Appendix A Appendix

a.1 Embedded Representation

Figure 4: Positional Embedding of the dialogue history and the memory content.

Since the model input may include structured data (e.g. DB records) we further define another embedding matrix for encoding the types and the segments as where is the set of positional tokens and its cardinality. is used to inform the model of the token types such as speaker information (e.g. Sys and Usr), the data-type for the memory content (e.g. Miles, Traffic etc.), and segment types like dialogue turn information and database record index [Wolf et al., 2019]. Figure 4 shows an example of the embedded representation of the input. Hence, we denote and as the type and segment tokens for each token in input , respectively.

Figure 5: Attention over Parameters visualization, vector for different reference (Ref.) and AoP generated answers. Top rows (Usr) are the last utterances from each dialogue contexts.

a.2 Attention Visualization

Figure 5 shows the attention vector over parameters for different generated sentences. In this figure, and by analyzing more examples 444Available in supplementary material and later online., we can identify two patterns:

  • [leftmargin=*]

  • AoP learns to focus on the correct skills (i.e., SQL, BOOK) when API-calls are needed. From the first example in Figure 5, we can see that the activations in are consistent with those in the correct attention vector . There are also false positives, in which AoP puts too high weights on BOOK when the correct response is plain text that should request more information from the user (i.e., i can help you with that. when would you like to leave the hotel?). However, we can notice that this example is, in fact, "almost correct" as triggering a booking API call may also be considered a valid response. Meanwhile, the third example also fails to attend to the correct skill, but, in fact, generates a very fluent and relevant response. This is most likely because the answer is simple and generic.

  • The attention often focuses on multiple skills not directly relevant to the task. We observe this pattern especially when there are other skill-related entities mentioned in the context or the response. For example, in the second dialog example in Figure 5, we can notice that AoP not only accurately focuses on taxi domain, but also has non-negligible activations for restaurant and hotel. This is because the words “hotel" and “restaurant" are both mentioned in the dialogue context and the model has to produce two entities of the same type (i.e. finches bed and breakfast and ask).

a.3 Data Pre-Processing

As mentioned in the main article, we convert MultiWOZ into an end-to-end trainable dataset. This requires to add sql-syntax queries when the system includes particular entities. To do so we leverage two annotations such as the state-tracker and the speech acts. The first is used to generate the a well-formed query, including key and attribute, the second instead to decide when to include the query. More details on the dialogue state-tracker slots and slots value, and the different speech acts can be found in [Budzianowski et al., 2018].

A query is create by the slots, and its values, that has been updated in the latest turn. The SQL query uses the following syntax:

Similarly for the booking api BOOK the syntax is the following:

In both cases the slot values are kept as real entities.

More challenging is to decide when to issue such apis. Speech acts are used to decide by using the "INFORM-DOMAIN" and "RECOMMEND-DOMAIN" tag. Thus any response that include those speech tag will trigger an api if and only if:

  • [leftmargin=*]

  • there has been a change in the state-tracker from the previous turn

  • the produced query has never been issued before

By a manual checking, this strategy results to be effective. However, as reported by [Budzianowski et al., 2018] the speech act annotation includes some noise, which is reflected also into our dataset.

The results from the SQL query can be of more that 1K records with multiple attributes. Following [Budzianowski et al., 2018] we use the following strategy:

  • [leftmargin=*]

  • If no speech act INFORM or RECOMMEND and the number of records are more than 5, we use a special token in the memory .

  • If no speech act INFORM or RECOMMEND and the number of records are less or equal than 5, we put all the records in memory.

  • If any speech act INFORM or RECOMMEND, we filter the records to include based on the act value. Notice that this is a fair strategy, since all the resulting record are correct possible answers and the annotators pick-up on of the record randomly [Budzianowski et al., 2018].

Notice that the answer of a booking call instead, is only one record containing the booking information (e.g. reference number, taxi plate etc.) or "Not Available" token in case the booking cannot made.

a.4 Hyper-parameters and Training

We used a standard Transformer architecture [Vaswani et al., 2017] with pre-trained Glove embedding [Pennington et al., 2014]. For the both Seq2Seq and MoE we use Adam [Kingma and Ba, 2014] optimizer with a learning rate of , where instead for the Transformer we used a warm-up learning rate strategy as in  [Vaswani et al., 2017]. In both AoP and AoR we use an additional transformer layer on top the output of the model. Figure 6,7,8 shows the high level design MoE, AoR and AoP respectively. In all the model we used a batch size of 16, and we early stopped the model using the Validation set. All the experiments has been conducted using a single Nvidia 1080ti.

We used a small grid-search for tuning each model. The selected hyper-parameters are reported in Table 4, and we run each experiment 3 times and report the mean and standard deviation of each result.

Model d d Layers Head Depth Filter GloVe Experts
Seq2Seq 100 100 1 - - - Yes -
TRS 300 300 1 2 40 50 Yes -
MoE 100 100 2 - - - Yes 13
AoP/AoR 300 300 1 2 40 50 Yes 13
TRS/AoP+U 300 300 6 2 40 50 Yes 13
Table 4: Hyper-Parameters used for the evaluations.

a.5 MWOZ and SMD with Std.

Seq2Seq 38.37 1.69 9.42 0.38 49.97 3.49 81.75 2.54 39.05 9.52 79.00 3.63
TRS 36.91 1.24 9.92 0.43 61.96 3.95 89.08 1.23 46.51 5.46 78.41 2.03
MoE 38.64 1.11 9.47 0.59 53.60 4.62 85.38 2.68 37.23 3.89 78.55 2.62
AoR 40.36 1.39 10.66 0.34 69.39 1.05 90.64 0.83 52.15 2.22 81.15 0.32
AoP 42.26 2.39 11.14 0.39 71.1 0.47 90.90 0.81 56.31 0.46 84.08 0.99
TRS + U 39.39 1.23 9.29 0.71 61.80 4.82 89.70 1.40 50.161.18 79.05 1.42
AoP + U 44.04 0.92 11.26 0.07 74.83 0.79 91.90 1.03 56.37 0.92 84.15 0.32
AoP w/o 38.50 1.15 10.50 0.55 61.47 0.15 88.28 0.50 52.61 0.56 80.34 0.21
AoP+O 46.36 0.92 11.99 0.03 73.41 0.59 93.81 0.16 56.18 1.55 86.42 0.92

a.6 Persona Result with Std

Model Ppl. F1 C BLEU
Seq2Seq 39.42 1.54 6.33 0.58 0.11 0.06 2.80 0.09
TRS 43.12 1.46 7.00 0.00 0.07 0.16 2.56 0.07
MoE 38.63 0.20 7.33 0.05 0.19 0.16 2.92 0.48
AoR 40.18 0.74 6.66 0.05 0.12 0.14 2.69 0.34
AoP 39.14 0.48 7.00 0.00 0.21 0.05 3.06 0.08
TRS + U 43.04 1.78 7.33 0.57 0.15 0.02 2.660.43
AoP + U 37.400.08 7.000.00 0.29 0.07 3.22 0.04
AoP w/o 42.810.01 6.660.57 0.12 0.04 2.85 0.21
AoP + O 40.16 0.56 7.33 0.58 0.21 0.14 2.98 0.05

a.7 Domain F1-Score

Sentence Seq2Seq MoE TRS AoR AoP Aop+O
Taxi 71.77 75.97 73.92 76.07 76.58 78.30
Police 49.73 49.95 50.24 51.95 56.61 52.05
Restaurant 50.20 49.59 48.34 50.58 50.47 50.90
Hotel 46.82 45.37 43.38 45.51 46.40 44.47
Attraction 37.87 35.21 33.10 36.97 38.79 37.51
Train 46.02 41.72 41.28 44.33 46.32 45.93
Weather 40.38 27.06 18.97 44.77 51.94 55.23
Schedule 35.98 43.94 38.95 32.90 54.18 52.99
Navigate 18.57 21.34 6.96 12.69 12.18 16.56
Taxi 23.16 32.28 30.70 41.93 46.66 43.15
Restaurant 45.02 28.26 49.72 55.70 58.51 57.70
Hotel 49.22 31.48 41.61 51.46 56.62 57.41
Train 55.86 56.38 57.51 57.51 59.15 60.80
Police 81.33 0.00 90.66 76.00 93.33 100.0
Restaurant 71.58 68.00 75.90 81.27 80.43 84.15
Hospital 62.22 15.55 58.89 71.11 76.67 83.33
Hotel 45.25 42.09 48.61 56.69 59.75 63.75
Attraction 65.48 67.69 65.91 70.61 76.22 74.93
Train 30.02 41.01 55.67 66.61 67.34 69.50
Table 5: Per Domain F1 Score.
Figure 6: Mixture of Experts (MoE) [Shazeer et al., 2017] model consist of feed-forward neural network (experts) which are embedded between two LSTM layers, a trainable gating network to select experts.
Figure 7: Attention over Representation (AoR) consist of a transformer encoder which encode the source input and compute the attention over the skills. Then transformer decoder layers computes specialized representation and the output response is generated based on the weighted sum the representation. In the figure, we omitted the output layer.
Figure 8: Attention over Parameters (AoP) consist of a transformer encoder which encode the source input and compute the attention over the skills. Then, specialized transformer decoder layers and a dummy transformer decoder layer parameterized by the weighted sum of the specialized transformer decoder layers parameters. In the figure, we omitted the output layer.