NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions

Existing conversational systems are mostly agent-centric, which assumes the user utterances would closely follow the system ontology (for NLU or dialogue state tracking). However, in real-world scenarios, it is highly desirable that the users can speak freely in their own way. It is extremely hard, if not impossible, for the users to adapt to the unknown system ontology. In this work, we attempt to build a user-centric dialogue system. As there is no clean mapping for a user's free form utterance to an ontology, we first model the user preferences as estimated distributions over the system ontology and map the users' utterances to such distributions. Learning such a mapping poses new challenges on reasoning over existing knowledge, ranging from factoid knowledge, commonsense knowledge to the users' own situations. To this end, we build a new dataset named NUANCED that focuses on such realistic settings for conversational recommendation. Collected via dialogue simulation and paraphrasing, NUANCED contains 5.1k dialogues, 26k turns of high-quality user responses. We conduct experiments, showing both the usefulness and challenges of our problem setting. We believe NUANCED can serve as a valuable resource to push existing research from the agent-centric system to the user-centric system. The code and data will be made publicly available.


page 1

page 2

page 3

page 4


Conversational implicatures in English dialogue: Annotated dataset

Human dialogue often contains utterances having meanings entirely differ...

User-Centric Conversational Recommendation with Multi-Aspect User Modeling

Conversational recommender systems (CRS) aim to provide highquality reco...

Learning to Ask Appropriate Questions in Conversational Recommendation

Conversational recommender systems (CRSs) have revolutionized the conven...

Oh My Mistake!: Toward Realistic Dialogue State Tracking including Turnback Utterances

The primary purpose of dialogue state tracking (DST), a critical compone...

Ontology-Enhanced Slot Filling

Slot filling is a fundamental task in dialog state tracking in task-orie...

Towards Deep Conversational Recommendations

There has been growing interest in using neural networks and deep learni...

Learning from Mistakes: Combining Ontologies via Self-Training for Dialogue Generation

Natural language generators (NLGs) for task-oriented dialogue typically ...

1 Introduction

Figure 1: Examples of traditional dataset and nuanced: in real-world scenarios, the free form user utterances often mismatch with system ontology. In nuanced, we model the user preferences (or dialogue state) as distributions over the ontology, therefore to allow mapping of entities unknown to the system to multiple values and slots for efficient conversation.

Conversational artificial intelligence (ConvAI) is one of the long-standing research problems in natural language processing. With the surge of neural models recently, there have been increasing interests in building intelligent dialogue agents to assist users, such as task-oriented dialogue system 

DBLP:conf/icml/WenMBY17; DBLP:conf/acl/MrksicSWTY17; DBLP:conf/emnlp/BudzianowskiWTC18; DBLP:journals/corr/abs-2005-00796, conversational recommendation systems DBLP:conf/recsys/FuXZZ20; DBLP:conf/sigir/SunZ18; DBLP:conf/cikm/ZhangCA0C18 and chi-chat adiwardana2020towards; roller2020recipes etc. However, most existing dialogue systems are agent-centric. Such systems require the users to unnaturally adapt to and even have a learn curve on the system ontology, which is largely unknown to the users (such as the sample instructions for most smart speakers). For example, Figure 1 shows a dialogue snippet commonly found in traditional datasets: the user is expected to closely follow the system ontology and provide the exact ontology terms, or at most with minor variations like synonyms.

In the real-world use cases, such formulation obviously suffers from the following issues. (1) It easily results in information loss, or breaks a conversation if the user speaks anything out of the system ontology; (2) it greatly limits the expressive power of the user because of the rigid pipeline of following a complex routine on system ontology. As an example in conversational recommendation in the restaurant domain, a user may prefer to speak freely as “I want to have a date wearing my shorts”, where “date” and “shorts” may not be defined in the system ontology but powerfully indicates multiple slots ambience=romantic and attire=casual simultaneously.

In this work, we argue that a smart agent can ideally be more user-centric: to allow the user to speak in their own way without restrictions. The system is expected to understand the user’s utterances in various forms, and more importantly, to reason the connection between one utterance to one or more values and slots defined by the system ontology. This further unleashes the expressive power of the user’s utterances and thus simplify the conversation. The lower part of Figure 1 gives examples of such free form user utterances. In the first turn, the entity “ramen” is in the user’s mind but (we assume) not in the system ontology. From this single word “ramen”, we humans can naturally (using our world knowledge to) have an estimate about this user’s preference on multiple values of the slot “Category” defined by the system ontology. In the third turn, what is even more powerful is that words such as “blog”, “laptop” and “martini” naturally picture the scene that the user wishes to have and imply two slots with multiple values in addition to what the system queried in the second turn. As such, it is highly desirable that a system is capable of reasoning and mapping the user’s entities to system ontology as we humans do. This is also a viable way of shortening the gap between chit-chat and task-oriented dialogue.

To build a user-centric

dialogue system, we propose to model the mapping from the free form user utterances to the system ontology as probability distributions over system ontology. We demonstrate that such representation can be much easier to capture fine-grained user preferences. To learn the probability distributions, we construct a new dataset, named

nuanced (Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions). nuanced targets conversational recommendation because such type of dialogue system encourages more modeling of soft matching, implicit reasoning for user preference, although the idea is generally applicable to other forms of dialogue systems. We employ the professional linguists to annotate the dataset, and end up with 5.1k dialogues, 26k turns of high-quality user utterances. Our dataset captures a wide range of phenomena naturally occurring in the realistic user utterances, including specified factoid knowledge, commonsense knowledge and users’ own situations. We conduct comprehensive experiments and analyses to demonstrate the challenges in our new dataset. We hope nuanced can serve as a valuable resource to help to bridge the gap between current dialogue systems and real-world applications. To summarize, we make the following contributions:

  • We study the important problem of building user-centric conversational agents, through modeling the mapping between the system ontology and the user utterances as probability distributions.

  • We propose a new large-scale dataset, nuanced, with high-quality user utterances paired with the estimated distributions on preference. The user utterances involve complex reasoning types, which raises more challenges in modeling user preferences.

  • We conduct in-depth experiments and analysis on our new dataset to show insights, challenges, and open problems for future research.

2 Related Work

2.1 Dialogue System

Over the recent years there have been a surge of works on conversational artificial intelligence (ConvAI). Task-oriented dialogue systems are typically divided into several sub modules, including user intent detection DBLP:conf/interspeech/LiuL16; DBLP:conf/naacl/GangadharaiahN19, dialogue state tracking DBLP:conf/acl/MrksicSWTY17; DBLP:conf/asru/RastogiHH17; DBLP:conf/sigdial/HeckNLGLMG20, dialogue policy learning DBLP:conf/emnlp/PengLLGCLW17; DBLP:conf/acl/SuGMRUVWY16, and response generation DBLP:conf/inlg/DusekNR18; DBLP:conf/emnlp/WenGMSVY15. More recent approaches begin to build unified models that bring the pipeline together DBLP:conf/acl/ChenCQYW19; DBLP:journals/corr/abs-2005-00796. Conversational recommendation focus on combining the recommendation system with online conversation to capture user preference DBLP:conf/recsys/FuXZZ20; DBLP:conf/sigir/SunZ18; DBLP:conf/cikm/ZhangCA0C18. Previous works mostly focus on learning the agent side policy to ask the right questions and make accurate recommendations, such as DBLP:conf/nips/LiKSMCP18; DBLP:conf/emnlp/KangBSCBW19; DBLP:journals/corr/abs-2006-00184; DBLP:conf/kdd/LeiZ0MWCC20; DBLP:journals/corr/abs-2005-12979; DBLP:conf/recsys/PenhaH20. Chit-Chat adiwardana2020towards; roller2020recipes is the most free form dialogue but almost with no knowledge grounding or state tracking. Both existing task-oriented, conversational recommendation systems have a pre-defined system ontology as a representation connected to the back-end database. The ontology defines all entity attributes as slots and the option values for each slot. In existing datasets, such as the DSTC challenges DBLP:journals/aim/WilliamsHRTBR14, Multi-WOZ DBLP:conf/emnlp/BudzianowskiWTC18, MGConvRex DBLP:journals/corr/abs-2006-00184, etc, the utterances from the users mostly follow the system ontology to make responses. While in task-oriented dialogue systems, parsing the user utterances into dialogue states is more on hard matching, in conversational recommendation systems soft matching is more encouraged since the user preferences are more salient and diverse in this type of conversations. In this work, we encourage implicit reasoning of grounded system ontology for both hard and soft matching to maximize the freedom from the user side.

2.2 Dialogue State Tracking

As a core component in dialogue systems, dialogue state tracking (DST) estimates the state of the conversation in the form of a set of discrete <domain-slot,value>

pairs, which is passed to back-end system for database query and dialogue policy generation. Traditional approaches for DST mostly employ feature engineering and domain-specific lexicons 

DBLP:conf/sigdial/HendersonTY14; DBLP:conf/slt/SunCZY14; Recent neural based approaches DBLP:conf/acl/MrksicSWTY17; DBLP:conf/acl/WuMHXSF19 have shown promising results on many benchmarks. They can be roughly categorized into classification with pre-defined slot-values DBLP:conf/acl/SocherZX18, span prediction from the dialogue context DBLP:conf/sigdial/GaoSACH19, hybrid approaches DBLP:journals/corr/abs-1910-03544, and generative approaches DBLP:conf/acl/WuMHXSF19, etc. However, especially for the applications with strong freedom, diversity and uncertainty in user utterances such as recommendation tasks, a state in the discrete labeling or span space may not be able to capture nuanced implicit reasoning during a conversation. For example, in the dialogue snippet from Figure 1, given the utterance of “update blog on laptop”, we cannot take a span prediction as state because it is not defined by system ontology. In this work, the proposed to expand the state representation as continuous probability distributions over the system ontology, which allows for larger scope of grounding and implicit reasoning from the freestyle user utterances.

2.3 Entity Linking and Taxonomy Construction

Our work is also related to the studies in entity linking and taxonomy construction. Entity linking is the task of linking entity mentions in text with their corresponding entities in a knowledge base DBLP:journals/tkde/ShenWH15; DBLP:journals/ai/HacheyRNHC13; DBLP:conf/sigir/HanSZ11. The key difference is that our work has no clear one-to-one mapping but implicit reasoning between multiple entities in utterances to multiple entities defined in system ontology. Taxonomy construction focuses on organizing entities or concepts into hierarchical categories DBLP:conf/kdd/LiuSLW12; DBLP:conf/kdd/ShenWLZRVS018; DBLP:conf/emnlp/LuuKN14. In our work, one of the reasoning types requires resolving entities in the user utterances and mapping to the system ontology. A part of them can be conceptually similar as taxonomy construction (or requires external taxonomy as discussed in the last section), e.g., in Figure 1, we need the knowledge that the entity “ramen” has the upper level entity “Japanese”, “Chinese”, and “Korea”. In addition to such hierarchical relations, we use distributions to capture more fine-grained user preferences, i.e., the user is more likely to choose a Japanese restaurant that has Ramen.

3 The nuanced Dataset

To build the mappings between the system ontology and the user utterances as probability distributions, our solution is to collect a large-scale dataset and learn the estimated distributions. In this work, we start from conversation recommendation in the restaurant domain since the user’s preferences are more salient in such type of dialogue systems. In this section, we will first give the formulation of the distribution for preference in §3.1. Then we describe the dataset construction process in §3.2 and the statistics/analysis in §3.3.

3.1 User Preference Modeling

For a given system ontology, without loss of generality, we denote the set of all slots as ; For each slot , denote the set of its option values as . In a dialogue between user and agent, denote the current user utterance as and dialogue context (of past turns) as . In the realistic setting where does not necessarily have grounded ontology terms, we model the user preference as a distribution over each slot-value, namely preference distribution:


Note that we expect the representation to be general, expandable, and holds the least assumptions, i.e., there is no assumption on the dependency among slot-values, neither the completeness of the value set. Therefore we model the distribution as a Bernoulli distribution over each slot-value, independent of the others. Intuitively,

represents the probability that the user chooses an item with attributes

, under the observed condition of the dialogue up to the current turn. Note that the preference distributions may differ among individuals which causes variances, for example, for the same situation ‘I need to download some files there before work’, some people may prefer paid wifi more because they want high speed while others may do not care. In this work, we aim to aggregate estimated distributions from large-scale data collected from multiple workers as “commonsense” distributions. We leave modeling user-specific distributions to future work.

3.2 Dataset Construction

To construct our dataset, we first simulate the dialogue flow with the user turns filled with the preference distributions, then we ask the human annotators to compose (or paraphrase) utterances that imply the given distribution. We employ the process of simulation followed by rewriting to cover abundant cases of preference distribution and reduce annotation bias, as suggested by the previous work DBLP:journals/corr/abs-1801-04871; DBLP:conf/naacl/ShahHLT18.

3.2.1 Dialogue Simulator

To obtain more valid and natural distributions, we start from the real user data from the MGConvRex dataset DBLP:journals/corr/abs-2006-00184. Specifically, for each user with its visiting history as a list of restaurants with corresponding slot-values, we randomly sample a subset of the history and aggregate to get a value distribution for each slot. Using the sampled distribution as the ground truth distribution, we simulate the dialogue skeletons of the following scenarios: 1) Straight dialogue flow: the system asks each slot, followed by the user response filled as the corresponding preference distributions; 2) User updating preference: the user updates the preference distributions in a previous turn; 3) System yes/no questions: the system can choose to ask confirmation questions based on the user history;

For each turn, we randomly select 1 to 3 slots, corresponding to the cases that the user utterances naturally (and powerfully) imply multiple slot-values. Since in this work our focus is on natural language understanding or dialogue state tracking, we adopt the above simple strategy and do not consider a complex policy on the system side. The system turns are composed using templates.

3.2.2 User Utterances Composition

Reasoning types Example user utterance Example preference distribution
Type I Factoid Knowledge (37.3%) I really want a G&T or a Riesling, but I could also have a tonic water. Slot: Alcohol = (full_bar, 0.7), (beer_and_wine, 0.2), (don’t_serve, 0.1)
Type II Commonsense knowledge or User Situations (43.8%) five to ten dollars, I don’t want a place with people wearing ties, you know? Slot: Price = (cheap, 0.6), (affordable, 0.4), (moderately_priced, 0.0), (expensive, 0.0) Slot: Attire = (casual, 1.0), (dressy, 0.0), (formal, 0.0)
Type III Mixed Type I & II (19.0%) i want to update blog on my laptop, with a dry martini on side. Slot: Wifi = (free, 0.7), (paid, 0.3), (no, 0.0) Slot: Alcohol = (full_bar, 1.0), (beer_and_wine, 0.0), (don’t_serve, 0.0)
Table 1: Examples of different reasoning types. In the Type I utterance, we need to reason from the factoid knowledge that G&T is only served in a full bar, while Riesling is a kind of wine and tonic water does not require alcohol options. In the Type II utterance, based on commonsense knowledge we know that ‘place without people wearing ties’ indicates casual attire, and ‘five to ten dollars’ indicates a price range of cheap or affordable. In the Type III utterance, we need both kinds of reasoning to infer the preferences.

After simulating the dialogue skeletons, where each user utterance is formulated as distributions for user preferences, we employ crowdsourcing to compose corresponding user utterances. To get a high-quality dataset, we employ professional linguistics to do the composition. Specifically, we provide two composing strategies to the linguistics:

Implicit Reasoning: not explicitly mention the slot-value terms. This is the focus of this work since we expect the users have no idea of system ontology and free to depict their wishes that are less likely to overlap with the formally defined system ontology.

Explicitly Mention: use the slot-value terms (or synonyms), as a backup option when the first one is not applicable.

During the composition, we emphasize the following aspects: 1) Diversity is the most important. Try to compose as diversified utterances as possible; 2) Read the whole dialogue first and have an overall “story” in mind, then begin to compose each utterance, to keep consistent with the dialogue context; 3) Since the dialogue skeleton is automatically simulated, there must be a certain number of invalid cases. Reject any cases with invalid or unnatural preference distributions. We provide detailed explanations and examples, as well as learning sessions to make sure all the linguistics well-mastering the task. We launched 5,784 simulated dialogue skeletons to the linguistics, and end up with 5,100 completed dialogues after post-processing.

3.3 Dataset Statistics and Analysis

3.3.1 Statistics

With an average of 5.39 user turns per dialogue, we have 5,100 dialogues of 25,757 user turns in total. The user utterances have an average length of 19.43 tokens. In terms of composing strategies, 84.7% of the utterances are composed using the strategy of implicit reasoning, i.e., the utterance does not have any grounded ontology term; 6.5% of the utterances explicitly mention the ontology terms, and the remaining 8.8% use mixed strategies. This demonstrates the uniqueness and challenge of our dataset, compared with previous ones that the user utterances mostly closely follow the predefined ontology.

Table 2 shows the train/valid/test split in the number of dialogues and turns. To evaluate the quality of our dataset, we randomly sample 500 examples and ask humans whether a preference distribution is reasonable based on the corresponding utterance. We end up with a correctness rate of 90.2%, which is the percentage of turns with a reasonable preference distribution.

All Train Valid Test
# dialogues 5,100 3,600 500 1,000
# user turns 25,757 18,182 2,529 5,046
Table 2: Train/Valid/Test Split of the dataset

3.3.2 Reasoning Types

One major challenge of maximizing the freedom for users is to encourage implicit reasoning. As such, understanding different types of implicit reasoning is crucial for the success of building a user-centric dialogue system. Among the utterances involving implicit reasoning, we summarize 3 basic reasoning types for utterances from our dataset. Type I (Factoid Knowledge) is the hidden backbone to build the connection between user utterances and the preference distribution. Such factoid knowledge are largely agreed by humans and stable, such as knowledge from Wikipedia. To learn the distributions involving such type, certain techniques such as pre-trained LMs or external knowledge base may be needed.

Type II (Commonsense Knowledge or User Situations)

is also important and frequent in utterances. The major difference between commonsense knowledge and factoid knowledge is that commonsense knowledge may not be formally defined and may change in future. For example, a food item that is less than $10 is cheap. In many cases, such commonsense knowledge need to be inferred from a situation described by the users at the current moment.

Type III (Mix of Type I and II) may appear in a single utterance. The examples and distributions of these 3 types are shown in Table 1. Note that this is not a comprehensive list of types of implicit reasoning. For example, there could also be user-specific knowledge that has no agreements among human beings but are important for mapping entities from user to system ontology. We discuss this as an open problem in the last section.

3.3.3 nuanced-reduced

Further, we are interested in the connection between the discrete states (used in existing dialogue state tracking) and our novel form continuous states (in continuous distributions). As such, we also provide a reduced variant called nuanced-reduced, by discretizing the distributions for preference into binary numbers. Specifically, for all slot-values with a positive preference distribution111In practice we set a threshold of 10%, because in the utterance composition stage a preference distribution lower than 10% is generally considered ignorable. we label them as 1.0, otherwise 0.0. For example, the preference distribution Wifi=(free, 0.7),(paid, 0.3),(no, 0.0) is turned into Wifi=(free, 1.0),(paid, 1.0),(no, 0.0), indicating preferences over the value free and paid. As a result, this reduced variant does not have continuous probabilities to tell the nuanced differences on positive labels but it still needs to map free form utterances to binary labels.

We further conduct a study based on a human evaluation to obtain insight into these two versions of the datasets. Specifically, we present the utterances with both the preference distributions and the coarse (binary) labels, then we ask the annotators to decide which representation can better capture more fine-grained user preferences. Table 3 shows the evaluation results and nuanced can better capture the nuanced information underlying user utterances. Note that in real applications, which version of the data to use may depends on requirements of the system, i.e., level of granularity for state representation. In experiment, we further explore the impacts of two versions of datasets on models.

nuanced win nuanced-reduced win Tied
54.7% 16.7% 28.6%
Table 3: Human evaluation results of comparing two versions of the dataset: the version using the (continuous) probability distributions can better capture fine-grained user preferences.

4 Experiments

In this section, we conduct experiments on both versions of datasets in  §4.1 and  §4.2, respectively. We primarily focus on getting insights into patterns inside the dataset and providing baseline approaches to estimating the challenges of building a more user-centric system. As nuanced-reduced is closer to existing datasets in dialogue state tracking, we first follow a similar evaluation as in previous datasets to have a better understanding of the challenges; then we explore nuancedwith the continuous distributions and perform a human evaluation on predicted distributions from models to understand the modeling challenges.

4.1 nuanced-reduced

4.1.1 Baselines

We design the following baselines for nuanced-reduced to estimate how nuanced can work in a similar fashion as in existing dialogue state tracking but is more powerful in drawing connections between users’ (unrecognized) entities and slots and values defined by system ontology.

Exact match & Random guess This is a rule-based baseline. For each turn we follow the preceding system query to make slot prediction; then we use exact match to predict the slot-values and additionally mentioned slots; If no match is found, we apply random guess. We use this baseline to estimate how a simple rule-based method can cover both explicit and implicit mapping.

BERT Following DBLP:conf/naacl/DevlinCLT19, we adopt a pre-trained LM to enrich the features of utterances for better tracking of binary labels in nuanced-reduced. The input is the concatenation of the slot name, current turn system question and user utterance, and optionally dialogue context of past turns (as in - w/o context). We add two (2) types of prediction heads on the [CLS] token of BERT, one head for slot prediction (whether the input slot is updated or not), and the other type is for the value prediction of each slot. The loss is a combination of cross-entropy loss for slot prediction and mean squared error (MSE) loss for value prediction. During inference, we set up a threshold to decide positive or negative predictions.

Transformer DBLP:conf/naacl/DevlinCLT19; DBLP:conf/nips/VaswaniSPUJGKP17 To study the effect of pre-training weights, we use the same architecture as BERT but train the weights from scratch.

Train-MGConvRex As MGConvRex dataset DBLP:journals/corr/abs-2006-00184 has similar domain and ontology, we compare BERT model trained on MGConvRex with that trained on nuanced-reduced. We use this baseline to demonstrate the open challenge from users’ freestyle speaking and to what degree nuanced-reduced can alleviate this issue.

We omit the model implementation and hyperparameter settings here. We refer the readers to Appendix A for more details. For all baselines, we evaluate in a similar way as in dialogue state tracking on the turn level slot prediction accuracy and joint accuracy.

4.1.2 Results for nuanced-reduced

The results are shown in Table 4. A pre-trained LM (BERT) achieves the best performance (compared to Transformer) as pre-training on large-scale corpus can draw a better relevance or mapping between unrecognized entities from the user and entities (such as slots, values) from the agent. We believe, for certain types of reasoning discussed in Table 1, some knowledge about entities such as factoid knowledge or even commonsense knowledge may have already seen by BERT during the pre-training. Noticeably, Train-MGConvRex limits (or even overfit) such mapping to system ontology. As a comparison, we further test Train-MGConvRex on the testing set of MGConvRex, resulting in 96.52% for slot accuracy and 91.35% for joint accuracy. This huge performance loss indicates that existing dialogue datasets (not just for conversational recommendation) may limit what an agent can understand from humans. What is even worse is that Train-MGConvRex probably overfits the training data that closely following a system ontology because a random solution may perform better on unrecognized entities, as indicated by Exact match & Random guess. Lastly, by comparing with BERT without dialogue context (or past turns), we notice that context may help on learning better values but yield more noises for slot prediction. This may be caused by diverse out-of-ontology entities spanning across multiple utterances making a model harder to identify the correct slot in a consistent way.

Baselines Slot Accuracy (%) Joint Accuracy (%)
Exact match & Random guess 48.83 4.84
Train-MGConvRex 38.70 4.02
Transformer 74.14 21.52
BERT 88.21 36.56
BERT w/o context 88.78 34.99
Table 4: Evaluation results on the nuanced-reduced. Slot Accuracy: percentage of turns that all slots are correctly predicted; Joint Accuracy: percentage of turns that all slots and values are correctly predicted. Train-MGConvRex: BERT trained on MGConvRex but evaluated on the testing set of nuanced-reduced; Transformer: the same architecture as BERT without pre-trained weights; w/o context: without past turns.

4.2 nuanced

Next, we focus on the ideal setting of continuous distributions as states for dialogues. As a reminder, the major difference between nuanced and nuanced-reduced is on the set a value can take, with the former as a 0/1 binary label and the latter as a continuous number between 0.0 and 1.0. As such, we detail the major differences of experiment setup compared with nuanced-reduced.

We keep the same evaluation for slot prediction. To reflect the differences in a value assigned with different continuous numbers, we evaluate the soft average mean absolute error (MAE) between the ground truth distribution and that from the predictions, instead of the hard metric on classification.

4.2.1 Baselines

Exact match & Random guess Similar to nuanced-reduced, for matched values, we assign a hard probability of 1.0; otherwise, we randomly assign a probability between 0.0 and 1.0.

BERT, Transformer Similar to nuanced-reduced, we use MSE loss between the distribution of ground truth and predicted distribution. During inference, we take the predicted distribution as the results.

Train-reduced-X Further, we are interested in the connection between states in continuous space and binary space. We train the model on nuanced-reduced and test on nuanced to see how data with binary labeled states can infer states in the continuous space. We define a fixed number of X as the continuous number for all positive predictions. We experiment with X = 0.5 and 1.0.

4.2.2 Results for nuanced

Table 5

shows the overall results. As expected, a pre-trained BERT reaches the best performance. One interesting observation is that using the same model BERT, the slot prediction accuracy increases (from 88.21% to 89.62%) compared with training on the reduced version, even though its own loss function does not change.

nuanced helps to reduce the noise of sparse entities in context (past turns). This is probably because numbers in continuous space can help to draw more relevance among different entities.

As we can see, Train-reduced-X has a much larger error on MAE because of the information loss when turning numbers in the continuous space into binary numbers. This indicates simply adapting the results from the reduced state labels suffers from information loss, i.e., the nuanced differences in continuous distributions. It is very important to model dialogue state with numbers in the continuous space to cover the uncertainty derived from unrecognized entities from the users.

Baselines Slot Accuracy (%) Correct slots mean MAE (1e-2)
Exact match & Random guess 48.83 46.84
Train-reduced-1.0 88.21 40.72
Train-reduced-0.5 88.21 21.62
Transformer 78.42 16.78
BERT 89.62 14.20
BERT w/o context 88.08 14.49
Table 5: Evaluation results on nuanced. Slot Accuracy: percentage of turns that all slots are correctly predicted; Correct slots mean MAE (lower the better): mean absolute error of predicted distribution for all correctly predicted updated slots; Train-reduced-X: train the model on nuanced-reduced, and test on nuanced with all positive predictions set as a distribution value of X.

4.2.3 Analysis on Slots

To get more insights of how users can provide information efficiently in our dataset, we study how the models performance on updating different number of slots per turn The results are shown in Table 6. Generally speaking, the turns with more slots implied by the utterance are relatively harder to learn. The turns involving 1 slot are mostly following the system query, which contains the slot name, thus the slot prediction accuracy is very high. The turns involving multiple slots, in addition to the system query, becomes harder to predict with the increasing number of slots. The turns that update the preference in previous turns have the highest error for distribution prediction. In such kind of turns, the preference distribution needs to be inferred from the previous mention and the current turn jointly.

We also study how the model performs on each slot in the domain, shown in Table 7. Generally, slots that may involve more factoid knowledge or more choices of values are harder to learn, such as food category, parking. These may require learning long-tailed knowledge from external data, as we discussed in the next section. We provide some case studies in Appendix B.

Type of turn all 1 slot 2 slots 3 slots updating preferences
Slot Accuracy(%) 89.62 96.67 78.91 67.65 90.61
Mean MAE(1e-2) 14.12 14.06 13.55 14.20 15.63
Table 6: Performance for different number of slots per turn: all: all kinds of turns; n slots: turns that the user utterance jointly implies n slots; updating preferences: turns that the user utterance updates the preference in previous turns. Slot Accuracy: percentage of turns that all slots are correctly predicted. Mean MAE: Here the mean MAE is measured for all correctly predicted slots.
Slot food category price parking noise
Mean MAE(1e-2) 15.48 15.29 16.94 13.34
Slot ambience alcohol wifi attire
Mean MAE(1e-2) 15.04 13.88 12.30 8.95
Table 7: Performance for each slot of our dataset. Here the mean MAE is measured for all correctly predicted slots.

4.2.4 Human Evaluation

We further conduct a human evaluation on baseline models. Although the automatic evaluations can generally tell the overall performance, it could be fuzzy since it takes an average overall correctly predicted slots. As a reminder, we perform human evaluations to ensure the quality of ground-truth as in §4.2, which serves as the basis to compare the predictions of models with ground-truths.

We first evaluate the model outputs of Transformer, BERT, and BERT w/o context, through pairwise comparison between the model predictions and the gold ones. The results on 200 samples are shown in Figure 2. One can easily notice the large gap between the best-performing baseline and the gold reference. This indicates a large room for improvement on modeling or data augmentation for future research.

Further, we study the breakdown of predictions of BERT on 3 different types of reasoning. As a reminder, we discussed 3 basic types of reasoning in the user utterances in §3.3. As shown in Figure 3, the type 1 utterances, that involve factoid knowledge, are relatively harder to learn. This is close to our intuition as factoid knowledge is in huge amount (and keeps increasing) and our limited utterances may not cover all of them. We discuss possible solutions to resolve different types of knowledge to improve performance in the next section.

Figure 2: Human evaluation results for the model outputs of Transformer, BERT, and BERT w/o context.
Figure 3: Human evaluation results for different reasoning types. Type I: factoid knowledge; Type II: commonsense knowledge or user situations; Type III: Mixed Type I & II.

5 Conclusion and Open Problems

In this work, we study to build user-centric conversational systems. We take the first attempt to bridge the gap between the system ontology and the users’ freestyle preferences, through learning continuous probability distributions as an in-between mapping. To this end, we build a new dataset named nuanced focusing on such a challenge. Starting from our datasets, we believe a user-centric dialogue system is open-ended problems and the following directions can be promising:

1) Preliminary experiments results indicate that to improve performance, it is promising to incorporate external domain texts into pre-trained models, for example, pre-training the model on domain corpora like restaurant descriptions and reviews. This will improve both Type I and Type II knowledge discussed in Table 1.

2) Although our dataset collects a large set of domain entity’s knowledge, it still cannot guarantee to cover the vast amount of unknown entities (in the future). One idea is to incorporate a knowledge base (KB) as Type I knowledge. This can be in both the form of training data (data augmentation of utterances in nuanced) and modeling (e.g., via a graph-based model to infer probability distributions in a structured way).

3) Through our large-scale dataset, although one can learn a general agreement of estimated distributions from workers, a more user-specific distribution would be more desirable in reality. For example, when talking about “beef”, different users may actually refer to different parts or types of beef. Knowing such different user-specific distributions requires the system to also build user ontology besides the formal and stable system ontology. Therefore, we believe providing a personalized solution is another proper next step to consider, i.e., learning and maintaining user ontology when interacting with the user and further learning preference distributions over user ontology.



A Model Implementation and Training Details

Figure 4: Illustration of the BERT baseline

Figure 4 presents the architecture of the BERT baseline. For each turn, we concatenate each slot with the current turn and the dialogue context as the input. On the [CLS] output, we add one head for slot prediction as binary classification, i.e., whether the input slot is updated in the current turn. For each slot, we add a specific head for value prediction. We use cross entropy loss for slot prediction, and mean squared loss for value distribution prediction. The overall loss is a weighted combination of the two losses. We set the weight for value prediction as 20.0. We use BERT-base from the official release222; The learning rate is set as 3e-5, batch size as 32. We take the results based on the performance on validation set.

Note that for the slot “food category”, some values are commonly observed in the dataset such as “American food”, “nightlife”, while some others are less frequently such as ”Thai”. During training we employ up-sampling for the less frequent ones.

In the construction of nuanced, we sample a subset of the user history and aggregate to get the ground truth preference distributions. Because the number of viable values of each slot is different, for those slots with relatively more values the distribution generally presents ‘long tail’, we only take the top 3 value distributions for each slot. Correspondingly, during the model evaluation, we also take the top 3 predicted value distributions to calculate the MAE.

B Case Studies

Example 1
Dialogue turn(s):
Assistant: any preference on attire?

User: I like shorts and a loose tee shirt in this heat.

Gold distributions:
attire ( casual= 1.00, dressy= 0.00, formal= 0.00 )

BERT predictions:
attire ( casual=0.99, dressy= 0.01, formal= 0.00 )

Gold labels:
attire ( casual= 1.00, dressy= 0.00, formal= 0.00 ) BERT predictions:
attire ( casual= 1.00, dressy= 0.00, formal= 0.00 )

Example 2
Dialogue turn(s):
Assistant: any preference on alcohol?

User: I really want a G&T or a Riesling, but I could also have a tonic water.

Gold distributions:
alcohol ( full_bar= 0.78, beer_and_wine= 0.33, don’t_serve= 0.11 )

BERT predictions:
alcohol ( full_bar= 0.55, beer_and_wine= 0.47, don’t_serve= 0.09 )

Gold labels:
alcohol ( full_bar= 1.0, beer_and_wine= 1.0, don’t_serve= 1.0 )
BERT predictions:
alcohol ( full_bar= 1.0, beer_and_wine= 1.0, don’t_serve= 1.0 )

Example 3
Dialogue turn(s):
Assistant: what parking option would you like?

User: I need something fuss-free and out of the rain for my car, Also, I really want a gin and tonic, but it’s not a complete deal-breaker if I can’t have it.

Gold distributions:
parking ( garage= 0.86, valet= 0.00, validated= 0.00 )
alcohol ( full_bar= 0.93, beer_and_wine= 0.21, don’t_serve= 0.14 )

BERT predictions:
parking ( garage= 0.78, valet= 0.41, lot= 0.34 )
alcohol ( full_bar= 0.79, beer_and_wine= 0.17, don’t_serve= 0.12 )

Gold labels:
parking ( garage= 1.0, valet= 0.0, validated= 0.0 )
alcohol ( full_bar= 1.0, beer_and_wine= 1.0, don’t_serve= 1.0 )

BERT predictions:
parking ( garage= 1.0, valet= 1.0, lot= 1.0 )
alcohol ( full_bar= 1.0, beer_and_wine= 1.0, don’t_serve= 1.0 )

(after some turns)
Assistant: here’re the recommendations
User: You know what, if it’s going to be a fancier place then I don’t mind dealing with more complicated parking after all.

Gold distributions:
parking ( garage= 0.86, valet= 0.64, validated= 0.21 )

BERT predictions:
parking ( garage= 0.67, valet= 0.48, lot= 0.40 )

Gold labels:
parking ( garage= 1.0, valet= 1.0, validated= 1.0 )

BERT predictions:
parking ( garage= 1.0, lot= 1.0, validated= 1.0 )