Denotation Extraction for Interactive Learning in Dialogue Systems

by   Miroslav Vodolán, et al.
Charles University in Prague

This paper presents a novel task using real user data obtained in human-machine conversation. The task concerns with denotation extraction from answer hints collected interactively in a dialogue. The task is motivated by the need for large amounts of training data for question answering dialogue system development, where the data is often expensive and hard to collect. Being able to collect denotation interactively and directly from users, one could improve, for example, natural understanding components on-line and ease the collection of the training data. This paper also presents introductory results of evaluation of several denotation extraction models including attention-based neural network approaches.


page 1

page 2

page 3

page 4


Conversation Graph: Data Augmentation, Training and Evaluation for Non-Deterministic Dialogue Management

Task-oriented dialogue systems typically rely on large amounts of high-q...

A Framework for Building Closed-Domain Chat Dialogue Systems

This paper presents PyChat, a framework for developing closed-domain cha...

End-to-End Task-Completion Neural Dialogue Systems

One of the major drawbacks of modularized task-completion dialogue syste...

On the Evaluation of Dialogue Systems with Next Utterance Classification

An open challenge in constructing dialogue systems is developing methods...

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

Human conversations in real scenarios are complicated and building a hum...

Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach

Different people have different habits of describing their intents in co...

Challenging Neural Dialogue Models with Natural Data: Memory Networks Fail on Incremental Phenomena

Natural, spontaneous dialogue proceeds incrementally on a word-by-word b...

1 Introduction

The increasing popularity of dialogue systems and the rapidly growing amount of available information (factoid general data such as Wikidata, domain-specific data such as transportation schedules, etc.), demands dialogue systems which can be quickly adapted to new domains.

This demand has motivated several industry technologies providing tools for quick development of dialogue systems, such as IBM Watson Dialogue Service111 and Wit.ai222 The core principle of these technologies is based on manual work of domain experts. This principle of handcrafting usually causes the resulting dialogue systems to be quite simple, focused on a single domain (separate systems for pizza ordering, restaurant bookings, transportation information, etc.), and they have troubles when users are trying to use out-of-domain concepts as noted in [1, 2]. A typical problem of such systems is that they are hard to maintain in the long term as language usage evolves (e.g. how users refer to concepts in a domain) and a domain itself changes (e.g. new concepts are added). To sum up, a principled approach to dialogue system development must be able to adapt in the long term.

Models with more complexity, covering many domains, could be developed on the basis of supervised machine learning techniques, as shown by recent end-to-end dialogue systems 

[3]. Such data-driven models require large amounts of labeled training data and many tricks to make them work, e.g. training components of the system separately with different training data [4]. The issue with those models is in obtaining sufficient amounts of labeled training data, which is often not feasible for a non-trivial number of domains, especially when the data must be collected repetitively.

Some research focus on reinforcement learning (RL) to alleviate the problem. The RL methods allows for on-line adaption of dialogue policies and learning from user feedback 

[5]. While a lot of progress has been made, the proposed methods are not very efficient yet. In part, it is caused by learning only from feedback in the form of a delayed numerical reward. Consequently, these methods only learn and adapt a model of communication, e.g. what response (often a handcrafted action) to use given the facts the dialogue system has access to. This severely limits their ability to adapt in the long term.

An alternative approach called interactive learning [6, 7, 8, 9] aims at obtaining factual information directly from conversations with real users and eventually use such information in real-time to improve skills of a dialogue system. Therefore, it has the potential to allow developing dialogue systems with less demand on domain experts and labeled training data. Interactive learning represents a great step towards adaptability of dialogue systems.

The work of [6] uses interactive learning to teach a personal agent new skills by combining old ones. Other work shows the possibility to use the interactive learning for improving language understanding of a system [7, 8]. The authors propose a user simulator which can be questioned by the system about simple facts. This setup neglects the complexity of the natural language (due to simple templates that the simulator uses), which is an issue for real dialogue system deployment. Interactive learning in connection with natural language is researched in [9]. The authors collected a large set of natural dialogues, called the Question Dialogue Dataset (QDD), to be used for experiments with interactive learning. They distinguish several kinds of information which an interactive system can obtain during a dialogue with a user. The most basic kind of information is an answer hint, which is obtained by asking a human user a factual question of system’s interest.

This paper proposes a task which aims to use answer hints to derive question denotations [10] (see Section 2 for details). For example, manually annotated denotations are used to train natural language understanding (NLU) components [10, 11, 12]. The denotation use itself is outside of the scope of this work. However, the ability to obtain the denotations automatically in real-time potentially enables dialogue systems to improve their NLU interactively while using, for example, the referred techniques above. Another possibility is to use denotations to enrich a system’s knowledge base (see Section 2 for details). A system may ask questions about topics with poor coverage. Then, the answers (denotations) can be used to derive new facts. This relates to the work on knowledge base population [13, 14].

The next Section 2 introduces the task of denotation extraction in more details. Several models for the denotation extraction from answer hints are proposed in Section 3. Section 4 describes evaluation of the models on the QDD. A discussion of the results is provided in Section 5. Finally, we conclude the paper in Section 6.

Figure 1: An example of a short dialogue from the Question Dialogue Dataset containing multiple kinds of information. This work is only concerned with answer hints (U3).

2 The Task of Denotation Extraction

In the framework of interactive learning by [9], an answer hint is a piece of information obtained by asking a human user a question of the system’s interest. This way, one can get utterances as follows:

System: What work did Scooter Libby write?
User: Scooter Libby wrote a novel called The Apprentice.

To make this information useful for a typical dialogue system, natural language utterances have to be mapped to a meaning representation. In a question answering system, such a convenient meaning representation of an answer has a form of a denotation [10, 11, 12]. A denotation is a set of entities, representing the correct answer of a question [10], from the system’s knowledge base (such as Freebase333 A knowledge base (KB) is a set of triplets in a form of subject entity, relation, object entity.

The automatic mapping of answer hints to denotations is a challenging task. First, finding entities in an answer hint is difficult because a single entity may be described in many ways using natural language, and not all the ways will be captured in a KB. Second, the system has to select denotation entities from potentially several entities in an answer hint. Third, natural dialogues are prone to errors, speech disfluencies or misspellings (in text-based interfaces). Finally, huge knowledge bases often contain lot of entities with the same label, but a different meaning (e.g., The Apprentice is a novel, TV series, rock album, etc.).

The QDD (see Section 1) is a suitable testbed for the denotation extraction task. It is a set of dialogues between human users and a dialogue system where the system attempts to learn to communicate the content of it’s KB to it’s users. As QDD contains manually annotated denotations, it can be used for the evaluation process. Note that the denotations for the QDD questions are single entities.An example dialogue from the dataset can be seen in Figure 1.

3 Denotation Extraction Algorithm

This section describes the process of denotation extraction. In this work, the denotation extraction algorithm is decomposed into two steps: an entity linking and a denotation identification, and it operates on pairs of a question and the corresponding answer hint. First, entity linking jointly recognizes entities in a pair of a question and an answer hint and aligns these entities with a system’s knowledge base (see Section 3.1). Second, denotation identification selects the denotation from the answer hint’s linked entities (see Section 3.2). The evaluation of the algorithm is in Section 4.

3.1 Entity Linking

The entity linking is a task of identifying entity mentions in a text [14] according to a system’s KB. This task is intensively studied [15, 16, 17, 18]. Also, many entity linking systems are available as web services444, (links checked on 21.06.2017). One could consider using these services for this task. However, our informal experiments showed that these services do not perform well on the QDD. The reasons are twofold. First, some common entities in the QDD are not recognized (e.g. male, female, …). Second, spelling errors appearing in the QDD are not handled gracefully. Consequently, we propose a custom entity linking algorithm which handles the above problems.

The principle of the proposed entity linking algorithm is shown in Figure 2.

Figure 2:

Entity linking algorithm using similarity scores between n-grams and entities from knowledge base.

The algorithm takes a pair of a question and an answer hint as an input. First, the algorithm finds n-grams in both the question and the answer hint matching some entity name or its alias in the KB and marks the n-grams as entity candidates. The matching is done based on the string edit distance thus compensating for spelling errors. In case the entity candidates overlap (e.g., London, London Street

), the shorter candidates are discarded in favor of longer ones which presumably specify entities more precisely. This heuristic has shown to be effective in our informal experiments. Second, entity candidates are linked with entities in the KB. For every entity candidate there are possibly many entities with the same matching name or alias while only one entity needs to be selected. For that purpose a

relation maximization disambiguation algorithm is used.

The algorithm is applied in two steps. First, it links entities in a question. The algorithm selects the entities in a way that maximizes number of relations (according to the KB) between all the selected entities. Second, the algorithm links the entities in the corresponding answer hint. In addition to the answer hint entity candidates, the algorithm uses the linked entities from the question as a context. In this case, answer hint entities are selected to maximize number of relations between each other and between them and the context entities. This follows an intuition that relationship between an answer hint and its question can be expressed by relations of KB’s entities. The following example shows how the relations can help to distinguish between entities corresponding to a single entity candidate:

Entity candidate The Apprentice matches three entities but only one is connected to other entities in the answer hint and question. The result of this process is called a linked question and a linked answer hint.

As a baseline for the proposed relation maximization disambiguation algorithm, a popularity maximization disambiguation algorithm was also evaluated. In the popularity maximization, the entities for entity candidates are selected according to their so called popularity score which is defined as a number of relations the entity has with all other entities in the KB, regardless the context.

3.2 Denotation identification

Figure 3: The neural denotation identification model. An input sequence is made of question’s and answer hint’s words encoded as trainable embeddings

. The model outputs probabilities of being a denotation


The denotation identification algorithm selects a denotation among all answer hint entities detected during the entity linking (see Section 3.1). The challenge of the denotation identification can be demonstrated on the following example (entities are enclosed by square brackets):

System: What is [Sharon Calcraft]’s nationality?
User: [Australian] [Composer] [Sharon Calcraft] was born in [1955] in [Sydney] [New South Wales] [Australia].

The answer hint in the example contains many entities, however, only Australia entity is a denotation because it answers the corresponding question.

Two approaches for the denotation identification are proposed in this paper. The first one, the context entity cancellation, is a simple method based on an observation that most of non–denotation entities in the answer hint also appear in its corresponding question (see Section 3.2.1 for more details). This model serves as a baseline for the later machine learned model.

The second approach, the attention selection model, uses an attention-based bidirectional LSTM network, which is frequently used for identification of important parts of an input sequence [19, 20]. This model and its variants are described in Section 3.2.2.

3.2.1 Simple Selection Models

This section describes several variants of the context entity cancellation approach to the denotation identification.

Basic cancellation - The first, simplest approach is based on an observation that most of non–denotation entities within the answer hint come from its context – the corresponding question. Therefore, the algorithm filters out all context entities from the answer hint entities. From the remaining entities, the one with the highest Freebase popularity, which is a number of relations the entity has with other entities, is selected.

+ enumeration detection - The basic cancellation does not deal well with an enumeration in questions. See the following example:

System: Is Stana Katic male or female?

The issue is that the question (the context) includes the correct answer which would be canceled by the basic cancellation model. Therefore, the second algorithm uses an enumeration detection in questions (based on the keyword spotting). If enumeration is detected, the context entities are intersected with those in an answer hint instead of being subtracted as in the basic cancellation.

+ context n-grams - Next, the basic model cannot handle well a discrimination between a denotation and entities providing extra information commonly included in answer hints. See the following example:

System: Where was [Barack Obama] born?
User: [Barack Obama] was a [USA president] born in [Hawaii].

Even though, the user was asked about a place of birth, he/she also included information about a function of Barack Obama which is an additional information. To deal with this issue, each entity popularity is multiplied by a prior probability of being a denotation (estimated from training data) given the surrounding context n-gram.

In the example above, there are two entities left after the context cancellation: [USA president] and [Hawaii]. Examples of their corresponding 3-gram contexts are “was a #ENTITY”, “born in #ENTITY”. From the training data it is easy to count how many times those contexts appeared with #ENTITY being a denotation/being an extra entity, which is a sufficient information for computing the prior probability.

3.2.2 Attention Selection Models

The other approach for the denotation identification uses an attention-based neural model over word sequences. The word sequences are created by concatenation of a linked question and its linked answer, where each entity in a word sequence is encoded as a single word. See the example in Figure 3. For every answer hint entity, the model outputs a probability of being a denotation.

Bidirectional attention model

- First, the model transforms the discrete word sequence into a vector representation using trained word embeddings. Then, a bidirectional LSTM layer converts word embeddings to context embeddings. Finally, the softmax layer produces a probability of being a denotation for every answer hint entity. The architecture of the model can be seen in Figure 


. The minimized objective of the model is a categorical cross-entropy between the model’s output and one-hot encoding of the denotation entity position (along the sequence length dimension).

+ positional word features - The next model variant, in addition, uses features about a position of an answer hint entity occurrence among the question entities. These features are concatenated with word embeddings produced by the previous model and therefore must be generated for every word in the input sequence. One-hot encoding of the position with two special symbols is used. If the entity does not appear among the question entities, it gets a NULL word symbol. If an input word is not an answer hint entity then it gets ZERO word symbol. In Figure 3 the first answer hint entity Scooter Libby appears as first entity in the question and the second answer hint entity a novel does not appear in the question at all.

+ pretrained glove - To help the system perform better in setups with small amount of training data as is ours, influence of pretrained embeddings (for non-entity words) on model performance was explored. The glove [21] embeddings were used due the simplicity of use in our framework.

4 Evaluation

The proposed models were tested on the QDD which is divided into 950 training dialogues, 285 validation dialogues, and 665 test dialogues. These dialogues include both dialogues with correct answer hints and incorrect/incomplete answer hints. In this work, only the dialogues with correct answer hints are used as it allows simple measurement of the denotation extraction performance. Therefore, subsets of 176 training, 43 validation, and 132 test dialogues from the QDD were selected.

The entity linking models were implemented in C# and optimized for the performance on large KBs. The surrounding context n-gram size was set to 3.

The denotation identification models were implemented in keras 


and tensorflow 

[23]. They all used embedding size of 8, and 8 LSTM cells. The pretrained glove embeddings dimension was 10. The model parameters were optimized by Adam [24]

with default hyper-parameters. The training ran for 50 epochs from which the best model parameters according to the validation accuracy were selected.

The evaluation considers three metrics: the entity linking accuracy, the denotation identification accuracy, and the denotation extraction accuracy.

The entity linking accuracy is measured as a ratio between answer hints containing a denotation (a QDD label) among their entities (i.e. correctly linked answer hints) and a count of all the answer hints. The results for the linking algorithm are shown in Table 1. The results suggest that the entity linking with relation maximization outperforms the entity linking with popularity maximization. Manual inspection of errors shown the improvement comes from the use of information about relations between linked entities.

The QDD contains only a limited number of samples that can be used for testing. Therefore, the results have a binomial 95% confidence intervals

, which is quite high. Narrowing of those intervals would require substantially more testing data which are not available yet.


relation maximization @1 .628
relation maximization @5 .628
popularity maximization @1 .598
popularity maximization @2 .621
popularity maximization @5 .628

Table 1: Table with the accuracy of the entity linking. An answer hint is correctly linked when it contains a denotation. @n means that a correct answer was among n-best hypotheses.

Manual inspection of the errors shown that the most of the denotations cannot be linked correctly as the wording of denotations significantly differ from names/aliases stored in the KB and a simple character string edit distance cannot account for these differences. See the following example:

System: What is the nationality of [Steve Rassin]?
User: [Steve Rassin] is American.

The entity [American] was not recognized although the database contains an entity with names like [United States of America], [USA], …. In the context of interactive learning, it would be possible to learn the unknown aliases such as [American] directly from users by actively asking appropriate questions or from the context.

The denotation identification accuracy measures a ratio of the number of correctly identified denotations and the number of correctly linked answer hints. This measure evaluates quality of the denotation identification assuming a perfect linking algorithm.

The denotation extraction accuracy measures a ratio of the number of correctly identified denotations and the number of all answer hints. This is end-to-end measure for the denotation extraction - measure representing the overall chance of the proposed algorithms to identify correct denotation in an answer hint. The results for the proposed denotation identification models are shown in Table 2. The results have a binomial 95% confidence intervals . The results suggests that the attention based model with pretrained glove embeddings is comparable and possibly slightly better then the rule based baseline. This may be surprising as the training dataset consists of only 176 training examples. While it is hard to come with firm conclusions given so little training and test data, the informal manual inspection of the test results suggests that the neural attention based model better disambiguates extra information entities from the denotations. It appears the neural model can learn the usage of prepositions before denotations in the context of the corresponding question which the baseline system cannot.

accuracy d.i. accuracy d.e.

Basic cancellation .768 .477
+ enum. detection .768 .477
+ context ngrams .780 .485
Bidir attention .639 .402
+ positional word features .723 .455
+ pretrained glove .793 .492

Table 2: Table with denotation extraction accuracy and accuracy of denotation identification (when using our best linker). Notice that adding enum. detection did not improve test performance. However, it has shown to be useful on dev and train splits.

The typical errors are caused by vague nature of some questions in the QDD. For example, in the case of ”Who lives in the New York City?” there live millions of people in the city and user can answer with any subset of the New York citizens. However, QDD label contains only one of them, hurting the accuracy of the model.

The code for denotation identification is available at GitHub555

5 Discussion

The evaluation shows that denotations can be automatically extracted from the data obtained interactively from dialog system’s users. The advantage of this approach compared to extraction from non-interactive sources is in ability of the system to immediately confirm its hypothesis (making sure the extraction was successful). This has shown as crucial in projects like NELL [25]. Without humans in the loop, the NELL system was not able to automatically learn facts with a high accuracy.

The extracted denotations can be used for learning natural language understanding (NLU) components as shown in [10, 11, 12]. Therefore, dialogue system built around such NLU components, continuously trained on automatically extracted denotations, can adapt to long-term changes in language.

Another interesting field where denotations may be useful is a knowledge base population task [13]. In this task, a KB is typically expanded by adding information extracted from off-line documents (e.g. Wikipedia pages). However, this information can also be inferred from the denotations. Therefore, the denotation extraction from answer hints allows to expand the KB by adding information obtained interactively from users. The advantage of the later approach is in possibility to focus the extraction effort on entities of system’s interest, thus making the process more efficient. In addition, the system can confirm extracted facts, ensuring the extracted information quality.

6 Conclusion

We have presented a novel task which aims to support interactively learned dialogue systems by extracting denotations from natural dialogues with real users. We also proposed a method for solving the task and evaluated it on the Question Dialogues Dataset.

The experiments shown that a reasonable amount of useful information can be extracted even with our simple rule-based baseline algorithms. Also, we have shown that it is possible to train a neural attention-based model, which slightly outperforms the baselines by using less than 200 training examples.

In future work, we plan to extend the model to support multiple entity denotations to deal with multiple options answer hints corresponding to too broad (vague) questions. Further, we plan to devise methods for interactive learning of unknown aliases to entities in a KB to improve the linking accuracy.


This work was funded by the Ministry of Education, Youth and Sports of the Czech Republic, SVV project 260 453, and GAUK grant 1170516 of Charles University in Prague. It used language resources stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).