Non-Sentential Utterances in Dialogue: Experiments in Classification and Interpretation

11/22/2015 ∙ by Paolo Dragone, et al. ∙ 0

Non-sentential utterances (NSUs) are utterances that lack a complete sentential form but whose meaning can be inferred from the dialogue context, such as "OK", "where?", "probably at his apartment". The interpretation of non-sentential utterances is an important problem in computational linguistics since they constitute a frequent phenomena in dialogue and they are intrinsically context-dependent. The interpretation of NSUs is the task of retrieving their full semantic content from their form and the dialogue context. The first half of this thesis is devoted to the NSU classification task. Our work builds upon Fernández et al. (2007) which present a series of machine-learning experiments on the classification of NSUs. We extended their approach with a combination of new features and semi-supervised learning techniques. The empirical results presented in this thesis show a modest but significant improvement over the state-of-the-art classification performance. The consecutive, yet independent, problem is how to infer an appropriate semantic representation of such NSUs on the basis of the dialogue context. Fernández (2006) formalizes this task in terms of "resolution rules" built on top of the Type Theory with Records (TTR). Our work is focused on the reimplementation of the resolution rules from Fernández (2006) with a probabilistic account of the dialogue state. The probabilistic rules formalism Lison (2014) is particularly suited for this task because, similarly to the framework developed by Ginzburg (2012) and Fernández (2006), it involves the specification of update rules on the variables of the dialogue state to capture the dynamics of the conversation. However, the probabilistic rules can also encode probabilistic knowledge, thereby providing a principled account of ambiguities in the NSU resolution process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Motivation

Non-sentential utterances are interesting in many ways. First of all, they are very frequent in dialogue. According to Fernandez:2002 and related works, the frequency of NSUs in the dialogue transcripts of the British National Corpus is about 10% of the total number of utterances. However, this number may vary greatly if one takes into account a larger variety of phenomena or different dialogue domains e.g. Schlangen:thesis estimates the frequency of NSUs to be 20% of the total number of utterances.

Despite their ubiquity, the semantic content of NSUs is often difficult to extract automatically. Non-sentential utterances are indeed intrinsically dependent on the dialogue context. It is impossible to make sense of them without accessing to the surrounding context. Their high context-dependency makes their interpretation a difficult problem from both a theoretical and computational point of view.

NSUs form a wide range of linguistic phenomena that need to be considered in the formulation of a theory of dialogue context. Only few previous works tackled this problem directly and the majority of them take place in theoretical semantics of dialogue without addressing the possible applications. This means that the interpretation of NSUs is still an understudied problem, making them possibly an even more interesting subject.

1.2 Contribution

Our work follows two parallel paths. On one hand we address the problem of the classification of NSUs by extending the work of Fernandez:2007. On the other hand we propose a novel approach to the resolution of NSUs using probabilistic rules Lison2015.

The classification task is needed to select the resolution procedure but it is nonetheless an independent problem and it can arise in many different situations. Our contribution to this problem is a small but significant improvement over the accuracy of the previous works as well as the exploration of one way to tackle the scarcity of labeled data.

Our work on the resolution of NSUs takes inspiration from Fernandez:thesis and Ginzburg:interactivestance which provide the theoretical background for our study. Their framework is however purely logic-based therefore it can have some drawbacks in dealing with raw conversational data which often contains hidden or partially observable variables. To this end a probabilistic account of the dialogue state is preferable. In our work we implemented a new approach to NSU resolution based on the probabilistic rules formalism of Lison2015. Probabilistic rules are similar, in some way, to the rules formalized by Ginzburg:interactivestance, as both express updates on the dialogue state given a set of conditions. However, probabilistic rules can also take into account probabilistic knowledge, thereby making them more suited to deal with the uncertainty often associated with conversational data. Our work does not aim to provide a full theory of NSU resolution but rather be a proof-of-concept for the resolution of NSUs via the probabilistic rules formalism. Nevertheless we detail a large set of NSU resolution rules based on the probabilistic rules formalism and provide an actual implementation of a dialogue system for NSU resolution using the OpenDial toolkit semdial2015_opendial, which can be the baseline reference for future developments.

Our work for this thesis has produced the following publications:

  • Paolo Dragone and Pierre Lison. Non-Sentential Utterances in Dialogue: Experiments in classification and interpretation. In: Proceedings of the 19th workshop on the Semantics and Pragmatics of Dialogue, SEMDIAL 2015 – goDIAL, p. 170. Göteborg, 2015.

  • Paolo Dragone and Pierre Lison. An Active Learning Approach to the Classification of Non-Sentential Utterances. In: Proceedings of the second Italian Conference on Computational Linguistics, CLiC-IT 2015, in press. Trento, 2015.

1.3 Outline

Chapter 2

This chapter discusses the background knowledge needed for the development of the following chapters. In particular the chapter describes the concept of non-sentential utterance and the task of interpretation of NSUs with an emphasis on the previous works. Secondly the chapter contains an overview on the formal representation of the dialogue context from the theory of Ginzburg:interactivestance. We discuss briefly the Type Theory with Records, the semantic representation of utterances and the update rules on the dialogue context. Finally, we introduce the probabilistic approach to the definition of the dialogue context from LisonThesis2014. We discuss the basics of Bayesian Networks (the dialogue context representation) and the probabilistic rules formalism.

Chapter 3

This chapter describes the task of the classification of non-sentential utterances. It provides details on our approach, starting from the replication of the work from Fernandez:2007 which we use as baseline. We then discuss the extended feature set we used and the semi-supervised learning techniques we employed in our experiments. Lastly we discuss the empirical results we obtained.

Chapter 4

This chapter describes the problem of resolving non-sentential utterances and our approach to address it through probabilistic rules. First we formalize the NSU resolution task and describe the theoretical notions needed to address it. We then describe our dialogue context design as a Bayesian network and our formulation for the resolution rules as probabilistic rules. In the end we describe our implementation and an extended example of its application to a real-world scenario.

Chapter 5

This is the conclusive chapter of this thesis which summarizes the work and describes possible future works.

2.1 Non-Sentential Utterances

From a linguistic perspective, Non-Sentential Utterances – also known as fragments – has been historically an umbrella term for many elliptical phenomena that often take place in dialogue. In order to give a definition of Non-Sentential Utterances ourselves, we shall start by quoting the definition given by Fernandez:thesis:

“In a broad sense, non-sentential utterances are utterances that do not have the form of a full sentence according to most traditional grammars, but that nevertheless convey a complete sentential meaning, usually a proposition or a question.”

This is indeed a very general definition, whereas a perhaps simpler approach is taken by Ginzburg:interactivestance which defines NSUs as “utterances without an overt predicate”. The minimal clausal structure of a sentence in English (as in many other languages) is composed of at least a noun phrase and a verb phrase. However, in dialogue the clausal structure is often truncated in favor of shorter sentences that can be understood by inferring their meaning from the surrounding context. We are interested in those utterances that, despite the lack of a complete clausal structure, convey a well-defined meaning given the dialogue context.

The context of an NSU can comprise any variable in the dialogue context but it usually suffice to consider only the antecedent of the NSU. The “antecedent” of an NSU is the utterance in the dialogue history that can be used to infer its underspecified semantic content. For instance, the NSU in 1 can be interpreted as “Paul went to his apartment” by extracting its semantic content from the antecedent. Generally, it is possible to understand the meaning of an NSU by looking at its antecedent.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A Where did Paul go? B To his apartment.

It is often the case that an NSU and its antecedent present a certain grade of parallelism. Usually the meaning of an NSU is associated to a certain aspect of the antecedent. As described in Ginzburg:interactivestance, the parallelism between an NSU and its antecedent can be of syntactic, semantic or phonological nature. The NSU in 1 presents syntactic parallelism – the use of “his” is syntactically constrained by the fact that Paul is a male individual – as well as semantic – the content of an NSU is a location as constrained by the where interrogative. This parallelism is one of the properties of NSUs that can be exploited in their interpretation (more details in Chapter 4). Even though it is often the case, the antecedent of an NSU is not always the preceding utterance, especially in multi-party dialogues.

2.1.1 A taxonomy of NSUs

As we briefly mentioned in Chapter 1, non-sentential utterances come in a large variety of forms. We can categorize NSUs on the basis of their form and their intended meaning. For instance NSUs can be affirmative or negative answers to polar questions, requests for clarification or corrections.

In order to classify the NSUs, we use a taxonomy defined by Fernandez:2002. This is a wide-coverage taxonomy resulting from a corpus study on a portion of the British National Corpus burnard2000reference. Table 2.1 contains a summary of the taxonomy with an additional categorization of the classes by their function, as defined by Fernandez:thesis then refined by Ginzburg:interactivestance.

Other taxonomies of NSUs are available from previous works by e.g. Schlangen:thesis, but we opted for the one from Fernandez:2002 because it has been used in an extensive machine learning experiment by Fernandez:2007 and it is also used in the theory of Ginzburg:interactivestance, which is our reference for the resolution part of our investigation. A detailed comparison of this taxonomy and other ones is given by Fernandez:thesis, which also details the corpus study on the BNC that led to the definition of this taxonomy.

Function NSU class
Positive Feedback Plain Acknowledgment
Repeated Acknowledgment
Metacommunicative queries Clarification Ellipsis
Check Question
Sluice
Filler
Answers Short Answer
Affirmative Answer
Rejection
Repeated Affirmative Answer
Helpful Rejection
Propositional Modifier
Extension Moves Factual Modifier
Bare Modifier Phrase
Conjunct fragment
Table 2.1: Overview of the classes in the taxonomy, further categorized by their function.

Follows a brief description of all the classes with some examples. Fernandez:thesis provides more details about the rationale of each class.

Plain Acknowledgment

Acknowledgments are used to signal understanding or acceptance of the preceding utterance, usually using words or sounds like yeah, right, mhm.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A I shall be getting a copy of this tape. B Right.
    [BNC: J42 71–72]

Repeated Acknowledgment

This is another type of acknowledgement that make use of repetition or reformulation of some constituent of the antecedent to show understanding.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A Oh so if you press enter it’ll come down one line. B Enter.
    [BNC: G4K 102–103]

Clarification Ellipsis

These are NSUs that are used to request a clarification of some aspect of the antecedent that was not fully understood.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A I would try F ten. B Just press F ten?
    [BNC: G4K 72–73]

Check Question

Check Questions are used to request an explicit feedback of understanding or acceptance, usually uttered by the same speaker as the antecedent.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A So (pause) I’m allowed to record you.
    Okay? B Yes.
    [BNC: KSR 5–6]

Sluice

Sluices are used for requesting additional information related to or underspecified into the antecedent.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A They wouldn’t do it, no. B Why?
    [BNC: H5H 202–203]

Filler

These are fragments used to complete a previous unfinished utterance.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A […] would include satellites like erm B Northallerton.
    [BNC: H5D 78–79]

Short Answer

The NSUs that are typically answers to wh-questions.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A What’s plus three times plus three? B Nine.
    [BNC: J91 172–173]

Plain Affirmative Answer and Plain Rejection

A type of NSUs used to answer polar questions using yes-words and no-words.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A Have you settled in? B Yes, thank you.
    [BNC: JSN 36–37]

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A (pause) Right, are we ready? B No, not yet.
    [BNC: JK8 137–138]

Repeated Affirmative Answer

NSUs used to give an affirmative answer by repeating or reformulating part of the query.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A You were the first blind person to be employed in the County Council? B In the County Council, yes.
    [BNC: HDM 19–20]

Helpful Rejection

Helpful Rejections are used to correct some piece of information from the antecedent.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A Right disk number four? B Three.
    [BNC: H61 10–11]

Propositional and Factual Modifiers

Used to add modal or attitudinal information to the previous utterance. They are usually expressed (respectively) by modal adverbs and exclamatory factual (or factive) adjectives.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A Oh you could hear it? B Occasionally yeah.
    [BNC: J8D 14–15]

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A You’d be there six o’clock gone mate. B Wonderful.
    [BNC: J40 164–165]

Bare Modifier Phrase

Modifiers that behave like non-sentential adjunct modifying a contextual utterance.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A […] then across from there to there. B From side to side.
    [BNC: HDH 377–378]

Conjunct

A Conjunct is a modifier that extends a previous utterance through a conjunction.

  1. [label=(2.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(2.0)]

  2. A I’ll write a letter to Chris B And other people.
    [BNC: G4K 19–20]

2.1.2 The NSU corpus

The taxonomy presented in the previous section is the result of a corpus study on a portion of the dialogue transcripts in the British National Corpus, first started by Fernandez:2002, then refined by Fernandez:thesis. The dialogue transcripts used in the corpus study contain both two-party and multi-party conversations. The transcripts cover a wide variety of dialogue domains including free conversation, interviews, seminars and more. Fernandez:thesis also describes the annotation procedure and a reliability test. The reliability test was carried out on a subset of the annotated instances comparing the manual annotation of three annotators. The test showed a good agreement between the annotators with a kappa-score of . From this test it is also clear that humans can reliably distinguish between the NSU classes in the taxonomy. Fernandez:thesis provides more details about the complete analysis of the corpus.

In total about sentences from files were examined by the annotators, resulting in a corpus of NSUs, about of the total of the sentences examined. Of the extracted NSUs, were successfully categorized according to the defined taxonomy making up a coverage of . Table 2.2 shows the distribution of the classes in the corpus.

NSU Class Total %
Plain Acknowledgment (Ack) 599 46.1
Short Answer (ShortAns) 188 14.5
Affirmative Answer (AffAns) 105 8.0
Repeated Acknowledgment (RepAck) 86 6.6
Clarification Ellipsis (CE) 82 6.3
Rejection (Reject) 49 3.7
Factual Modifier (FactMod) 27 2.0
Repeated Affirmative Answer (RepAffAns) 26 2.0
Helpful Rejection (HelpReject) 24 1.8
Check Question (CheckQu) 22 1.7
Sluice 21 1.6
Filler 18 1.4
Bare Modifier Phrase (BareModPh) 15 1.1
Propositional Modifier (PropMod) 11 0.8
Conjunct (Conj) 10 0.7
Total 1283 100.0
Table 2.2: The distribution of the classes in the NSU corpus.

The annotated instances were also tagged with a reference to the antecedent of the NSU. About of annotated NSUs have their immediately preceding utterance as antecedent. Fernandez:thesis describes a study of the distance between NSUs and their antecedents, with a comparison between two-party and multi-party dialogues.

2.1.3 Interpretation of NSUs

Due to their incomplete form, non-sentential utterances do not have an exact meaning by themselves. They need to be “interpreted” i.e. their intended meaning must be inferred from the dialogue context. One way to interpret NSUs is developed by Fernandez:thesis, in turn based on Schlangen:thesis, and it is formed by to consecutive steps, namely the classification and the resolution of the NSUs. The first step for the interpretation of an NSU is its classification i.e. finding its class according to the taxonomy described in Section 2.1.1. As demonstrated in Fernandez:2007, we can infer the class of an NSU using machine learning, i.e. we can train a classifier on the corpus detailed in Section 2.1.2 and use it to classify unseen NSU instances. The type of an NSU is used to determine the right resolution procedure to use. The resolution of an NSU is the task of recovering the full clausal meaning from their incomplete form on the basis of contextual information. Fernandez:thesis describes a resolution procedure in terms of rules that, given some preconditions on the antecedent and other elements of the dialogue states, builds the semantic representation of the NSU. This approach to the resolution of NSUs has been the basis of several implementations of dialogue systems handling the resolution of NSUs such as Fernandez:shards and Purver:2006.

Extending the interpretation problem to raw conversational data we need also a way to “detect” an NSU i.e. decide whether an utterance should be considered as an NSU in the first place. Since this is not our direct concern, we employ in our experiments a simple set of heuristics to distinguish between NSU and non-NSU utterances (see Section

3.5.1).

2.2 A formal model of dialogue

As theoretical base of our work we rely on the theory of dialogue context brought up by Ginzburg:interactivestance, which presents a grammatical framework expressly developed for dialogue. The claim of Ginzburg:interactivestance is that the rules that encode the dynamics of the dialogue have to be built into the grammar itself. The grammatical framework is formulated using Type Theory with Records cooper2005records. Type Theory with Records (TTR) is a logical formalism developed to cope with semantics of natural language. TTR is used to build a semantic ontology of abstract entities and events as well as to formalize the dialogue gameboard i.e. a formal representation of the dialogue context and its rules. The evolution of the conversation is formalized by means of update rules on the dialogue context. Ginzburg:interactivestance also accounts for NSUs and provides a set of dedicated rules.

2.2.1 Type Theory with Records

We will now briefly introduce the basic notions of the Type Theory with Records (TTR), with just enough detail needed by to understand the following sections, referring to Ginzburg:interactivestance for a complete description.

In TTR, objects can be of different types. The statement is a typing judgment, indicating that the object is of type . If is of type , is said to be a witness of . Types can either be basic (atomic) such as IND111The type IND stands for a generic “individual”. or complex i.e. dependent on other objects or types such as . Types also include constructs such as lists, sets and so on. Other useful constructs are records and record types. A record contains a set of assignments between labels and values whereas a record type contains a set of judgments between labels and types:

The record is of record type if and only if . Typing judgment can be used to indicate the record being of record type as .

TTR also provides function types of the form which maps records of type to records of type . Functional application is indicated as .

Utterance representation

At the basis of the grammatical framework of Ginzburg:interactivestance lies the notion of proposition. Propositions are entities used to represent facts, events and situations as well as to characterize the communicative process. In TTR propositions are records of the type:

A simple example of proposition may be the following:

Paul drives a car.


On the other hand, questions are represented as propositional abstracts i.e. functions from the question domain to propositions. Following the definition of Fernandez:thesis:

The question domain is a record type containing the wh-restrictors of the question .222Ginzburg:interactivestance extends this field to be a list of record types to take into account situations with multiple question domains. The wh-restrictors are record types that characterize the necessary information needed to resolve a wh-question e.g. for a where interrogative the answer must be a place instead for a when interrogative it must be a time. Clearly, the right wh-restrictor depends on the wh-interrogative used. Consider the following example of a wh-question:

Who drives?

Here the question domain of the who interrogative is an individual x that is a person.

Polar questions, i.e. bare yes/no-questions, are represented as propositional abstract as the wh-questions, with the difference that their question domain is an empty record type. An example of polar question:

Does Paul drive?

A special type of propositions are used to represent the content of conversational moves which need to take into a account the relation that stands between the speaker, the addressee and the content of the move. Those are called illocutionary propositions (of type IllocProp) and the relation that they contain is called illocutionary relation333Also called illocutionary act or dialogue act.. Illocutionary relations indicates the function of a proposition, such as “Assert”, “Ask”, “Greet”. For a proposition , the illocutionary proposition that holds as its content can be indicated as , where R is the illocutionary relation, spkr and addr refer respectively to the speaker and the addressee444For brevity only the semantic content of the illocutionary proposition is shown here.. Examples of illocutionary propositions are:

Assert(spkr : IND, addr : IND, p : Prop)
Ask(spkr : IND, addr : IND, q : Question)

2.2.2 The dialogue context

In Ginzburg:interactivestance, the dialogue context – also known as the Dialogue Gameboard (DGB) – is a formal representation that describes the current state of the dialogue. It includes a wide range of variables needed to handle different aspects of the dialogue. However, we concentrated on the most basic ones:

  • Facts, a set of known facts;

  • LatestMove, the latest move made in the dialogue;

  • QUD, a partially ordered set of questions under discussion.

The DGB can be represented in TTR as a record in the following way:

The elements in the DGB represent the common ground of the conversation, shared between all the participants. In this representation we abstracted away several details that would be included in the actual DGB presented by Ginzburg:interactivestance such as the fields to track who is holding the turn, the current time and so on. We now detail the basic variables of the DGB.

Facts

Facts is a set of known facts, shared by all the conversational participants. The elements of Facts are propositions, which are assumed to be sufficient to encode the knowledge of the participants within the context of the dialogue. The Facts encode all the records that are accepted by all participants, i.e. facts that will not raise issues in the future development of the conversation. A complementary problem that we marginally address is the understanding – or grounding – of a sentence. Ginzburg:interactivestance develops a comprehensive theory of grounding but we do not include it in our work.

LatestMove

Dialogue utterances are made of coherent responses to the preceding utterances, that is why it is important to keep track of the history of the dialogue. In a two-party dialogue it is usually the case that the current utterance is a response to previous one, instead in a multi-party dialogue can be useful to keep track of a larger window of the dialogue history. Ginzburg:interactivestance keeps track of the history of the dialogue within the variable Moves while a reference to the latest (illocutionary) proposition is recorded in the field LatestMove.

Qud

QUD is a set of questions under discussion. In a general sense, a “question under discussion” represents an issue being raised in the conversation which drives the future discussion. Despite the name, QUDs may arise from both questions and propositions.

Ginzburg:interactivestance defines QUD as a partially ordered set (poset). Its ordering determines the priority of the issues to be resolved. Of particular importance is the first element in the set according to the defined ordering which is taken as the topic of discussion of the subsequent utterances until it is resolved. Such element is referred to as MaxQUD.

The formalization of the ordering is a rather complex matter in a generic theory of context that needs to account for the beliefs of the participants and it is especially problematic when dealing with multi-party dialogues. The usage of QUD is of particular importance in our case because the MaxQUD is used as the antecedent in the interpretation of NSUs.

2.2.3 Update rules

The dynamics of the DGB are defined by a set of update rules – also called conversational rules – which are applied on the DGB throughout the course of the conversation. Update rules are formalized as a set of effects on the parameters of the DGB given that certain preconditions hold. An update rule can be represented in the following way:

where both pre and effects are subsets of the parameters of the DGB and they respectively represent the necessary conditions for the application of the rule and the values of the involved variables right after the application of the rule.

Ginzburg:interactivestance defines all sorts of rules needed to handle a great variety of conversational protocols. Rules that are particularly interesting with respect to our work are those that handle queries and assertions as well as the ones that describe the dynamics of QUD and Facts.

The following rule describes how QUD is incremented when a question is posed:

As argued above, issues are also raised by assertions, as realized by the following rule:

The act of answering to a question is nothing else than asserting a proposition that resolves such a question. As a consequence the other speaker can either raise another issue related to the previous one or accept the fact that the issue has been resolved. The acceptance move is realized in the following way:

The speaker can also query the addressee with a Check move in order to ask for an explicit acknowledgment (Confirm) to a question-resolving assertion555The rules for the Check and Confirm moves are omitted for brevity.. Acceptance and confirmation lead to an update of Facts and to a “downdate” of the QUD i.e. the removal of the resolved questions in QUD:

While QUD represents the unresolved issues that have been introduced in the dialogue, Facts contains all the issues that have been resolved instead. That is why their update rules are closely related. The function NonResolve in the above rule checks for any resolved issues by the just updated facts and leave the unresolved ones into QUD.

2.3 Probabilistic modeling of dialogue

In the previous section we detailed a logic-based model of dialogue from Ginzburg:interactivestance. Another possible approach to dialogue modeling relies on probabilistic models to encode the variables and the dynamics of the dialogue context. Arguably this approach can be considered more robust to the intrinsic randomness present in dialogue. This is partially the reason why we explored this strategy as well as other advantages that will be discussed in Chapter 4.

We based our work on the probabilistic rules formalism developed by Lison:2012. This formalism is particularly suited for our purpose because of their commonalities with the update rules described in Section 2.2.3. The probabilistic rules formalism is based on the representation of the dialogue state as a Bayesian network. In this section we briefly describe how Bayesian networks are structured, then we detail the probabilistic rules formalism that we employ in Chapter 4 to model the resolution of the NSUs.

2.3.1 Bayesian Networks

Bayesian networks are probabilistic graphical models666A type of probabilistic models represented by graphs.

representing a set of random variables (nodes) and their conditional dependency relations (edges). A Bayesian network is a directed acyclic graph i.e. a direct graph that does not contain cycles (two random variables cannot be mutually dependent). Given the random variables

in a Bayesian network, we are interested in the joint probability distribution

of those variables. In general, the size of the joint distribution is exponential in the number

of variables therefore it is difficult to estimate when grows. In the case of Bayesian networks we can exploit the conditional independence to reduce the complexity of the joint distribution. Given three random variables , and , and are said to be conditionally independent given if and only if (for all combinations of values) . We can define for a variable in the set such that if there is a direct edge from to then . Given a topological ordering777A topological ordering is an ordering of the nodes such that for every two nodes and connected by a directed edge from to , appears before in such ordering. A topological ordering can only be defined on directed acyclic graphs. of the variables (nodes) of the Bayesian network, a variable is conditionally independent from all its predecessor that are not in

therefore the joint probability distribution can be defined as follows:

For each variable , is the conditional probability distribution (CPD) of . The CPDs together with the directed graph fully determine the joint distribution of the Bayesian network.

The network can be used for inference by querying the distribution of a subset of variables, usually given some evidence. Given a subset of variables and an assignment of values of the evidence variables, the query is the posterior distribution . To compute the posterior distribution one needs an inference algorithm. Such algorithm can be exact – such as the variable elimination algorithm zhang1996exploiting – or approximate – such as the loopy belief propagation algorithm murphy1999loopy.

The distributions of the single variables can be learned from observed data using maximum likelihood estimation or Bayesian learning.

2.3.2 Probabilistic rules

The probabilistic rules formalism is a domain-independent dialogue modeling framework. Probabilistic rules are expressed as if … then … else constructs mapping logical conditions on the state variables to effects encoded by either probability distributions or utility functions. The former are called probability rules while the latter are utility rules. While we make use of both types of rules in our work, here we concentrate only on the probability rules which are the ones used for the resolution of the NSUs.

Let be a sequence of logical conditions and a sequence of categorical probability distributions888A categorical distribution is a probability distribution of an event having a finite set of outcomes with defined probability. over mutually exclusive effects. A probability rule is defined as follows:

The random variable encodes a range of possible effects , each one with a corresponding probability . The conditions and effects of a rule may include underspecified variables, denoted with x, which are universally quantified on the top of the rule. The effects are duplicated for every possible assignments (grounding) of the underspecified variables.

Each pair of condition and probability distribution over the effects is a branch of the rule. Overall, the rule is a sequence of branches . The rule is “executed” by running sequentially through the branches. Only the first condition satisfied triggers the corresponding probabilistic effect, the subsequent branches are ignored (as in programming languages).

The dialogue state is represented as a Bayesian network containing a set of nodes (random variables). At each state update, rules are instantiated as nodes in the network. For each rule, the input edges of the node come from the condition variables whereas the output edges go towards the effect variables. The probability distribution of the rule is extracted by executing it. The probability distribution of the effect variables are then retrieved by probabilistic inference. LisonThesis2014 details the rules and update procedure.

The probabilistic rules are useful in at least three ways:

  • They are expressly designed for dialogue modeling. They combine the expressivity of both probabilistic inference and first order logic. This is an advantage in dialogue modeling where one has to describe objects that relate to each other in the dialogue domain and, at the same time, handle uncertain knowledge of the state variables.

  • They can cope with the scarcity of training data of most dialogue domains by exploiting the internal structure of the dialogue models. By using logical formulae to encode the conditions for a possible outcome, it is possible to group the values of the variables into partitions, reducing the number of parameters needed to infer the outcome distribution and therefore the amount of data needed to learn the distribution.

  • The state update is handled with probabilistic inference therefore they can operate under uncertain settings which is often needed in dialogue modeling where variables are best represented as belief states, continuously updated by observed evidence.

The probabilistic rules formalism has also been implemented into a framework called OpenDial semdial2015_opendial. OpenDial is a Java toolkit for developing spoken dialogue systems using the probabilistic rules formalism. Using an XML-based language it is possible to define in OpenDial the probabilistic rules to handle the evolution of the dialogue state in a domain-independent way. OpenDial can either work on existent transcripts or as an interactive user interface. OpenDial can also learn parameters from small amounts of data using either supervised or reinforcement learning.

2.4 Summary

In this chapter we discussed the background knowledge needed for describing our work on non-sentential utterances. We first described the notion of non-sentential utterances and the problem of interpreting them. We showed how those utterances can be categorized with a taxonomy from Fernandez:2002. We described how the interpretation of non-sentential utterances can be addressed by first classifying them using the aforementioned taxonomy and then applying some kind of “resolution” procedure to extract their meaning from the dialogue context. In Chapter 3 we will address the NSU classification problem on the basis of the experiments from Fernandez:2007. In Chapter 4 instead we will address the NSU resolution task. Fernandez:thesis describes a set of NSU resolution rules rooted in a TTR representation of the dialogue context. Section 2.2 briefly described the TTR notions we employed as well as the dialogue context theory based on TTR from Ginzburg:interactivestance.

At last we described the probabilistic modeling of dialogue from LisonThesis2014 based on the probabilistic rules formalism. As we mentioned in Chapter 1 this formalism is the framework for our formulation of the NSU resolution rules on the basis of the one developed by Fernandez:thesis. We described in Section 2.3 the basic notion of Bayesian networks which is the representation of the dialogue state employed by the probabilistic rules formalism. Finally, in Section 2.3.2 we explained the probabilistic rules formalism itself and its advantages.

3.1 The data

The corpus from Fernandez:2007 contains annotated NSU instances, each one identified by the name of the containing BNC file and their sentence number, a sequential number to uniquely identify a sentence in a dialogue transcript. The instances are also tagged with the sentence number of their antecedent which makes up the context for the classification. The raw utterances can be retrieved from the BNC using this information.

For the classification task, we make the same simplifying restriction on the corpus made by Fernandez:2007, that is to consider only the NSUs whose antecedent is their preceding sentence. This assumption facilitates the feature extraction procedure without reducing significantly the size of the dataset (about

of the total). The resulting distribution of the NSUs after the restriction is showed in Table 3.1.

As one can see from Table 3.1

, the distribution of the instances is quite skewed, largely in favor of some classes than others. Moreover very frequent classes are usually the easiest to classify, leaving the most difficult ones with few instances as examples for the classifiers. Although the scarcity of the training material and the imbalance of the classes are the two major problems for the classification task, we propose a set of methods to address them, as described in the following sections.

NSU class Total
Plain Acknowledgment (Ack) 582
Short Answer (ShortAns) 105
Affirmative Answer (AffAns) 100
Repeated Acknowledgment (RepAck) 80
Clarification Ellipsis (CE) 66
Rejection (Reject) 48
Repeated Affirmative Answer (RepAffAns) 25
Factual Modifier (FactMod) 23
Sluice 20
Helpful Rejection (HelpReject) 18
Filler 16
Check Question (CheckQu) 15
Bare Modifier Phrase (BareModPh) 10
Propositional Modifier (PropMod) 10
Conjunct (Conj) 5
Total 1123
Table 3.1: Distribution of the classes in the corpus after the simplifying restriction.
The British National Corpus

The British National Corpus burnard2000reference – BNC for short – is a collection of spoken and written material, containing about 100 million words of (British) English texts from a large variety of sources. Among the others, it contains a vast selection of dialogue transcripts covering a wide range of domains. Each dialogue transcript in the BNC is contained in an XML file along with many details about the dialogue settings. The dialogues are structured following the CLAWS tagging system Garside:1993 which segmented the utterances both at word and sentence level. The word units contains both the raw text, the corresponding lemma (headword) and the POS-tag according to the C5 tagset Leech:1994. Each sentence is identified by an unique ID number within the file. Sentences can also contain information about the pauses and the unclarities. The sentences are sorted in their order of appearance and include additional information about temporal alignment in case of overlapping.

3.2 Machine learning algorithms

We employ two different supervised learning algorithms: decision trees and support vector machines. The former are used mainly as a comparison with Fernandez:2007 which employ this algorithm as well. For parameters tuning we implemented a coordinate ascent algorithm. As a framework for our experiments we rely on the Weka toolkit Weka, a Java library containing the implementation of many machine learning algorithms as well as a general-purpose machine learning API.

3.2.1 Classification: Decision Trees

We employ the C4.5 aglorithm Quinlan:1993 for decision tree learning. Weka contains an implementation of this algorithm called J48. The goal of decision tree learning is to create a predictive model from the training data. The construction of the decision tree is performed by splitting the training set into subsets according to the values of an attribute. This process is then repeated recursively on each subset. The construction algorithm is usually an informed search using some kind of heuristics to drive the choice of the splitting attribute. In the case of C4.5 the metric used for the attribute choice is the expected information gain. The information gain is based on the concept of entropy.

In information theory, the entropy shannon:1948 is the expected value of information carried by a message (or an event in general). It is also a measure of the “unpredictability” of an event. The more unpredictable an event is, the more information it provides when it occurs. Formally, the entropy of a random variable is

where is the probability of the -th value of the variable . A derived notion is the conditional entropy of a random variable knowing the value of another variable :

where are the values of the variable and are the value of the variable .

For the decision tree construction, the information gain of an attribute is the reduction of the entropy of the class gained by knowing the value of :

The attribute with the highest information gain is used as splitting attribute.

3.2.2 Classification: Support Vector Machines

The Support Vector Machines (SVMs) Boser:1992 is one of the most studied and reliable family of learning algorithms. An SVM is a binary classifier that uses a representation of the instances as points in a -dimensional space, where is the number of attributes. Assuming that the instances of the two classes are linearly separable111An hyperplane can be drawn in the space such that the instances of one class are all in one side of the hyperplane and the instances of the other class are in the other side.

, the goal of SVMs is to find an hyperplane that separates the classes with the maximum margin. The task of finding the best hyperplane that separates the classes is defined as an optimization problem. SVMs can also be formulated to have “soft margins” i.e. allowing some points of a class to lay in the opposite side of the hyperplane in order to find a better solution. The SVM algorithm we use regularizes the model through a single parameter

C.

The SVMs can also be used with non-linear (i.e. non linearly separable) data using the so called kernel method

. A kernel function maps the points from the input space into an high-dimentional space where they might be linearly separable. A popular kernel function is the (Gaussian) Radial Basis Function (RBF) which maps the input space into an infinite-dimensional space. Its popularity is partially due to the simplicity of its model which involves only one parameter

.

Even though SVMs are defined as binary classifiers, they can be extended to a multi-class scenario by e.g. training multiple binary classifiers using a one-vs-all or a one-vs-one classification strategy Duan:2005.

The Weka toolkit contains an implementation of SVMs that uses the Sequential Minimal Optimization (SMO) algorithm Platt:1998. In all our experiments we use the SMO algorithm with an RBF kernel.

3.2.3 Optimization: Coordinate ascent

The parameter tuning of all our experiments is carried out automatically through a simple coordinate ascent222Also known as coordinate descent which is the minimization counterpart (distinguished only by changing the sign of the function). optimization algorithm. Coordinate ascent is based on the idea of maximizing a multivariable function along one direction at a time, as apposed to e.g. gradient descent which follows the direction given by the gradient of the function. Our implementation detects the ascent direction by lookup of the function value. The Algorithm 1 contains a procedure to maximize a function along the direction while Algorithm 2 performs the coordinate ascent. The step-size values decay at a rate given by the coefficient . The minimum step-sizes determine the stopping conditions for the maximize function, instead the coordinateAscent algorithm stops as soon as the found values do not change between two iterations therefore the maximum is found. The latter algorithm can be easily modified to account for a stopping condition given by a maximum number of iterations.

We implemented this algorithm ourselves because, for technical reasons, it was easier than rely on a third-party API. Its simplicity is one of its advantages but it is more prone to be stuck on local maximums than more sophisticated techniques such as gradient ascent.

For all our experiments we use the described algorithm to find the parameters that yield the maximum accuracy of the classifiers (using 10-fold cross-validation). For the SMO algorithm we optimize the parameters C and , whereas for the J48 algorithm we optimize the parameters C (confidence threshold for pruning) and M (minimum number of instances per leaf).

Input: Function to be maximized; index of the parameter to maximize; vector of the current parameter values; initial step-size value for the -th parameter; minimum step-size value for the -th parameter
Output: The value of the -th parameter that maximizes the function along the corresponding direction
1 ;
2 while  do
3       ;
4       ;
5       ;
6       if  then
7             ;
8             ;
9            
10      else
11             ;
12            
13       end if
14      ;
15      
16 end while
17return ;
Algorithm 1
Input: Function to be maximized; number of parameters of the function; vector of initial parameter values; vector of initial step-size values; vector of minimum step-sizes;
Output: The vector that maximizes the function
1 initialize to random values (different from );
2 while  do
3       ;
4       for  do
5             ;
6             ;
7             ;
8            
9       end for
10      
11 end while
12return ;
Algorithm 2

3.3 The baseline feature set

Our baseline is set to be the replicated approach of the classification experiments carried out by Fernandez:2007, which is our reference work for our study. It contains two experiments, one with a restricted set of classes (leaving out Acknowledgments and Check Questions) and a second taking into account all classes. We are interested in the latter although the former is useful to understand the problem and to analyze the results of our classifier. The aforementioned paper also contains an analysis of the results and the feature contribution which proved useful in the replication of the experiments. For our baseline we use only the features they describe. The feature set is composed of features exploiting a series of syntactic and lexical properties of the NSUs and their antecedents. The features can be categorized as: NSU features, Antecedent features, Similarity features. Table 3.2 contains an overview of the feature set.

NSU features

Different NSU classes are often distinguished by their form. The following is a group of features exploiting their syntactic and lexical properties.

  • [label=]

  • nsu_cont
    Denotes the “content” of the NSU i.e. whether it is a question or a proposition. This is useful to distinguish between question denoting classes, such as Clarification Ellipsis and Sluices, and the rest.

  • wh_nsu
    Denotes whether the NSU contains a wh-word, namely: what, which, who, where, when, how. This can help for instance to distinguish instances of Sluices and Clarification Ellipsis knowing that the former are wh-questions while the latter are not.

  • aff_neg
    Denotes the presence of a yes-word, a no-word or an ack-word in the NSU. Yes-words are for instance: yes, yep, aye; no-words are for instance: no, not, nay; ack-words are: right, aha, mhm. This is particularly needed to distinguish between Affirmative Answers, Rejections and Acknowledgments.

  • lex
    Indicates the presence of lexical items at the beginning of the NSU. This feature is intended to indicate the presence of modifiers. A modal adverb (e.g. absolutely, clearly, probably) at the beginning of the utterance usually denotes a Propositional Modifier. The same applies for Factual Modifiers, which are usually denoted by factual adjectives (e.g. good, amazing, terrible, brilliant) and Conjuncts which are usually denoted by conjunctions. Bare Modifier Phrases are a wider class of NSUs which do not have a precise lexical conformation but they are usually started by lexical patterns containing a Prepositional Phrase (PP) or an Adverbial Phrase (AdvP).

Feature Description Values
nsu_cont the content of the NSU (a question or a proposition) p,q
wh_nsu presence of a wh-word in the NSU yes,no
aff_neg presence of a yes/no-word in the NSU yes,no,e(mpty)
lex presence of different lexical items at the beginning of the NSU p_mod,f_mod,mod,conj,e
ant_mood mood of the antecedent utterance decl,n_decl
wh_ant presence of a wh-word in the antecedent yes,no
finished whether the antecedent is (un)finished fin,unf
repeat number of common words in the NSU and the antecedent -
parallel length of the common tag sequence in the NSU and the antecedent -
Table 3.2: An overview of the baseline feature set.
Antecedent features

As for the NSUs, antecedents also show different syntactic and lexical properties that can be used as features for the classification task. This is a group of features exploiting those properties.

  • [label=]

  • ant_mood
    As defined by rodriguez2004form, this feature was though to distinguish between declarative and non-declarative antecedent sentences. This feature is useful to indicate the presence of an answer NSU, if the antecedent is a question, or a modifier, if the antecedent is not a question.

  • wh_ant
    As the corresponding NSU feature, this indicates the presence of a wh-word in the antecedent. Usually Short Answers are answers to wh-questions while Affirmative Answers and Rejections are are answers to polar questions i.e. yes/no-questions without a wh-interrogative.

  • finished
    This feature encodes a truncated antecedent sentence as well as the presence of uncertainties at the end of it. Truncated sentences lack a closing full stop, question mark or exclamation mark. Uncertainties are given by the presence of pauses or unclear words or else a last word being “non-closing”, e.g. conjunctions or articles.

Similarity features

As discussed in Section 2.1, some classes show some kind of parallelism between the NSU and its antecedent. The parallelism of certain classes can be partially captured by similarity measures. The following is a group of features encoding the similarity at the word and POS-level between the NSUs and their antecedents.

  • [label=]

  • repeat
    This feature counts the content words that the NSU and the antecedent have in common (a maximum value of is taken as a simplification). A value greater than is usually a sign of Repeated Acknowledgment or Repeated Affirmative Answers.

  • parallel
    This feature encodes whether there is a common sequence of POS tags between the NSU and the antecedent and denotes its length. This feature can help classify Repeated Acknowledgments, Repeated Affirmative Answers and Helpful Rejections.

3.4 Feature engineering

The first and most straightforward method we use to address the classification problem is to find more features to describe the NSU instances. We present here the combination of features that we employ as our final approach. The extended feature set is composed of all the baseline features plus 23 new linguistic features, summing up to a total of 32 features. Our features can be clustered into five groups: POS-level features, Phrase-level features, Dependency features, Turn-taking features and Similarity features. Table 3.3 shows an overview of the additional features we use in the extended feature set.

Feature Description Values
pos_{1,2,3,4} POS tags of the first four words in the NSU C5 tag-set
ending_punct ending punctuation in the antecedent if any .,?,!,e
has_pause presence of a pause in the antecedent yes,no
has_unclear presence of an “unclear” marker in the antecedent yes,no
ant_sq presence of a SQ tag in the antecedent yes,no
ant_sbarq presence of a SBARQ tag in the antecedent yes,no
ant_sinv presence of a SINV tag in the antecedent yes,no
nsu_first_clause first clause-level syntactic tag in the NSU S,SQ,…
nsu_first_phrase first phrase-level syntactic tag in the NSU NP,ADVP,…
nsu_first_word first word-level syntactic tag in the NSU NN,RB,…
neg_correct presence of a negation followed by a correction yes,no
ant_neg presence of a neg dependency in the antecedent yes,no
wh_inter presence of a wh-interrogative fragment in the antecedent yes,no
same_who whether the NSU and its antecedent have been uttered by the same speaker same,diff,unk
repeat_last number of repeated words between the NSU and the last part of the antecedent numeric
abs_len number of words in the NSU numeric
cont_len number of content-words in the NSU numeric
local_all the local alignment (at character-level) of the NSU and the antecedent numeric
lcs longest common subsequence (at word-level) between the NSU and the antecedent numeric
lcs_pos longest common subsequence (at pos-level) between the NSU and the antecedent numeric
Table 3.3: An overview of the additional features comprised in the extended feature set.
POS-level features

Shallow syntactic properties of the NSUs that make use of the pieces of information already present in the BNC such as POS tags and other markers.

  • [label=]

  • pos_{1,2,3,4}
    A feature for each one of the first four POS-tags in the NSU. If an NSU is shorter than four words the value None is assigned to each missing POS tag. Many NSU classes share (shallow) syntactic patterns among their instances, especially at the beginning of the NSU phrase. Those features aim to capture those patterns in a shallow way through the POS tags.

  • ending_punct
    A feature to encode the final punctuation mark of the antecedent if any.

  • has_pause
    Marks the presence of a pause in the antecedent.

  • has_unclear
    Marks the presence of an unclear passage in the antecedent.

Phrase-level features

Occurrence of certain syntactic structures in the NSU and the antecedent. These features were extracted through the use of the Stanford PCFG parser klein2003accurate on the utterances. Refer to Marcus:1993 for more information about the tag set used for the English grammar.

  • [label=]

  • ant_{sq,sbarq,sinv}
    Those features indicate the presence of the syntactic tags SQ, SBARQ and SINV in the antecedent. Those tags indicate a question formulated in various ways even when there is no explicit question mark at the end. Useful to recognize e.g. Short Answers.

  • nsu_first_clause
    Marks the first clause-level tag (S, SQ, SBAR, …) in the NSU.

  • nsu_first_phrase
    Marks the first phrase-level tag (NP, VP, ADJP, …) in the NSU.

  • nsu_first_word
    Marks the first word-level tag (NN, RB, UH, …) in the NSU.

  • neg_correct
    Presence of a negation word (no, nope, …), followed by a comma and a correction. For instance:

    1. [label=(3.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(3.0)]

    2. A Or, or were they different in your childhood? B No, always the same.
      [BNC: HDH 158–159]

    This pattern is useful to describe some of the Helpful Rejections such as 1.

Dependency features

Presence of certain dependency patterns in the antecedent. These features were extracted through the use of the Stanford Dependency Parser chen2014fast on the utterances. For more details about the dependency relations tag set please refer to de2014universal.

  • [label=]

  • ant_neg
    Signals the presence of a neg dependency relation in the antecedent. The neg dependency arises from an adverbial negation in the sentence (not, don’t, never, …). This feature helps to capture situations such as the following:

    1. [label=(3.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(3.0)]

    2. A You’re not getting any funny fits from that at all, June? B Er no.
      [BNC: H4P 36–37]

    Since the question in the antecedent is negative, the NSU in 2 is actually an Affirmative Answer, even though it contains a negative word. This feature, in combination with the aff_neg feature, addresses this situation.

  • wh_inter
    Whether the antecedent contains a wh-interrogative fragment such as the one in the following example:

    1. [label=(3.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(3.0)]

    2. A And you know what the voltage is B Yeah, two forty.
      [BNC: GYR 174–175]

    The feature looks for a dobj dependency with a wh-word then for an nsubj dependency with the dependent element of the previous dependency, for instance in 3 we have dobj(is-7, what-4) and nsubj(is-7, voltage-6). This features tries to mitigate the absence of a question as antecedent for Short Answers such as 3.

Turn-taking features

Features indicating certain patterns in the turn-taking of the dialogue.

  • [label=]

  • same_who
    Denotes whether the NSU and the antecedent were uttered by the same speaker. Sometimes dialogues do not provide the speaker information so an additional value unk is added for this cases. This feature is particularly important to capture Check Questions which are almost always uttered by the same speaker.

Similarity features

Additional numeric features and similarity measures between the NSU and its antecedent.

  • [label=]

  • repeat_last
    This measures the number of words in common between the NSU and the last portion of the antecedent. Often happens that Repeated Affirmative Answers and Repeated Acknowledgments contain the last words in the antecedent.

  • abs_len
    The total number of words in the NSU.

  • cont_len
    The number of content-words in the NSU.

  • local_all
    A feature that denotes the local alignment at the character-level between the NSU and the antecedent, computed using the Smith–Waterman algorithm smith1981identification.

  • lcs
    A feature to express the longest common subsequence at the word-level between the NSU and its antecedent, computed using a modified version of the Needleman–Wunsch algorithm needleman1970general, tailored to account for words instead of characters.

  • lcs_pos
    The longest common subsequence at the POS-level between the NSU and its antecedent, computed with the same algorithm of above but using the list of POS tags instead of the list of words.

3.5 Semi-Supervised Learning

The scarcity of labeled data is probably the major problem to face in this classification task. Even though the quality of the data is good enough, it is still difficult for a classifier to learn patterns out of instances or less for some classes (see Table 3.1). However, a large amount of unlabeled data is available in the BNC. There are many classification tasks, such as ours, in which it is hard or costly to label a large amount of instances while instead it is relatively cheap to extract unlabeled ones. The empirical question is whether the use of unlabeled data is useful to improve the classification performances. Semi-Supervised Learning techniques deal with this issue. They exploit the combination of a small amount of labeled data and a large amount of unlabeled data to try improve the classification accuracy. Even though it is still a young research field, semi-supervised learning has already found many fields of application liang2005semi,bergsma2010large.

3.5.1 Unlabeled data extraction

With the use of some heuristics it is possible to extract NSU instances of good quality from the BNC. We use a set of rules to determine whether an utterance in a dialogue transcript of the BNC is a probable NSU. The following is a list of such rules:

  • The number of words in the NSU must be less than a given threshold;

  • The number of characters in the NSU must be higher than a given threshold;

  • The NSU must not contain only pauses, unclear passages and punctuation;

  • The NSU must not contain a greeting (e.g. hi, hello, good night);

  • The NSU must not contain a verb in any form.

An accuracy test was run over the corpus of NSUs: of the NSUs examined, where detected correctly by this set of rules, for an accuracy of . The main flaws of the rules were mostly overlong NSUs, such as 4, and presence of verbs, such as 5.

  1. [label=(3.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(3.0)]

  2. A Was it a coal fire? B Coal fire and er scrubbed the cabin out like that, soda water and soft soap.333A Repeated Affirmative Answer, but the additional content after the conjunction makes the NSU much longer. It is still a valid NSU since it does not have a full clausal structure.
    [BNC: H5G 151–152]

  1. [label=(3.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(3.0)]

  2. A […] the resistance the same the current goes up. B Current goes up.444The NSU is a Repeated Acknowledgment. Repeating the words in the antecedent, it introduces a verb. It is still considered an NSU according to the definition of Fernandez:thesis.
    [BNC: GYR 112–113]

The detection of NSUs using the rules above is not the only problem to face. Perhaps more challenging is the selection of an antecedent for the NSU. As pointed out in Section 2.1, the antecedent of an NSU is not always the preceding utterance. Nevertheless, as proved in the corpus study of Fernandez:thesis, the percentage of the utterances whose antecedent is not the preceding utterance is rather low. Another result of the aforementioned work is that the case in which the antecedent is an utterance at distance greater than one is far more probable in a multi-party dialogue context. In light of the above considerations, we restrict the instances we extract to only those from two-party dialogues and we always consider the preceding utterance as the antecedent of an NSU. While there has been some previous work towards using machine learning techniques for the detection of the antecedent of NSUs in multi-party dialogue Schlangen:2005, we consider sufficient the amount of unlabeled data we can extract following the previous rule.

In order to maximize the quality of the unlabeled data that we extract we also enforce some rules over the antecedent utterance:

  • The number of words in the antecedent must be greater than the number of words in the NSU;

  • The antecedent must have a complete clausal form i.e. at least a verb phrase and a noun phrase.

Using the whole set of heuristics we extracted in total new unlabeled NSU instances from the BNC (checked not to be already in the corpus).

3.5.2 Semi-supervised learning techniques

As previously mentioned, semi-supervised learning techniques are used when labeled data is scarce and unlabeled data is abundant. Every techniques tries to integrate the information yield by the unlabeled instances inside a learning model based on the available labeled data. In this section we give a brief and high-level description of the semi-supervised learning techniques that we have employed, namely: Self Training, Transductive SVM and Active Learning.

Self Training

The simplest way to exploit unlabeled data is to automatically predict some unlabeled instances through a classifier built from the available labeled data then add them to the training data for the next step. This is an iterative process, at each step one or more newly labeled instances are added to the training set then the classifier is retrained and more unlabeled instances are predicted.

Various strategies can be used at each step:

  • Add one or a few (random) instances at the time;

  • Add a few most confident instances;

  • Add all the first time, correct the wrong predictions the next times.

The last strategy as well as other variants can be cast as an Expectation-Maximization problem, especially when using a probabilistic learning model.

Transductive SVM

As already described in Section 3.2.2, Support Vector Machines are one of the most studied and reliable family of classification algorithms. Transductive SVM (TSVM) is a variant of the standard SVM algorithm which exploits unlabeled data to help adjust the SVM model. The basic assumption under TSVM is that unlabeled instances from different classes are separated with large margin. Therefore, similarly to the standard SVM, TSVM tries to find the hyperplane that maximizes the unlabeled data margin i.e. considering unlabeled points as labeled ones. To decide whether an unlabeled point should be considered of one class or the other, clustering techniques are used e.g. -nearest neighbors (the class of the majority of the neighbors or some other variant).

We will not go into mathematical details so we recommend the interested reader to vapnik1998statistical, Collobert:2006.

Active Learning

Annotating data is often a very expensive procedure, mostly because one needs to annotate a lot of instances in order to be able to reliably classify unseen ones. An idea to ease this problem is to let the learning algorithm choose which instance could be the most informative (i.e. the most difficult to predict) then annotate it manually. This technique has the advantage of reducing the cost of manual annotation of the instances by making informed guesses over the instances to label and discarding the redundant ones.

This kind of techniques is typically employed to cope with the scarcity of labeled data. In our case, the lack of sufficient training data is especially problematic due to the strong class imbalance between the NSU classes.

The Active Learning (AL) scheme, which is a special case of semi-supervised learning, trains the model over the available labeled data then queries the user for the label of one (or few more) instances then retrain the model and so on until convergence criteria are met, e.g. the wanted number of new instances is reached.

There can be different query strategies, some of them are:

  • Uncertainty Sampling: queries the least confident instance (according to the probability of the prediction). A variant of that uses entropy to determine the most informative instance.

  • Query-by-committee: uses many different classifiers to predict unlabeled data then formulates the most informative query as the instance about which they most disagree.

  • Expected Model Change: selects the instance that would impart the greatest change to the model, according to a decision-theoretic approach.

  • Expected Error Reduction: Another decision-theoretic approach that aims to minimize the risk

    , that is the expected future error. The instances are selected on the basis of how much the model generalization error is likely to be reduced. A variant of this approach considers only the output variance of the model.

The particular active learning algorithm we employed in our experiments is a pool-based method555That involves drawing labeled instances from a “pool” that remains the same over the iterations, as opposed of stream-based ones in which sampling is done over a stream of data. with uncertainty sampling lewis1994heterogeneous. The sampling relies on entropy as measure of uncertainty. Given a particular (unlabeled) instance with a vector of feature values , we use the existing classifier to predict the class of the instance, and derive the probability distribution for each possible output class . We can then determine the corresponding entropy of the class :

As seen in section 3.2.1, entropy indicates the “unpredictability” of a random variable and also how much information it carries. The higher the entropy of the class of an instance the more information we gain by knowing it. The algorithm we employ (Algorithm 3) selects the instances with highest entropy as the most informative ones. As argued in settles2010active, entropy sampling is especially useful when there are more than two classes, as in our setting. In practice, we applied the JCLAL active learning library666cf. https://sourceforge.net/projects/jclal. to extract and annotate 100 new instances of NSUs, which were subsequently added to the existing training data.

Input: The classifier ; the unlabeled data ; the sample size .
Output: The instances with highest entropy.
1 vector of the same size of ;
2 for  do
3       ;
4       ;
5       ;
6       ;
7      
8 end for
;
  // Sort according to (descending)
9 return ;
Algorithm 3

3.6 Evaluation

In this section we discuss the evaluation of our experiments and their empirical results. We first discuss the evaluation metrics for the classification task we employed, then we present the evaluation results on each setting.

3.6.1 Metrics

Given the dataset with a total of instances, the metrics are based on the amount of true positives (), true negatives (), false positive () and false negatives ().

Accuracy

The ratio of the correctly classified instances over the total

where is the set of the classes and and are respectively the true positives and the true negatives of the class .

Precision

The ratio between the true positives and the total instances classified as positives. In a context with multiple classes (more than two) such as ours, the precision must be calculated per class, where the positive instances are the ones classified with the current class whereas the negative instances are the ones classified otherwise. The per class precision is calculated as follows:

To have a summary value for all the classes we can compute the weighted average precision:

Recall

The recall is the ratio between the true positives and the total instances that are actually positives. As for the precision, we can calculate the per class recall:

And the weighted average recall:

-score

The

-score is the harmonic mean of precision and recall. As for the other two measures, we compute the per class

-score:

Then the weighted average -score:

Student’s -test

Empirical results alone can not assess whether a classifier performs better than another. To assess that the performances of one classifier being higher than a second one is not due to the randomness associated with the data manipulation but to a statistically significant difference between the classifiers one needs to prove with high confidence that the null hypothesis is false. The null hypothesis is a statement that is assumed to be true until evidence indicates otherwise. When comparing two learning systems, the null hypothesis states that there is no difference between the performances of the two learning systems. To prove that a classifier performs better than another we need to disprove the null hypothesis with a high degree of confidence. For this purpose we employ a Student’s -test, a widespread method to compare two sets of data. The -test can be used to find the probability of the performance values of the two classifiers being drawn from the same mean.

To run the -test, we compare the differences among the performance values of the two classifiers over the independent samples. We first compute the mean of the differences:

Then we compute the -statistic:

From the -statistic we can derive the -value from a Student’s -distribution with degree of freedom. A small -value means that it is unlikely that the samples show such a -statistic by chance therefore we can assess that the difference in performance between the two classifiers is statistically significant.

In our case we use a paired -test on the accuracy values of the 10-fold cross-validation over the dataset (thus ). By convention, an acceptable -value is . For our experiments we rely on the t.test function from the R project, a framework for statistical computing R:manual.

3.6.2 Empirical results

Baseline

As in Fernandez:2007, we evaluate our system in a 10-fold cross-validation fashion. Weka’s J48 algorithm was used as a comparing classifier. Thanks to the analysis of the resulting trees, we managed to imitate quite closely the behavior of their system as well as reaching a very close performance overall. Even though we use the same feature set and the same algorithm the performance parameters turn out to be slightly lower than the ones claimed in Fernandez:2007. That might be for a variety of reasons, for instance the way feature were extracted or how the parameters were tuned. Nevertheless the overall performance is matched as well as many of the patterns in the scores. Table 3.4 shows the comparison between the performance parameters of the reference classification Fernandez:2007 and the values of the same parameters achieved by our implementation.

Our replica Reference classification
NSU Class Precision Recall -score Precision Recall -score
Ack
AffAns
BareModPh
CE
CheckQu
ConjFrag
FactMod
Filler
HelpReject
PropMod
Reject
RepAck
RepAffAns
ShortAns
Sluice
weighted avg. 0.89 0.89 0.88 0.90 0.90 0.89
Table 3.4: Performances comparison between Fernandez:2007 and our replica.
Self-training and TSVM

Both those two techniques did not perform particularly well, sometimes even worsening the classification accuracy. Self-training was implemented and tested in many variants but none were successful. One possible explanation is that the labeled data added at each step to the training data is always biased by the labeled data available in the initial training set. This may lead to adding redundant data that is not actually useful to improve the classification performances. On the other hand, TSVM has been unsuccessful mostly due to computational performances of the implementation and other technical difficulties. It was impracticable to run it on a large amount of unlabeled data so we managed to test it only on few unlabeled instances and therefore no improvement was shown.

Active Learning

Our active learning experiment was carried out using the JCLAL library. For the active learning process we divided the dataset into three parts: training set (50%), development set (25%) and test set (25%). At each iteration, the JCLAL library builds a classifier on the training set and evaluates it over the development set. The same classifier is then used to select an instance from the unlabeled data, as described in Section 3.5.2. The user is then asked to annotate the selected instance. The process iterates in this manner until the stopping criteria is met, that is when the goal of 100 newly annotated instances is reached. Table 3.5 shows the distribution of the instances annotated with Active Learning. From Table 3.5 we can see that the AL algorithm using the entropy measure prefers the instances that belongs to the classes that are most difficult to classify and, in particular, the ones that are ambiguous, such as Clarification Ellipsis and Sluices. This process has been performed once with the extended feature set and the SMO classifier. Secondly, it has been simulated (i.e. using the data obtained in the previous run) using the baseline feature set instead. The Figures 3.1, 3.2, 3.3, 3.4 show the learning curves777The graph showing how the performances change as the new labeled data extracted with Active Learning are inserted in the training set., respectively for the accuracy, precision, recall and -score, of both the extended feature set and the baseline feature set.888Notice that the images are scaled on the -axis to make the change visible. All the performance measures are clearly improving as new instances become available, for both the extended feature set and the baseline one.

NSU Class Instances
Helpful Rejection
Repeated Acknowledgment
Clarification Ellipsis
Acknowledgment
Propositional Modifier
Filler
Sluice
Repeated Affirmative Answer
Factual Modifier
Conjunct Fragment
Short Answer
Check Question
tot. 100
Table 3.5: Distribution of the classes of the instances annotated with Active Learning.

In the end, the test set has been used to evaluate the overall performances of the various settings. Table 3.6 and Table 3.7 show the results of the experiments respectively over the development set and the test set. The results on the test set show that the inclusion of the active learning data is only beneficial when combined with the extended feature set.

We also performed an evaluation of the various settings using 10-fold cross-validation over the full dataset. The evaluation results based on the active learning procedure (AL) refer to the performance of the system after the inclusion of all newly annotated instances. The novel data was added to the training set of each fold.

We compare the results of the various settings using the J48 algorithm (Table 3.8) and SMO algorithm (Table 3.9). The use of active learning was successful and, in the end, the use of the SMO classifier with the extended feature set and the inclusion of the AL instances constitutes our final approach.

The results show a significant improvement of the classification performance between the baseline and the final approach. Using a paired

-test with a 95% confidence interval between the baseline and the final results (as detailed in Section

3.6.1), the improvement in classification accuracy is statistically significant with a -value of .

The SVM algorithm does not perform particularly well with the baseline feature set but scales better than the J48 classifier after the inclusion of the additional features. Overall, the results demonstrate that the classification can be improved using a modest amount of additional training data combined with an extended feature set. However, we can observe from Table 3.10 that some NSU classes remain difficult to classify even with the insertion of additional training data. For instance, Helpful Rejections are still the most difficult classes to classify, even with the addition of 21 new instances. One of the problems with Helpful Rejections is that they are connected to their antecedents mainly at the semantic level. Consider the following example of Helpful Rejection that is hard to classify:

  1. [label=(3.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(3.0)]

  2. A There was one which you said Ernest Morris was born in 1950. B Fifteen. [BNC: J9A 372–373]

Training set (feature set) Accuracy Precision Recall -score
Train-set (baseline) 0.853 0.857 0.853 0.848
Train-set (extended) 0.860 0.871 0.860 0.858
Train-set + AL (baseline) 0.867 0.883 0.867 0.868
Train-set + AL (extended) 0.884 0.899 0.885 0.886
Table 3.6: Performances of the SMO classifier in the various settings on the development set.
Training set (feature set) Accuracy Precision Recall -score
Train-set + Dev-set (baseline) 0.906 0.911 0.906 0.903
Train-set + Dev-set (extended) 0.928 0.937 0.929 0.930
Train-set + Dev-set + AL (baseline) 0.898 0.911 0.898 0.898
Train-set + Dev-set + AL (extended) 0.932 0.945 0.932 0.935
Table 3.7: Performances of the SMO classifier in the various settings on the test set.

It is clear that, for the Helpful Rejection in 6, morpho-syntactic and lexical features, such as the ones we employ, are of little use in classifying this utterance. Most of the connection is at the semantic level therefore we would need to use features that exploit semantic patterns. At the same time, the use of this type of features would add several layers of complexity at the feature extraction process. Other examples of difficult classes are the Repeated Affirmative Answers and Repeated Acknowledgments. They are highly ambiguous because they can be misclassified between each other, with their respective non-repeated classes and sometimes with other NSU classes. An example of ambiguous Repeated Acknowledgment can be the following:

  1. [label=(3.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(3.0)]

  2. A Selected period. B Selected period, right, Andrew?999In the dialogue, the speaker B is asking the same question to many people in turns.
    [BNC: JK8 114–115]

The instance in 7 contains also a question therefore it is often misclassified with other question denoting NSU classes. It is clear that handling these type of NSU requires to perform a deeper semantic analysis of the connection with their antecedents then design appropriate semantic features. The extraction of additional labeled data is also especially important for both the feature engineering and the learning process of the classifiers. This two approaches may be the starting points of any future work on this task.

Training set (feature set) Accuracy Precision Recall -score
Train-set (baseline) 0.885 0.888 0.885 0.879
Train-set (extended) 0.889 0.904 0.889 0.889
Train-set + AL (baseline) 0.890 0.896 0.890 0.885
Train-set + AL (extended) 0.896 0.914 0.896 0.897
Table 3.8: Performances of the J48 classifier in the various settings using 10-fold cross-validation.
Training set (feature set) Accuracy Precision Recall -score
Train-set (baseline feature set) 0.881 0.884 0.881 0.875
Train-set (extended feature set) 0.899 0.904 0.899 0.896
Train-set + AL (baseline feature set) 0.883 0.893 0.883 0.880
Train-set + AL (extended feature set) 0.907 0.913 0.907 0.905
Table 3.9: Performances of the SMO classifier in the various settings using 10-fold cross-validation.
Baseline Final approach
NSU Class Precision Recall -score Precision Recall -score
Ack 0.97 0.97 0.97 0.97 0.98 0.97
AffAns 0.89 0.84 0.86 0.81 0.90 0.85
BareModPh 0.63 0.65 0.62 0.77 0.75 0.75
CE 0.87 0.89 0.87 0.88 0.92 0.89
CheckQu 0.85 0.90 0.87 1.00 1.00 1.00
ConjFrag 0.80 0.80 0.80 1.00 1.00 1.00
FactMod 1.00 1.00 1.00 1.00 1.00 1.00
Filler 0.77 0.70 0.71 0.82 0.83 0.78
HelpReject 0.13 0.14 0.14 0.31 0.43 0.33
PropMod 0.92 0.97 0.93 0.92 1.00 0.95
Reject 0.76 0.95 0.83 0.90 0.90 0.89
RepAck 0.74 0.75 0.70 0.77 0.77 0.77
RepAffAns 0.67 0.71 0.68 0.72 0.55 0.58
ShortAns 0.86 0.80 0.81 0.92 0.86 0.89
Sluice 0.67 0.77 0.71 0.80 0.84 0.81
Table 3.10: Per class performances comparison between the baseline (J48, baseline feature set) and the final approach (SMO, extended feature set, AL instances).
Figure 3.1: Learning curve for the accuracy (output of the JCLAL library).
Figure 3.2: Learning curve for the precision (output of the JCLAL library).
Figure 3.3: Learning curve for the recall (output of the JCLAL library).
Figure 3.4: Learning curve for the -score (output of the JCLAL library).

3.7 Summary

This chapter presented the task of classifying non-sentential utterances and our approach to address this problem. This task is formulated as a machine learning problem and we follow and extend the work of Fernandez:2007. We use their corpus as a gold-standard and a replica of their approach as a baseline. The data, the machine learning algorithm used and the feature set of the baseline were discussed respectively in Section 3.1, 3.2, 3.3. The two main problems we faced in our work have been the scarcity of the labeled data and the imbalance in the distribution of the classes. To address these problems we extended the baseline approach in two ways: using a larger feature set (detailed in Section 3.4) and employing semi-supervised learning techniques to exploit the abundance of unlabeled data. We described in Section 3.5 the semi-supervised learning techniques that we employed, namely: Self Training, Transductive SVM and Active Learning. Section 3.6 shows the empirical results we got from our experiments. While the extended feature set alone did not make an improvement on the performances of the classifiers, its use in combination with Active Learning made a modest but significant difference.

4.1 The resolution task

The resolution of an NSU is the task of extracting its meaning from the dialogue context. More precisely, let and represent respectively the word word sequence making up the NSU and its type according to the taxonomy presented in Section 2.1.1. We also assume MaxQUD to be a high-level semantic representation of the antecedent, as mentioned in Section 2.2.2. Through a resolution procedure, we want to extract i.e. the high-level semantic representation of the NSU. The right resolution procedure is selected on the basis of the type of the NSU. In our case the value of is retrieved using the classifier developed in Chapter 3 which takes as input the raw NSU and the antecedent. Figure 4.1 shows a schema of the task just defined. Indeed this is the simplest way to define the task. The resolution procedure may also be dependent of other variables in the dialogue state such as the Facts. In principle, the resolution task is defined independently from the actual semantic representation of the utterances. It is also defined independently from the rules used to update the variables in the dialogue state such as QUD and Facts. In practice, define a set of rules that are generic enough to handle every possible case and behave independently from the state update rules is a difficult task and still an open research problem.

Figure 4.1: The basic schema for the NSU resolution task.

4.2 Theoretical foundation

As previously stated, we rely on Fernandez:thesis and Ginzburg:interactivestance for the theoretical notions needed to represent the dialogue state and to develop the NSU resolution rules. In Section 2.2 we detailed the basic concepts of TTR, the utterance representation and the update rules for the dialogue state. In this section we describe the notions needed for the resolution of the NSUs. In particular we describe how we can exploit the parallelism between the NSU and its antecedent that we mentioned in Section 2.1. We discuss here the concepts that Ginzburg:interactivestance defines to address the resolution of NSUs then we will describe how we adapt those concepts to our needs in the next section.

4.2.1 Partial Parallelism

Instances of NSU classes such as Acknowledgment and Affirmative Answers are related to their antecedent as a whole, that is to understand their meaning one as to consider not a specific aspect of the antecedent but the entire sentence. On the other hand, there are NSU classes, such as Short Answers and Sluices, that show a more fine grained parallelism between their instances and their antecedents i.e. they may refer in particular to certain aspects of the antecedent. In the theory of Ginzburg:interactivestance, this concept is named Partial Parallelism222Fernandez:thesis previously addressed this concept as Sentential Antecedent (SA).. Partial Parallelism is one way to categorize NSU classes according to the relation with their antecedents. NSU classes are categorized as +/-ParPar in order to find the right way to treat them. An NSU class categorized as +ParPar involves the access to one or more sub-utterances from its antecedent. On the contrary, -ParPar NSU classes do not need to know the internal structure of their antecedents to be resolved. Table 4.1 shows how NSU classes are categorized in this way.

-ParPar +ParPar
Plain Acknowledgment Short Answer
Plain Affirmative Answer Repeated Acknowledgment
Plain Rejection Clarification Ellipsis
Factual Modifier Repeated Affirmative Answer
Check Question Sluice
Propositional Modifier Helpful Rejection
Filler
Bare Modifier Phrase
Conjunct
Table 4.1: An overview of the NSU classes divided according to Partial Parallelism.

4.2.2 Propositional lexemes

-ParPar NSU classes are realized (mainly) by propositional lexemes, i.e. words that can stand alone and form a proposition with full contextual meaning. Among those classes there are the Plain Affirmative Answers, Plain Rejections and Propositional Modifiers which respectively are realized by the words yes, no and adverbials such as probably and possibly.

Those classes of NSU arise from polar questions such as 1.

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  2. A Will you go to the party on Saturday? B Yes. / No. / Probably.

The semantic content of these stand-alone lexemes can be modeled as a function of the content of the antecedent polar question.

For Plain Affirmative Answers, is the Identity () relation, i.e. the function that returns the argument itself. This means that the positive answer “yes” to a polar question is equivalent to the assertion of a proposition with the same content as the polar question.

For Plain Rejections, is the relation . indicates the negation of a proposition although it is sensitive to the polarity of , meaning that, when is positive, is the negation of (denoted with ) whereas, when is negative, is itself. This rule is needed to account for the asymmetry in the meaning of negative answers to negative questions. A negative answer to a negative question does not equate a positive one, as exemplified in 2 (rephrased from Ginzburg:interactivestance).

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  2. A Did Paul not leave? B No. (= Paul did not leave.)

For Propositional Modifiers, is a relation which applies different modalities on the basis of the lexical meaning of the word used as modifier, e.g. “probably” would have a different modality than “clearly”.

4.2.3 Focus Establishing Constituents

To account for the partial parallelism between NSUs and their antecedents stemming out from the instances of the classes of the +ParPar group we need to keep track of the focal sub-utterances of the antecedents i.e. of the elements of QUD. For this reason we employ the notion of focus establishing constituents (FEC) from the theory of Ginzburg:interactivestance333The concept was previously formalized by Fernandez:thesis as topical constituents.. The FECs are relevant constituents in the elements of QUD that may be used to resolve NSUs. Consider the following example:

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  2. A A friend is coming to the party. B Who?

The noun phrase “A friend” in the first sentence of 3 is the one which the following Sluice is referring to. Roughly the Sluice can be resolved in such a manner: “Who is your friend that is coming to the party?”. It is clear that the aforementioned sub-utterance has to be contextually available to allow the resolution of the subsequent Sluice. In this we follow Ginzburg:interactivestance, who defines a set of rules to follow to make FECs contextually available. In particular we are interested in the following ones:

  • The FEC associated with a wh-interrogative is the wh-phrase444As in Ginzburg:interactivestance we consider only unary wh-interrogatives. Refer to Fernandez:thesis for an account of utterances with multiple wh-interrogatives itself:

    1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

    2. A Who is organizing the party? B Paul.

  • The FEC associated with a polar interrogative or declarative utterance can be any (quantified) noun phrases:

    1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

    2. A A friend is organizing a party and many people are coming. B Who?

  • The FEC associated with a clarification request is the sub-utterance that has to be clarified i.e. any sub-utterance in the antecedent:

    1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

    2. A Is Paul organizing a party? B Paul? / Organizing? / A party?

4.2.4 Understanding and acceptance

The classes of Plain Acknowledgments and Check Questions are used to handle understanding and acceptance in the conversation. Plain Acknowledgments are used to send a direct feedback of understanding or acceptance of the previous utterances. Understanding involves grasping successfully the content of an utterance while acceptance is a sign of shared belief which therefore updates the Facts with the accepted utterance and removes the corresponding issue from the QUD. As argued in Fernandez:thesis, understanding does not always imply acceptance, and Plain Acknowledgments are ambiguous in this distinction. Despite this difference, we assume that Plain Acknowledgments are used to show acceptance, therefore the use of a Plain Acknowledgments also downdates the QUD. On the other hand, understanding is assumed to be shown by any utterance that is not a Clarification Ellipsis.

Check Questions are used in conversation to request an explicit feedback about the understanding/acceptance of the previous utterance.

4.2.5 Sluicing

Sluices can take a wide range of meanings depending on the particular situation. To formalize the meaning of Sluices, Fernandez:2007 distinguish four types of Sluices that convey different meanings: Direct Sluices, Reprise Sluices, Clarification Sluices, Wh-anaphor.

The aforementioned paper describes a machine learning experiment to automatically classify Sluices according to these types. Ginzburg:interactivestance describes several different treatments for every group of Sluices.

In our work we do not distinguish between those type of Sluices but we confine ourselves for simplicity to direct Sluices only. Direct Sluices, such as the one in 7, are used to query the other speaker for additional information about some aspect of the antecedent.

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  2. A Can I have some toast please? B Which sort?
    [BNC: KCH 104–105]

4.3 Dialogue context design

As mentioned before, the dialogue context is represented as a Bayesian network containing a set of random variables representing the current information state. The values of those random variables can represent virtually anything, from the raw utterances to their semantic representation. The variables in the dialogue context are inspired by Ginzburg:interactivestance. In order to make the transition from the rules of Fernandez:thesis to probabilistic rules as direct as possible, we mimic the basic dynamic of the DGB detailed in Section 2.2. For our semantics we do not employ TTR because it would add unnecessary complexity to our formalization. In this section we first describe the semantics we adopt and then we discuss the random variables that compose the dialogue context.

4.3.1 Semantics

The semantic content of the utterance is represented by logical predicates, individuals and variables. Predicates are labeled as words or camel-case phrases and can present zero or more arguments. Individuals are labeled with uppercase abbreviations such as IND for generic individuals or E for events. Variables are labeled with an uppercase X. Both individuals and variables are uniquely identified by a numeric subscript.

Predicates represent the high-level semantic meaning of the constituents of the utterances. Intuitively, predicates without variables as argument can represent propositions such as 8. As discussed in Section 2.2.1, polar questions and wh-questions can be seen as functions from/to record types. Polar questions take as argument the empty record type. Following this schema, in our formalism polar questions are denoted by predicates with no variables, whereas wh-questions are denoted by predicates containing one or more variables, as exemplified by 9.

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]


Retrieving the semantic representation from the raw utterances is a Natural Language Understanding (NLU) task, a completely different task with respect to the resolution of NSUs. We do not attempt to generate predicates from raw utterances instead we make use of simple handcrafted predicates in our examples, abstracting the necessity of NLU to retrieve the meaning of all the utterances that are not NSUs. We try to keep the problem of NSU resolution generic separating it as much as possible from the NLU task.

4.3.2 Dialogue acts

As seen in Section 2.2.1, to represent the “purpose” of an utterance, we need to use an illocutionary relation, also known as dialogue act. The set of dialogue acts we employ in our dialogue context is a small subset of the ones defined by Ginzburg:interactivestance:

  • Assert, denoting the act of asserting a proposition;

  • Ask, denoting the act of posing a question;

  • Ground, denoting the act of understanding what being previously said;

  • Accept, denoting the act of accepting what being previously said.

Assertions are applied to propositions and they are implicitly considered truthful unless they violate some predicates in the Facts. Asking a query is the act of posing questions and they are piled up in the QUD until they are resolved by an answer. The act of answering to a query corresponds, in the case of a wh-interrogative, to finding the valid arguments to the variables of the question. In case of a polar question, the answer is derived simply by its truth status, denoted by the presence of the same predicate in the Facts. In our formalization we use “Ground” to represent the act understanding. Acceptance is the act of resolving an issue, which involves updating Facts and downdating QUD.

4.3.3 Variables of the dialogue context

For our formalization, as in TTR, we assume the availability of various data structures such as variables, lists, sets and complex types. The probabilistic rules formalism provides those structures out of the box as possible values for the random variables. Random variables are denoted with the notation “”. Array element accesses are indicated with the square brackets notation, such as “”. Sets are denoted with the classical mathematical notation “”. Complex type accesses are denoted with the dot notation, such as “”. The classical operations on sets are available such as the union and the intersection. Array concatenation is denoted with the symbol. Now we describe the variables used in our formalization of the dialogue context.

, , ,

As a convention, raw utterances and dialogue acts are indicated respectively with the letters and and a subscript denotes the speaker. We record in separate variables only the last utterance and dialogue act of each speaker.

A random variable that contains the distribution over the NSU classes returned by the classifier for the latest recorded utterance. It uses max-qud to refer to the antecedent therefore the probabilistic inference framework takes care of finding the most probable antecedent for the current NSU. Besides the values of the NSU classes, a distinct value NoNsu is used to account for input utterances that are not NSUs. To determine whether an utterance is an NSU or not we used the same detection rules explained in Section 3.5.1.

new-fec

The set of FECs introduced by the NLU of the last recorded utterance. It is also a buffer variable used in the NSU resolution to encode the focal constituents of the newly resolved NSU. It is used also to hold FECs of the utterance that is being inserted in the qud.

facts

A set of predicates representing the common knowledge of the users. The predicates in facts contain only individuals as arguments (i.e. no variables) and they are implicitly considered truthful.

qud

As defined in Section 2.2, the QUD is a partially ordered set containing the issues currently under discussion. Its ordering determines the “priority” of the issues to be resolved. Here instead qud is represented as a vector and the max-qud variable denotes the index of the MaxQUD element (see below). Each element in qud has a number of sub-fields:

  • utt: The raw utterance associated to the current question under discussion;

  • q: The semantic representation of the utterance;

  • fec: An array of topical sub-utterances used in the resolution of the NSUs.

The qud is incremented by adding elements in the tail (growing numbers) and decremented in a random-access fashion, usually by removing the MaxQUD element (which could not be the last element) after its resolution. We denote with the size of qud.

max-qud

Despite being represented as the maximal element of QUD in Ginzburg:interactivestance, here max-qud actually denotes the index of such element which therefore is retrieved in this way: qud[max-qud]. In Ginzburg:interactivestance, MaxQUD is given from the partial ordering imposed on QUD. This ordering is often similar but not limited to the behavior of a stack. At our disposal we have the full power of probabilistic modeling which enables us to encode max-qud as a random variable with a prior that gives more probability to the highest element in qud. The function used to give this prior to max-qud is

where is an index in qud. In this way, the prior most probable MaxQUD is the last element inserted in the QUD but the probability can be modified by other contextual elements by probabilistic inference on the dialogue state.

4.4 NSU resolution rules

Here we present the probabilistic rules that handle the resolution of NSUs. For each rule we also present an example of usage. Since they are a (almost) direct translation of the deterministic rules from Fernandez:thesis, most of them have deterministic effects (i.e. a single effect with probability 1). Nonetheless the updates are handled probabilistically by the probabilistic rules framework through probabilistic inference over the Bayesian network representing the dialogue state. We show an example of probabilistic update in Section 4.4.1, which is valid for every other resolution rule.

4.4.1 Acknowledgments

The only requirement for Acknowledgment resolution is to have at least one issue under discussion to be accepted. As explained in Section 4.2.4, we assume that an explicit Acknowledgment is a sign of acceptance of the latest issue under discussion. For Repeated Acknowledgments, Fernandez:thesis requires to have co-referentiality between the repeated constituent in the NSU and the relative constituent in the FEC of MaxQUD. We decided to drop this requirement assuming that the co-reference is always present when the classifier assigns the class RepAck to the current NSU. This assumption does not affect the system given that the effect on the state variables is the same for both Acks and RepAcks. The rule for Acknowledgments is the following555The symbol indicates the assignment of the right-hand side value to the left-hand side variable.:

Consider the following example.

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  2. B I am going to the party. A OK.

The dialogue context of 10 is:

After the application of the rule:

Notice that this may be an oversimplification since often the values of the variables in the dialogue state are not determined with full probability but rather the variables encode a probability distribution over a set of values. For instance, it is often the case that the classifier will retrieve the type of the NSU in a probability distribution with one value with large probability and other few values with smaller probability scores. In this case we could have a situation resembling the following:

The case above would result in the following distribution of 666The actual distribution would not necessarily assign None as alternative value because other rules may be triggered by the other values of .:

Since the dialogue state is a Bayesian network, the update rules will return a distribution of values that is dependent on both the distribution assigned by the rule (in this case only one value with full probability) and the distributions of the variables the rule depend on. These considerations can be of course extended to all the other classes so in the following sections we will only point out the most relevant use cases.

4.4.2 Affirmative Answers

The context for an Affirmative Answer contains a polar question as MaxQUD. As for the Acknowledgments, we drop the requirement of co-referentiability between the repeated constituent of the RepAffAns and the same constituent in the FECs of the MaxQUD element.

An Affirmative Answer to a polar question corresponds to asserting the same semantic content (predicate) of the question. The following is the rule to handle Affirmative Answers777As a convention, quantified variables and quantified individuals in the rule definitions are indicated respectively as and . Vectors of variables or individuals are indicated respectively as x and y.888As in this case, a probabilistic effect might contain several assignments. Hereafter, for readability, we write the sequence of assignments in a vertical notation.:

An example of application of the affAns rule can be:

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  2. B Are you going to the party? A Yes.

The context of 11 is the following:

After the application of the rule:

4.4.3 Rejections

As for the Affirmative Answers, the context of Rejections is a polar question , but, as explained in Section 4.2.2, we need to distinguish the cases in which is positive or negative. We will define the following function Neg indicating the negation of a proposition (or equivalently a question):

where is the negative of . As an extension of the above notation, we indicate a proposition that is explicitly negative as .

Rejections are handled by the following rule:

The following is an example for the above rule:

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  2. B Are you going to the party? A No.

The context of 12 is the following:

After the application of the rule:

4.4.4 Propositional Modifiers

As Affirmative Answers and Rejections, Propositional Modifiers are triggered by polar questions. As seen in Section 4.2.2, their resolution corresponds to asserting the predicate of the polar question modified by a certain modality given by the lexical meaning of the NSU itself.

We define the function that modifies the meaning of a proposition (or equivalently a question) with the modality . The modality is given by the lexical meaning of the word used in the NSU, here indicated for simplicity as the word itself (contained in the variable ).

The rule for Propositional Modifiers states:

Here is an example of application of the above rule:

  1. [label=(4.0), ref=0, labelsep=20pt, leftmargin=55pt, ref=(4.0)]

  2. B Are you going to the party? A Probably.

The dialogue state of 13 before the application of the rule is the following:

After the application of the rule we would have:

Conversely to Affirmative Answers and Rejections, the Propositional Modifiers need to take into account the lexical meaning of the modifier used to update the dialogue state accordingly. This requires a set of lexicalized update rules to properly react to each possible modality of the modified proposition. However, these rules will only take place at the level of action selection and context update therefore it is still possible to resolve this kind of NSUs in a general way, as previously explained in Section 4.2.2.

An example of lexicalized rule for updating the context in the presence of a modified proposition can be the following.