Givenness Hierarchy Theoretic Cognitive Status Filtering

by   Poulomi Pal, et al.

For language-capable interactive robots to be effectively introduced into human society, they must be able to naturally and efficiently communicate about the objects, locations, and people found in human environments. An important aspect of natural language communication is the use of pronouns. Ac-cording to the linguistic theory of the Givenness Hierarchy(GH), humans use pronouns due to implicit assumptions about the cognitive statuses their referents have in the minds of their conversational partners. In previous work, Williams et al. presented the first computational implementation of the full GH for the purpose of robot language understanding, leveraging a set of rules informed by the GH literature. However, that approach was designed specifically for language understanding,oriented around GH-inspired memory structures used to assess what entities are candidate referents given a particular cognitive status. In contrast, language generation requires a model in which cognitive status can be assessed for a given entity. We present and compare two such models of cognitive status: a rule-based Finite State Machine model directly informed by the GH literature and a Cognitive Status Filter designed to more flexibly handle uncertainty. The models are demonstrated and evaluated using a silver-standard English subset of the OFAI Multimodal Task Description Corpus.



There are no comments yet.


page 1

page 2

page 3

page 4


Toward Givenness Hierarchy Theoretic Natural Language Generation

Language-capable interactive robots participating in dialogues with huma...

Cognitive Principles in Robust Multimodal Interpretation

Multimodal conversational interfaces provide a natural means for users t...

Language (Re)modelling: Towards Embodied Language Understanding

While natural language understanding (NLU) is advancing rapidly, today's...

Using Pause Information for More Accurate Entity Recognition

Entity tags in human-machine dialog are integral to natural language und...

Toward Forgetting-Sensitive Referring Expression Generationfor Integrated Robot Architectures

To engage in human-like dialogue, robots require the ability to describe...

COGEVIS: A New Scale to Evaluate Cognition in Patients with Visual Deficiency

We evaluated the cognitive status of visually impaired patients referred...

Multi-optional Many-sorted Past Present Future structures and its description

The cognitive theory of true conditions (CTTC) is a proposal to describe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As human-robot interaction becomes increasingly common, robots need to be able to talk about the objects, locations, and people in their environments in the same way humans do, to facilitate concise, easy, and unambiguous communication. To reap these benefits, just like humans, robots must be able to understand and use pronouns like it, this, and that. The linguistic theory of the Givenness Hierarchy (GH) (gundel1993cognitive) suggests that humans tend to use pronouns rather than longer referring expressions due to implicit assumptions about the cognitive status the referent has in the mind of their interlocutor. That is, the use of different referring forms is viewed as justified based on whether the referent is In Focus, Activated, Familiar, and so forth, within the current conversation. Thus, for robots to understand and generate human-like natural language they must be able to model this notion of cognitive status.

Previously, williams2018reference (see also (williams2018towards; williams2016situated)) presented the first full computational implementation of the GH for the purpose of robotic natural language understanding, using a set of hand-crafted rules informed by the GH literature. However, that approach was designed specifically for robotic natural language understanding, oriented around GH-inspired memory structures used to assess what entities are candidate referents given a particular cognitive status. In contrast, natural language generation requires a model in which cognitive status can be assessed for a given entity.

Such a model of cognitive status could either be developed as a rule-based model (not dissimilar from the rule-based approach to GH-theoretic language understanding taken by williams2018reference), or could instead be developed as a statistical model which would attempt to learn to predict an entity’s cognitive status from data. While in practice both rule-based and data-driven empirical models are useful (bangalore2005probabilistic), data-driven models may be better able to handle unseen, uncertain situations (bangalore2003balancing; bangalore2005probabilistic).

In this paper, we thus propose (and compare to a rule-based Finite State Machine (FSM) model) the Cognitive Status Filter (CSF): a data-driven probabilistic model of cognitive status, structured to be optimized for natural language generation rather than natural language understanding, trained and evaluated using a silver-standard111This subset constitutes English transliteration of originally German dialogues. English subset of the OFAI Multimodal Task Description Corpus (schreitter2016ofai). Specifically, the CSF seeks to predict the cognitive status for a given entity based on whether and how it has been referenced in natural language.

The remainder of this paper is organized as follows. After discussing related work on cognitive status and referring expressions, we formally define the concept of a Cognitive Status Filter. We then present the results of a crowdsourced human-subject experiment to gather the data necessary to train and evaluate this model, and compare the CSF model’s performance to that of a rule-based Finite State Machine model. Finally, we discuss our results and conclude with possible directions for future work.

2 Related Work

The Givenness Hierarchy, originally presented by gundel1993cognitive, consists of a nested hierarchy of six tiers of cognitive status: {in focus activated familiar uniquely identifiable referential type identifiable}, each of which is associated with a set of referring (or pronominal) forms that can be used when referring to an entity with that status (gundel2006coding; hedberg2013applying). The hierarchical nesting here means that an entity with one status can also be said to have all other statuses lower in the hierarchy. If a target referent is in focus, for example, it can also be inferred to be activated, familiar, and so forth. Accordingly, a speaker’s selection of a pronominal form depends on their assumptions as to the cognitive status of their target referent in the mind of their conversational partner. For example, if a speaker uses “it” to refer to an object, the listener can infer that the object being referenced must be one that is already in focus, whereas if a speaker uses “that ”, the speaker can only infer that the object is at least familiar (but may in fact be activated or even in focus).

The hierarchical structure of the GH is also important due to the way it parallels the hierarchical nesting of models of human memory, such as cowan1998attention’s, in which the focus of attention is a subset of short-term memory (or working memory), which is in turn a subset of long-term memory.

The GH coding protocol, presented by gundel2006coding, provides guidelines as to what features of linguistic and environmental context should dictate the cognitive status of a given entity. For example, this protocol suggests that an entity that is mentioned in a topic role in the preceding clause should be considered to be in focus, and that any entity that is mentioned at all should be considered to be at least activated (gundel2006coding; hedberg2013applying).

Due to the GH’s popularity within the research literature, and its validation across a wide variety of languages beyond English (gundel2010testing), many researchers have sought to computationally implement it in whole or in part, especially within the context of reference resolution algorithms. kehler2000cognitive, for example, use the GH to justify an approach in which elements of an interface that are highlighted are considered to be ”in focus”, and referring expressions that use pronominal forms are automatically resolved to those highlighted referents.

Building on this work, chai2004probabilistic proposed a probabilistic graph-matching algorithm for resolving referring expressions that are complex (involving multiple target referents) and ambiguous (involving gestures that could indicate multiple candidate referents) in multimodal user interfaces. Because this algorithm had high computational complexity, chai2006cognitive demonstrated how the algorithm’s performance could be improved using a greedy algorithm based on the theories of Conversational Implicature (grice1975logic; dale1995computational) and the GH. Chai et al. combine these theories to create a reduced hierarchy: Gesture Focus Visible Others, where Focus combines the “in focus” and “activated” tiers of the GH, and Visible combines its “familiar” and “uniquely identifiable” tiers. When a referring expression is processed, the relationship between referring form and status is then used to help resolve that referring expression.

Finally, while the approaches above focused on modeling of reduced versions of the GH, williams2016situated; williams2018reference instead presented an implementation of the full GH, through a set of rules that associated different referring forms with different sequences of actions involving all six tiers of the GH. This required, in part, four data structures corresponding to the top four tiers of cognitive statuses of the GH, while the last two tiers were instead associated with new “mnemonic actions” such as creating new mental representations (williams2018reference).

In all of these previous approaches, the GH is used to justify a set of data structures used to store representations for entities that could be referred to, and to justify which of these data structures should be considered (and how) when a given referring form is used. However, while this is sensible during natural language understanding, it may not be appropriate for the purposes of natural language generation. During generation, the speaker already knows what object they wish to refer to, and do not need to search through these sorts of data structures. Instead, when a speaker decides what referring form to use to refer to a given object, we argue that they would instead start by determining the status of that object, and only then may they look through the data structure associated with that status, in order to determine what distractors must be ruled out. Critically, this requires the ability to quickly determine the cognitive status of a given entity. Accordingly, in the next section we propose an approach to this problem, which we term as cognitive status modeling.

3 Problem Formulation

We formulate cognitive status modeling as a Bayesian filtering problem. Let a dialogue consist of a set of utterances . For object , let denote the cognitive status of at a particular timestep after utterance (either In Focus, Activated or Familiar), and let denote the linguistic status of in utterance

(e.g., either not mentioned in the utterance, mentioned in the utterance in a non-topic role, or mentioned in the utterance in a topic role). Using this formalism, our goal is to recursively estimate, for a given object, the probability distribution over cognitive statuses for object

at time :


We define a Bayesian filter of this form as a Cognitive Status Filter (CSF) for a given object . Given a set of known objects, , our goal is then to estimate this distribution for each at each time step. To do so, we use a Cognitive Status Modeling Engine , consisting of a set of CSFs , one for each object believed to be of a status familiar or higher within the conversation. Here, we make the simplifying assumption that the same set of objects are known to both the robot and its conversational partner, meaning that the set of all objects with status Uniquely Identifiable or higher is simply the set of objects . We assume that it is straightforward to determine whether one of these objects is or is not Familiar based on whether or not it has appeared in the current conversation. This allows us to model whether or not an object is of status Familiar or higher based on whether or not a CSF exists for that object, and to model which of those statuses the object likely has, using its associated CSF.

4 Data Collection

The core component of our CSF model that must be learned ahead of time is the conditional probability . To learn this, we trained our model using a silver-standard English translation of the German OFAI Multimodal Task Description corpus (schreitter2016ofai). The corpus represents a collection of human-human and human-robot interactions where the human teacher shows and explains to a human or robot learner how to connect two separate parts of a tube and then how to mount the tube onto a box with holders, as shown in Figure 1 by actually moving around the objects and performing the task while explaining it to the learner. The average length of a sentence that is used in this corpus has 8-9 words. As the name suggests, since the corpus is “multimodal”, the corpus contains both verbal and non-verbal cues such as speech, gaze, and gestures. Realistic multimodal HRI scenarios require the use of such non-verbal cues; however as our first step we begin in this work by looking only at our model’s ability to handle the same kind of linguistic factors that are handled by the GH, leaving the ability to model other linguistic factors for future work.

While the OFAI MTD corpus contains data from four task scenarios, we only use the data from one particular task scenario (Task 3). The original dataset for this task consists of 16 monologues each having approximately 4 to 5 utterances. As a first step, in this work we begin by evaluating our model on a small subset of the original dataset, consisting of 4 of these monologues, each of which is comprised of just 4 utterances, to control for monologue length. As shown in Figure 1, this task context contains 8 objects, including the learner and teacher.

Task 3 was selected because it includes a larger number of objects than the other tasks in a dyadic instruction context, and contains data from both human-human and human-robot dyads. Specifically, Task 1 involved a human teacher explaining and performing a task in front of the camera without the presence of a learner in the scenario; Task 2 involved a human teacher and a human learner jointly performing the task of moving an object; and Task 4 is a pure “navigation task” involving both human-human and human-robot dyads (schreitter2016ofai).

4.1 Appearance Feature Annotation

To collect linguistic status information L, three annotators independently annotated the OFAI Multimodal Task Description Corpus (schreitter2016ofai) according to the following annotation procedure. Each annotator was provided a printed copy of all 16 monologues to annotate. For each sentence in each monologue, the annotator was instructed to underline any piece of the text that could refer to some object in the scene. For each of these underlined pieces of text, the annotator was instructed to indicate the correspondence between the underlined sentence fragment and the object in the scene it referred to. Finally, the annotator was required to circle the fragment-object mapping they believed to be the topic of the sentence. There were a few cases in which annotators circled multiple objects as the topic of the sentence; in these cases, both objects were recorded as being equally probable topic referents222The inter-annotator agreement score as measured through Fleiss’ Kappa was = 0.37, indicating fair agreement between annotators. It will be important in future work to adapt the annotation protocol to increase rate of agreement..

4.2 Cognitive Status Annotation

Ground-truth cognitive status information was then collected through a crowdsourced human-subject experiment. 160 US participants were recruited from Amazon Mechanical Turk. Two participants answered an attention check question incorrectly and were dropped from our analysis, leaving 158 participants (71 female, 85 male, 2 N/A). Participant ages ranged from 19 to 70 years (M = 35.03, SD = 11.36). Each participant was paid $0.25 for completing the study.

4.2.1 Procedure:

At the beginning of the experiment, each participant is shown the scene depicted in Figure 1, and is instructed to remember the objects and their labelings in order to performing their upcoming task. Participants were then shown the same scene without labels while listening to a portion of one of the experiment’s four monologues, as read by the experimenters. Specifically, participants were randomly assigned to hear a random prefix of a randomly selected monologue (i.e., either only the first utterance of that monologue, the first two, the first three, or all four).

Figure 1: Scene (labeled)

At the end of this monologue excerpt, participants were asked to answer two questions, presented in a randomized order, with the second question becoming available after the first question was answered. The two questions are as follows:

  • Q1: Click on the object in the scene that you think the speaker would most likely be referring to if the speaker would have said “look at it” at the end of the monologue.

  • Q2: Click on all the objects in the scene that you think the speaker would most likely be referring to if the speaker would have said “look at that” at the end of the monologue.

Two of the monologues used in our experiment are shown below.

Monologue 1:

You must take the tube with your right hand.


And insert it in at the yellow-green connection here.


Put it on the tube.


Again, with your right hand insert it here in the holder.

Monologue 2:

With the right hand stick the two tubes together.


You put that together here with the yellow-green mark.


It is okay that it is not holding firmly.


Now lead the one tube through here.

These questions allowed us to probe the user’s implicit beliefs as to the cognitive status of the objects in the scene. From a GH-theoretic perspective, if a participant implicitly believed a given object to be in focus, they should click on that particular object for both Q1 and Q2, whereas if they believed the object to be activated, they should click on that object for Q2 but not for Q1. Because the context is narrowly defined and participants were given time to examine each object in the scene, we assume that all objects in the scene are familiar or higher. Thus, if a participant believed the object to be familiar or lower, they should not click on the object at all. After completing the task, participants completed a check question (cf. schreitter2016ofai) requiring users to identify the scene they had viewed from among several distractors. This allowed us to ignore data from participants who did not pay sufficient attention while completing the task.

Using this coding procedure, we are thus able to determine the perceived cognitive status of each object in the scene for each participant after the completion of the monologue excerpt they were exposed to. When paired with the linguistic status annotations, this allowed us to train our CSF model, using the procedure described in the following section.

5 Training and Evaluation

5.1 Training

After collecting this dataset, our CSF was trained in the following way: First, we initialized a 9x3 matrix whose rows correspond to the nine cognitive/linguistic status pairs an object could have at time (, , , , , , , , ), and whose columns correspond to the three cognitive statuses that object could have at time (, , ).

For each pair of adjacent utterances in each monologue , we consider the data from all participants (for all objects) who provided data immediately following utterance , and from all participants who provided data immediately following utterance . For each resulting pair of datapoints, we identify and increment the correct cell in this matrix. For example, for the combination of a datapoint from a participant who heard some utterance and subsequently viewed that object as in focus, and a datapoint from a participant who heard the next utterance in the same monologue, containing object 1 in a non-topic role, and at that point viewed the object as being activated, we would increment the cell . Once all data has been considered, we normalize each row of this table to produce a conditional probability table.

5.2 Evaluation

To evaluate our CSF model, we then considered each object and each monologue , and retrained our model using all data except that which was collected for object or monologue (for example, while testing for object in monologue , we retrain our model with all the data except that concerned with and/or ), and used this model (along with a prior distribution over cognitive statuses for that object as described below) to simulate what status would be predicted for that object at each point in that monologue. After each of these utterances, we evaluated the model’s prediction by comparing it to the majority opinion from participants who had

provided data for that object at that point in that monologue. Combining these prediction results for all eight objects in all four utterances in all four monologues produced a 128-element prediction vector for the model.

Specifically, we computed these prediction vectors for each of two CSF models, each of which used a different prior distribution over cognitive statuses:


an uninformed

prior in which each cognitive status was assigned a prior probability of 0.33.


a (weakly) informed prior, in which the three cognitive statuses were assigned prior probabilities I= 0.05, A= 0.1, F= 0.85. These probabilities reflect the fact that objects are a priori far more likely to be familiar than activated, and among the set of things that are currently activated it is more likely for a given object to be activated than in focus. While in theory this distribution could be learned from data, in a realistic environment it may be the case that hundreds or thousands of objects are familiar and only one is in focus, yielding an extremely unbalanced distribution. This weakly informed prior thus represents an optimistic belief state in which the prior probability of any given object being in focus is artificially boosted.

In addition to these two prediction vectors produced by different parameterizations of our CSF model, we also computed prediction vectors for two baseline models:

Finite State Machine:

First, we computed the decisions made by a rule-based FSM model, which formalized a set of heuristics from the GH coding protocol (the same heuristics previously used in the work of 

williams2018reference). In this FSM, the states correspond to cognitive statuses, and transitions are triggered based on linguistic statuses observed in incoming utterances. For example, for an FSM dedicated to some object, if that object is mentioned in a topic role, this will deterministically trigger a state transition to in focus.

Random Baseline:

Second, we computed the decisions made by a random baseline (RB) model, which predicted cognitive statuses at random.

5.3 Results

The overall accuracy of each model (i.e., the proportion of correct entries in each model’s prediction vector) is shown in Table 1

. This demonstrates that our U-model had the highest accuracy, and that our I-model and the theoretical FSM model had the same accuracy, slightly less than the U-model. The accuracy measure of the FSM model suggests that the heuristics encoded in the GH coding protocol are a good representation of the patterns that can be learned from the data we collected, given our choice of data annotations. The similarity of the CSF model’s accuracy to that of the FSM similarly demonstrates that the CSF did a good job of automatically learning these patterns from our data. The slightly higher accuracy of the U-model over the I-model suggests that the uniformly distributed prior probabilities may have been more helpful than the weakly informed prior distribution. Finally, the performance advantage of all of these models over the RB model provides a good baseline measurement of success.

model accuracy
U-model 82.03
I-model 81.25
FSM 81.25
RB 32.81
Table 1: Accuracy measure of each model

To validate these intuitive assessments, we formally compared our four models using six pairwise McNemar’s Tests (mcnemar1947note; bostanci2013evaluation), whose results are shown in Tables 2 and LABEL:table:mcnemars.

U-model I-model 104 1 0 23
U-model FSM 89 16 15 8
U-model RB 34 71 8 15
I-model FSM 89 15 15 9
I-model RB 33 71 9 15
FSM RB 34 70 8 16
Table 2: Contingency Table entries for model pairs

Table 2 (see also Figure LABEL:fig_3) shows the contingency table values used by McNemar’s test for each pairwise comparison, where the four counts refer to the contingency table cells shown in Table 3. That table layout simply depicts a general 2x2 contingency table (liddell1976practical; clark1999performance) comparing the performance of two models A and B. Here, and respectively denote the number of instances where both models failed and succeeded. and respectively denote the instances where one model failed and the other succeeded.

model A success model A fail
model B success
model B fail
Table 3: A 2X2 Contingency Table