In this paper, we present an application of Answer Set Programming (ASP) to the understanding of restaurant narratives. Dining at a restaurant is a stereotypical human activity, i.e., a sequence of actions normally performed in a certain order by one or more actors, according to cultural conventions. Automating a deep understanding of stories about stereotypical human activities is a more difficult task than that of understanding other types of narratives because a larger number of events that are part of the activity are not explicitly mentioned in the text, with the assumption that readers will be able to fill in the gaps based on their shared cultural knowledge. Consider the following example:
Example 1 (Normal scenario from [m07])
Nicole went to a vegetarian restaurant. She ordered lentil soup. The waitress set the soup in the middle of the table. Nicole enjoyed the soup. She left the restaurant.
The story in Example 1 does not mention that the waitress went to the kitchen to get the soup nor that Nicole paid for her meal, as these actions are implicitly assumed.
We chose to focus specifically on restaurants with table service because stories about this domain involve more actors, performing more actions, and interacting in more complex ways than in other types of restaurants (e.g., fast food restaurants). As a consequence, knowledge representation and reasoning techniques become relevant to understanding stories about the chosen stereotypical activity, while an automated learning of the sequence of events that form this activity would be very difficult, as indicated in Section 2. Our work is applicable to stories about other stereotypical activities.
Schank and Abelson sa77 proposed modeling stereotypical human activities as scripts, i.e., “standardized sequences of events” bf81, and Mueller conducted substantial research in this direction (e.g., m04), including work on the restaurant domain specifically m07. However, his system was not able to understand stories describing exceptional scenarios like the ones in Examples 2 and 3 below because of the rigid structure of scripts – actions in the script are assumed to always occur in the exact order specified in the script.
Example 2 (Serendipity)
Nicole went to a vegetarian restaurant. She ordered lentil soup. When the waitress brought her the soup, she told Nicole that it was on the house. Nicole enjoyed the soup and then left. (The reader should understand that Nicole did not pay for the soup.)
Example 3 (Diagnosis)
Nicole went to a vegetarian restaurant. She ordered lentil soup. The waitress brought her a miso soup instead. (The reader is supposed to produce some explanations for what may have gone wrong: either the waitress or the cook misunderstood the order.)
We have argued zi17; izbi18 that modeling actors in a restaurant scenario as goal-driven intentional agents is needed in order to be able to process exceptional restaurant scenarios in addition to normal ones. We have proposed to use theories of intentions written in ASP bg05i or easily translatable into ASP thesisblount13; bgb15 to model the characters in a restaurant episode as intentional agents, and concluded that our methodology has a wider coverage than script-based approaches.
In this paper, we investigate remaining research questions related to the proposed methodology and present a corpus of restaurant stories that we built in order to evaluate our methodology, and which we make publicly available.
Our first research question studies the impact in terms of coverage and performance of modeling all characters in a restaurant scenario as goal-driven intentional agents as defined by Blount et al. bgb15 versus viewing only the main character, the customer, as such, and modeling other characters using a simpler theory of intentions by Baral and Gelfond bg05i. The second research question investigates the optimal structure for the representation of the stereotypical activity of dining at a restaurant from the point of view of each character. We envision the restaurant corpus that we constructed, restaurant-1.0, as a resource to be used in future research on stereotypical activities, but also a useful benchmark for the NLP, natural language understanding, KRR, and ASP communities.
The contributions of this work are as follows:
We demonstrate that, by using ASP theories of intentions, our proposed methodology can reason about exceptional scenarios that can not be processed using traditional, script-based approaches. Thus, we introduce and highlight an important application area for the ASP body of work on theories of intentions.
We indicate that modeling only the customer role as a goal-driven intentional agent presents advantages in terms of performance, while having only a moderate negative impact on coverage.
We provide guidelines for structuring the representation of stereotypical human activities based on lessons learned from serendipitous scenarios like the one in Example 2, which require a hierarchical structure where activities have sub-activities with sub-goals, and scenarios involving diagnosis like the one in Example 3, which require paying attention to the parameters of the activity.
We introduce and make available a corpus of restaurant stories accompanied by their ASP logic forms.
In what follows, we start by discussing related work and then describe the proposed methodology for reasoning about restaurant stories. Next, we explore the two research questions connected to our methodology. We then briefly present the application of our refined methodology on a few illustrative stories. We present the restaurant story corpus that we created and end with conclusions and future work.
2 Related Work
Restaurant Narratives. Erik Mueller’s work is based on the hypothesis that readers of a text understand it by constructing a mental model c43 of the narrative jl83; vdk83. Mueller’s system m07 showed a deep understanding of restaurant narratives by answering questions about time and space aspects that were not necessarily mentioned explicitly in the text. His system relied on two important pieces of background knowledge: a commonsense knowledge base about actions occurring in a restaurant, their effects and preconditions, encoded in Event Calculus s97; and a script describing a sequence of actions performed in a normal unfolding of a restaurant episode. The script was much more detailed than those used in other systems, for instance Ng and Mooney’s plan recognition software ACCEL nm92, and thus was able to demonstrate a more in-depth understanding.
The system processed English text using information extraction techniques in order to fill out slot values in a template. Table 1 shows the template constructed for the scenario in Example 1. Note that the slot SCRIPT: LAST EVENT is filled with the value Leave, which corresponds to the customer’s last action in the restaurant script.
Next, the template was translated into a reasoning problem that contained: facts about the entities identified in the template; facts about the consecutive occurrence of all actions in the script up to the one corresponding to the value of the SCRIPT: LAST EVENT slot; and default information about the layout of the restaurant and the locations of different objects and characters. Then, the reasoning problem was expanded with the information in the commonsense knowledge base to compute models of the input restaurant scenario. Finally, questions about time and space aspects were automatically generated and answers were obtained from the model resulting in the previous step. Mueller’s system was tested on 124 excerpts of texts retrieved from the web or Project Gutenberg collection, and answered correctly 70% of the test questions.
In terms of limitations of the system, the author acknowledged the lack of flexibility of scripts, which resulted in scenarios with exceptional cases (or variations, such as an additional wine tasting step) not being processed correctly. For instance, for the scenario in Example 2, Mueller’s system would detect Leave as the SCRIPT: LAST EVENT and thus construct the same template as the one for Example 1
. As a result, the reasoning problem built from the template would include the fact that Nicole paid for the soup, since this action precedes the customer’s action of leaving in the script, when in fact a human reader would infer that she did not pay because the soup was on the house. In the script-based approach, such a serendipitous scenario can only be solved by introducing a new script that does not contain the pay action. This means that the knowledge engineer would have to predict all possible exceptional scenarios and create a new script for each of them in advance.
Narrative Corpora. Mueller’s two restaurant story corpora m07, one based on Internet stories and the other on Project Gutenberg texts, are proprietary and thus unavailable. Reconstructing these corpora is a laborious task. General story corpora exist, but they do not apply to the subject of this research. For instance, the InScript narrative corpus maop17 covers other stereotypical human activities (e.g. grocery shopping, taking the bus), but not the topic of dining in a restaurant with table service. The OMCS (Open Mind Common Sense) slmlpz02 and OMICS (Open Mind Indoor Common Sense) guptak04 corpora cited in earlier papers, though publicly available in the past, are no longer readily available and do not contain restaurant stories. The SMILE corpus rkp10 contains stories about eating at a fast food restaurant, but not at an elegant restaurant. Gordon et al. gcs07 processed and annotated an existing corpus of stories extracted from Internet web blogs. However, due to a complex agreement system for using the original web blog corpus, the data is not readily available to the public.
Automated Learning of Activities. In recent years, there has been an increased interest in automatically learning the sequence of events that forms a stereotypical activity cj08; msg08; rkp10. However, these approaches are only able to produce flat sequences of actions that are not associated with goals. Smith and Arnold sa09 are able to produce hierarchical plans, but these are not associated with goals either. As stated in the introduction, a hierarchical structure with sub-sequences and associated (sub-)goals is required for a system to be able to reason about exceptional scenarios like the ones in Examples 2 and 3
. Additionally, the targeted stereotypical activities in this unsupervised learning body of work do not include dining at a restaurant with table service and generally have only one actor (e.g., make coffee).
Activity Recognition. The task of automating the understanding of restaurant narratives is somewhat connected to activity recognition, in that it requires observing agents and their environment in order to complete the picture about the agents’ actions and activities. However, unlike activity recognition, understanding restaurant narratives does not require identifying an agent’s goal, which is always the same in our case (e.g., a customer entering a restaurant always seeks to become satiated). Gabaldon Gabaldon09 performed activity recognition using Baral and Gelfond’s theory of intentions bg05i that did not consider goal-driven agents.
Preliminary Work. In an earlier version of this paper, Zhang and Inclezan zi17 presented an initial solution to the problem of reasoning about restaurant stories by using both of the existing theories of intentions. Inclezan et al. izbi17; izbi18 introduced the alternative approach of using the newer theory of intentions for modeling all characters in a restaurant story. Neither of these papers compare the two approaches nor include the work on the restaurant story corpus.
3 Reasoning about Restaurant Stories in ASP
As mentioned in the introduction, the script-based approach is not suitable for reasoning about exceptional scenarios because of the rigidity of scripts. In previous work zi17; izbi17; izbi18, we proposed a new approach, capable of handling normal and exceptional scenarios, based on the idea of viewing the main character in a restaurant scenario (and possibly others) as a goal-driven intentional agent. We used theories of intentions to reason about the actions of such agents, coupled with a background knowledge base about actions and properties (fluents) relevant to the restaurant domain, as well as an encoding of the stereotypical activity itself, from the point of view of each character. In this section, we briefly introduce the two existing theories of intentions, written in ASP or languages closely related to ASP. We then outline our methodology and stress specifically the research questions that resulted from our preliminary work.
3.1 Theory of Intended Actions by Baral and Gelfond
Baral and Gelfond bg05i captured properties of intended actions
in an ASP theory we denote by that had two main
tenets: “Normally intended actions are executed the moment
such execution becomes possible”
“Normally intended actions are executed the moment such execution becomes possible”(non-procrastination) and “Unfulfilled intentions persist” (persistence). Sequences of actions were modeled using predicates: (s is a sequence); (n is the length of sequence s); and (the element of sequence s is x, where x can be either an action or another sequence). An agent’s intentions at different time points was captured by (action/ sequence x is intended at time step i). The theory was successfully used in activity recognition Gabaldon09 and question answering about biological processes ig11, but was not sufficient for modeling goal-driven agents. We use the term simple intentional agent to refer to an agent that can be modeled by .
3.2 Theory of Goal-Driven Intentional Agents by Blount et al.
Blount and collaborators thesisblount13; bgb15 improved on the previous theory of intentions by considering goal-driven agents inspired by the Belief-Desire-Intention (BDI) model b87. For this purpose, each sequence of actions of an agent was associated with a goal that it was meant to achieve – the combination of the two was called an activity. Activities could have nested sub-activities, and were encoded using the predicates: (m is an activity); (the goal of activity m is g); (the length of activity m is n); and (the component of activity m is x, where x is either an action or a sub-activity).
The authors introduced the concept of a goal-driven intentional agent — one that has goals that it intends to pursue, “only attempts to perform those actions that are intended and does so without delay.” To represent the intentions and decisions of an intentional agent, Blount et al. introduced mental fluents and actions. Two important mental fluents are (m is in progress if , and not yet started or stopped if ) and (the next action to be executed as part of activity m is a). Mental actions included and for goals, and and for activities. The new theory of intentions was encoded in action language bg00. We denote by its ASP translation.
Additionally, Blount et al. developed an agent architecture (implemented in CR-Prolog bg03a; b07, an extension of ASP) that adapts the agent loop bg08 to specify the behavior of a goal-driven intentional agent. For instance, while fluent in the theory of intentions indicates the action in activity m that the agent would normally need to execute next, the agent architecture handles exceptions to this rule. The decision not to execute the next action is made if the activity’s goal was already achieved by some other action (Example 2) or was abandoned; or if the current activity needs to be stopped altogether because it no longer has chances of achieving its goal (Example 3).
3.3 Proposed Methodology
Our methodology describes how to construct an ASP logic program for each input restaurant narrative based on the information given in the text, a background commonsense knowledge base, theories of intentions, and an adapted and extended version of the architecture. Answer sets of the resulting program correspond to a cautious reader’s possible mental models of the narrative, which can be used to demonstrate a deep understanding of the story via question answering. By “deep understanding” we mean awareness of the intentions of characters and of the occurrence of actions that were not explicitly stated in the text but would be assumed by a human reader.
Our goal is to focus on the reasoning component. We thus ignore the natural language processing part, which is a difficult task on its own. We distinguish between the story time line containing strictly the events mentioned in the text and the reasoning time line corresponding to the mental model that the reader constructs. We assume that a wide coverage commonsense knowledge base () written in ASP is available to us and that it contains information about a large number of actions, their effects and preconditions, including actions in the stereotypical activity. How to actually build such a knowledge base is a difficult research question, but it is orthogonal to our goal. In practice, in order to be able to evaluate our methodology, we have built a basic knowledge base with core information about restaurants in the spirit of Mueller’s work m07.
According to our methodology, for each input text we construct a logic program consisting of an input-dependent part and a pre-defined part common to all texts.
The input-dependent part of (i.e., the logic form obtained by translating the English text into ASP facts) consists of facts defining objects mentioned in the text as instances of relevant sorts in the and observations about the values of fluents and the occurrences of actions at different points on the story time line. To record observations about fluents and actions, we use predicates and respectively. By we mean that fluent f from the has value v at time step ss on the story time line, where v may be true or false), and indicates that action a from the was observed to have occurred if v is true, or not if v is false, at time step ss on the story time line. Let us illustrate the logic form obtained for a sample scenario.
Example 4 (Input-dependent part (i.e., logic form) for story in Example 1)
The text in Example 1 is translated into a logic form that includes the following facts:
where , , , , and are actions described in
The pre-defined part of consists of:
The background commonsense knowledge base , which contains information about actions and fluents relevant to the restaurant domain, including axioms about the direct, indirect effects and preconditions of actions. These are encoded in ASP using a standard methodology gk14 in which predicates and denote the beliefs that fluent f holds at time step i and action a occurs at i respectively. For example, the two rules below encode one direct effect and one executability condition for action – person p puts thing t on location l:
A theory (or theories) of intentions.
A module encoding the stereotypical activity from the perspective of each actor.
A reasoning module, encoding (i) a mapping of time points on the story time line into points on the reasoning time line; and (ii) reasoning components that reflect a reader’s reasoning process and expected to allow reasoning about serendipitous achievement of goals, decisions to stop futile activities, and diagnosis.
(i) To encode the mapping of story time steps to reasoning time steps we introduce the predicates and , respectively, as well as the predicate to say that story step s is mapped into reasoning time step i:
Observations about the occurrence of actions and values of fluents, recorded from the text using predicates and , are translated into observations on the reasoning time line, for which we use the predicates and , via rules of the type:
Finally, “gaps” on the reasoning time line are prevented by the rules
where is true if i is the last time step on the reasoning time line that has a correspondent on the story time line.
(ii) Reasoning components are adapted from and expanded to reflect the reasoning process of an outside observer (the cautious reader) instead of that of an agent thinking about its next action. For instance, rules indicating how an agent should select a new activity to satisfy an active goal if the current activity is deemed futile are replaced by a single rule indicating that the mental action of replanning has occurred:
The cautious reader is not expected to guess what new activity the agent decided to start, unless this is explicitly specified in the text.
Default information about the values of fluents in the initial situation (e.g., the restaurant is normally open, dishes listed on the menu are normally available, etc.)
The proposed methodology was tested with good results in previous work: in one instance, only the customer role was modeled as a goal-driven agent using while other actors were modeled as simple intentional agents using zi17; in the other case, all characters were modeled as goal-driven agents using the izbi18. However, a couple of important research questions still remain about components 2 and 3 of the pre-defined part:
If we were interested in answering questions about the goals and intentions of the customer only, what are the trade-offs in terms of coverage and performance between viewing only the customer as a goal-driven intentional agent (i.e., using the for the customer only and the for all other characters – case denoted by +) versus viewing all characters as goal-driven agents (i.e., using the for all of the characters – case denoted by -only)?
How should we structure the representation of the stereotypical activity, from the point of view of each actor, in order to maximize coverage and performance?
We define coverage as the number of different types of scenarios that can be processed correctly. Scenario types include: stories with only one customer versus multiple customers, plus the different scenario types listed in Baral et al.’s bgb15 work on theory of intentions (e.g., normal, serendipitous, diagnosis scenarios).
In the next section we present our insights into these two research questions.
4 Research Questions: Insights
4.1 Insight #1: Two Theories of Intentions
We start by focusing on research question and analyze the two cases listed above: + and -only. The case of viewing all characters as simple intentional agents and thus using only is not an option. This approach is too limiting and does not allow reasoning correctly about scenarios like the one in Example 2 for reasons similar to those related to the script-based approach.
The advantage of viewing a character as a goal-driven intentional agent (and using the instead of the to model the character’s activity) is that it allows reasoning about the serendipitous achievement of the character’s goals and sub-goals. This means that using the instead of for secondary characters (e.g., the waiter) leads to scenarios like the one in Example 5 not being processed correctly:
Example 5 (Serendipity for Waiter)
Nicole went to a vegetarian restaurant. She ordered a lentil soup. Nicole was in a hurry, so as soon as the waitress laid the dish on the table, Nicole paid for it in cash and said that she didn’t need the bill. (The reader is expected to understand that the waitress did not bring the bill to Nicole.)
For the story in Example 5, an answer set would be produced by the + approach, but it would inaccurately state that the waitress did bring the bill to Nicole. This is a drawback for the + case.
On the other hand, the + case allows reasoning about scenarios involving multiple customers, each ordering a different dish as in Example 6:
Example 6 (Multiple Customers)
Nicole and Sam went to a vegetarian restaurant. She ordered a lentil soup. He ordered a miso soup. They both enjoyed their soups.
In our formalization of the domain, the waiter either maintains an sequence of actions for each customer (case +) or, alternatively, it maintains one activity per customer, with the associated goal of serving and billing the customer (case -only). However, a waiter cannot maintain multiple activities at a time, corresponding to multiple customers, because of a current limitation in Blount et al.’s theory of intentions indicating that an agent can only have one top-level active goal at a time. As a result, applying the -only approach to such scenarios would result in no answer sets. Substantial work on goal selection and prioritization is needed in order to lift this restriction. With the + solution, the secondary role of waiter is modeled as a simple intentional agent who does not maintain goals, but rather follows a sequence of intended actions. As a result, the waiter may entertain multiple sequences of actions at a time.
The , which is required to reason about goal-driven intentional agents, is much more complex than the . A comparison in terms of different measures can be seen in Table 2. This has a substantial impact on the performance of a system implemented according to our methodology, especially on input stories that involve diagnosis, which is a combinatorial search for an explanation.
Consider for instance the last metric in Table 2. If activities are represented using a hierarchical structure with sub-activities that have associated sub-goals (which is desired, as we will show in the next subsection), then each sub-activity adds two additional time steps on the reasoning time line: one for the mental action of starting the sub-activity and another one for stopping the sub-activity. This happens even when the sub-activity’s goal is serendipitously satisfied by some other agent’s actions and none of the physical actions in the sub-activity are performed by the agent (see the output for Example 2 in Section LABEL:sec:eval). Moreover, no physical actions of the same agent can occur while a mental action is happening, and some restrictions about physical actions of other agents also exist. A larger number of steps on the reasoning time line has an impact on diagnosis problems especially, as shown in Table 3. The reported times are the averages of ten runs on a machine with an Intel(R) Core(TM) i5-4300U CPU 1.9GHz and 4GB RAM using the clingo4.5.4 solver111https://sourceforge.net/projects/potassco/files/clingo/.
Answer for . Based on this analysis, we conclude that + has an improved performance over -only and has the potential for a wider coverage as it can handle a larger number of what seem to be recurrent scenarios.
4.2 Insight #2: Hierarchical Activity Representation
To answer research question about the most suitable structure for the representation of each actor’s actions as part of the stereotypical activity, we started from the flat and fixed scripts described in the existing body of literature on narratives about restaurants with table service (e.g., sa77; m07), which we then refined. We considered two main factors that impact decisions about activity structure: (1) in order to be able to reason about serendipitous scenarios, activities must have a hierarchical structure with sub-activities having their own sub-goals; and (2) in order to be able to process scenarios that require diagnosis (e.g., wrong dish / bill) additional parameters may be needed (e.g., one parameter indicating the ordered dish and another one for the actual, possibly wrong, dish brought by the waiter). In what follows, we describe our conclusions related to such decisions, and especially their impact on coverage and performance. We adopt the conclusion from Section 4.1 and assume that the customer’s actions are represented as an activity of , while the waiter and cook intend to execute sequences of actions of .
4.2.1 Activity Structure and Serendipitous Scenarios
In our methodology, reasoning about serendipitous scenarios is possible whenever the customer’s actions whose purpose is satisfied by someone else’s actions are grouped into a sub-activity associated with a goal. For instance, Example 2 can be processed correctly if and only if the customer’s activity contains a sub-activity consisting of the payment-related actions (request bill and pay bill) and associated with a goal that can be satisfied by another character’s actions. When this is the case, the rules in indicate that the customer performs the mental action of starting the payment sub-activity, realizes that the goal is already met, and then performs the mental action of stopping the sub-activity, without performing any of the physical actions in it. To increase the coverage of different serendipitous scenarios, we must make sure that we create a hierarchical structure for the customer’s activity in which all goals that may be satisfied by other actors’ actions are represented and associated with a corresponding sub-goal, thus rendering sub-activities optional. This is the criterion that we employ to divide the customer’s activity into sub-activities, of course in addition to grouping together the actions that are intuitively part of the same sub-plan (e.g., picking up the menu and putting it back on the table are part of the sub-plan of deciding what to order).
One possibility that would guarantee maximum coverage is to package each action into a sub-activity with a sub-goal as in Activity Theory bb03, and then build other sub-activities from there. However, this would be detrimental in terms of performance. Each new sub-activity that is introduced adds two mental actions (start and stop) that need to be executed by the actor, which translates into two additional time points on the reasoning time line given that no other actions, physical or mental, can be executed by the actor at the same time. As a result, this approach would roughly triple the length of the reasoning time line, as compared to a flat activity, which will negatively impact the code that maps story time line steps onto the reasoning time line, as well as scenarios with diagnosis, as shown in the previous section. As an example, consider the structures shown in Table 4.2.1, where only introduces one sub-activity compared to . The average time over ten runs for processing a normal scenario using the + approach is 0.57s for , 0.70s for (22% increase), and 1.07s for (87% increase).
There is an obvious trade-off between coverage and performance that impacts the activity structure we choose. We decided where to draw the line based on the exceptional scenarios in our restaurant corpus that were not hand-crafted. We identified as optimal the activity structure shown in Table 4.2.1, which includes sub-activities for the customer getting ready to eat (), customer deciding what to order () and customer paying the bill (). Note that is optional in a scenario where the wrong dish is brought by the waiter and the customer decides to eat it – at this point the customer drops his initial intention of eating the original dish and starts a new activity of eating the wrong dish, but all the actions up to eating (e.g., sit) become irrelevant and should be made optional.