In an attempt to enable this, we propose a new commonsense reasoning benchmark where the task is to infer commonsense presumptions in commands of the form “If state holds Then perform action Because I want to achieve goal .” The reason for including the “because” clause in the commands is that some presumptions are ambiguous without knowing the user’s purpose, or goal. For instance, if Alice’s goal in the above example was to see snow for the first time, Bob would have presumed that even a snow flurry would be excuse enough to wake her up. Since humans frequently omit details when stating such commands, a computer possessing common sense should be able to infer the hidden presumptions; that is, the additional unstated conditions on the If and/or Then portion of the command (refer to Tab. 1 to see some examples).
We propose an approach that infers such missing presumptions, by extracting a chain of reasoning that shows how the commanded action will achieve the desired goal when the state holds. Whenever any additional reasoning steps appear in this reasoning chain, they are output by our system as assumed implicit presumptions associated with the command. For our reasoning method we propose a neuro-symbolic interactive, conversational approach, in which the computer combines its own common sense knowledge with conversationally evoked knowledge provided by a human user. The reasoning chain is extracted using our neuro-symbolic theorem prover that learns sub-symbolic representations (embeddings) for logical statements, making it robust to variations of natural language encountered in a conversational interaction setting.
|(5, at night)||12|
|(5, in the Winter)||6|
We have three main contributions. 1) We propose a benchmark task for commonsense reasoning and release a data set containing if-then-because commands, annotated with commonsense presumptions. 2) We present a system called CORGI (COmmonsense ReasoninG by Instruction) that performs soft logical inference. We propose a neuro-symbolic theorem prover and apply it to extract a multi-hop reasoning chain that reveals commonsense presumptions. 3) We equip CORGI with a conversational interaction mechanism that enables it to collect just-in-time commonsense knowledge from humans. Our user-study shows (a) the plausibility of relying on humans to evoke commonsense knowledge and (b) the effectiveness of our theorem prover, enabling us to extract reasoning chains for up to 45% of the studied tasks111Our code and data are publicly available here https://github.com/ForoughA/CORGI.
The literature on commonsense reasoning dates back to the very beginning of the field of AI winograd1972understanding ; mueller2014commonsense ; davis2015commonsense and is studied in several contexts. One aspect focuses on building a large knowledge base (KB) of commonsense facts. Projects like CYC lenat1990cyc , ConceptNet liu2004conceptnet ; havasi2007conceptnet and ATOMIC sap2018atomic ; rashkin2018event2mind are examples of such KBs (see davis2015commonsense for a comprehensive list). Recently, bosselut2019comet proposed COMET, a generative model trained on ConceptNet and ATOMIC, that generates commonsense facts. These KBs provide background knowledge for tasks requiring common sense. However, it is known that knowledge bases are incomplete, and most have ambiguities and inconsistencies davis2015commonsense that must be clarified for particular reasoning tasks. Therefore, we argue that reasoning engines can benefit greatly from a conversational interaction strategy to ask humans about their missing or inconsistent knowledge. Closest in nature to this proposal is the work by Hixon et al., hixon2015learning
on relation extraction through conversation for question answering. The advent of intelligent agents and advancements in natural language processing have given learning from conversational interactions a good momentum in the last few years(azaria2016instructable, ; labutov2018lia, ; srivastava2018teaching, ; goldwasser2014learning, ; christmann2019look, ; guo2018dialog, ; li2018appinite, ; li2017programming, ; li2017sugilite, ).
A current challenge in commonsense reasoning is lack of benchmarks davis2015commonsense . Benchmark tasks in commonsense reasoning include the Winograd Schema Challenge (WSC)levesque2012winograd , its variationskocijan2020review , and its recently scaled up counterpart, Winograndesakaguchi2019winogrande ; ROCStoriesmostafazadeh2017lsdsem , COPAroemmele2011choice , and ARTbhagavatula2019abductive , where the task is to choose a plausible outcome, cause or explanation for an input scenario. Most of these benchmarks have a multiple choice design format. However, in the real world the computer is usually not given multiple choice questions. None of these benchmarks targets the extraction of unspoken details in a natural language statement, which is a challenging task for computers known since the 1970’s grice1975logic .
CORGI has a neuro-symbolic logic theorem prover. Neuro-symbolic systems are hybrid models that leverage the robustness of connectionist methods and the soundness of symbolic reasoning to effectively integrate learning and reasoning garcez2015neural ; besold2017neural . They have shown promise in different areas of logical reasoning ranging from classical logic to propositional logic, probabilistic logic, abductive logic, and inductive logic mao2019neuro ; manhaeve2018deepproblog ; dong2019neural ; marra2019integrating ; zhou2019abductive ; evans2018learning . To the best of our knowledge, neuro-symbolic solutions for commonsense reasoning have not been proposed before. Examples of commonsense reasoning engines are: AnalogySpace speer2008analogyspace ; havasi2009digital that uses dimensionality reduction, and mueller2014commonsense that uses the event calculus formal language. TensorLog (cohen2016tensorlog, ) converts a first-order logical database into a factor graph and proposes a differentiable strategy for belief propagation over the graph. DeepProbLog manhaeve2018deepproblog
developed a probabilistic logic programming language that is suitable for applications containing categorical variables. Contrary to our approach, both these methods do not learn embeddings for logical rules that are needed to make CORGI robust to natural language variations. Therefore, we propose an end-to-end differentiable solution that uses a Prologcolmerauer1990introduction proof trace to learn rule embeddings from data. Our proposal is closest to the neural programmer interpreter (reed2015neural, ) that uses the trace of algorithms such as addition and sort to learn their execution. The use of Prolog for performing multi-hop logical reasoning has been studied in rocktaschel2017end ; weber2019nlprolog . These methods perform Inductive Logic Programming to learn rules from data, and are not applicable to our problem. DeepLogic cingillioglu2018deeplogic and rocktaschel2014low ; wang2016blearning
also learn representations for logical rules using neural networks. Very recently, transformers were used for temporal logicfinkbeiner2020teaching and to do multi-hop reasoning clark2020transformers using logical facts and rules stated in natural language. A purely connectionist approach to reasoning suffers from some limitations. For example, the input token size limit of transformers restricts clark2020transformers to small knowledge bases. Moreover, generalizing to arbitrary number of variables or an arbitrary inference depth is not trivial for them. Since symbolic reasoning can inherently handle all these challenges, a hybrid approach to reasoning takes the burden of handling them off of the neural component.
2 Proposed Commonsense Reasoning Benchmark
The benchmark task that we propose in this work is that of uncovering hidden commonsense presumptions given an input if-then-because command. Formally, the commands follow the general format “if state holds then perform action because I want to achieve goal”. We refer to the if-clause as the state , the then-clause as the action and the because-clause as the goal . These natural language commands were collected from a pool of human subjects (more details in the Appendix). The data is annotated with unspoken commonsense presumptions by a team of annotators. Tab. 1 shows the statistics of the data and annotated examples from the data. We collected two sets of if-then-because commands. The first set contains 83 commands targeted at a state that can be observed by a computer/mobile phone (which is checking emails, calendar, maps, alarms, and weather). The second set contains 77 commands whose state is about day-to-day events and activities. 81% of the commands over both sets qualify as “if then because ”. The remaining 19% differ in the categorization of the because-clause (see Tab. 1); common alternate clause types included anti-goals (“…because I don’t want to be late”), modifications of the state or action (“… because it will be difficult to find an Uber”), or conjunctions including at least one non-goal type. Note that we did not instruct the subjects to give us data from these categories, rather we uncovered them after data collection. Note that commonsense benchmarks such as the Winograd Schema Challenge levesque2012winograd included a similar number of examples (100) when first introduced kocijan2020review .
Lastly, after collecting the data we discovered that the if-then-because commands given by humans can be categorized into several different logic templates. The discovered logic templates are given in Table 5 in the Appendix. Our neuro-symbolic theorem prover uses a general reasoning strategy that can address all reasoning templates. However, in an extended discussion in the Appendix, we explain how a reasoning system, including ours, could potentially benefit from these logic templates.
Background and notation
The system’s commonsense knowledge is a KB, denoted , programmed in a Prolog-like syntax. We have developed a modified version of Prolog, which has been augmented to support several special features (types, soft-matched predicates and atoms, etc). Prolog colmerauer1990introduction is a declarative logic programming language that consists of a set of predicates whose arguments are atoms, variables or predicates. A predicate is defined by a set of rules () and facts (), where Head is a predicate, Body is a conjuction of predicates, and is logical implication. We use the notation , and to represent the logical form of the state , action and goal , respectively where , and are predicate names and and indicate the list of arguments of each predicate. For example, for goal =“I want to get to work on time”, we have get(i, work, on_time). Prolog can be used to logically “prove” a query (e.g., to prove ) using the backward chaining algorithm (see the Appendix - Prolog Background).
3.1 CORGI: COmmonsense Reasoning by Instruction
CORGI takes as input a natural language command of the form “if state then action because goal ” and infers commonsense presumptions by extracting a chain of commonsense knowledge that explains how the commanded action achieves the goal when the state holds. For example from a high level, for the command in Fig. 2 CORGI outputs if it snows more than two inches, then there will be traffic, if there is traffic, then my commute time to work increases, if my commute time to work increases then I need to leave the house earlier to ensure I get to work on time if I wake up earlier then I will leave the house earlier. Formally, this reasoning chain is a proof tree (proof trace) shown in Fig.2. As shown, the proof tree includes the commonsense presumptions.
CORGI’s architecture is depicted in Figure 1. In the first step, the if-then-because command goes through a parser that extracts the state , action and goal from it and converts them to their logical form representations , and , respectively. For example, the action “wake me up early” is converted to wake(me, early). The parser is presented in the Appendix (Sec. Parsing).
The proof trace is obtained by finding a proof for , using and the context of the input if-then-because command. In other words, One challenge is that even the largest knowledge bases gathered to date are incomplete, making it virtually infeasible to prove an arbitrary input . Therefore, CORGI is equipped with a conversational interaction strategy which enables it to prove a query by combining its own commonsense knowledge with conversationally evoked knowledge provided by a human user in response to a question from CORGI (user feedback loop in Fig.1). There are 4 possible scenarios that could occur when designing such a conversational knowledge extraction strategy.
The user understands the question, but does not know the answer.
The user misunderstands the question and responds with an undesired answer.
The user understands the question and provides a correct answer, but the system fails to understand the user due to:
limitations of natural language understanding.
variations in natural language, which result in misalignment of the data schema in the knowledge base and the data schema in the user’s mind.
The user understands the questions and provides the correct answer and the system successfully parses and understands it.
CORGI’s different components are designed such that they address the above challenges, as explained below. Since our benchmark data set deals with day-to-day activities, it is unlikely for scenario to occur. If the task required more specific domain knowledge, could have been addressed by choosing a pool of domain experts. Scenario is addressed by asking informative questions from users. Scenario is addressed by trying to extract small chunks of knowledge from the users piece-by-piece. Specifically, the choice of what to ask the user in the user feedback loop is deterministically computed from the user’s goal . The first step is to ask how to achieve the user’s stated goal , and CORGI expects an answer that gives a sub-goal . In the next step, CORGI asks how to achieve the sub-goal the user just mentioned. The reason for this piece-by-piece knowledge extraction is to ensure that the language understanding component can correctly parse the user’s response. CORGI then adds the extracted knowledge from the user to in the knowledge update loop shown in Fig.1. Missing knowledge outside this goal /sub-goal path is not handled, although it is an interesting future direction. Moreover, the model is user specific and the knowledge extracted from different users are not shared among them. Sharing knowledge raises interesting privacy issues and requires handling personalized conflicts and falls out of the scope of our current study.
Scenario , caused by the variations of natural language, results in semantically similar statements to get mapped into different logical forms, which is unwanted. For example, “make sure I am awake early morning” vs. “wake me up early morning” will be parsed into different logical forms awake(i,earlymorning) and wake(me, earlymorning), respectively although they are semantically similar. This mismatch prevents a logical proof from succeeding since the proof strategy relies on exact match in the unification operation (see Appendix). This is addressed by our neuro-symbolic theorem prover (Fig.1
) that learns vector representations (embeddings) for logical rules and variables and uses them to perform a logical proof through soft unification. If the theorem prover can prove the user’s goal ,, CORGI outputs the proof trace (Fig.2) returned by its theorem prover and succeeds. In the next section, we explain our theorem prover in detail.
We revisit scenarios in detail in the discussion section and show real examples from our user study.
4 Neuro-Symbolic Theorem Proving
Our Neuro-Symbolic theorem prover is a neural modification of backward chaining
and uses the vector similarity between rule and variable embeddings for unification. In order to learn these embeddings, our theorem prover learns a general proving strategy by training on proof traces of successful proofs. From a high level, for a given query our model maximizes the probability of choosing the correct rule to pick in each step of the backward chaining algorithm. This proposal is an adaptation of Reed et al.’s Neural Programmer-Interpreterreed2015neural that learns to execute algorithms such as addition and sort, by training on their execution trace.
In what follows, we represent scalars with lowercase letters, vectors with bold lowercase letters and matrices with bold uppercase letters. denotes the embedding matrix for the rules and facts, where is the number of rules and facts and is the embedding dimension. denotes the variable embedding matrix, where is the number of all the atoms and variables in the knowledge base and is the variable embedding dimension. Our knowledge base is type coerced, therefore the variable names are associated with their types (e.g., alarm(Person,Time))
The model’s core consists of an LSTM network whose hidden state indicates the next rule in the proof trace and a proof termination probability, given a query as input. The model has a feed forward network that makes variable binding decisions. The model’s training is fully supervised by the proof trace of a query given in a depth-first-traversal order from left to right (Fig. 2). The trace is sequentially input to the model in the traversal order as explained in what follows. In step of the proof, the model’s input is and is the total number of proof steps. is the query’s embedding and is computed by feeding the predicate name of the query into a character RNN. is the concatenated embeddings of the rules in the parent and the left sister nodes in the proof trace, looked up from . For example in Fig.2, represents the node at proof step , represents the rule highlighted in green (parent rule), and represents the fact alarm(i, 8). The reason for including the left sister node in is that the proof is conducted in a left-to-right depth first order. Therefore, the decision of what next rule to choose in each node is dependent on both the left sisters and the parent (e.g. the parent and the left sisters of the node at step in Fig. 2 are the rules at nodes , , and , respectively). The arguments of the query are presented in where is the arity of the query predicate. For example, in Fig 2 is the embedding of the variable Person. Each for , is looked up from the embedding matrix . The output of the model in step is and is computed through the following equations
where is a probability vector over all the variables and atoms for the argument, is a probability vector over all the rules and facts and is a scalar probability of terminating the proof at step t. , , and are feed forward networks with two fully connected layers, and
is an LSTM network. The trainable parameters of the model are the parameters of the feed forward neural networks, the LSTM network, the character RNN that embedsand the rule and variable embedding matrices and .
Our model is trained end-to-end. In order to train the model parameters and the embeddings, we maximize the log likelihood probability given below
where the summation is over all the proof traces in the training set and is the trainable parameters of the model. We have
5 Experiment Design
The knowledge base, , used for all experiments is a small handcrafted set of commonsense knowledge. See Tab.6 in the Appendix for examples. includes general information about time, restricted-domains such as setting alarms and notifications, emails, and so on, as well as commonsense knowledge about day-to-day activities. contains a total of 228 facts and rules. Among these, there are 189 everyday-domain and 39 restricted domain facts and rules. We observed that most of the if-then-because commands require everyday-domain knowledge for reasoning, even if they are restricted-domain commands (see Table 3 for example).
Our Neuro-Symbolic theorem prover is trained on proof traces collected by proving automatically generated query’s to using sPyrolog222https://github.com/leonweber/spyrolog. and are initialized randomly and with GloVe embeddings pennington2014glove , respectively, where and . Since is type-coerced (e.g. Time, Location,
), initializing the variables with pre-trained word embeddings helps capture their semantics and improves the performance. The neural components of the theorem prover is implemented in PyTorchpaszke2017automatic and the prover is built on top of sPyrolog.
|CORGI variations||Novice User||Expert User|
|If it’s going to rain in the afternoon then remind me to bring an umbrella because I want to remain dry.|
|How do I know if “I remain dry”?|
|If I have my umbrella.|
|How do I know if “I have my umbrella”?|
|If you remind me to bring an umbrella.|
|Okay, I will perform “remind me to bring an umbrella” in order to achieve “I remain dry”.|
|If it’s going to rain in the afternoon then remind me to bring an umbrella because I want to remain dry.|
|How do I know if “I remain dry”?|
|If I have my umbrella.|
|How do I know if “I have my umbrella”?|
|If it’s in my office.|
|How do I know if “it’s in my office”?|
In order to assess CORGI’s performance, we ran a user study. We selected 10 goal-type if-then-because commands from the dataset in Table 1 and used each as the prompt for a reasoning task. We had 28 participants in the study, 4 of which were experts closely familiar with CORGI and its capabilities. The rest were undergraduate and graduate students with the majority being in engineering or computer science fields and some that majored in business administration or psychology. These users had never interacted with CORGI prior to the study (novice users). Each person was issued the 10 reasoning tasks, taking on average 20 minutes to complete all 10.
Solving a reasoning task consists of participating in a dialog with CORGI as the system attempts to complete a proof for the goal of the current task; see sample dialogs in Tab. 3. The task succeeds if CORGI is able to use the answers provided by the participant to construct a reasoning chain (proof) leading from the goal to the state and action . We collected 469 dialogues in our study.
The user study was run with the architecture shown in Fig. 1. We used the participant responses from the study to run a few more experiments. We (1) Replace our theorem prover with an oracle prover that selects the optimal rule at each proof step in Alg. 1 and (2) attempt to prove the goal without using any participant responses (no-feedback). Tab. 3 shows the success rate in each setting.
In this section, we analyze the results from the study and provide examples of the 4 scenarios in Section 3.1 that we encountered. As hypothesized there, scenario hardly occurred. We did encounter scenario , however. The study’s dialogs show that some users provided means of sensing the goal rather than the cause of the goal . For example, for the reasoning task “If there are thunderstorms in the forecast within a few hours then remind me to close the windows because I want to keep my home dry”, in response to the system’s prompt “How do I know if ‘I keep my home dry’?” a user responded “if the floor is not wet” as opposed to an answer such as “if the windows are closed”. Moreover, some users did not pay attention to the context of the reasoning task. For example, another user responded to the above prompt (same reasoning task) with “if the temperature is above 80”! Overall, we noticed that CORGI’s ability to successfully reason about an if-then-because statement was heavily dependent on whether the user knew how to give the system what it needed, and not necessarily what it asked for; see Table 3 for an example. As it can be seen in Table 3, expert users are able to more effectively provide answers that complete CORGI’s reasoning chain, likely because they know that regardless of what CORGI asks, the object of the dialog is to connect the because goal back to the knowledge base in some series of if-then rules (goal /sub-goal path in Sec.3.1
). Therefore, one interesting future direction is to develop a dynamic context-dependent Natural Language Generation method for asking more effective questions.
We would like to emphasize that although it seems to us, humans, that the previous example requires very simple background knowledge that likely exists in SOTA large commonsense knowledge graphs such as ConcepNet333http://conceptnet.io/, ATOMIC444https://mosaickg.apps.allenai.org/kg_atomic or COMET bosselut2019comet , this is not the case (verifiable by querying them online). For example, for queries such as “the windows are closed”, COMET-ConceptNet generative model555https://mosaickg.apps.allenai.org/comet_conceptnet returns knowledge about blocking the sun, and COMET-ATOMIC generative model666https://mosaickg.apps.allenai.org/comet_atomic returns knowledge about keeping the house warm or avoiding to get hot; which while being correct, is not applicable in this context. For “my home is dry”, both COMET-ConceptNet and COMET-ATOMIC generative models return knowledge about house cleaning or house comfort. On the other hand, the fact that 40% of the novice users in our study were able to help CORGI reason about this example with responses such as “If I close the windows” to CORGI’s prompt, is an interesting result. This tells us that conversational interactions with humans could pave the way for commonsense reasoning and enable computers to extract just-in-time commonsense knowledge, which would likely either not exist in large knowledge bases or be irrelevant in the context of the particular reasoning task. Lastly, we re-iterate that as conversational agents (such as Siri and Alexa) enter people’s lives, leveraging conversational interactions for learning has become a more realistic opportunity than ever before.
In order to address scenario , the conversational prompts of CORGI ask for specific small pieces of knowledge that can be easily parsed into a predicate and a set of arguments. However, some users in our study tried to provide additional details, which challenged CORGI’s natural language understanding. For example, for the reasoning task “If I receive an email about water shut off then remind me about it a day before because I want to make sure I have access to water when I need it.”, in response to the system’s prompt “How do I know if ‘I have access to water when I need it.’?” one user responded “If I am reminded about a water shut off I can fill bottles”. This is a successful knowledge transfer. However, the parser expected this to be broken down into two steps. If this user responded to the prompt with “If I fill bottles” first, CORGI would have asked “How do I know if ‘I fill bottles’?” and if the user then responded “if I am reminded about a water shut off” CORGI would have succeeded. The success from such conversational interactions are not reflected in the overall performance mainly due to the limitations of natural language understanding.
Table 3 evaluates the effectiveness of conversational interactions for proving compared to the no-feedback model. The 0% success rate there reflects the incompleteness of . The improvement in task success rate between the no-feedback case and the other rows indicates that when it is possible for users to contribute useful common-sense knowledge to the system, performance improves. The users contributed a total number of 96 rules to our knowledge base, 31 of which were unique rules. Scenario occurs when there is variation in the user’s natural language statement and is addressed with our neuro-symbolic theorem prover. Rows 2-3 in Table 3 evaluate our theorem prover (soft unification). Having access to the optimal rule for unification does still better, but the task success rate is not 100%, mainly due to the limitations of natural language understanding explained earlier.
In this paper, we introduced a benchmark task for commonsense reasoning that aims at uncovering unspoken intents that humans can easily uncover in a given statement by making presumptions supported by their common sense. In order to solve this task, we propose CORGI (COmmon-sense ReasoninG by Instruction) which is a neuro-symbolic theorem prover and performs commonsense reasoning by initiating a conversation with a user. CORGI has access to a small knowledge base of commonsense facts and completes it through time as she interacts with the user. We further conduct a user study that indicates the possibility of using conversational interactions with humans for evoking commonsense knowledge and verifies the effectiveness of our proposed theorem prover.
-  Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
-  Ernest Davis and Gary Marcus. Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM, 58(9):92–103, 2015.
-  Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. 2019.
-  Herbert P Grice. Logic and conversation. In Speech acts, pages 41–58. Brill, 1975.
-  Terry Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972.
-  Erik T Mueller. Commonsense reasoning: an event calculus based approach. Morgan Kaufmann, 2014.
-  Douglas B Lenat, Ramanathan V. Guha, Karen Pittman, Dexter Pratt, and Mary Shepherd. Cyc: toward programs with common sense. Communications of the ACM, 33(8):30–49, 1990.
-  Hugo Liu and Push Singh. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
-  Catherine Havasi, Robert Speer, and Jason Alonso. Conceptnet 3: a flexible, multilingual semantic network for common sense knowledge. In Recent advances in natural language processing, pages 27–29. Citeseer, 2007.
-  Maarten Sap, Ronan LeBras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. Atomic: An atlas of machine commonsense for if-then reasoning. arXiv preprint arXiv:1811.00146, 2018.
-  Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A Smith, and Yejin Choi. Event2mind: Commonsense inference on events, intents, and reactions. arXiv preprint arXiv:1805.06939, 2018.
-  Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. Comet: Commonsense transformers for automatic knowledge graph construction. arXiv preprint arXiv:1906.05317, 2019.
-  Ben Hixon, Peter Clark, and Hannaneh Hajishirzi. Learning knowledge graphs for question answering through conversational dialog. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 851–861, 2015.
-  Amos Azaria, Jayant Krishnamurthy, and Tom M Mitchell. Instructable intelligent personal agent. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  Igor Labutov, Shashank Srivastava, and Tom Mitchell. Lia: A natural language programmable personal assistant. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 145–150, 2018.
Teaching Machines to Classify from Natural Language Interactions. PhD thesis, Samsung Electronics, 2018.
-  Dan Goldwasser and Dan Roth. Learning from natural instructions. Machine learning, 94(2):205–232, 2014.
-  Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh, and Gerhard Weikum. Look before you hop: Conversational question answering over knowledge graphs using judicious context expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 729–738, 2019.
-  Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. Dialog-to-action: Conversational question answering over a large-scale knowledge base. In Advances in Neural Information Processing Systems, pages 2942–2951, 2018.
-  Toby Jia-Jun Li, Igor Labutov, Xiaohan Nancy Li, Xiaoyi Zhang, Wenze Shi, Wanling Ding, Tom M Mitchell, and Brad A Myers. Appinite: A multi-modal interface for specifying data descriptions in programming by demonstration using natural language instructions. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 105–114. IEEE, 2018.
-  Toby Jia-Jun Li, Yuanchun Li, Fanglin Chen, and Brad A Myers. Programming iot devices by demonstration using mobile apps. In International Symposium on End User Development, pages 3–17. Springer, 2017.
-  Toby Jia-Jun Li, Amos Azaria, and Brad A Myers. Sugilite: creating multimodal smartphone automation by demonstration. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 6038–6049. ACM, 2017.
-  Vid Kocijan, Thomas Lukasiewicz, Ernest Davis, Gary Marcus, and Leora Morgenstern. A review of winograd schema challenge datasets and approaches. arXiv preprint arXiv:2004.13831, 2020.
-  Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, 2017.
-  Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
-  Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi. Abductive commonsense reasoning. In International Conference on Learning Representations (ICLR), 2020.
-  Artur d’Avila Garcez, Tarek R Besold, Luc De Raedt, Peter Földiak, Pascal Hitzler, Thomas Icard, Kai-Uwe Kühnberger, Luis C Lamb, Risto Miikkulainen, and Daniel L Silver. Neural-symbolic learning and reasoning: contributions and challenges. In 2015 AAAI Spring Symposium Series, 2015.
-  Tarek R Besold, Artur d’Avila Garcez, Sebastian Bader, Howard Bowman, Pedro Domingos, Pascal Hitzler, Kai-Uwe Kühnberger, Luis C Lamb, Daniel Lowd, Priscila Machado Vieira Lima, et al. Neural-symbolic learning and reasoning: A survey and interpretation. arXiv preprint arXiv:1711.03902, 2017.
-  Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. 2019.
-  Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems, pages 3749–3759, 2018.
-  Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. Neural logic machines. In International Conference on Learning Representations (ICLR), 2019.
-  Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Integrating learning and reasoning with deep logic models. arXiv preprint arXiv:1901.04195, 2019.
-  Zhi-Hua Zhou. Abductive learning: towards bridging machine learning and logical reasoning. Science China Information Sciences, 62(7):76101, 2019.
-  Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64, 2018.
-  Robert Speer, Catherine Havasi, and Henry Lieberman. Analogyspace: Reducing the dimensionality of common sense knowledge. In AAAI, volume 8, pages 548–553, 2008.
-  Catherine Havasi, Robert Speer, James Pustejovsky, and Henry Lieberman. Digital intuition: Applying common sense using dimensionality reduction. IEEE Intelligent systems, 24(4):24–35, 2009.
-  William W Cohen. Tensorlog: A differentiable deductive database. arXiv preprint arXiv:1605.06523, 2016.
-  Alain Colmerauer. An introduction to prolog iii. In Computational Logic, pages 37–79. Springer, 1990.
-  Scott Reed and Nando De Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.
-  Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving. In Advances in Neural Information Processing Systems, pages 3788–3800, 2017.
-  L Weber, P Minervini, J Münchmeyer, U Leser, and T Rocktäschel. Nlprolog: Reasoning with weak unification for question answering in natural language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, Volume 1: Long Papers, volume 57. ACL (Association for Computational Linguistics), 2019.
-  Nuri Cingillioglu and Alessandra Russo. Deeplogic: Towards end-to-end differentiable logical reasoning. arXiv preprint arXiv:1805.07433, 2018.
-  Tim Rocktäschel, Matko Bošnjak, Sameer Singh, and Sebastian Riedel. Low-dimensional embeddings of logic. In Proceedings of the ACL 2014 Workshop on Semantic Parsing, pages 45–49, 2014.
-  William Yang Wang and William W Cohen. Learning first-order logic embeddings via matrix factorization. In IJCAI, pages 2132–2138, 2016.
-  Bernd Finkbeiner, Christopher Hahn, Markus N Rabe, and Frederik Schmitt. Teaching temporal logics to neural networks. arXiv preprint arXiv:2003.04218, 2020.
-  Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867, 2020.
-  Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
-  Matthew Honnibal and Ines Montani. spacy 2: Natural language understanding with bloom embeddings. Convolutional Neural Networks and Incremental Parsing, 2017.
-  Bhavana Dalvi Mishra, Niket Tandon, and Peter Clark. Domain-targeted, high precision knowledge extraction. Transactions of the Association for Computational Linguistics, 5:233–246, 2017.
Data collection was done in two stages. In the first stage, we collected if-then-because commands from humans subjects. In the second stage, a team of annotators annotated the data with commonsense presumptions. Below we explain the details of the data collection and annotation process.
In the data collection stage, we asked a pool of human subjects to write commands that follow the general format: if state holds then perform action because i want to achieve goal . The subjects were given the following instructions at the time of data collection:
“ Imagine the two following scenarios:
Scenario 1: Imagine you had a personal assistant that has access to your email, calendar, alarm, weather and navigation apps, what are the tasks you would like the assistant to perform for your day-to-day life? And why?
Scenario 2: Now imagine you have an assistant/friend that can understand anything. What would you like that assistant/friend to do for you?
Our goal is to collect data in the format “If …. then …. because ….” ”
After the data was collected, a team of annotators annotated the commands with additional presumptions that the human subjects have left unspoken. These presumptions were either in the if-clause and/or the then-clause and examples of them are shown in Tables 1 and 4
As explained in the main text, we uncovered 5 different logic templates, that reflect humans’ reasoning, from the data after data collection. The templates are listed in Table 5. In what follows, we will explain each template in detail using the examples of each template listed in Tab. 5.
In the blue template (Template 1), the state results in a “bad state” that causes the not of the goal. The speaker asks for the action in order to avoid the bad state and achieve the goal . For instance, consider the example for the blue template in Table 5. The state of snowing a lot at night, will result in a bad state of traffic slowdowns which in turn causes the speaker to be late for work. In order to overcome this bad state. The speaker would like to take the action , waking up earlier, to account for the possible slowdowns cause by snow and get to work on time.
In the orange template (Template 2), performing the action when the state holds allows the speaker to achieve the goal and not performing the action when the state holds prevents the speaker from achieving the goal . For instance, in the example for the orange template in Table 5 the speaker would like to know who the attendees of a meeting are when the speaker is walking to that meeting so that the speaker is prepared for the meeting and that if the speaker is not reminded of this, he/she will not be able to properly prepare for the meeting.
In the green template (Template 3), performing the action when the state holds allows the speaker to take a hidden action that enables him/her to achieve the desired goal . For example, if the speaker is reminded to buy flower bulbs close to the Fall season, he/she will buy and plant the flowers (hidden action s) that allows the speaker to have a pretty spring garden.
In the purple template (Template 4), the goal that the speaker has stated is actually a goal that they want to avoid. In this case, the state causes the speaker’s goal , but the speaker would like to take the action when the state holds to achieve the opposite of the goal . For the example in Tab. 1, if the speaker has a trip coming up and he/she buys perishables the perishables would go bad. In order for this not to happen, the speaker would like to be reminded not to buy perishables to avoid them going bad while he/she is away.
The rest of the statements are categorized under the “other” category. The majority of these statements contain conjunction in their state and are a mix of the above templates. A reasoning engine could potentially benefit from these logic templates when performing reasoning. We provide more detail about this in the Extended Discussion section in the Appendix.
Prolog  is a declarative logic programming language. A Prolog program consists of a set of predicates. A predicate has a name (functor) and arguments. is referred to as the arity of the predicate. A predicate with functor name and arity is represented as where ’s, for , are the arguments that are arbitrary Prolog terms. A Prolog term is either an atom, a variable or a compound term (a predicate with arguments). A variable starts with a capital letter (e.g., Time) and atoms start with small letters (e.g. monday). A predicate defines a relationship between its arguments. For example, isBefore(monday, tuesday) indicates that the relationship between Monday and Tuesday is that, the former is before the latter.
A predicate is defined by a set of clauses. A clause is either a Prolog fact or a Prolog rule. A Prolog rule is denoted with , where the Head is a predicate, the Body is a conjunction () of predicates, is logical implication, and period indicates the end of the clause. The previous rule is an if-then statement that reads “if the Body holds then the Head holds”. A fact is a rule whose body always holds, and is indicated by Head. , which is equivalent to Head true. Rows 1-4 in Table 6 are rules and rows 5-8 are facts.
Prolog can be used to logically “prove” whether a specific query holds or not (For example, to prove that isAfter(wednesday,thursday)? is false or that status(i, dry, tuesday)? is true using the Program in Table 6). The proof is performed through backward chaining, which is a backtracking algorithm that usually employs a depth-first search strategy implemented recursively. In each step of the recursion, the input is a query (goal) to prove and the output is the proof’s success/failure. in order to prove a query, a rule or fact whose head unifies with the query is retrieved from the Prolog program. The proof continues recursively for each predicate in the body of the retrieved rule and succeeds if all the statements in the body of a rule are true. The base case (leaf) is when a fact is retrieved from the program.
At the heart of backward chaining is the unification operator, which matches the query with a rule’s head. Unification first checks if the functor of the query is the same as the functor of the rule head. If they are the same, unification checks the arguments. If the number of arguments or the arity of the predicates do not match unification fails. Otherwise it iterates through the arguments. For each argument pair, if both are grounded atoms unification succeeds if they are exactly the same grounded atoms. If one is a variable and the other is a grounded atom, unification grounds the variable to the atom and succeeds. If both are variables unification succeeds without any variable grounding. The backwards chaining algorithm and the unification operator is depicted in Figure 3.
The goal of our parser is to extract the state , action and goal from the input utterance and convert them to their logical forms , , and , respectively. The parser is built using Spacy . We implement a relation extraction method that uses Spacy’s built-in dependency parser. The language model that we used is the encoreflg released by Hugging face777https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz. The predicate name is typically the sentence verb or the sentence root. The predicate’s arguments are the subject, objects, named entities and noun chunks extracted by Spacy. The output of the relation extractor is matched against the knowledge base through rule-based mechanisms including string matching to decide weather the parsed logical form exists in the knowledge base. If a match is found, the parser re-orders the arguments to match the order of the arguments of the predicate retrieved from the knowledge base. This re-ordering is done through a type coercion method. In order to do type coercion, we use the types released by Allen AI in the Aristo tuple KB v1.03 Mar 2017 Release  and have added more entries to it to cover more nouns. The released types file is a dictionary that maps different nouns to their types. For example, doctor is of type person and Tuesday is of type date. If no match is found, the parsed predicate will be kept as is and CORGI tries to evoke relevant rules conversationally from humans in the user feedback loop in Figure 1.
We would like to note that we refrained from using a grammar parser, particularly because we want to enable open-domain discussions with the users and save the time required for them to learn the system’s language. As a result, the system will learn to adapt to the user’s language over time since the background knowledge will be accumulated through user interactions, therefore it will be adapted to that user. A negative effect, however, is that if the parser makes a mistake, error will propagate onto the system’s future knowledge. This is an interesting future direction that we are planning to address.
The inference algorithm for our proposed neuro-symbolic theorem prover is given in Alg. 1. In each step of the proof, given a query , we calculate and from the trained model to compute . Next, we choose entries of corresponding to the top entries of as candidates for the next proof trace. is set to and is a tuning parameter. For each rule in the top
rules, we attempt to do variable/argument unification by computing the cosine similarity between the arguments ofand the arguments of the rule’s head. If all the corresponding pair of arguments in and the rule’s head have a similarity higher than threshold, , unification succeeds, otherwise it fails. If unification succeeds, we move to prove the body of that rule. If not, we move to the next rule.
|4||notify(Person1, corgi, Action1) :- email(Person1, Action1).|
|7||isInside(i, home, tuesday).|
Table 7 shows the performance breakdown with respect to the logic templates in Table 5. Currently, CORGI uses a general theorem prover that can prove all the templates. The large variation in performance indicates that taking into account the different templates would improve the performance. For example, the low performance on the green template is expected, since CORGI currently does not support the extraction of a hidden action from the user, and interactions only support extraction of missing goal s. This interesting observation indicates that, even within the same benchmark, we might need to develop several reasoning strategies to solve reasoning problems. Therefore, even if CORGI adapts a general theorem prover, accounting for logic templates in the conversational knowledge extraction component would allow it to achieve better performance on other templates.