Understanding in Artificial Intelligence

by   Stefan Maetschke, et al.

Current Artificial Intelligence (AI) methods, most based on deep learning, have facilitated progress in several fields, including computer vision and natural language understanding. The progress of these AI methods is measured using benchmarks designed to solve challenging tasks, such as visual question answering. A question remains of how much understanding is leveraged by these methods and how appropriate are the current benchmarks to measure understanding capabilities. To answer these questions, we have analysed existing benchmarks and their understanding capabilities, defined by a set of understanding capabilities, and current research streams. We show how progress has been made in benchmark development to measure understanding capabilities of AI methods and we review as well how current methods develop understanding capabilities.


page 1

page 2

page 3

page 4


Challenges and Prospects in Vision and Language Research

Language grounded image understanding tasks have often been proposed as ...

A curated, ontology-based, large-scale knowledge graph of artificial intelligence tasks and benchmarks

Research in artificial intelligence (AI) is addressing a growing number ...

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

Benchmarks are crucial to measuring and steering progress in artificial ...

Engineering Reliable Deep Learning Systems

Recent progress in artificial intelligence (AI) using deep learning tech...

AI Challenges for Society and Ethics

Artificial intelligence is already being applied in and impacting many i...

Using NLU in Context for Question Answering: Improving on Facebook's bAbI Tasks

For the next step in human to machine interaction, Artificial Intelligen...

The Limitations of Standardized Science Tests as Benchmarks for Artificial Intelligence Research: Position Paper

In this position paper, I argue that standardized tests for elementary s...

1 Introduction

Recent advancements in deep learning have facilitated tremendous progress in computer vision, speech processing, natural language understanding and many other domains (LeCun et al., 2015; Schmidhuber, 2015)

. However, this progress is largely driven by increased computational power, namely GPU’s, and bigger data sets but not due to radically new algorithms or knowledge representations. Artificial Neural Networks and Stochastic Gradient Descent, popularized in the 80’s 

(Rummelhart et al., 1986), remain the fundamental building blocks for most modern AI systems.

While very successful for many applications, especially in vision, the purely deep-learning based approach has significant weaknesses. For instance, CNN’s struggle with same-different relations (Ricci et al., 2018), fail when long-chained reasoning is needed (Johnson et al., 2017a), are non-decomposable, cannot easily incorporate symbolic knowledge, and are hampered by a lack of model interpretability. Many current methods essentially compute higher order statistics over basic elements such as pixels, phonemes, letters or words to process inputs but do not explicitly model the building blocks and their relations in a (de)composable and interpretable way.

On the other hand, there is a long tradition of symbolic knowledge representation, formal logic and reasoning. However, such approaches, struggle with contradicting, incomplete or fuzzy information.

In this article, we focus on presenting existing benchmarks and research streams that go beyond purely deep-learning or symbolic methods, and describe how they represent and apply knowledge in a way that aims at human-level understanding. These works explore knowledge representations that are hierarchically structured, compositional, and unify different modalities. We will start by providing background on the definition of ”understanding” and its implications for artificial intelligence in Section 2. As we will discuss, a way of evaluating understanding capabilities is by using benchmarks to measure performance against other systems or humans. We will present and analyse existing benchmarks in Section 3. Once we have a definition and benchmarks to measure understanding, we will introduce the different research streams recently developed in this area (Section 4). Finally, we discuss and summarise our study in Sections 5 and 6.

2 Definition of ”understanding”

The Cambridge Dictionary defines ”understanding” as to know why or how something happens or works. Similar definitions are available in fields such as psychology (Bereiter, 2005), philosophy (Kant, 1998) and computer science (Chaitin, 2006)111More specifically in algorithmic information theory an argument is made to relate understanding to data compression..

Understanding has been recognized as a relevant component in artificial intelligence (Lake et al., 2017) and building systems with understanding capabilities is a significant step towards artificial general intelligence. There are several theories regarding the inner workings of human intelligence and how an artificial intelligence could be built (Chollet, 2019).

Current state-of-the-art methods implementing artificial intelligence – sometimes even surpassing human performance – still lack an understanding of the tasks they are applied to. For instance, systems that learn to play Atari games (Mnih et al., 2013) or AlphaGO (Silver et al., 2017) require a large amount of example games to tune the underlying game model. However, even after extensive training no general understanding of the game is generated and the model is not reusable for other, even closely related, tasks.

Another example is in image analytics, where deep neural networks are currently the state of the art for many tasks, for instance in medical image diagnosis, face recognition and self-driving cars. So called ”adversarial attacks”, in which images are perturbed by tiny amounts that are essentially invisible to humans, can cause these networks to wrongly classify objects with high confidence 

(Nguyen et al., 2015). This shows that these networks have no actual understanding of the scene but rely on higher-order pixel distributions for classification. Similarly state-of-the-art natural language understanding systems are sensitive to small changes in the input text that humans are not affected by (Jin et al., 2019).

A formal definition and evaluation of a system’s understanding capabilities is challenging (Chollet, 2019). Similar to evaluating intelligence, ”understanding” could be measured by human evaluators who judge the output of a system in a similar way as a Turing test has been proposed to recognize intelligence. However, such a measurement is labor intensive and subjective.

An alternative way of evaluating the understanding capabilities of a system is by benchmarking its performance for specific data sets and tasks. Using benchmarks is a common practice to evaluate machine learning systems and allow for a reproducible and objective evaluation.

Benchmarks with increasingly more complex tasks can be defined to ensure that algorithms solves increasingly more complex problems (Santoro et al., 2018; Hernandez-Orallo, 2020; Crosby, 2020), with the hope to move closer and closer to a system that shows truly intelligent behavior.

Similar to the assessment of human intelligence, there is the need to prevent memorization, in which success in a benchmark might be achieved by memorizing questions and answers. Also biases in the benchmark data set need to be considered (Goyal et al., 2017), since they can be exploited; impeding the evaluation of understanding capabilities.

Examples of recent benchmarks such as CLEVR (Johnson et al., 2017a) overcome these problems by generating synthetic data with strictly controlled distributions to avoid any biases. However, the understanding capabilities on synthetic data are often limited, since the complexity in comparison with real-world scenario is severely reduced. For instance, the CLEVR benchmark requires the recognition and localization of a very small number of objects with a small set of fixed properties. It has successfully been solved using neuro-symbolic approaches, surpassing human performance (Mao et al., 2019a). To move towards more complex understanding capabilities, requires the creation and use of benchmarks that exhibit an increasing level of complexity.

Another property relevant to understanding is compositionality  (Lake et al., 2017, 2015), which has been discussed in detail for vision tasks (Biederman, 1987) and natural language understanding (Manning, 2016). It essentially refers to a system’s capability to identify the composing parts of an object or a problem and being able to reuse parts in new combinations to solve related but different tasks more efficiently, e.g. with less data or faster.

As the level of understanding of the system improves, we will consider the integration of existing information in ontologies and knowledge graphs to quickly and easily increase the systems knowledge. Being able to integrate knowledge directly could support the transfer of skills across tasks or domains, which would contribute to the adaptability of AI that understand to new problems.

3 Benchmarks

As mentioned in section 2, benchmarks play an important role to evaluate the capabilities of existing AI systems. Therefore, they can be seen as guiding the development of algorithm’s to allow more complex tasks to be solved. In this section, recent advances in benchmarks are explored in each of the categories image analytics (section 3.1), natural language processing (section 3.2), visual question answering (section 3.3), common sense inference (section 3.4) and common sense that require NLP (section  3.5).

3.1 Image analytics benchmarks

Current state-of-the-art methods in processing images and text are evaluated on benchmarks that require some level of understanding. Arguably, for many of these benchmarks the level of understanding is limited and neural network based function approximators are able to provide a high level of accuracy. The following examples show existing benchmarks on a multiple of data modalities that either are, or have been considered for state-of-the-art model evaluation.

Image analytics has seen a revolution with deep learning and many data sets have seen their performance improved to become similar to human performance in image classification (such as ImageNet 

Deng et al. (2009)

) and some object detection benchmarks. In object detection, the task is to identify objects from a predefined set of images. Annotated sets are used to identify these objects. Example data sets include, COCO (Common Objects in Context) 

Lin et al. (2014) 222https://cocodataset.org/index.ht which contains over 330K images and 1.5 million annotations, which can be used as well for image segmentation. The SHAPE benchmark Shilane et al. (2004) offers a set of objects in 3D to be identified.

3.2 Natural language processing (NLP) benchmarks

A large portion of knowledge is available in unstructured text format, thus providing methods to understand language are relevant. Even if performance in natural language processing has been lagging behind the use of deep learning, the field is catching up with current developments. Some tasks include natural language understanding, as provided by the Glue (General Language Understanding Evaluation) benchmark Wang et al. (2018a) that contains several tasks for natural language understanding.

Question answering is found in many common and popular tasks, in which the objective is to find the answer to a question given in text. Even if we can see that the task might require limited understanding capabilities in most cases, these tasks have motivated research into more challenging benchmarks. Examples of benchmarks for question answering include SQUAD Rajpurkar et al. (2016), with over 100K questions and answers crowdsourced using Wikipedia articles, this benchmark has provided a valuable resource to test novel NLP methods. Question answering has taken different shapes and forms with increasing level of complexity.

The AI2 Reasoning Challenge (ARC) https://allenai.org/data/arc is a dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering.

In contrast to these question and answering bechmarks, other tasks have appeared that require additional attention to the context. Examples of those benchmarks include LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) Paperno et al. (2016), in which it is required to have looked at the whole text before an answer can be provided. More examples of novel question and answering tasks have appeared such as BREAK Wolfson et al. (2020), which is a new question understanding benchmark data set that combines multiple older VQA data sets. In addition the data set provides high-quality, human-generated question decompositions, so called ”Question Decomposition Meaning Representations” (QDMR) that are similar to the program traces of the CLEVR data set (Johnson et al., 2017a) but at a slightly higher and richer level of abstraction. The data set is composed of 36k question and answer pairs with the corresponding question decompositions.

Furthermore, the creators of the data set also conduct a Challenge. The data set and challenge offers an excellent opportunity to evaluate current state-of-the-art methods and to identify the issues to overcome for understanding systems.

3.3 Visual Question Answering (VQA) benchmarks

In addition to benchmarks covering only one data modality, e.g. text or images, there is recent work that combines both modalities in what is called Visual Question Answering (VQA). Examples of those data sets include VQAv1 Agrawal et al. (2017), VQAv2 Goyal et al. (2017), and Visual Genome Antol et al. (2015).

An analysis of VQA algorithms by Kafle and Kanan (2017a)

found that VQA data sets with natural images are biased. For instance, in many cases questions regarding color tend to be much more frequent than others and VQA can exploit this bias, circumventing the need for true scene understanding. Consequently, the authors created the

Task Driven Image Understanding Challenge (TDIUC) data set to reduce this bias. Another common balanced data set is the VQAv2 data set by Goyal et al. (2017), where every question is associated with a pair of similar images but different answers.

Similarly, synthetic data sets such as SHAPES (Andreas et al., 2015) and CLEVR (Johnson et al., 2017a) were created to control bias but also to determine which question types pose problems for VQA algorithms; for instance, counting of objects (Hu et al., 2017a). SHAPES is a small data set with 64 images of colored geometric shapes in different arrangements. CLEVR is much larger with 999,968 questions, 13 question types and 100,000 images of 3D rendered geometric objects such as cubes, spheres and cylinders of two different sizes, eight colors and two materials.

To overcome the bias that exist in previous benchmarks, CLEVR followed the approach of generating synthetic data, which provides control over data being generated and at the same time provides challenging tasks that require reasoning about the object in the scene. Neuro-symbolic reasoning methods Gan et al. (2017) have surpassed human performance on this task. Recently, this benchmark has been extended with an animated version in which the task is to predict a future outcome of the animation Yi et al. (2019).

Additional benchmarks have been generated which address more complex reasoning such as Math SemEval Hopkins et al. (2019) in which the task is to solve mathematical problems which in some cases require understanding of the image in which the problem is depicted.

3.4 Commonsense inference benchmarks

In this area of AI, there is a critical need for integrating different modes of reasoning (e.g. symbolic reasoning through deduction and statistical reasoning based on large amount of data), as well as benchmarks and evaluation metrics that can quantitatively measure research progress  

Davis and Marcus (2015).

In recent years, there has been a surge of research activities in the NLP community to tackle commonsense reasoning and inference through ever-growing benchmark tasks. These tasks range from earlier textual entailment tasks, e.g. the Recognizing Textual Entailment (RTE) Challenges 

(Dagan et al., 2005), to more recent tasks that require a comprehensive understanding of everyday physical and social commonsense, e.g. the Story Cloze Test (Mostafazadeh et al., 2016) or SWAG (Zellers et al., 2018). An increasing effort has been devoted to extracting commonsense knowledge from existing data (e.g. Wikipedia) or acquiring it directly from crowd workers. Many learning and inference approaches have been developed for these benchmark tasks which range from earlier symbolic and statistical approaches to more recent neural approaches Storks et al. (2019).

Many commonsense benchmarks are based upon classic language processing problems. The scope of these benchmark tasks ranges from more focused tasks, such as coreference resolution and named entity recognition, to more comprehensive tasks and applications, such as question answering and textual entailment. More focused tasks tend to be useful in creating component technology for NLP systems, building upon each other toward more comprehensive tasks. Meanwhile, rather than restricting tasks by the types of language processing skills required to perform them, a common characteristic of earlier benchmarks, recent benchmarks are more commonly geared toward particular types of commonsense knowledge and reasoning. Some benchmark tasks focus on singular commonsense reasoning processes, e.g., temporal reasoning, requiring a small amount of commonsense knowledge, while others focus on entire domains of knowledge, e.g., social psychology, thus requiring a larger set of related reasoning skills. Furthermore, some benchmarks include a more comprehensive mixture of everyday commonsense knowledge, demanding a more complete commonsense reasoning skill set 

(Storks et al., 2019). Some of the commonsense inference benchmarks in NLP are as follow:

  • TriviaQA (Joshi et al., 2017): a corpus of webcrawled trivia and quiz-league websites together with evidence documents from the web.

  • CommonsenseQA (Talmor et al., 2018): consists of 9,000 crowdsourced multiple-choice questions with a focus on relations between entities that appear in ConceptNet.

  • NarrativeQA (Kočiskỳ et al., 2018): provides full novels and other long texts as evidence documents and contains approximately 30 crowdsourced questions per text.

  • NewsQA (Trischler et al., 2016): provides news texts with crowdsourced questions and answers, which are spans of the evidence documents.

  • The Story cloze test and the ROC data set (Mostafazadeh et al., 2016): systems have to find the correct ending to a 5-sentence story, using different types of commonsense knowledge.

  • SWAG (Zellers et al., 2018): is a Natural Language Inference (NLI) data set of 113k highly varied grounded situations for commonsense application with a focus on difficult commonsense inferences.

  • WikiSQL (Zhong et al., 2017): is a corpus of 87,726 hand-annotated instances of natural language questions, SQL queries, and SQL tables. This data set was created as the benchmark data set for the table-based question answering task.

  • Datasets for graph-based learning (Xu et al., 2018): i) SDPDAG whose graphs are directed acyclic graphs (DAGs); ii) SDPDCG whose graphs are directed cyclic graphs (DCGs) that always contain cycles; iii) SDPSEQ whose graphs are essentially sequential lines.

  • HotpotQA (Yang et al., 2018) and WorldTree (Jansen et al., 2018): to provide explicit gold explanations that serve as training and evaluation instruments for multi-hop inference models. These data sets were created to overcome the existing limits on the length of inferences in traversing the knowledge graphs (Khashabi et al., 2019).

  • Event2Mind (Rashkin et al., 2018): this corpus has 25,000 narrations about everyday activities and situations, for which the best performing model is ConvNet (Rashkin et al., 2018).

  • Winograd and Winograd NLI schema Challenge (Mahajan, 2018): Employs Winograd Schema questions that require the resolution of anaphora i.e. the model should identify the antecedent of an ambiguous pronoun.

Common sense knowledge bases include:

  • ConceptNet (Speer et al., 2017): contains over 21 million edges and 8 million nodes (1.5 million nodes in the partition for the English vocabulary), generating triples of the form (, , ): the natural-language concepts and are associated by commonsense relation .

  • WebChild (Tandon et al., 2017): is a large collection of commonsense knowledge, automatically extracted from Web contents. WebChild contains triples that connect nouns with adjectives via fine-grained relations. The arguments of these assertions, nouns and adjectives, are disambiguated by mapping them onto their proper WordNet senses.

  • Never Ending Language Learner (NELL) (Mitchell et al., 2018)

    : is C.M.U.’s learning agent that actively learns relations from the web and keeps expanding its knowledge base 24/7 since 2010. It has about 80 million facts from the web with varying confidence. It continuously learns facts and also keeps improving its reading competence and thus learning accuracy.

  • ATOMIC (Sap et al., 2019) is a new knowledge-base that focuses on procedural knowledge. Triples are of the form (Event, , {Effect—Persona—Mental-state}), where head and tail are short sentences or verb phrases and represents an if-then relation type.

In terms of tackling the task of commonsense inference in NLP, some methods propose to explicitly incorporate commonsense knowledge, and more recent works focus on using pre-trained deep learning representations such as BERT or XLNet. Da and Kusai (2019) investigated a number of aspects of BERT’s commonsense representation abilities to provide a better understanding of the capability of these models. Their findings demonstrate that BERT is able to encode various commonsense features in its embedding space; and if there is deficiency in its representation, the issue can be solved by pre-training with additional data related to the deficient attributes. Li et al. (2019) follow a similar approach over the COIN 2019 shared task data, and they obtain the best performance for both tasks by relying on XLNet to get contextual representations.

On the other hand, explicitly incorporating commonsense requires injecting knowledge into pre-trained deep learning models. Inspired by Bauer et al. (2018), Ma et al. (2019) proposed to convert concept-relation tokens into regular tokens, which are later used to generate a pseudo-sentence. Embeddings of the converted pseudo-sentences are concatenated with other representations for commonsense inference.

Apart from reading comprehension, there are other commonsense inference tasks that are being used for evaluation. In the work of Huminski et al. (2019) the goal is to perform commonsense inference in human-robot communication. This is an important task because human-human communications often describe the expected change of state after an action, and the actions to achieve the change are not detailed. In their work a method to transform high-level result-verb commands into action-verb commands is implemented and evaluated. Because of the lack of manually annotated data, the proposed solution is a pipeline relying on WordNet, ngram frequencies, and a search engine to predict the mapping.

Another relevant commonsense knowledge task is the prediction of event sequences. This has been addressed by relying on sequence-to-sequence models trained on annotated data, and data sets have been manually annotated for implementation and evaluation of these systems (Nguyen et al., 2017)

. Recent work on this task includes the use of conditional variational autoencoders for improved diversity, and the extension of original annotated data to cover multiple follow-up events 

(Kiyomaru et al., 2019).

Semantic plausibility is another commonsense-related task that has been used as a test bed for analyzing knowledge representations. In (Porada et al., 2019), self-supervision is applied to learn from text via pre-trained models (BERT) and state-of-the-art performance is achieved without requiring manual annotations.

Finally, recent research in this area has focused on physical properties of objects, and being able to make comparisons by applying commonsense reasoning. As an example of this task, Goel et al. (2019) observe that probing pre-trained models (GloVe, ELMo, and BERT) can reliably perform physical comparisons between objects.

In terms of graph-based commonsense inference, question answering in bAbI (Li et al., 2015)

, Shortest path, and Natural Language Generation

(Song et al., 2017) are the tasks that have been modelled as a Sequence to Sequence learning (Seq2Seq) technique (Xu et al., 2018)

to tackle the challenge of achieving accurate conversion from graph to the appropriate sequence. To this end, a general end-to-end graph-to-sequence neural encoder-decoder model is proposed for mapping an input graph to a sequence of vectors using an attention-based LSTM method to decode the target sequence from these vectors 

(Xu et al., 2018). Using the proposed bi-directional node embedding aggregation strategy, the model converges rapidly to the optimal performance and outperforms existing graph neural networks, Seq2Seq, and Tree2Seq models.

Some graph-based approaches take a step further and try to answer complex questions requiring a combination of multiple facts, and identify why those answers are correct (Thiem and Jansen, 2019). Combining multiple facts to answer questions is often modeled as a multi-hop graph traversal problem, which suffers from semantic drift, or the tendency for chains of reasoning to drift to unrelated topics, and this semantic drift greatly limits the number of facts that can be combined in both free text or knowledge base inference. This issue is addressed by extracting large high-confidence multi-hop inference patterns, generated by abstracting large-scale explanatory structure from a corpus of detailed explanations. Given that, a prototype tool for identifying common inference patterns from corpora of semi-structured explanations has been released.

3.5 Commonsense inference benchmarks that need Natural Language Processing

  • CommonsenseQA Talmor et al. (2018) consists of 9, crowdsourced multiple-choice questions with a focus on relations between entities that appear in ConceptNet.

  • NewsQA Trischler et al. (2016): provides news texts with crowdsourced questions and answers, which are spans of the evidence documents.

  • The Story cloze test and the ROC data set Mostafazadeh et al. (2016) systems have to fi

    nd the correct ending to a 5-sentence story, using different types of commonsense knowledge.

  • SWAG Zellers et al. (2018): is a Natural Language Inference (NLI) data set of 113k highly varied grounded situations for commonsense application with a focus on difficult commonsense inferences.

  • WikiSQL Zhong et al. (2017): is a corpus of 87,726 hand-annotated instances of natural language questions, SQL queries, and SQL tables. This data set was created as the benchmark data set for the table-based question answering task.

  • Datasets for graph-based learning Xu et al. (2018): i) SDPDAG whose graphs are directed acyclic graphs (DAGs); ii) SDPDCG whose graphs are directed cyclic graphs (DCGs) that always contain cycles; iii) SDPSEQ whose graphs are essentially sequential lines.

  • Event2Mind Rashkin et al. (2018): this corpus has 25,000 narrations about everyday activities and situations, for which the best performing model is ConvNet.

  • Winograd and Winograd NLI schema Challenge Mahajan (2018): Employs Winograd Schema questions that require the resolution of anaphora i.e. the model should identify the antecedent of an ambiguous pronoun.

There has been progress in defining tasks for measuring understanding, and recently there have been new development in benchmark requiring and understanding (e.g. the Abstraction and Reasoning Corpus (ARC) Chollet (2019)). Several directions might be relevant to consider for extending the current approaches. Such extensions would include understanding of the compositions of objects in different media and being able to reason about these compositions. As well, integrate different media through a common representation and perform reasoning.

4 Research streams

An AI that can understand requires certain technical capabilities. In this paper the following are considered as relevant features of such a system: (i) capable of hierarchical and compositional knowledge representation, (ii) provides multi-modal structure-to-structure mapping, (iii) integrates symbolic and non-symbolic knowledge, and (iv) supports symbolic reasoning with uncertainties.

Different research streams address these desired capabilities in some way, and this paper describes the main trends. Recently, there has been a renewed focus on combining the best of neural networks and symbolic systems (Marcus, 2020). These so-called ”neuro-symbolic” systems (Besold et al., 2017) have achieved impressive improvements in Visual Question Answering (VQA) tasks (CLEVR (Johnson et al., 2017a), Neuro-symbolic concept learner (Mao et al., 2019a)), even if they are currently limited to image scenes with simple objects such as cubes, spheres and cylinders333VQA systems have been applied to real-world scenes but even then objects are treated as units and not as entities composed of parts, e.g. a person with head, arms, legs and body. Furthermore, performance is significantly lower for real-world scenes. . We start describing the use of neuro-symbolic systems in Visual Question Answering in Section 4.1.

Another related research trend aims at hierarchically structured knowledge representations for different tasks (Section 4.2), and in this stream we find methods that focus on multimodal representation (Section 4.2.1), tree-based representations (Section 4.2.2), hyperbolic embeddings (Section 4.2.3), and knowledge graph learning (Section 4.2.4).

Understanding images and documents is another of the main areas of research where deep understanding capabilities are required. Fully processing an image scene requires comprehension of the object it contains, and the relations between them. For documents, figures, tables and other structured images need to be decomposed into elements to enable reasoning. Relevant research regarding scene and document understanding will be discussed in section 4.3.

A required capability for understanding is to learn from example data a (partially) symbolic knowledge representation that allows reasoning and question answering. Research in 4.4.2 Program synthesis (section 4.4.2), 4.4 Neuro-symbolic computing (section 4.4) and 4.5 Commonsense inference (section 4.5) are of specific interest here.

4.1 Visual Question Answering

Visual Question Answering (VQA) is the task of answering questions regarding the contents of a visual scene (Malinowski and Fritz, 2014; Antol et al., 2015; Kafle and Kanan, 2017b). For instance, questions such as ”How many black dogs are left to the tree?” The scenarios in this task deal with multimodal data (images, questions in plain text), require structure-to-structure mapping, and symbolic reasoning has to be applied to deal with relations described in both images and text.

VQA has experienced tremendous progress in recent years due to novel neuro-symbolic methods (see section 4.4) and well designed benchmarks (see Section 3.3).

The CLEVR data set (see Section 3.3) was especially influential in the development of better VQA methods. Early VQA algorithms achieved very modest accuracies (Johnson et al., 2017a) on CLEVR but rapid progress was made. Current methods can answer very complicated questions such as ”There is an object that is both on the left side of the brown metal block and in front of the large purple shiny ball; how big is it?” with near perfect and super-human accuracy (Yi et al., 2018). However, Kafle and Kanan (2017a) point out that CLEVR is specifically designed for compositional language approaches and requires demanding language reasoning, but only limited visual understanding due to the simple objects.

Beyond images, questions and answers to the CLEVR data set also provide so called ”ground-truth programs” for each sample, which define functional programs that answer the sample question. Early Neural Module Network (NMN) approaches for VQA required all ground-truth programs for training (Hu et al., 2017a; Suarez et al., 2018), while later methods used less and less ground-truth programs (Johnson et al., 2017b). The VQA algorithm by Yi et al. (2018) uses only a small subset, and the recent StackNMN (Hu et al., 2018) or the Neuro-symbolic concept learner (Mao et al., 2019a) learn without any ground-truth programs at all, using images, questions and answers alone. Finally, Shi et al. (2019) demonstrated that the Neural module networks framework can achieve perfect accuracies on the CLEVR benchmark provided the objects and their relations (scene graph) are perfectly extracted from the scene.

4.2 Compositional and hierarchical representations

Compositional and hierarchical representation are often used to solve some of the most challenging tasks, including visual question answering, visual grounding, and compositional learning and reasoning. Most of the existing work in the field are focused on tasks that make use of multi-modal data sources (images and text), and we describe those methods in section 4.2.1. Tree representations, and in particular Tree-LSTMs are also a popular trend to represent and generate compositional knowledge, and we cover it in section 4.2.2. Next, we introduce hyperbolic embeddings in section 4.2.3, where we describe methods that use hyperbolic vector spaces for better capturing hierarchical dependencies. Finally, another area where compositional and hierarchical representations are required is the automatic learning of knowledge graphs, which we describe in section 4.2.4.

4.2.1 Hierarchical representations in multi-modal data

In order to deal with multimodal data, Lu et al. (2016a) proposed a co-attention mechanism for performing visual question answering by attending to both textual descriptions and subareas in images. This approach makes it possible to answer compositional questions such as ”what is the man holding a snowboard on top of a snow covered?”, ”how many snowboarders in formation in the snow, four is sitting?”, etc.

Agrawal et al. (2017) proposed compositional-VQA (C-VQA) which is created by rearranging the VQAv1 dataset (see Section 3.3) so that QA pairs in the C-VQA test data set are not present in the C-VQA training data set, but most concepts constituting the QA pairs in the test data are present in the training data. The idea is to ensure that the question-answer (QA) pairs in the C-VQA test data are compositionally novel with respect to those in the C-VQA training data. They evaluate existing VQA models under this new setting, and show that the performance degrade considerably.

Hu et al. (2017b) propose compositional modular networks (CMN) 444http://ronghanghu.com/cmn to localize a referential expression by grounding the components in the expressions and exploiting their interactions, in an end-to-end manner. The model performs the task of grounding in a three-step procedure: (a) an expression is parsed into subject, relationship and object with attention for language representation; (b) the subject or object is matched with each image region with unary score; and (c) the relationships and region pairs are matched with a pairwise score. To perform the same task, Choi et al. (2018) propose the Recursive Grounding Tree (Rvg-Tree), which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores as returned by the sub-trees.

Purushwalkam et al. (2019) consider that in order to perform compositional reasoning, it is crucial to capture the intricate interactions between the image, the object and the attribute; while existing research only captures the contextual relationship between objects and attributes. Assuming the original feature space of images is rich enough, inference entails matching image features to an embedding vector of object-attribute pairs. They propose the Task-driven Modular Networks (TMN) to perform the task of compositional learning.

4.2.2 Tree-LSTMs and tree structures

A popular hierarchical structure that has been widely used for compositionality are tree representations. A number of neural approaches have been proposed to handle tree structures in different ways: as input, for representation, and for generation. One of the earliest models, recursive neural networks, was developed by Socher et al. (2013b). In this model, sequential data (e.g. natural language sentences) together with tree-structures (e.g. the constituency parse tree of the sentence) are fed to the neural network, which represents terminal nodes with word embeddings and non-terminal nodes with the concatenation of all their child nodes. This tree-structured representation is applied to sentiment classification. With the new representation, the system is capable of performing sentiment classification in a more fine-grained level by classifying non-root nodes instead of only root nodes of the tree.

The tree-LSTM neural network (Tai et al., 2015) took this work one step further by replacing the concatenation operation with more sophisticated gates. Its implementation is different from the generic LSTM, which calculates the hidden states at each step by using input, output, and forget gates on hidden states from previous steps and input data of the current step. Tree-LSTMs apply the input, output and forget gates for each non-terminal node based on hidden states of all child nodes, and the input of the current non-terminal node. Tai et al. (2015) presented two implementations of Tree-LSTM: Child-Sum TreeLSTM and N-ary Tree-LSTMs.

The vanilla Tree-LSTM assumes input from both texts and their corresponding tree-structures. Choi et al. (2018)

propose the Gumbel Tree-LSTM which learns to compose task-specific tree structures only from plain text data. The model uses a Straight-Through Gumbel-Softmax estimator to decide on the parent node among candidates dynamically, and to calculate gradients of the discrete decision.

There are multiple versions of tree-structured LSTMs, however Havrylov et al. (2019) demonstrate that the trees do not resemble any semantic or syntactic formalism. They propose latent-LSTM to model the tree learning task as an optimization problem where the search space of possible tree structures are explored. Learning new concepts via structure space searching is not uncommon; Lake et al. (2018) proposed a new computational model to discover new structure organizations by using a broad hypothesis space with a preference for sparse connectivity. Different from probabilistic inference, where a predefined set of structure forms (e.g. tree, ring, chain, grid) are provided, here they explore more general structures. Demeter and Downey (2019)

presented the Neural-Symbolic Language Model (NSLM) which consists of two components: a hierarchical Neural network language model (NNLM) which assigns a probability to each class,

, and a micro-model that allocates this probability over the words in . They specifically defined a micro-model to incorporate logic for words referring to numbers. Note that the logic aspect is modelled at the leaf node level, i.e. for individual classes of vocabularies.

Finally, apart from representation, some also introduce methodologies for generating tree structures. Corro and Titov (2018) proposed the tree-structured variational auto-encoder (VAE) for such purposes. A generic VAE (Doersch, 2016)

is composed of an encoder, which extracts the hidden representation,

, of the input data, , and a decoder which reconstructs the input data by performing a normalized sampling from the latent representation, , (note that the parameters and are parameters to be learned). In the tree-structured VAE, the encoder is replaced with a level-by-level representation extraction process that merges the representations of terminal nodes into the representation of non-terminal nodes in higher levels of the tree. On the other hand, the decoder is replaced with a generative process that splits the extracted representation from the encoder into representations of non-terminal nodes and finally terminal nodes. The values of terminal nodes are determined by their representation via argmax. In both the encoder and the decoder, the merge and split

operations are implemented using multi-layered perceptrons (MLP’s). Using tree-structured VAE’s

Jin et al. (2018) propose the automatic generation of molecular graphs.

4.2.3 Hyperbolic embeddings

As seen above, learning embeddings (e.g. Word2Vec (Mikolov et al., 2013), Glove (Pennington et al., 2014)) of symbolic data such as text, image parts or graphs are a common approach to capture the semantics of entities and enable similarity measures between them. While data often exhibits latent hierarchical structures, most methods generate embeddings in Euclidean vector space, which do not capture hierarchical dependencies adequately. Other embeddings such as linear relational embeddings (Paccanaro and Hinton, 2001), holographic embeddings (Nickel et al., 2016), complex embeddings (Trouillon et al., 2016) and deep recursive network embeddings (Zhu et al., 2020) have been proposed to address this issue. Here we focus on hyperbolic embeddings, which have demonstrated better model generalisation, and more interpretable representations for data with underlying hierarchical structure.

Nickel and Kiela (2017) introduce an algorithm to learn a hyperbolic embedding in an -dimensional Poincaré ball and show improved representation of WordNet (Miller, 1995) noun hierarchies in this space compared to the traditional Euclidean embedding. Further improvements of the method were achieved by Ganea et al. (2018) and De Sa et al. (2018), where the latter achieved a nearly perfect reconstruction of WordNet hypernym graphs.

Similarly, Chamberlain et al. (2017) employ Poincaré embeddings to achieve improved representation on Zachary’s karate club (Zachary, 1977) and other small-scale benchmarks. Mathieu et al. (2019) modify Variational Auto Encoders (Kingma and Ba, 2014)

to embed the latent space in a Poincaré ball and demonstrate better generation of MNIST handwritten digits

(LeCun and Cortes, 2010).

While Poincaré embeddings are easy to implement, the application for deep hierarchies is problematic due to the limited numerical accuracy on digital computers. For instance, on the WordNet graph approximately 500 bits of precision are needed to store values from the combinatorial embedding (De Sa et al., 2018). Yu and De Sa (2019) address this problem by proposing a tiling-based model for hyperbolic embeddings. Noteworthy in this context is also the paper by Nickel and Kiela (2018), where they revisit their earlier work on Poincaré and find that learning embeddings in the Lorentz model is substantially more efficient than in the Poincaré-ball model.

Considering that a crucial part of understanding is to extract structured representations of text, the work by Le et al. (2019) on inferring concept hierarchies from text corpora via hyperbolic embeddings is especially relevant. Another fundamental capability is structure-to-structure mapping, which has been addressed by Alvarez-Melis et al. (2019), using unsupervised hierarchy matching in hyperbolic spaces. Finally the works by Suzuki et al. (2019) and Nagano et al. (2019)

are potentially useful to learn the assembly of neural network modules with a differentiable method instead of reinforcement learning.

4.2.4 Knowledge graph learning

A system with understanding capabilities needs to access structured information to reason on. An example of such resources are knowledge graphs that are defined by entities as nodes and relations of different types as edges (Wang et al., 2014). They are quite popular and have been used in AI in many domains, from linguistic tools such as WordNet (Fellbaum, 2012) to domain specific resources such as OBO Foundry (Smith et al., 2007) ontologies. These resources are expensive to build manually, thus methods have been developed to complement these resources or exploit the existing knowledge using automatic means (Paulheim, 2017; Dash et al., 2019).

There is an existing body of literature about learning knowledge from unstructured sources such as text. NLP methods have been extensively studied within information extraction and retrieval (Mao et al., 2019b). State-of-the-art deep learning based methods might be considered to learn additional information and special requirements are needed to validate the newly extracted information and to integrate it into an existing knowledge graph.

Research has been devoted as well to extend the knowledge that is already in an existing knowledge graph. Examples of such methods include link prediction methods (Zhang and Chen, 2018; Nickel et al., 2015)

, which specially focus in predicting whether two nodes in a graph should be linked. An additional body of research considers existing knowledge to infer additional facts using tensors 

(Socher et al., 2013a). Tensor based methods have problems with large knowledge graphs or focus on a single relation (Cai et al., 2018); graph embedding methods solve both issues (Wang et al., 2017; Yang et al., 2014)

. Existing methods rely on positive information, and a uniform distribution is typically considered to solve this problem which is not ideal. Adversarial methods have been applied 

(Cai and Wang, 2017) to alleviate this problem.

4.3 Scene and document understanding

Understanding an image scene requires comprehension of the objects it contains, and the relationships between them. Objects themselves are typically composed of parts, and a complete description of a scene requires a hierarchical, multi-level representation, such as a scene graph, where nodes represent objects or parts, and link represents relationships. Similarly, document structure understanding aims at identifying physical elements in a document layout (images, tables, lists, formulas, text, etc.), and representing their relations as a tree structure.

In the following we firstly discuss a selection of state-of-the-art work in instance segmentation and object detection (Section 4.3.1). Those methods do not extract scene graphs, but provide a basis by extracting the objects in a scene. Next, we will discuss inverse graphics and graphic generation approaches that attempt to learn programs to (re)create an image (Section 4.3.2). The learned program often can be interpreted as a structured and hierarchical scene representation. We then discuss methods with a focus on image decomposition and scene representation in Section 4.3.3. Finally, we present techniques for document understanding in Section 4.3.4.

4.3.1 Object detection and instance segmentation

Object detection and instance segmentation are fast evolving fields and we point to a recent review paper by Zhao et al. (2019) for an overview. Here we focus on a few, very recent, state-of-the-art methods that could serve as the basis of a scene graph extractor. Lee and Park (2019) introduce CenterMask, a real-time anchor-free instance segmentation method that outperforms all previous state-of-the-art models at a much faster speed by adding a spatial attention-guided mask branch to the FCOS (Tian et al., 2019) object detector. Another method for instance segmentation by Wang et al. (2019b) is worth mentioning, since it is conceptually very simple with an accuracy on par with Mask R-CNN (He et al., 2017b) on the COCO data set (Lin et al., 2014). Finally a very recent improvement on Mask R-CNN with respect to speed and accuracy is BlendMask by Chen et al. (2020) that also builds on the FCOS (Tian et al., 2019) object detector.

All the methods above are supervised approaches that are fast and accurate but require large amounts of labeled data and struggle with object occlusion. Generative, probabilistic models are an alternative approach that attempts to address these issues. An early example is DRAW (Gregor et al., 2015)

, a recurrent neural network to recognize, localize and generate MNIST digits (among other tasks). A faster method (

(D)AIR) was proposed by Eslami et al. (2016) and recently has been further improved by Stelzner et al. (2019) and Yuan et al. (2019a).

4.3.2 Inverse graphics

A different approach to model or understand an image is Inverse graphics (Baumgart, 1974). Here the task is to learn a mapping between the content of an input image and a probabilistic or symbolic scene description (e.g. a graphical DSL such as HTML). Fundamentally this is achieved by reconstructing the input image from the scene description and using the differences between the original image and its reconstruction as a training signal (e.g. via reinforcement learning or MCMC). This approach is of special interest, since it typically results in a hierarchically structured and easy to interpret representation of a scene that generalizes well.

Kulkarni et al. (2015) build upon the differentiable Renderer (Loper and Black, 2014)

to create ”Picture”, a probabilistic programming framework to construct generative vision models for different inverse graphics applications. They demonstrate its usefulness on 3D face analysis, 3D human pose estimation, and 3D object reconstruction.

Originally inverse graphics relied largely on probabilistic scene representations. Later work is including symbolic representations and a good example is the work by Wu et al. (2017), where an encoder-decoder architecture with a discrete renderer is used to learn image representations in an XML-flavored scene description language. Similarly, Zhu et al. (2018) employ a encoder-decoder architecture based on hierarchical LSTM’s to map screenshots from Graphical User Interfaces (GUI) to a DSL that describes GUI’s, and Ellis et al. (2018) learn to infer graphics programs in LaTeX from hand-drawn sketches, using a combination of deep neural networks and stochastic search. Finally, a neuro-symbolic system for inverse graphics based on a capsule network architecture (Hinton et al., 2011) was proposed by Kissner and Mayer (2019).

4.3.3 Image decomposition

The third approach, beyond inverse graphics and object detection, to understand or structure image content is to decompose the image into parts and objects, and to construct some form of scene graph. For instance, Burgess et al. (2019) introduce the Multi-ObjectNetwork (MONet), in which a Variational Autoencoder (VAE) together with a recurrent network is trained in an unsupervised manner to decompose 3D scenes such as CLEVR (Johnson et al., 2017a) into objects and background. Charakorn et al. (2020) also employ VAE’s but focus on disentangled representations of scene properties. Recent advances along this line of unsupervised scene-mixture models that allow to separate objects from background are: IODINE (Greff et al., 2019), GENESIS (Engelcke et al., 2019) and SPACE (Lin et al., 2020).

Deng et al. (2019) go one step further and propose a generative model (RICH) that allows decomposing the objects and parts of a scene in a tree structure. Correctly recognizing occluded objects in a scene is a common problem in scene representation that specifically has been addressed by Yuan et al. (2019b).

4.3.4 Document understanding

A large portion of existing data is available in unstructured document formats such as PDF (portable document format), with over 2.7 trillion documents available in this format. Machines that can understand documents will improve the access to information that is otherwise difficult to process.

Document layout and structure understanding aims at parsing unstructured documents (e.g., PDF, scanned image) into machine readable format (e.g., XML, JSON) for down-stream applications. Document layout analysis identifies physical elements (image, table, list, formula, text, title, etc) in a document, without logical relations between the elements. Document structure analysis aims at representing documents as a tree structure to encode the logical relations.

Deep neural networks have been used to understand the physical layout and logical relations in documents. Convolutional neural networks 

(Hao et al., 2016; Chen et al., 2017a), fully-convolutional neural networks (He et al., 2017a; Kavasidis et al., 2019), region-based convolutional neural networks (Schreiber et al., 2017; Gilani et al., 2017; Staar et al., 2018; Zhong et al., 2019c), and graph neural networks (Qasim et al., 2019; Renton et al., 2019) have all been exploited to parse the physical layout of documents or to detect elements of interest (e.g., tables). Encoder-decoder networks (Zhong et al., 2019b) and visual relation extraction networks (Nguyen et al., 2019) have been adopted to infer the logical structure of documents. Layout information is also integrated in language models for NLP tasks that take unstructured documents as input (Xu et al., 2019).

4.4 Neuro-symbolic computing

4.4.1 Neuro-symbolic reasoning

Neural networks (NN) are well suited to learn from data, and have done exceptionally well for various data modalities including structured, natural language and imaging data (LeCun et al., 2015). Despite the ability of NN in achieving state-of-the-art accuracies (Tan and Le, 2019), they fail at gaining a fundamental understanding of the data and rather focus on the distributions and statistical relations in the data. Another disadvantage of NN is that they often require large volumes of data in order to achieve these state-of-the-art accuracies. On the other side of the spectrum there is logical reasoning, which uses a set of facts and rules to make inferences. This allows direct interpretability of the model, something neural networks are not able to do. One of the downsides to logical reasoning is that they need symbolic knowledge, which needs to be extracted from data first before being applied in the reasoning. Logical neural networks (LNN) aim at combining these two approaches, that is to leverage the ability of neural networks to represent and learn from data and the ability of formal logic to perform reasoning on what has been learned by NN (Garcez et al., 2019).

Although the research area of combining NN and logical reasoning has received a lot of interest in recent years (Garcez et al., 2019)(Manhaeve et al., 2018)(Serafini and Garcez, 2016)(De Raedt et al., 2019)(Dong et al., 2019)(Riegel et al., 2020), the idea of combining machine learning, and in specific neural networks, and logical reasoning has been described in early works of (Chan et al., 1993), (Quah et al., 1995), (Muggleton, 1995) dating back to the early 90’s. Garcez et al. (2019) wrote a review paper on some strategies and methodologies employed for neuro-symbolic approaches and outlines the key characteristics of such a system, namely:

  1. Knowledge representation

  2. Learning

  3. Reasoning

  4. Explainability (Interpretability)


Knowledge representation: How will the available knowledge be represented? Are there explicit rules that can be represented by propositional of first order logic which would allow logical reasoning using logical programs.

2 Learning: Garcez et al. (2019) mentions two popular learning approaches, horizontal and vertical learning and describes how neural learning and logical reasoning is interfaced in each. Hu et al. (2016)

explains the horizontal learning approach, making use of a ’teacher’ and ’student’ network. First, the ’student’ network gets trained on the available data. Once trained, this ’student’ network is projected into a ’teacher’ network. Secondly, the prior knowledge (rules) are added to the ’teacher’ network by regularizing the network’s loss function. This loss from the ’teacher’ network is then used to optimize the weights of the ’student’ network. Thus, this approach offers a way to distill prior knowledge into the neural network weights, and serves as a knowledge extraction method. The disadvantage of such approaches is that the rules are integrated into the final network and not available at inference time for further analysis.

Manhaeve et al. (2018) explains one approach to vertical LNN learning. The first part of such a network is a neural network, such as a CNN, RNN etc. that performs low level perception. In the case of the MNIST handwritten digits (LeCun and Cortes, 2010) example, the CNN is used to extract the low level features and output a probability score (softmax) for each digit. The second part of the network is logical reasoning, which could for example be the addition of two digits, defined in the logic of ProbLog Raedt et al. (2007).The network is end-to-end differentiable and allows for uncertainty in the final answer by using probabilities.

3 Reasoning: Arguably one of the most desirable character traits an AI system should have. Apart from being able to learn from available data, the system should be able to reason on the learned features in order to come to a decision. This can be achieved by the explicit use of formal logic as done by Manhaeve et al. (2018). Other examples of reasoning approaches is that by Hu et al. (2016), whom incorporates a confidence variable to allow for uncertainty and achieves logical reasoning where not all rules can be satisfied exactly. Serafini and Garcez (2016) uses approximate satisfiability to allow for flexibility in rules being satisfied. Riegel et al. (2020) makes use of truth bounds, consisting of a lower and upper bound [L, U] which can be used to assign truth (L = U = ), false (L = U = ) as well as unknown (L = , U = ) values to an outcome or particular node in the network. In addition to the latter, the network can also highlight cases where contradicting (L U) knowledge arise in the network.

4 Explainability (Interpretability): In the last couple of years there has been an increasing interest in the ability to explain and interpret the models decision and to understand the reasoning behind a given result (Fan et al., 2020). Perhaps an important point to outline here is the distinction between interpretability and explainability. The former is a model that provides insight into decision making on its own without the need for additional processing, as apposed to the latter that requires an additional model or processing to explain the original model. Accuracy is therefore no longer the most important criteria for model performance and is especially true in domains where models are used in life critical decision making such as health care, and where ethical decision making is critical. These model needs to be interpretable and should allow transparency and understanding of the reasoning behind a particular model outcome. Logical statements are easily interpretable and thus account for the model interpretability. However, interpretability of the model as a whole depends on how the knowledge was represented to the logical part of the model. As described above most learning strategies depend on either a horizontal or vertical approach. Thus, the part performed by the neural network could cloud full interpretability of the model.

The above methods all makes use of explicit logic in some form or another. Neural networks are used to learn features and extract symbolic representations in the images and logic is used in a subsequent reasoning step. An alternative, non-logic, approach is presented by Andreas et al. (2016b). The author introduces Neural Module Networks (NMN), a composition of multiple neural networks in a jointly-trained arrangement of modules for question answering. The most interesting aspect of this method is that the arrangement of the modules is dynamically inferred from the grammar and content of the question. This not only enables long-chained reasoning but also provides interpretabily with detailed text and image attention maps (Lu et al., 2016b). This method offers the advantage that the domain knowledge does not need to be known upfront as is the case with most of the methods requiring exact logic. This advantage could be a disadvantage when exact knowledge is available and could help provide full model interpretability. An example of this could be in medical diagnoses, where a disease is diagnosed according to a set of guidelines. These guidelines can be encoded in logical statements and provide interpretabilty of the model.

In (Andreas et al., 2016a), the authors improved Neural module networks, and learned the arrangement of the modules jointly with the answers instead of relying on a language parser for the questions and handwritten rules for module assembly. These Dynamic Neural Module Networks (DNMN) are more accurate and understand more complex questions. Further improvements were achieved by Hu et al. (2017a) and their End-to-End Module Networks (N2NMM).

4.4.2 Program synthesis

Program synthesis is the task of learning programs (code) from examples of input and output data. This task is strongly linked to an AI that understands, since induced programs provide explainability (via symbolic rules), enable reasoning, and also result in a structured representation of an input. Early attempts utilized Genetic programming

(Banzhaf et al., 1998) to evolve programs (Koza and Koza, 1992), and interest in the field has gained new strength with the success of program-like neural network models. This section describes recent techniques for program synthesis and how they relate to an AI that understands.

Parisotto et al. (2016) introduce Recursive-Reverse-Recursive Neural Network (R3NN) to incrementally construct programs conforming to a pre-specified Domain Specific Language (DSL). R3NN is a tree structured generation model that is trained with supervision, and it was able to solve the majority of Flash-Fill benchmarks (Gulwani, 2011), which consist on synthesizing programs built on regular expressions to perform the desired string transformation. However R3NN did not generalize well to larger programs nor does it seem to scale well computationally for more complex tasks. Poor generalization is a known issue in program synthesis. Cai et al. (2017) address this problem by augmenting neural architectures with recursion. They apply Neural Programming Architecture (NPA) to four tasks: grade-school addition, bubble sort, topological sort, and quicksort; and they demonstrate that recursion improves accuracy.

Ellis et al. (2015) employ general-purpose symbolic solvers for Satisfiability Modulo Theories (SMT) problems (De Moura and Bjørner, 2008) to learn SVRT (Synthetic Visual Reasoning Test) concepts (Fleuret et al., 2011) such as spatial relationships between objects. Training is unsupervised, but the symbolic solvers do not scale well to more complex problems and there is no soft-reasoning. One reason why symbolic solvers do not scale well is that the search space grows exponentially. DeepCoder (Balog et al., 2016) addresses this problem by learning an embedding for the functions of a DSL to guide the (beam) search for program candidates. Zohar and Wolf (2018) were able to further improve and accelerate DeepCoder by learning to remove intermediate variables.

Feser et al. (2016) implemented a differentiable program interpreter utilizing a DSL inspired by functional programming, but they found, in agreement with (Gaunt et al., 2016a), that discrete search-based techniques for program synthesis such as (Feser et al., 2015) perform better than differentiable programming approaches. Later work by Gaunt et al. (2016b) extend a differentiable programming language with neural networks, enabling the combination of functions with learnable elements, in a similar fashion to Neural Network Modules (Andreas et al., 2016b) with symbolic modules. Discrete search-based techniques for program synthesis are also a focus of Chen et al. (2017b), who in addition highlight the importance of recursion. Using reinforcement learning, they demonstrate significantly improved generalization for learning context-free parsers. Similarly, improved generalization for equation verification and completion was achieved by Arabshahi et al. (2018) using Tree-LSTMs (Tai et al., 2015) with weakly supervised training. Recent work by Pierrot et al. (2019) further advances the above line of research, introducing the novel reinforcement learning algorithm AlphaNPI with structural biases for modularity, hierarchy and recursion.

Program induction is similar to Program Synthesis in that a program needs to be learned from input-output examples, however, in Program induction the program is not explicit and not part of the result (Kant, 2018). In spirit, methods for Program induction tend to be closer to neural networks than to symbolic computing. For instance, architectures such as the Neural Turing Machine (Graves et al., 2014, 2016), the Differential Neural Computer (Graves et al., 2016; Tanneberg et al., 2019), the Neural programmer (Hudson and Manning, 2019), Neural programmer-interpreters (Reed and De Freitas, 2015; Pierrot et al., 2019), Neural Program Lattices (Li et al., 2017), the Neural State Machine (Hudson and Manning, 2019), and most recently MEMO (Banino et al., 2015) extend neural networks with external memory, and can infer simple algorithms such as adding numbers, copying, sorting and path finding. Manhaeve et al. (2018) illustrates the use of DeepProbLog to solve three program induction tasks and compare their results to Differentiable Forth () (Bošnjak et al., ). The first of the three examples looks at calculating the resulting digit and the carry digit given the sum of two digits and the previous carry digit. The results between DeepProbLog and where the same, with both obtaining 100% accuracy for training lengths of up to 8 digits and 64 testing digits. The second program induction problem they looked at was the bubble sort algorithm. Given a list of unsorted digits, the task is to arrange the digits in a sorted list. The program was tasked to figure out the action in each step of the bubble. DeepProbLog showed higher accuracy than for input lists of length bigger than 3 and did so in a fraction of the time was able to. The final program induction problem was solving algebraic word problems, where the program had to decide on the order of input digits and the mathematical operation required in each step. As with the first problem, similar results were obtained between DeepProbLog and , 96.5% and 96.0% respectively.

4.5 Commonsense inference

An intelligent creature needs to know about the real world and use its knowledge effectively to be able to act sensibly in the world. The knowledge of a schoolchild about the world and the methods for making obvious inferences from this knowledge are called common sense. Commonsense knowledge, such as knowing that “bumping into people annoys them” or “rain makes the road slippery”, helps humans navigate everyday situations seamlessly (Apperly, 2010). This type of knowledge and reasoning plays a crucial role in all aspects of artificial intelligence, from language understanding to computer vision and robotics. (Davis, 2014).

Commonsense inference in NLP

A successful linguistic communication relies on a shared experience of the world, and it is this shared experience that makes utterances meaningful. Despite the incredible effectiveness of language processing models trained on text alone, today’s best systems still make mistakes that arise from a failure to relate language to the physical world it describes and to the social interactions it facilitates (Bisk et al., 2020). One important difference between human and machine text understanding lies in the fact that humans can access commonsense knowledge while processing text, which helps them to draw inferences about facts that are not mentioned in a text, but that are assumed to be common ground. However, for a computer system, inferring such unmentioned facts is a non-trivial challenge (Ostermann et al., 2019b).

In recent years, NLP community has introduced multiple exploratory research directions into automated commonsense understanding (Sap et al., 2020). Recent efforts to acquire and represent this knowledge have resulted in large knowledge graphs, acquired through extractive methods (Speer et al., 2016) or crowdsourcing (Sap et al., 2019). Moreover, reasoning capabilities have been integrated into downstream NLP tasks, in order to develop smarter dialogue (Zhou et al., 2018) and question answering systems (Xiong et al., 2019).

Recent large pretrained language models (Devlin et al., 2018; Liu et al., 2019; Brown et al., 2020), however, have significantly improved human-like understanding capabilities of machines. Hence, machines sould be able to model commonsense through symbolic integrations. Given the large number of NLP applications which are designed to require commonsense reasoning, some efforts infer commonsense knowledge from structured KBs as additional inputs to a neural network in generation (Guan et al., 2019), dialogue (Zhou et al., 2018), question answering (Mihaylov and Frank, 2018; Bauer et al., 2018; Lin et al., 2019; Weissenborn et al., 2017; Musa et al., 2018), and classification (Chen et al., 2019; Paul and Frank, 2019; Wang et al., 2019a). In some others, researchers have relied on commonsense knowledge aggregated from corpus statistics exploited from unstructured text (Tandon et al., 2018; Lin et al., 2017; Li et al., 2018; Banerjee et al., 2019). Recently, instead of using relevant commonsense as an additional input to neural networks, commonsense knowledge has been encoded into the parameters of neural networks through pretraining on relevant knowledge bases (Zhong et al., 2019a) or explanations (Rajani et al., 2019), or by using multi-task objectives with commonsense relation prediction (Xia et al., 2019).

Commonsense is indirectly evaluated by assessing the performance of higher-level tasks over conventional Natural Language Understanding (NLU) data sets. Direct measurement of capabilities for assessing and explaining commonsense can shed light upon the ability to represent semantic knowledge, and symbolic reasoning with uncertainty. A detailed account of challenges with commonsense reasoning is provided by Davis and Marcus (2015), which spans difficulties in understanding and formulating commonsense knowledge for specific or general domains to complexities in various forms of reasoning and their integration for problem solving (Storks et al., 2019).

Common sense inference in NLP is normally evaluated by reading comprehension tasks that require question-answering about a text by inferring information that is common knowledge but not necessarily present in the text. (Ostermann et al., 2019b) evaluated two commonsense inference tasks on everyday narratives (task1) and on news articles (task2), based on two data sets: MCScript2.0 (Ostermann et al., 2019a) and ReCoRD (Zhang et al., 2018)

. These two tasks were also released as challenges in the SemEval 2018 shared task 11, and in the COIN (COmmonsense INference in Natural Language Processing) workshop at EMNLP-IJCNLP 2019. Three baseline models: logistic regression 

(Merkhofer et al., 2018), attentive reader (Hermann et al., 2015) and three-way attentive network (Wang et al., 2018b); and five baseline models: BERT (Devlin et al., 2018), KT-NET (Yang et al., 2019), stochastic answer network (SAN) (Liu et al., 2017), the DocQA (Clark and Gardner, 2017) and random guess, were presented for task1 and task2, respectively. Systems from five teams during the COIN workshop were evaluated in (Ostermann et al., 2019b) among which a best accuracy of 90.6% and a best F1-score of 83.7% were achieved for task1 and task2, respectively. Even with the best-performing transformer-based method, machine performances are 7% and 8% lower than human performance in the two tasks. Also, this particular field is in demand of better data sets that make it harder to benefit from redundancy in the training data or large-scale pre-training on similar domains (Ostermann et al., 2019b).

Visual commonsense inference

Visual understanding goes well beyond object recognition. Understanding the world beyond the pixels is very straightforward for humans. However, this task is still difficult for today’s vision systems, which require higher-order cognition and commonsense reasoning about the world. Various tasks have been introduced for joint understanding of visual information and language, such as image captioning 

(Chen et al., 2015; Vinyals et al., 2015; Sharma et al., 2018), visual question answering (Antol et al., 2015; Johnson et al., 2017a; Marino et al., 2019) and referring expressions (Kazemzadeh et al., 2014; Plummer et al., 2015; Mao et al., 2016). There is also a recent body of work addressing representation learning using vision and language cues (Tan and Bansal, 2019; Lu et al., 2019; Su et al., 2019). However, these works fall short of understanding the dynamic situation captured in the image, which is the main motivation of visual commonsense inference.

Visual understanding requires seamless integration between recognition and cognition: beyond recognition-level perception (e.g. detecting objects and their attributes), one must perform cognition-level reasoning (e.g. inferring the likely intents, goals, and social dynamics of people) (Davis and Marcus, 2015). State-of-the-art vision systems can reliably perform recognition-level image understanding, but still struggle with complex inferences. In the contect of visual question answering, some work focuses on commonsense phenomena, such as ‘what if’ and ‘why’ questions (Pirsiavash et al., 2014; Wagner et al., 2018). However, the space of commonsense inferences is often limited by the underlying dataset chosen (synthetic (Wagner et al., 2018) or COCO (Pirsiavash et al., 2014) scenes). In (Zellers et al., 2019), commonsense questions are asked in the context of rich images from movies. Some visual commonsense inference work (Mottaghi et al., 2016; Ye et al., 2018) involves reasoning about commonsense phenomena, such as physics. Some involves commonsense reasoning about social interactions (Alahi et al., 2016; Chuang et al., 2018; Gupta et al., 2018; Vicol et al., 2018), while some others involve procedure understanding (Alayrac et al., 2016; Zhou et al., 2017). Predicting what might happen next in a video is also studied in (Singh et al., 2016; Ehsani et al., 2018; Zhou and Berg, 2015; Vondrick et al., 2016a; Felsen et al., 2017; Rhinehart and Kitani, 2017; Yoshikawa et al., 2018).

To move toward incorporating commonsense knowledge in the context of visual understanding, Vedantam et al. (2015) proposed an approach where human-generated abstract scenes made from clipart is used to learn common sense, but not on real images. Inferring the motivation behind the actions of people from images is also explored by (Pirsiavash et al., 2014). Visual Commonsense Reasoning (VCR) (Zellers et al., 2019) tests if the model can answer questions with rationale using commonsense knowledge. Given a challenging question about an image, a machine must answer correctly and then provide a rationale to justify its answer. To move towards cognition-level understanding, Zellers et al. (2019) porposed a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. However R2C helps narrowing the gap between humans and machines, the challenge is still far from solved. While this work includes rich visual common sense information, their question answering setup makes it difficult to have models to generate commonsense inferences (Park et al., 2020).

ATOMIC (Sap et al., 2019) on the other hand provides a commonsense knowledge graph containing if-then inferential textual descriptions in generative setting; however, it relies on generic, textual events and does not consider visually contextualized information. In further attempt, Park et al. (2020) proposed an approach by extending Zellers et al. (2019) and Sap et al. (2019) for general visual commonsense to build a large scale repository of visual commonsense graphs and models that can explicitly generate commonsense inferences for given images. This repository consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 59,000 images, each paired with short video summaries of before and after.

From another perspective, there is also a large body of work on future prediction in different contexts such as future frame generation (Ranzato et al., 2014; Srivastava et al., 2015; Xue et al., 2016; Vondrick et al., 2016b; Mathieu et al., 2015; Villegas et al., 2019; Castrejon et al., 2019), prediction of the trajectories of people and objects (Walker et al., 2014; Alahi et al., 2016; Mottaghi et al., 2016), predicting human pose in future frames (Fragkiadaki et al., 2015; Walker et al., 2017; Chao et al., 2017) and semantic future action recognition (Lan et al., 2014; Zhou and Berg, 2015; Sun et al., 2019).

5 Discussion

In previous sections, we have identified relevant characteristics for an AI that understand. These capabilities are present in benchmarks and research streams that advance the current state of the art in understanding capabilities of AI systems. In this section, we summarize the different benchmarks and streams and present several capabilities that we considered relevant for an AI that understands. Then, we categorize the benchmarks using those capabilities and discuss the different research streams.

5.1 AI that understands capabilities

Following the discussion and requirements from an AI that understands defined in previous sections, we have developed the following capabilities as relevant to an AI that understands:

  • Capability 1: hierarchical and compositional knowledge representation, e.g. are the different components of an object be represented? Can we for instance compose several objects to derive a different one? Being able of decomposing objects supports better generalisation to previously unseen situations by the AI system.

  • Capability 2: multi-modal structure-to-structure mapping. As we have seen, information is available in several modalities including text and images. In order to combine information from different modalities, a mapping from these modalities to a structured knowledge representation provides the means for understanding information from several modalities.

  • Capability 3: integration of symbolic and non-symbolic knowledge. This is required to model information that does not have a symbolic representation with existing symbolic knowledge.

  • Capability 4: symbolic reasoning with uncertainties. Uncertainty about the world might be present even with fully symbolic representations. Uncertainties might be due to incompleteness, inconsistencies or noise, or motivated by world changes.

5.2 Benchmarks and understanding

If we revisit the benchmarks using the set of capabilities above, we obtain Table 1. One first obvious characteristic of most benchmarks is that they deal with textual or image modalities in tasks such as object detection or question answering. As presented in Table 1, the benchmarks covering a broader set of understanding capabilities are in visual question answering category, while benchmarks purely based on images or text focus on modelling specific tasks, such as object detection, that can be considered for understanding building blocks. Visual question answering considers a broad set of capabilities including composition of objects, their relation to each other and combination of multimodal data from text and images. The CLEVR data set supports as well symbolic reasoning with uncertainties about the world but there is no compositionality or hierarchical composition of the objects in the scene, thus the object detection methods might just focus on identifying shapes and colors and do not reason on more complex objects. A recent benchmark named Math SemEval Hopkins et al. (2019) requires reasoning about compositionality or hierarchical composition of the data in the benchmark, thus analysis of images, text and the need to combine everything to reason about the answer.

Considering the commonsense benchmarks, there exist a large number of benchmarks focused on question answering, mostly based on natural language. Some of these benchmarks are approached using natural language processing methods using pre-trained models on large corpora. In some examples such as ATOMIC, a knowledge graph is provided to reason on.

Ideally, an AI that understands would meet the requirements for the four capabilities. We have observed that the current benchmark development is progressing in the right direction but it might be possible to expand some of the existing ones with additional capabilities. For instance, the CLEVR benchmark could be extended to deal with a composition of objects, possibly starting with simple geometric objects and then increase the complexity into more complex objects obtained as a composition of this simpler ones.

A recent benchmark, the Abstraction and Reasoning Challenge (ARC)555https://www.kaggle.com/c/abstraction-and-reasoning-challenge Chollet (2019), has been defined to predict changes in an abstract block world in which reasoning would be used to predict the future state of the block world abstracting from small number of examples.

Benchmarks CAP1 CAP2 CAP3 CAP4
Image analytics
ImageNet Deng et al. (2009)
COCO Lin et al. (2014)
SHAPE benchmark Shilane et al. (2004)
Natural language processing
SQUAD Rajpurkar et al. (2016)
GLUE Wang et al. (2018a)
BioASQ Tsatsaronis et al. (2015)
AI2 Reasoning Challenge
LAMBADA Paperno et al. (2016)
BREAK Wolfson et al. (2020)
Visual QA
CLEVR Johnson et al. (2017a)
VQAv2 Goyal et al. (2017)
Visual Genome Antol et al. (2015)
Math SemEval Hopkins et al. (2019)
Common sense inference
TriviaQA Joshi et al. (2017)
CommonsenseQA Talmor et al. (2018)
NarrativeQA Kočiskỳ et al. (2018)
NewsQA Trischler et al. (2016)
SWAG Zellers et al. (2018)
WikiSQL Zhong et al. (2017)
Graph-basedLearningDataset Xu et al. (2018)
HotpotQA Yang et al. (2018)
WorldTree Jansen et al. (2018)
Event2Mind Rashkin et al. (2018)
Winograd-WinogradNLIschemaChallenge Mahajan (2018)
ConceptNet Speer et al. (2017)
WebChild Tandon et al. (2017)
NELL Mitchell et al. (2018)
ATOMIC Sap et al. (2019)
Table 1: CAP1: hierarchical and compositional knowledge representation, CAP2: multi-modal structure-to-structure mapping, CAP3: integrates symbolic and non-symbolic knowledge, CAP4: supports symbolic reasoning with uncertainties

5.3 Research streams

The capabilities defined above describes high level functionality an AI that understand should have, but in order to obtain these capabilities would require one or more lower level building blocks. These lower level building blocks could be from several research fields, or research streams as referred to in this text, combined in a single model architecture that allows the problem at hand to be solved.

The CLEVR dataset is a good a example to illustrate this concept. This is a Visual Question Answer (VQA) dataset containing a number of 3D geometrical shapes arbitrarily arranges with respect to each other. Based on questions such as ”Are there an equal number of large things and metal spheres?” or ”How many objects are either small cylinders or metal spheres”? the corresponding answer is inferred. This requires identification of attributes in the image, counting of objects, comparison between objects to be made, multiple attention and logical operations. The building blocks required to solve this VQA problem includes image analytics, such as a feature extractor using CNN, text encoding using word embeddings and subsequent processing by an LSTM, the NLP component. The use of attention layers allows focus to be placed on specific areas of interest and thus creates a symbolic representation of the input data. The output layer provides probabilities of the outcome and thus reasoning with uncertainty. This example illustrates the combination of multiple research fields or streams to enable an AI that shows understanding. As shown in Table 1, this benchmarks captured three of the four AI requirements.

The streams in section 4 are considered relevant to an AI that understand and applicable to address the benchmarks presented in section 3. Future development of these streams, and others, would allow more complex benchmarks to be solved. in turn, the creation of more complex benchmark also guides and drives the developments of the respective research streams. The relative new interest and developments in neuro-symbolic computing is a particularly interesting area of research and has sparked new interest is system that can reason and offer model interpretability.

6 Conclusion

The paper considers the components an artificial intelligence system that understands should have. That is, a system that not only learns statistical relationships within the data, but is capable of forming a human-like understanding of the input data. This is most certainly a truly difficult problem to solve and the purpose of the paper is not to claim a general solution for solving it, but to look at several research streams and some of their latest developments. Furthermore, several benchmarks are described, which have been used to study certain characteristics of an AI that understands. The work also contributes to a growing interest in artificial intelligence systems that are interpretable and transparent; properties that are crucial in domains, such as medical diagnosis.



  • A. Agrawal, A. Kembhavi, D. Batra, and D. Parikh (2017) C-vqa: a compositional split of the visual question answering (vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243. Cited by: §3.3, §4.2.1.
  • A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 961–971. Cited by: §4.5, §4.5.
  • J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien (2016) Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4575–4583. Cited by: §4.5.
  • D. Alvarez-Melis, Y. Mroueh, and T. S. Jaakkola (2019) Unsupervised hierarchy matching with optimal transport over hyperbolic spaces. arXiv preprint arXiv:1911.02536. Cited by: §4.2.3.
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2015) Deep compositional question answering with neural module networks. ArXiv abs/1511.02799. Cited by: §3.3.
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016a) Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705. Cited by: §4.4.1.
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016b) Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48. Cited by: §4.4.1, §4.4.2.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §3.3, §4.1, §4.5, Table 1.
  • I. Apperly (2010) Mindreaders: the cognitive basis of” theory of mind”. Psychology Press. Cited by: §4.5.
  • F. Arabshahi, S. Singh, and A. Anandkumar (2018) Combining symbolic and function evaluation expressions in neural programs. ArXiv abs/1801.04342. Cited by: §4.4.2.
  • M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow (2016) Deepcoder: learning to write programs. arXiv preprint arXiv:1611.01989. Cited by: §4.4.2.
  • P. Banerjee, K. K. Pal, A. Mitra, and C. Baral (2019) Careful selection of knowledge to solve open book question answering. arXiv preprint arXiv:1907.10738. Cited by: §4.5.
  • A. Banino, A. P. Badia, R. Köster, M. J. Chadwick, V. Zambaldi, D. Hassabis, C. Barry, M. Botvinick, D. Kumaran, and C. Blundell (2015) MEMO: a deep network for flexible combination of episodic memories. arXiv preprint arXiv:2001.10913. Cited by: §4.4.2.
  • W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone (1998) Genetic programming. Springer. Cited by: §4.4.2.
  • L. Bauer, Y. Wang, and M. Bansal (2018) Commonsense for generative multi-hop question answering tasks. arXiv preprint arXiv:1809.06309. Cited by: §3.4, §4.5.
  • B. G. Baumgart (1974) Geometric modelling for computer vision, rep. STA-CS-74-463, Stanford Univ., Stanford, Calf. Cited by: §4.3.2.
  • C. Bereiter (2005) Education and mind in the knowledge age. Routledge. Cited by: §2.
  • T. R. Besold, A. S. d’Avila Garcez, S. Bader, H. Bowman, P. M. Domingos, P. Hitzler, K. Kühnberger, L. C. Lamb, D. Lowd, P. M. V. Lima, L. de Penning, G. Pinkas, H. Poon, and G. Zaverucha (2017) Neural-symbolic learning and reasoning: a survey and interpretation. ArXiv abs/1711.03902. Cited by: §4.
  • I. Biederman (1987) Recognition-by-components: a theory of human image understanding.. Psychological review 94 (2), pp. 115. Cited by: §2.
  • Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich, et al. (2020) Experience grounds language. arXiv preprint arXiv:2004.10151. Cited by: §4.5.
  • [21] M. Bošnjak, T. Rocktäschel, and S. Riedel Programming with a differentiable forth interpreter. In Proceedings of the 34th International Conference on Machine Learning, pp. 547–556. Cited by: §4.4.2.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §4.5.
  • C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner (2019) Monet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §4.3.3.
  • H. Cai, V. W. Zheng, and K. C. Chang (2018) A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30 (9), pp. 1616–1637. Cited by: §4.2.4.
  • J. Cai, R. Shin, and D. X. Song (2017) Making neural programming architectures generalize via recursion. ArXiv abs/1704.06611. Cited by: §4.4.2.
  • L. Cai and W. Y. Wang (2017) Kbgan: adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071. Cited by: §4.2.4.
  • L. Castrejon, N. Ballas, and A. Courville (2019) Improved conditional vrnns for video prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7608–7617. Cited by: §4.5.
  • G. Chaitin (2006) The limits of reason. Scientific American 294 (3), pp. 74–81. Cited by: §2.
  • B. P. Chamberlain, J. Clough, and M. P. Deisenroth (2017) Neural embeddings of graphs in hyperbolic space. arXiv preprint arXiv:1705.10359. Cited by: §4.2.3.
  • S. Chan, L. Hsu, K. Loe, and H. Teh (1993) Neural logic networks. Progress in Neural Networks 2. Cited by: §4.4.1.
  • Y. Chao, J. Yang, B. Price, S. Cohen, and J. Deng (2017) Forecasting human dynamics from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 548–556. Cited by: §4.5.
  • R. Charakorn, Y. Thawornwattana, S. Itthipuripat, N. Pawlowski, P. Manoonpong, and N. Dilokthanakul (2020) An explicit local and global representation disentanglement framework with applications in deep clustering and unsupervised object detection. arXiv preprint arXiv:2001.08957. Cited by: §4.3.3.
  • H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan (2020) BlendMask: top-down meets bottom-up for instance segmentation. arXiv preprint arXiv:2001.00309. Cited by: §4.3.1.
  • J. Chen, J. Chen, and Z. Yu (2019) Incorporating structured commonsense knowledge in story completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6244–6251. Cited by: §4.5.
  • K. Chen, M. Seuret, J. Hennebert, and R. Ingold (2017a) Convolutional neural networks for page segmentation of historical document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 965–970. Cited by: §4.3.4.
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §4.5.
  • X. Chen, C. Liu, and D. Song (2017b) Towards synthesizing complex programs from input-output examples. arXiv preprint arXiv:1706.01284. Cited by: §4.4.2.
  • J. Choi, K. M. Yoo, and S. Lee (2018) Learning to compose task-specific tree structures. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.2.1, §4.2.2.
  • F. Chollet (2019) The measure of intelligence. arXiv preprint arXiv:1911.01547. Cited by: §2, §2, §3.5, §5.2.
  • C. Chuang, J. Li, A. Torralba, and S. Fidler (2018) Learning to act properly: predicting and explaining affordances from images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 975–983. Cited by: §4.5.
  • C. Clark and M. Gardner (2017) Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723. Cited by: §4.5.
  • C. Corro and I. Titov (2018) Differentiable perturb-and-parse: semi-supervised parsing with a structured variational autoencoder. arXiv preprint arXiv:1807.09875. Cited by: §4.2.2.
  • M. Crosby (2020) Building thinking machines by solving animal cognition tasks. Minds and Machines, pp. 1–27. Cited by: §2.
  • J. Da and J. Kusai (2019) Cracking the contextual commonsense code: understanding commonsense reasoning aptitude of deep contextual representations. arXiv preprint arXiv:1910.01157. Cited by: §3.4.
  • I. Dagan, O. Glickman, and B. Magnini (2005) The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Cited by: §3.4.
  • S. Dash, Md. F. M. Chowdhury, A. M. Gliozzo, N. Mihindukulasooriya, and N. R. Fauceglia (2019) Hypernym detection using strict partial order networks. ArXiv abs/1909.10572. Cited by: §4.2.4.
  • E. Davis and G. Marcus (2015) Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM 58 (9), pp. 92–103. Cited by: §3.4, §4.5, §4.5.
  • E. Davis (2014) Representations of commonsense knowledge. Morgan Kaufmann. Cited by: §4.5.
  • L. De Moura and N. Bjørner (2008) Z3: an efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 337–340. Cited by: §4.4.2.
  • L. De Raedt, R. Manhaeve, S. Dumancic, T. Demeester, and A. Kimmig (2019) Neuro-symbolic= neural+ logical+ probabilistic. In NeSy’19@ IJCAI, the 14th International Workshop on Neural-Symbolic Learning and Reasoning, Cited by: §4.4.1.
  • C. De Sa, A. Gu, C. Ré, and F. Sala (2018) Representation tradeoffs for hyperbolic embeddings. Proceedings of machine learning research 80, pp. 4460. Cited by: §4.2.3, §4.2.3.
  • D. Demeter and D. Downey (2019) Just add functions: a neural-symbolic language model. arXiv preprint arXiv:1912.05421. Cited by: §4.2.2.
  • F. Deng, Z. Zhi, and S. Ahn (2019) Generative hierarchical models for parts, objects, and scenes. ArXiv abs/1611.01988v2. Cited by: §4.3.3.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.1, Table 1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.5, §4.5.
  • C. Doersch (2016) Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: §4.2.2.
  • H. Dong, J. Mao, T. Lin, C. Wang, L. Li, and D. Zhou (2019) Neural logic machines. arXiv preprint arXiv:1904.11694. Cited by: §4.4.1.
  • K. Ehsani, H. Bagherinezhad, J. Redmon, R. Mottaghi, and A. Farhadi (2018) Who let the dogs out? modeling dog behavior from visual data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4051–4060. Cited by: §4.5.
  • K. Ellis, D. Ritchie, A. Solar-Lezama, and J. Tenenbaum (2018) Learning to infer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 6059–6068. External Links: Link Cited by: §4.3.2.
  • K. Ellis, A. Solar-Lezama, and J. Tenenbaum (2015) Unsupervised learning by program synthesis. In Advances in neural information processing systems, pp. 973–981. Cited by: §4.4.2.
  • M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner (2019) Genesis: generative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052. Cited by: §4.3.3.
  • S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al. (2016) Attend, infer, repeat: fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pp. 3225–3233. Cited by: §4.3.1.
  • F. Fan, J. Xiong, and G. Wnag (2020) On interpretability of artificial neural networks. arXiv preprint arXiv:2001.02522. Cited by: §4.4.1.
  • C. Fellbaum (2012) WordNet. The encyclopedia of applied linguistics. Cited by: §4.2.4.
  • P. Felsen, P. Agrawal, and J. Malik (2017) What will happen next? forecasting player moves in sports videos. In Proceedings of the IEEE international conference on computer vision, pp. 3342–3351. Cited by: §4.5.
  • J. K. Feser, S. Chaudhuri, and I. Dillig (2015) Synthesizing data structure transformations from input-output examples. ACM SIGPLAN Notices 50 (6), pp. 229–239. Cited by: §4.4.2.
  • J. K. Feser, M. Brockschmidt, A. L. Gaunt, and D. Tarlow (2016) Differentiable functional program interpreters. ArXiv abs/1910.09119. Cited by: §4.4.2.
  • F. Fleuret, T. Li, C. Dubout, E. K. Wampler, S. Yantis, and D. Geman (2011) Comparing machines and humans on a visual categorization test. Proceedings of the National Academy of Sciences 108 (43), pp. 17621–17625. Cited by: §4.4.2.
  • K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik (2015) Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354. Cited by: §4.5.
  • C. Gan, Y. Li, H. Li, C. Sun, and B. Gong (2017) Vqs: linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1811–1820. Cited by: §3.3.
  • O. Ganea, G. Bécigneul, and T. Hofmann (2018) Hyperbolic entailment cones for learning hierarchical embeddings. arXiv preprint arXiv:1804.01882. Cited by: §4.2.3.
  • A. d. Garcez, M. Gori, L. C. Lamb, L. Serafini, M. Spranger, and S. N. Tran (2019) Neural-symbolic computing: an effective methodology for principled integration of machine learning and reasoning. arXiv preprint arXiv:1905.06088. Cited by: §4.4.1, §4.4.1, §4.4.1.
  • A. L. Gaunt, M. Brockschmidt, R. Singh, N. Kushman, P. Kohli, J. Taylor, and D. Tarlow (2016a) Terpret: a probabilistic programming language for program induction. arXiv preprint arXiv:1608.04428. Cited by: §4.4.2.
  • A. L. Gaunt, M. Brockschmidt, N. Kushman, and D. Tarlow (2016b) Differentiable programs with neural libraries. In ICML, Cited by: §4.4.2.
  • A. Gilani, S. R. Qasim, I. Malik, and F. Shafait (2017) Table detection using deep learning. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 771–776. Cited by: §4.3.4.
  • P. Goel, S. Feng, and J. Boyd-Graber (2019) How pre-trained word representations capture commonsense physical comparisons. In Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, Hong Kong, China, pp. 130–135. External Links: Link, Document Cited by: §3.4.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913. Cited by: §2, §3.3, §3.3, Table 1.
  • A. Graves, G. Wayne, and I. Danihelka (2014)

    Neural turing machines

    arXiv preprint arXiv:1410.5401. Cited by: §4.4.2.
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471. Cited by: §4.4.2.
  • K. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner (2019) Multi-object representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450. Cited by: §4.3.3.
  • K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra (2015) Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623. Cited by: §4.3.1.
  • J. Guan, Y. Wang, and M. Huang (2019) Story ending generation with incremental encoding and commonsense knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6473–6480. Cited by: §4.5.
  • S. Gulwani (2011) Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46 (1), pp. 317–330. Cited by: §4.4.2.
  • A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018)

    Social gan: socially acceptable trajectories with generative adversarial networks

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264. Cited by: §4.5.
  • L. Hao, L. Gao, X. Yi, and Z. Tang (2016) A table detection method for pdf documents based on convolutional neural networks. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 287–292. Cited by: §4.3.4.
  • S. Havrylov, G. Kruszewski, and A. Joulin (2019) Cooperative learning of disjoint syntax and semantics. In NAACL-HLT, Cited by: §4.2.2.
  • D. He, S. Cohen, B. Price, D. Kifer, and C. L. Giles (2017a) Multi-scale multi-task fcn for semantic page segmentation and table detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 254–261. Cited by: §4.3.4.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017b) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.3.1.
  • K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In Advances in neural information processing systems, pp. 1693–1701. Cited by: §4.5.
  • J. Hernandez-Orallo (2020) AI evaluation: on broken yardsticks and measurement scales. In Evaluating Evaluation of AI Systems (Meta-Eval 2020), AAAI, Cited by: §2.
  • G. E. Hinton, A. Krizhevsky, and S. D. Wang (2011) Transforming auto-encoders. In International conference on artificial neural networks, pp. 44–51. Cited by: §4.3.2.
  • M. Hopkins, R. Le Bras, C. Petrescu-Prahova, G. Stanovsky, H. Hajishirzi, and R. Koncel-Kedziorski (2019) SemEval-2019 task 10: math question answering. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 893–899. External Links: Link, Document Cited by: §3.3, §5.2, Table 1.
  • R. Hu, J. Andreas, T. Darrell, and K. Saenko (2018) Explainable neural computation via stack neural module networks. In Proceedings of the European conference on computer vision (ECCV), pp. 53–69. Cited by: §4.1.
  • R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko (2017a) Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813. Cited by: §3.3, §4.1, §4.4.1.
  • R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko (2017b) Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124. Cited by: §4.2.1.
  • Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing (2016) Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318. Cited by: §4.4.1, §4.4.1.
  • D. Hudson and C. D. Manning (2019) Learning by abstraction: the neural state machine. In Advances in Neural Information Processing Systems, pp. 5901–5914. Cited by: §4.4.2.
  • A. Huminski, Y. B. Ng, K. Kwok, and F. Bond (2019) Commonsense inference in human-robot communication. In Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, pp. 104–112. Cited by: §3.4.
  • P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison (2018) Worldtree: a corpus of explanation graphs for elementary science questions supporting multi-hop inference. arXiv preprint arXiv:1802.03052. Cited by: item 9, Table 1.
  • D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits (2019) Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932. Cited by: §2.
  • W. Jin, R. Barzilay, and T. Jaakkola (2018) Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364. Cited by: §4.2.2.
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017a) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §1, §2, §3.2, §3.3, §4.1, §4.3.3, §4.5, §4, Table 1.
  • J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017b) Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998. Cited by: §4.1.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551. Cited by: item 1, Table 1.
  • K. Kafle and C. Kanan (2017a) An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1965–1973. Cited by: §3.3, §4.1.
  • K. Kafle and C. Kanan (2017b) Visual question answering: datasets, algorithms, and future challenges. Computer Vision and Image Understanding 163, pp. 3–20. Cited by: §4.1.
  • I. Kant (1998) Critique of pure reason. The Cambridge Edition of the Works of Immanuel Kant, Cambridge University Press, New York, NY. Note: Translated by Paul Guyer and Allen W. Wood Cited by: §2.
  • N. Kant (2018) Recent advances in neural program synthesis. arXiv preprint arXiv:1802.02353. Cited by: §4.4.2.
  • I. Kavasidis, C. Pino, S. Palazzo, F. Rundo, D. Giordano, P. Messina, and C. Spampinato (2019) A saliency-based convolutional neural network for table and chart detection in digitized documents. In International Conference on Image Analysis and Processing, pp. 292–302. Cited by: §4.3.4.
  • S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg (2014) Referitgame: referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798. Cited by: §4.5.
  • D. Khashabi, E. S. Azer, T. Khot, A. Sabharwal, and D. Roth (2019) On the capabilities and limitations of reasoning for natural language understanding. arXiv preprint arXiv:1901.02522. Cited by: item 9.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.3.
  • M. Kissner and H. Mayer (2019) A neural-symbolic architecture for inverse graphics improved by lifelong meta-learning. arXiv preprint arXiv:1905.08910. Cited by: §4.3.2.
  • H. Kiyomaru, K. Omura, Y. Murawaki, D. Kawahara, and S. Kurohashi (2019) Diversity-aware event prediction based on a conditional variational autoencoder with reconstruction. In Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, Hong Kong, China, pp. 113–122. External Links: Link, Document Cited by: §3.4.
  • T. Kočiskỳ, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette (2018) The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. Cited by: item 3, Table 1.
  • J. R. Koza and J. R. Koza (1992) Genetic programming: on the programming of computers by means of natural selection. Vol. 1, MIT press. Cited by: §4.4.2.
  • T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. Mansinghka (2015) Picture: a probabilistic programming language for scene perception. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 4390–4399. Cited by: §4.3.2.
  • B. M. Lake, N. D. Lawrence, and J. B. Tenenbaum (2018) The emergence of organizing structure in conceptual representation. Cognitive science 42, pp. 809–832. Cited by: §4.2.2.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §2.
  • B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman (2017) Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §2, §2.
  • T. Lan, T. Chen, and S. Savarese (2014) A hierarchical representation for future action prediction. In European Conference on Computer Vision, pp. 689–704. Cited by: §4.5.
  • M. Le, S. Roller, L. Papaxanthos, D. Kiela, and M. Nickel (2019) Inferring concept hierarchies from text corpora via hyperbolic embeddings. arXiv preprint arXiv:1902.00913. Cited by: §4.2.3.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1, §4.4.1.
  • Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. External Links: Link Cited by: §4.2.3, §4.4.1.
  • Y. Lee and J. Park (2019) CenterMask : real-time anchor-free instance segmentation. ArXiv abs/1911.06667. Cited by: §4.3.1.
  • C. Li, D. Tarlow, A. L. Gaunt, M. Brockschmidt, and N. Kushman (2017) Neural program lattices. In ICLR, Cited by: §4.4.2.
  • Q. Li, Z. Li, J. Wei, Y. Gu, A. Jatowt, and Z. Yang (2018) A multi-attention based neural network with external knowledge for story ending predicting task. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1754–1762. Cited by: §4.5.
  • X. Li, Z. Zhang, W. Zhu, Z. Li, Y. Ni, P. Gao, J. Yan, and G. Xie (2019) Pingan smart health and SJTU at COIN - shared task: utilizing pre-trained language models and common-sense knowledge in machine reading tasks. In Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, Hong Kong, China, pp. 93–98. External Links: Link, Document Cited by: §3.4.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §3.4.
  • B. Y. Lin, X. Chen, J. Chen, and X. Ren (2019) Kagnet: knowledge-aware graph networks for commonsense reasoning. arXiv preprint arXiv:1909.02151. Cited by: §4.5.
  • H. Lin, L. Sun, and X. Han (2017) Reasoning with heterogeneous knowledge for commonsense machine comprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2032–2043. Cited by: §4.5.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.1, §4.3.1, Table 1.
  • Z. Lin, Y. Wu, S. V. Peri, W. Sun, G. Singh, F. Deng, J. Jiang, and S. Ahn (2020) SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition. arXiv preprint arXiv:2001.02407. Cited by: §4.3.3.
  • X. Liu, Y. Shen, K. Duh, and J. Gao (2017) Stochastic answer networks for machine reading comprehension. arXiv preprint arXiv:1712.03556. Cited by: §4.5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.5.
  • M. M. Loper and M. J. Black (2014) OpenDR: an approximate differentiable renderer. In European Conference on Computer Vision, pp. 154–169. Cited by: §4.3.2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: §4.5.
  • J. Lu, J. Yang, D. Batra, and D. Parikh (2016a) Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pp. 289–297. Cited by: §4.2.1.
  • J. Lu, J. Yang, D. Batra, and D. Parikh (2016b) Hierarchical question-image co-attention for visual question answering. In NIPS, Cited by: §4.4.1.
  • K. Ma, J. Francis, Q. Lu, E. Nyberg, and A. Oltramari (2019) Towards generalizable neuro-symbolic systems for commonsense question answering. arXiv preprint arXiv:1910.14087. Cited by: §3.4.
  • V. Mahajan (2018) Winograd schema-knowledge extraction using narrative chains. arXiv preprint arXiv:1801.02281. Cited by: item 11, item 8, Table 1.
  • M. Malinowski and M. Fritz (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, Cited by: §4.1.
  • R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, and L. De Raedt (2018) Deepproblog: neural probabilistic logic programming. In Advances in Neural Information Processing Systems, pp. 3749–3759. Cited by: §4.4.1, §4.4.1, §4.4.1, §4.4.2.
  • C. Manning (2016) Understanding human language: can nlp and deep learning help?. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 1–1. Cited by: §2.
  • J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu (2019a) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584. Cited by: §2, §4.1, §4.
  • J. Mao, Y. Yao, S. Heinrich, T. Hinz, C. Weber, S. Wermter, Z. Liu, and M. Sun (2019b) Bootstrapping knowledge graphs from images and text. Frontiers in Neurorobotics 13, pp. 93. Cited by: §4.2.4.
  • J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy (2016) Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20. Cited by: §4.5.
  • G. Marcus (2020) The next decade in ai: four steps towards robust artificial intelligence. External Links: 2002.06177 Cited by: §4.
  • K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi (2019) Ok-vqa: a visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3195–3204. Cited by: §4.5.
  • E. Mathieu, C. Le Lan, C. J. Maddison, R. Tomioka, and Y. W. Teh (2019) Continuous hierarchical representations with poincaré variational auto-encoders. In Advances in neural information processing systems, pp. 12544–12555. Cited by: §4.2.3.
  • M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §4.5.
  • E. Merkhofer, J. Henderson, D. Bloom, L. Strickhart, and G. Zarrella (2018) Mitre at semeval-2018 task 11: commonsense reasoning without commonsense knowledge. In Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 1078–1082. Cited by: §4.5.
  • T. Mihaylov and A. Frank (2018) Knowledgeable reader: enhancing cloze-style reading comprehension with external commonsense knowledge. arXiv preprint arXiv:1805.07858. Cited by: §4.5.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §4.2.3.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §4.2.3.
  • T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, et al. (2018) Never-ending learning. Communications of the ACM 61 (5), pp. 103–115. Cited by: item 3, Table 1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §2.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696. Cited by: item 5, item 3, §3.4.
  • R. Mottaghi, M. Rastegari, A. Gupta, and A. Farhadi (2016) “What happens if…” learning to predict the effect of forces in images. In European conference on computer vision, pp. 269–285. Cited by: §4.5, §4.5.
  • S. Muggleton (1995) Inductive logic programming. New Generation Computing 8, pp. 295–318. Cited by: §4.4.1.
  • R. Musa, X. Wang, A. Fokoue, N. Mattei, M. Chang, P. Kapanipathi, B. Makni, K. Talamadupula, and M. Witbrock (2018) Answering science exam questions using query rewriting with background knowledge. arXiv preprint arXiv:1809.05726. Cited by: §4.5.
  • Y. Nagano, S. Yamaguchi, Y. Fujita, and M. Koyama (2019) A differentiable gaussian-like distribution on hyperbolic space for gradient-based learning. arXiv preprint arXiv:1902.02992. Cited by: §4.2.3.
  • A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §2.
  • D. Q. Nguyen, D. Q. Nguyen, C. X. Chu, S. Thater, and M. Pinkal (2017) Sequence to sequence learning for event prediction. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, pp. 37–42. External Links: Link Cited by: §3.4.
  • N. Nguyen, C. Rigaud, and J. Burie (2019) Multi-task model for comic book image analysis. In International Conference on Multimedia Modeling, pp. 637–649. Cited by: §4.3.4.
  • M. Nickel and D. Kiela (2018) Learning continuous hierarchies in the lorentz model of hyperbolic geometry. arXiv preprint arXiv:1806.03417. Cited by: §4.2.3.
  • M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich (2015) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: §4.2.4.
  • M. Nickel, L. Rosasco, and T. Poggio (2016) Holographic embeddings of knowledge graphs. In Thirtieth Aaai conference on artificial intelligence, Cited by: §4.2.3.
  • M. Nickel and D. Kiela (2017) Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338–6347. Cited by: §4.2.3.
  • S. Ostermann, M. Roth, and M. Pinkal (2019a) MCScript2. 0: a machine comprehension corpus focused on script events and participants. arXiv preprint arXiv:1905.09531. Cited by: §4.5.
  • S. Ostermann, S. Zhang, M. Roth, and P. Clark (2019b) Commonsense inference in natural language processing (coin)-shared task report. In Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, pp. 66–74. Cited by: §4.5, §4.5.
  • A. Paccanaro and G. E. Hinton (2001) Learning distributed representations of concepts using linear relational embedding. IEEE Transactions on Knowledge and Data Engineering 13 (2), pp. 232–244. Cited by: §4.2.3.
  • D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández (2016) The lambada dataset: word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031. Cited by: §3.2, Table 1.
  • E. Parisotto, A. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli (2016) Neuro-symbolic program synthesis. ArXiv abs/1611.01855. Cited by: §4.4.2.
  • J. S. Park, C. Bhagavatula, R. Mottaghi, A. Farhadi, and Y. Choi (2020) VisualCOMET: reasoning about the dynamic context of a still image. In In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §4.5, §4.5.
  • D. Paul and A. Frank (2019) Ranking and selecting multi-hop knowledge paths to better predict human needs. arXiv preprint arXiv:1904.00676. Cited by: §4.5.
  • H. Paulheim (2017) Knowledge graph refinement: a survey of approaches and evaluation methods. Semantic web 8 (3), pp. 489–508. Cited by: §4.2.4.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.2.3.
  • T. Pierrot, G. Ligner, S. E. Reed, O. Sigaud, N. Perrin, A. Laterre, D. Kas, K. Beguir, and N. de Freitas (2019) Learning compositional neural programs with recursive tree search and planning. ArXiv abs/1905.12941. Cited by: §4.4.2, §4.4.2.
  • H. Pirsiavash, C. Vondrick, and A. Torralba (2014) Inferring the why in images. Technical report MASSACHUSETTS INST OF TECH CAMBRIDGE. Cited by: §4.5, §4.5.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §4.5.
  • I. Porada, K. Suleman, and J. C. K. Cheung (2019) Can a gorilla ride a camel? learning semantic plausibility from text. In Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, Hong Kong, China, pp. 123–129. External Links: Link, Document Cited by: §3.4.
  • S. Purushwalkam, M. Nickel, A. Gupta, and M. Ranzato (2019) Task-driven modular networks for zero-shot compositional learning. arXiv preprint arXiv:1905.05908. Cited by: §4.2.1.
  • S. R. Qasim, H. Mahmood, and F. Shafait (2019) Rethinking table recognition using graph neural networks. In 2019 15th International Conference on Document Analysis and Recognition, pp. accepted. Cited by: §4.3.4.
  • T. Quah, C. Tan, H. Teh, and B. S. Sriniivasan (1995) Utilizing a neural logic expert system in currency option trading. Expert Systems with Applications 9, pp. 213–222. Cited by: §4.4.1.
  • L. D. Raedt, A. Kimmig, and H. Toivonen (2007) ProbLog: a probabilistic prolog and its applications in link discovery. In Proceedings of the 20th International conference of Artificial Intelligence, pp. 2462–2467. Cited by: §4.4.1.
  • N. F. Rajani, B. McCann, C. Xiong, and R. Socher (2019) Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361. Cited by: §4.5.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §3.2, Table 1.
  • M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra (2014) Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604. Cited by: §4.5.
  • H. Rashkin, M. Sap, E. Allaway, N. A. Smith, and Y. Choi (2018) Event2mind: commonsense inference on events, intents, and reactions. arXiv preprint arXiv:1805.06939. Cited by: item 10, item 7, Table 1.
  • S. Reed and N. De Freitas (2015) Neural programmer-interpreters. arXiv preprint arXiv:1511.06279. Cited by: §4.4.2.
  • G. Renton, P. Héroux, B. Gaüzère, and S. Adam (2019) Graph neural network for symbol detection on document images. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 1, pp. 62–67. Cited by: §4.3.4.
  • N. Rhinehart and K. M. Kitani (2017) First-person activity forecasting with online inverse reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3696–3705. Cited by: §4.5.
  • M. Ricci, J. Kim, and T. Serre (2018) Same-different problems strain convolutional neural networks. arXiv preprint arXiv:1802.03390. Cited by: §1.
  • R. Riegel, A. Gray, F. Luus, N. Khan, N. Makondo, I. Akhalwaya, H. Qian, R. Fagin, F. Barahona, U. Sharma, S. Ijbal, H. Karanam, S. Neelam, A. Likhhyani, and S. Srivastava (2020) Logical neural networks. arXiv preprint arXiv:2006.13155v1. Cited by: §4.4.1, §4.4.1.
  • D. Rummelhart, G. Hinton, and R. Williams (1986) Learning internal representations by error propagation. parallel distributed processing, explorations in the microstructure of cognition. Foundations. Cambridge, MA: MIT Press. Cited by: §1.
  • A. Santoro, F. Hill, D. Barrett, A. Morcos, and T. Lillicrap (2018) Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pp. 4477–4486. Cited by: §2.
  • M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi (2019) Atomic: an atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3027–3035. Cited by: item 4, §4.5, §4.5, Table 1.
  • M. Sap, V. Shwartz, A. Bosselut, Y. Choi, and D. Roth (2020) Introductory tutorial: commonsense reasoning for natural language processing. Association for Computational Linguistics (ACL 2020): Tutorial Abstracts, pp. 27. Cited by: §4.5.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
  • S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed (2017) Deepdesrt: deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 1162–1167. Cited by: §4.3.4.
  • L. Serafini and A. d. Garcez (2016) Logic tensor networks: deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422. Cited by: §4.4.1, §4.4.1.
  • P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565. Cited by: §4.5.
  • J. Shi, H. Zhang, and J. Li (2019) Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8376–8384. Cited by: §4.1.
  • P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser (2004) The princeton shape benchmark. In Proceedings Shape Modeling Applications, 2004., pp. 167–178. Cited by: §3.1, Table 1.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: §2.
  • K. K. Singh, K. Fatahalian, and A. A. Efros (2016) Krishnacam: using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. Cited by: §4.5.
  • B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J. Goldberg, K. Eilbeck, A. Ireland, C. J. Mungall, et al. (2007) The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nature biotechnology 25 (11), pp. 1251–1255. Cited by: §4.2.4.
  • R. Socher, D. Chen, C. D. Manning, and A. Ng (2013a) Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pp. 926–934. Cited by: §4.2.4.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013b) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §4.2.2.
  • L. Song, X. Peng, Y. Zhang, Z. Wang, and D. Gildea (2017) Amr-to-text generation with synchronous node replacement grammar. arXiv preprint arXiv:1702.00500. Cited by: §3.4.
  • R. Speer, J. Chin, and C. Havasi (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: item 1, Table 1.
  • R. Speer, J. Chin, and C. Havasi (2016) Conceptnet 5.5: an open multilingual graph of general knowledge. arXiv preprint arXiv:1612.03975. Cited by: §4.5.
  • N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §4.5.
  • P. W. Staar, M. Dolfi, C. Auer, and C. Bekas (2018) Corpus conversion service: a machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 774–782. Cited by: §4.3.4.
  • K. Stelzner, R. Peharz, and K. Kersting (2019) Faster attend-infer-repeat with tractable probabilistic models. In ICML, Cited by: §4.3.1.
  • S. Storks, Q. Gao, and J. Y. Chai (2019) Commonsense reasoning for natural language understanding: a survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172, pp. 1–60. Cited by: §3.4, §3.4, §4.5.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §4.5.
  • J. Suarez, J. M. Johnson, and F. Li (2018) DDRprog: a clevr differentiable dynamic reasoning programmer. ArXiv abs/1803.11361. Cited by: §4.1.
  • C. Sun, A. Shrivastava, C. Vondrick, R. Sukthankar, K. Murphy, and C. Schmid (2019) Relational action forecasting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 273–283. Cited by: §4.5.
  • R. Suzuki, R. Takahama, and S. Onoda (2019) Hyperbolic disk embeddings for directed acyclic graphs. arXiv preprint arXiv:1902.04335. Cited by: §4.2.3.
  • K. S. Tai, R. Socher, and C. D. Manning (2015)

    Improved semantic representations from tree-structured long short-term memory networks

    arXiv preprint arXiv:1503.00075. Cited by: §4.2.2, §4.4.2.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2018) Commonsenseqa: a question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937. Cited by: item 2, item 1, Table 1.
  • H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §4.5.
  • M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946v3. Cited by: §4.4.1.
  • N. Tandon, G. De Melo, and G. Weikum (2017) Webchild 2.0: fine-grained commonsense knowledge distillation. In Proceedings of ACL 2017, System Demonstrations, pp. 115–120. Cited by: item 2, Table 1.
  • N. Tandon, B. D. Mishra, J. Grus, W. Yih, A. Bosselut, and P. Clark (2018) Reasoning about actions and state changes by injecting commonsense knowledge. arXiv preprint arXiv:1808.10012. Cited by: §4.5.
  • D. Tanneberg, E. Rueckert, and J. Peters (2019) Learning algorithmic solutions to symbolic planning tasks with a neural computer. ArXiv abs/1911.00926. Cited by: §4.4.2.
  • S. Thiem and P. Jansen (2019) Extracting common inference patterns from semi-structured explanations. In Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, pp. 53–65. Cited by: §3.4.
  • Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: §4.3.1.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2016) Newsqa: a machine comprehension dataset. arXiv preprint arXiv:1611.09830. Cited by: item 4, item 2, Table 1.
  • T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In International Conference on Machine Learning (ICML), Cited by: §4.2.3.
  • G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn, A. Krithara, S. Petridis, D. Polychronopoulos, et al. (2015) An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics 16 (1), pp. 138. Cited by: Table 1.
  • R. Vedantam, X. Lin, T. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Learning common sense through visual abstraction. In Proceedings of the IEEE international conference on computer vision, pp. 2542–2550. Cited by: §4.5.
  • P. Vicol, M. Tapaswi, L. Castrejon, and S. Fidler (2018) Moviegraphs: towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8581–8590. Cited by: §4.5.
  • R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V. Le, and H. Lee (2019) High fidelity video prediction with large stochastic recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 81–91. Cited by: §4.5.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §4.5.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016a) Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106. Cited by: §4.5.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016b) Generating videos with scene dynamics. In Advances in neural information processing systems, pp. 613–621. Cited by: §4.5.
  • M. Wagner, H. Basevi, R. Shetty, W. Li, M. Malinowski, M. Fritz, and A. Leonardis (2018) Answering visual what-if questions: from actions to predicted scene descriptions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §4.5.
  • J. Walker, A. Gupta, and M. Hebert (2014) Patch to the future: unsupervised visual prediction. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3302–3309. Cited by: §4.5.
  • J. Walker, K. Marino, A. Gupta, and M. Hebert (2017) The pose knows: video forecasting by generating pose futures. In Proceedings of the IEEE international conference on computer vision, pp. 3332–3341. Cited by: §4.5.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018a) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §3.2, Table 1.
  • L. Wang, M. Sun, W. Zhao, K. Shen, and J. Liu (2018b) Yuanfudao at semeval-2018 task 11: three-way attention and relational knowledge for commonsense machine comprehension. arXiv preprint arXiv:1803.00191. Cited by: §4.5.
  • Q. Wang, Z. Mao, B. Wang, and L. Guo (2017) Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29 (12), pp. 2724–2743. Cited by: §4.2.4.
  • X. Wang, P. Kapanipathi, R. Musa, M. Yu, K. Talamadupula, I. Abdelaziz, M. Chang, A. Fokoue, B. Makni, N. Mattei, et al. (2019a) Improving natural language inference using external knowledge in the science questions domain. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7208–7215. Cited by: §4.5.
  • X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li (2019b) SOLO: segmenting objects by locations. arXiv preprint arXiv:1912.04488. Cited by: §4.3.1.
  • Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014)

    Knowledge graph embedding by translating on hyperplanes

    In Twenty-Eighth AAAI conference on artificial intelligence, Cited by: §4.2.4.
  • D. Weissenborn, T. Kočiskỳ, and C. Dyer (2017) Dynamic integration of background knowledge in neural nlu systems. arXiv preprint arXiv:1706.02596. Cited by: §4.5.
  • T. Wolfson, M. Geva, A. Gupta, M. Gardner, Y. Goldberg, D. Deutch, and J. Berant (2020) Break it down: a question understanding benchmark. Transactions of the Association for Computational Linguistics. Cited by: §3.2, Table 1.
  • J. Wu, J. B. Tenenbaum, and P. Kohli (2017) Neural scene de-rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 699–707. Cited by: §4.3.2.
  • J. Xia, C. Wu, and M. Yan (2019) Incorporating relation knowledge into commonsense reading comprehension with multi-task learning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2393–2396. Cited by: §4.5.
  • W. Xiong, M. Yu, S. Chang, X. Guo, and W. Y. Wang (2019) Improving question answering over incomplete kbs with knowledge-aware reader. arXiv preprint arXiv:1905.07098. Cited by: §4.5.
  • K. Xu, L. Wu, Z. Wang, Y. Feng, M. Witbrock, and V. Sheinin (2018) Graph2seq: graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823. Cited by: item 8, item 6, §3.4, Table 1.
  • Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou (2019) LayoutLM: pre-training of text and layout for document image understanding. arXiv preprint arXiv:1912.13318. Cited by: §4.3.4.
  • T. Xue, J. Wu, K. Bouman, and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §4.5.
  • A. Yang, Q. Wang, J. Liu, K. Liu, Y. Lyu, H. Wu, Q. She, and S. Li (2019) Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2346–2357. Cited by: §4.5.
  • B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: §4.2.4.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) Hotpotqa: a dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600. Cited by: item 9, Table 1.
  • T. Ye, X. Wang, J. Davidson, and A. Gupta (2018) Interpretable intuitive physics model. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: §4.5.
  • K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum (2019) Clevrer: collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442. Cited by: §3.3.
  • K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum (2018) Neural-symbolic vqa: disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pp. 1031–1042. Cited by: §4.1, §4.1.
  • Y. Yoshikawa, J. Lin, and A. Takeuchi (2018) Stair actions: a video dataset of everyday home actions. arXiv preprint arXiv:1804.04326. Cited by: §4.5.
  • T. Yu and C. M. De Sa (2019) Numerically accurate hyperbolic embeddings using tiling-based models. In Advances in Neural Information Processing Systems, pp. 2021–2031. Cited by: §4.2.3.
  • J. Yuan, B. Li, and X. Xue (2019a) Generative modeling of infinite occluded objects for compositional scene representation. In International Conference on Machine Learning, pp. 7222–7231. Cited by: §4.3.1.
  • J. Yuan, B. Li, and X. Xue (2019b) Generative modeling of infinite occluded objects for compositional scene representation. In ICML, Cited by: §4.3.3.
  • W. W. Zachary (1977) An information flow model for conflict and fission in small groups. Journal of anthropological research 33 (4), pp. 452–473. Cited by: §4.2.3.
  • R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi (2019) From recognition to cognition: visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6720–6731. Cited by: §4.5, §4.5, §4.5.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) Swag: a large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326. Cited by: item 6, item 4, §3.4, Table 1.
  • M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: §4.2.4.
  • S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme (2018) Record: bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885. Cited by: §4.5.
  • Z. Zhao, P. Zheng, S. Xu, and X. Wu (2019) Object detection with deep learning: a review. IEEE transactions on neural networks and learning systems 30 (11), pp. 3212–3232. Cited by: §4.3.1.
  • V. Zhong, C. Xiong, and R. Socher (2017) Seq2sql: generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. Cited by: item 7, item 5, Table 1.
  • W. Zhong, D. Tang, N. Duan, M. Zhou, J. Wang, and J. Yin (2019a) Improving question answering by commonsense-based pre-training. In CCF International Conference on Natural Language Processing and Chinese Computing, pp. 16–28. Cited by: §4.5.
  • X. Zhong, E. ShafieiBavani, and A. J. Yepes (2019b) Image-based table recognition: data, model, and evaluation. arXiv preprint arXiv:1911.10683. Cited by: §4.3.4.
  • X. Zhong, J. Tang, and A. J. Yepes (2019c) PubLayNet: largest dataset ever for document layout analysis. In 2019 15th International Conference on Document Analysis and Recognition, pp. accepted. Cited by: §4.3.4.
  • H. Zhou, T. Young, M. Huang, H. Zhao, J. Xu, and X. Zhu (2018) Commonsense knowledge aware conversation generation with graph attention.. In IJCAI, pp. 4623–4629. Cited by: §4.5, §4.5.
  • L. Zhou, C. Xu, and J. J. Corso (2017) Towards automatic learning of procedures from web instructional videos. arXiv preprint arXiv:1703.09788. Cited by: §4.5.
  • Y. Zhou and T. L. Berg (2015) Temporal perception and prediction in ego-centric video. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4498–4506. Cited by: §4.5, §4.5.
  • W. Zhu, X. Wang, and P. Cui (2020) Deep learning for learning graph representations. In Deep Learning: Concepts and Architectures, pp. 169–210. Cited by: §4.2.3.
  • Z. Zhu, Z. Xue, and Z. Yuan (2018) Automatic graphics program generation using attention-based hierarchical decoder. In ACCV, Cited by: §4.3.2.
  • A. Zohar and L. Wolf (2018) Automatic program synthesis of long programs with a learned garbage collector. In NeurIPS, Cited by: §4.4.2.